Posts Tagged 'prediction'

Which predictors can rank?

To be able to build a ranking predictor, there must be some way of labelling the training records with (estimates of) their ranks, so that this property can be generalised to new records. This is often straightforward, even if the obvious target label doesn’t map directly to a ranking.

There are six mainstream prediction technologies:

  1. Decision trees. These are everyone’s favourites, but they are quite weak predictors, and can only be used to predict class labels. So no use for ranking.
  2. Neural networks. These are also well-liked, but undeservedly so. Neural networks can be effective predictors for problems where the boundaries between the classes are difficult and non-linear, but they are horrendously expensive to train. They should not be used without soul searching. They can, however, predict numerical values and so can do ranking.
  3. Support Vector Machines. These are two-class predictors that try to fit the optimal boundary (the maximal margin) between points corresponding to the records from each class. The distances from the boundary are an estimate of how confident the classifier is in the classification of each record, and so provide a kind of surrogate ranking: from large positive numbers down to 1 for one class and then from -1 to large negative numbers for the other class.
  4. Ensembles. Given any kind of simple predictor, a better predictor can be built by: creating samples of the records from the training dataset; building individual predictors from each sample; and the use the collection of predictors as single, global predictor by asking for the prediction of each one, and using voting to make the global prediction. Ensembles have a number of advantages, primarily that the individual predictors cancel out each others variance. But the number of predictors voting for the winning class can also be interpreted as a strength of opinion for that class; and so for a value on which to rank. In other words, a record can be unanimously voted normal, voted normal by all but one of the individual predictors, and so on.
  5. Random Forests. Random Forests are a particular form of ensemble predictor where each component decision tree is built making decisions about internal tests in a particularly robust and contextualized way. This makes them one of the most powerful prediction technologies known. The same technique, using the number of votes for the winning class, or the margin between the most popular and the next most popular class can be used as a ranking.
  6. Rules. Rules are used because they seem intuitive and explanatory, but they are very weak predictors. This is mostly because they capture local structure in data, rather than global structure captured by most other predictors. Rules cannot straightforwardly be used to rank.

So, although ranking predictors are very useful in adversarial situations, they are quite difficult to build and use.

Second-stage predictors

If we have a ranking predictor, we can choose to put the boundary in a position where the false negative rate is zero (which is what we desperately need to do in an adversarial situation).

The side-effect of this, of course, is that the false positive rate is likely to be very large, so at first glance this seems crazy. But only if we think that the problem can be solved by a single predictor.

Suppose, instead, that we think of the problem as one to be solved in stages.

The first predictor is intended mostly to reduce the problem to a manageable size. The boundary is set so that the false negative rate is zero, knowing that this means that many, many records will be ranked as “possibly abnormal”.

However, the total number of records is now 20% of what it was initially. We can now apply a more sophisticated predictor, still trying to model and predict normality, but able to spend more resources per record, and use a cleverer algorithm.

Again the result is a ranking of the records, into which we can insert a boundary such that the false negative rate is zero. This is still extremely conservative, so that many records that are quite normal may still be ranked on the “possibly abnormal” side of the boundary, but the number of records remaining is still smaller. Even if only half the records can be safely discarded as normal, we have still reduced the total number of records to 10% of the original.

This staged approach can be continued for as many stages as required. Each new stage works with fewer records, and can be correspondingly more sophisticated. Eventually, the number of questionable records may become small enough that a human can do the final check.

Assembling predictors in sequence means that predictors whose individual accuracies are not perfect can be arranged so that the overall effect is close to perfect.

Of course, we have not yet quite solved the problem of detecting possible bad things. The pool of abnormal records contains both records associated with bad guys and records associated with other more ordinary kinds of abnormality (eccentricity).

At this point, two lines of attack remain. The first is to build a model of eccentricity, and use it to eliminate those records from the data. The other is to observe that most bad-guy actions require some kind of collective effort, so that bad-guy records are likely to form some kind of grouping, while those of eccentrics are likely to be one-offs. (Of course, this does not solve the problem of the lone-wolf bad guy, which is why serial killers can be so hard to detect.)

Ranking versus boundaries

Knowledge discovery is full of chicken and egg problems — typically it’s not clear how to set the parameters for an algorithm until you’ve seen the results it gives on your data.

For prediction, the problem of how to specify the boundary is of this kind. Suppose that we want to build a predictor for normality, so that each record will be classified as “normal”, or “possibly abnormal”. We will have some examples of each in our training data (that is records already labelled “normal” or “abnormal”), and typically there will be many more normal records than abnormal.

But how abnormal does a record have to be before we label it as abnormal? And what parameters should we give the algorithm that builds the predictor? Different decisions will have different effects on the false positive and false negative rates. If we move the boundary so that more records are predicted to be abnormal, we reduce the number of false negatives, but increase the number of false positives. There are techniques for making a good choice (using the so-called ROC or Receiver-Operating-Characteristic curve) but these aren’t very useful in an adversarial situation, where false negatives matter a lot. Moving the boundary means changing the parameters of the algorithm that builds the predictor, so every time we want to try another position, we have to rebuild the predictor.

A better way to think about the problem is that the goal is to rank the records from most abnormal to least abnormal. In other words, the knowledge-discovery technique does not actually build a predictor, but something close to it.

Once we have a ranking, we can decide where to put the boundary, but after we have seen the analysis of the data, rather than before. There is a twofold win: we have avoided the chicken and egg problem of having to set the parameters of the algorithm, and we can easily explore the effect of different choices of the boundary without retraining.

Building a ranking predictor is a little harder than building a plain predictor, but not by much. Predictors are divided into two kinds: classifiers that predict a class label (from a finite set), and regressors that predict a continuous value. Ranking requires the second kind of predictor.

Making prediction usable

In the previous post, I pointed out that the goal of prediction in adversarial settings is to prevent bad things happening, but it doesn’t work well to attack the problem directly.

The first important insight is to see that the problem is really about predicting normality. Rather than try and predict what bad guys might do, which might be any of a very long list of things, try and predict what normal people will do. This is a great deal easier because normality has regularities.

This isn’t obvious, but it happens because we are social beings. The things we do, and the way we do them, are constrained by the fact that other people are involved. Even something as simple as walking down the sidewalk is a highly constrained process — people move to the appropriate side when they approach someone without any conscious thought, and without any apparent signal. In fact, this works so well that when you go to a country where the standard side on which to pass someone is different (left vs right) both you and the natives will feel uncomfortable without knowing why — and it takes some people years to correct their behaviour to local norms.

But our social nature means that the structure of normality is quite robust, and so can be learned with acceptable accuracy by a predictor.

Of course, this hasn’t solved the problem. But it has reduced the scale of the problem immensely. If we have a million records, a fairly simple predictor of normality can predict that 800,000 of them are certainly normal, reducing the problem by a factor of 5.

The records that remain can now be processed by a second predictor which (a) has a different task, and (b) has to handle much less data.

Doing prediction in adversarial settings

The overall goal of prediction in adversarial settings is to stop bad things happening — terrorist attacks, fraud, crime, money laundering, and lots of other things.

People intuitively think that the way to address this goal is to try and build a predictor for the bad thing. A few moments thought shows that building such a predictor is a very difficult, maybe impossible, thing to do. So some people immediately conclude that it’s silly or a waste of money, to try and address such goals using knowledge discovery.

There are a couple of obvious reasons why direct prediction won’t work. The first is that bad guys have a very large number of ways in which they can achieve their goal, and it’s impossible for the good guys to consider every single one in designing the predictive model.

This problem is very obvious in intrusion detection, trying to protect computer systems against attacks. There are two broad approaches. The first is to keep a list of bad things, and block any of them when they occur. This is how antivirus software works — every day (it was every week; soon it will be every hour) new additions to the list of bad things have to be downloaded. Of course, this doesn’t predict so-called zero-day attacks, which use some mechanism that has never been used before and so is not on the list of bad things. The second approach is to keep track of what has happened before, and prevent anything new from happening (without some explicit approval by the user). The trouble is that, although there are some regularities in what a user or a system does every day, there are always new things — new websites visited, email sent to new addresses. As a result, alarms are triggered so often that it drives everyone mad, and such systems often get turned off. Vista’s user authorization is a bit like this.

The other difficulty with using direct prediction is making it accurate enough. Suppose that there are two categories that we want to predict: good, and bad. A false positive is when a good record is predicted to be bad; and a false negative is when a bad record is predicted to be good. Both kinds of wrong predictions are a problem, but in different ways. A false positive causes annoyance and irritation, and generates extra work, since the record (and the person it belongs to) must be processed further. However, a false negative is usually much worse — because it means that a bad guy gets past the prediction mechanism.

Prediction technology is considered to be doing well if it achieves a prediction accuracy of around 90% (the percentage of records predicted correctly). It would be fabulous if it achieved an accuracy of 99%. But when the number of records is 1 million, a misclassification rate of 1% is 10,000 records! The consequences of this many mistakes would range from catastrophic to unusable.

These problems with prediction have been pointed out in the media and in some academic writing, as if they meant that prediction in adversarial settings is useless. This is a bit of an argument against a straw man. What is needed is a more thoughtful way of thinking about how prediction should be done, which I’ll talk about in the next posting.