Ranking versus boundaries

Knowledge discovery is full of chicken and egg problems — typically it’s not clear how to set the parameters for an algorithm until you’ve seen the results it gives on your data.

For prediction, the problem of how to specify the boundary is of this kind. Suppose that we want to build a predictor for normality, so that each record will be classified as “normal”, or “possibly abnormal”. We will have some examples of each in our training data (that is records already labelled “normal” or “abnormal”), and typically there will be many more normal records than abnormal.

But how abnormal does a record have to be before we label it as abnormal? And what parameters should we give the algorithm that builds the predictor? Different decisions will have different effects on the false positive and false negative rates. If we move the boundary so that more records are predicted to be abnormal, we reduce the number of false negatives, but increase the number of false positives. There are techniques for making a good choice (using the so-called ROC or Receiver-Operating-Characteristic curve) but these aren’t very useful in an adversarial situation, where false negatives matter a lot. Moving the boundary means changing the parameters of the algorithm that builds the predictor, so every time we want to try another position, we have to rebuild the predictor.

A better way to think about the problem is that the goal is to rank the records from most abnormal to least abnormal. In other words, the knowledge-discovery technique does not actually build a predictor, but something close to it.

Once we have a ranking, we can decide where to put the boundary, but after we have seen the analysis of the data, rather than before. There is a twofold win: we have avoided the chicken and egg problem of having to set the parameters of the algorithm, and we can easily explore the effect of different choices of the boundary without retraining.

Building a ranking predictor is a little harder than building a plain predictor, but not by much. Predictors are divided into two kinds: classifiers that predict a class label (from a finite set), and regressors that predict a continuous value. Ranking requires the second kind of predictor.


0 Responses to “Ranking versus boundaries”

  1. Leave a Comment

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: