Posts Tagged 'support vector machine'

Adversarial knowledge discovery is not just knowledge discovery with classified data

Someone made a comment to me this week implying that data mining/knowledge discovery in classified settings was a straightforward problem because the algorithms didn’t have to be classified, just the datasets.

This view isn’t accurate, for the following reason: mainstream/ordinary knowledge discovery builds models inductively from data by maximizing the fit between the model and the available data. In an adversarial setting, this creates problems: adversaries can make a good guess about what your model will look like, and so can do better at hiding their records (making them less obvious or detectable); and they can also tell exactly what kinds of data they should try and force you to use if/when you retrain your model.

A simple example. The two leading-edge predictors today are support vector machines and random forests and, in mainstream settings, there’s often little to choose between them in prediction performance. However, in an adversarial setting, the difference is huge: support vector machines are relatively easy to game, because even one record that’s in the wrong place or mislabelled can change the entire orientation of the decision surface. To make it even worse, the way in which the boundary changes is roughly predictable from the kind of manipulation made. (You can see how this works in: J. G. Dutrisac, David B. Skillicorn: Subverting prediction in adversarial settings. ISI 2008: 19-24.). Random forests, on the other hand, are much more robust predictors, partly because their ensemble characteristics makes it hard to force particular behaviour because of the inherent randomness ‘inside’ the algorithm.

The same thing happens with clustering algorithms. Algorithms such as Expectation-Maximization are easily misled by a few carefully chosen records; and no mainstream clustering technique does a good job of finding the kinds of ‘fringe’ clusters that occur when something unusual is present in the data, but adversaries have tried hard to make it as usual as possible.

In fact, adversarial knowledge discovery requires building the entire framework of knowledge discovery over again, taking into account the adversarial nature of the problem from the very beginning. Some parts of mainstream knowledge discovery can be used directly; others with some adaptation; and others can’t safely be used. There are also some areas where we don’t know very much about how to solve problems that aren’t interesting in the mainstream, but are critical in the adversarial domain.


Which predictors can rank?

To be able to build a ranking predictor, there must be some way of labelling the training records with (estimates of) their ranks, so that this property can be generalised to new records. This is often straightforward, even if the obvious target label doesn’t map directly to a ranking.

There are six mainstream prediction technologies:

  1. Decision trees. These are everyone’s favourites, but they are quite weak predictors, and can only be used to predict class labels. So no use for ranking.
  2. Neural networks. These are also well-liked, but undeservedly so. Neural networks can be effective predictors for problems where the boundaries between the classes are difficult and non-linear, but they are horrendously expensive to train. They should not be used without soul searching. They can, however, predict numerical values and so can do ranking.
  3. Support Vector Machines. These are two-class predictors that try to fit the optimal boundary (the maximal margin) between points corresponding to the records from each class. The distances from the boundary are an estimate of how confident the classifier is in the classification of each record, and so provide a kind of surrogate ranking: from large positive numbers down to 1 for one class and then from -1 to large negative numbers for the other class.
  4. Ensembles. Given any kind of simple predictor, a better predictor can be built by: creating samples of the records from the training dataset; building individual predictors from each sample; and the use the collection of predictors as single, global predictor by asking for the prediction of each one, and using voting to make the global prediction. Ensembles have a number of advantages, primarily that the individual predictors cancel out each others variance. But the number of predictors voting for the winning class can also be interpreted as a strength of opinion for that class; and so for a value on which to rank. In other words, a record can be unanimously voted normal, voted normal by all but one of the individual predictors, and so on.
  5. Random Forests. Random Forests are a particular form of ensemble predictor where each component decision tree is built making decisions about internal tests in a particularly robust and contextualized way. This makes them one of the most powerful prediction technologies known. The same technique, using the number of votes for the winning class, or the margin between the most popular and the next most popular class can be used as a ranking.
  6. Rules. Rules are used because they seem intuitive and explanatory, but they are very weak predictors. This is mostly because they capture local structure in data, rather than global structure captured by most other predictors. Rules cannot straightforwardly be used to rank.

So, although ranking predictors are very useful in adversarial situations, they are quite difficult to build and use.