Someone made a comment to me this week implying that data mining/knowledge discovery in classified settings was a straightforward problem because the algorithms didn’t have to be classified, just the datasets.
This view isn’t accurate, for the following reason: mainstream/ordinary knowledge discovery builds models inductively from data by maximizing the fit between the model and the available data. In an adversarial setting, this creates problems: adversaries can make a good guess about what your model will look like, and so can do better at hiding their records (making them less obvious or detectable); and they can also tell exactly what kinds of data they should try and force you to use if/when you retrain your model.
A simple example. The two leading-edge predictors today are support vector machines and random forests and, in mainstream settings, there’s often little to choose between them in prediction performance. However, in an adversarial setting, the difference is huge: support vector machines are relatively easy to game, because even one record that’s in the wrong place or mislabelled can change the entire orientation of the decision surface. To make it even worse, the way in which the boundary changes is roughly predictable from the kind of manipulation made. (You can see how this works in: J. G. Dutrisac, David B. Skillicorn: Subverting prediction in adversarial settings. ISI 2008: 19-24.). Random forests, on the other hand, are much more robust predictors, partly because their ensemble characteristics makes it hard to force particular behaviour because of the inherent randomness ‘inside’ the algorithm.
The same thing happens with clustering algorithms. Algorithms such as Expectation-Maximization are easily misled by a few carefully chosen records; and no mainstream clustering technique does a good job of finding the kinds of ‘fringe’ clusters that occur when something unusual is present in the data, but adversaries have tried hard to make it as usual as possible.
In fact, adversarial knowledge discovery requires building the entire framework of knowledge discovery over again, taking into account the adversarial nature of the problem from the very beginning. Some parts of mainstream knowledge discovery can be used directly; others with some adaptation; and others can’t safely be used. There are also some areas where we don’t know very much about how to solve problems that aren’t interesting in the mainstream, but are critical in the adversarial domain.