Quiet Analysis

The biggest difference between adversarial data analysis and what happens in the mainstream is that, for adversarial analysis, there’s no such thing as noise.

Mainstream data mining tries to build models that fit the available data well. There are two issues that arise, although they are intertwined and usually not thought about very clearly. Both are called overfitting.

That’s because there are two different problems that can be present in the data. The first is random variation around some central normality. This often arises because there are deep underlying (usually social) processes that force common structure in the data, but humans are locally variable, so there’s some variation superimposed on this common structure. In a geometric sense, the effect of noise is to make the data points into a fuzzier cloud than they “should be”. There’s no need for a good model to capture this fuzziness; it obscures the underlying structure rather than revealing it.

The second is the presence of outlying data that isn’t interesting enough to want to capture in the model. For example, in data about humans there will tend to be some points that are far from all of the others, just because some people are very different. If the goal is to capture generalities, then these outliers may not be worth modelling. There’s also a technical problem. The way in which most models fit the data is based on statistically sound but semantically dubious error models (e.g. least squares) that penalize points very heavily when they fit badly. Thus the presence of even a single outlier can distort the main structure of a model much more than it “should”. Such a model will perform badly on new data because it has learned a structure that depends (heavily) on one single record.

In adversarial settings, model building that tries to prevent both kinds of overfitting is dangerous. Adversaries who are sophisticated enough to know that modelling is happening will be trying, as hard as they can, to look similar to normal data — and this tends to make them sit in exactly the regions that look like noise.

Adversaries who can force data into the model building process, which isn’t as far fetched as it seems, can exploit modelling techniques that are distorted by outliers. If they can insert such a record, then the main structure of the model can be changed — which can created places for them to hide in the data, since they have some control of where those places are.

Bottom line: in adversarial settings, no data is less interesting — it may have a story to tell, but even when it doesn’t it has to be examined with some care.

In practice, the nasty problem is that many forms of data collection introduce both noise and artifacts into the collected data. A great deal of analyst time goes into bullet extraction from self-shot feet.


0 Responses to “Quiet Analysis”

  1. Leave a Comment

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: