Looking for bad guys II: Using concealment

Bad guys will take steps to hide their traces in data, unless they’re very naive, or knowledge-discovery tools have never been applied in their particular domain before.

At first glance, it might look as if this kind of concealment might make the task of finding them harder. In fact, the opposite is true — doing anything to try and look more normal runs a serious risk of looking more abnormal (see the previous post for why this matters).

Hercule Poirot has made this point. He often says that murderers are not content to leave things alone, and it is their attempts to make detection harder that, in the end, reveal who they are.

If a bad guy wants to create data values that look more normal than they would otherwise be as the result of whatever action he is doing, he has two problems:

  • what are the normal values; and
  • can values close to them be generated?

Knowing normal values is harder than it looks. In a sense, such values are knowable, but the risk is that the more the issue is thought about, the more likely a person is to go into an infinite loop of improving the values. What’s the latest time that it’s acceptable to phone someone you’ve met at a party and make it seem casual? The first part is easy to work out, but it gets much harder with the extra qualification.

When more than one value has to be set appropriately, the problem becomes much harder because, in normal data, the values of different attributes are correlated. It is therefore possible to set two attributes each to plausible individual values, but still create an anomaly because these values rarely occur together. The observation that the values are usually correlated is also the explanation for how they come to be that way in normal data. People who don’t think about the values naturally produce the observed correlations.

Even if the desired values were known, it may be hard to generate them. If they are usually the result of unconscious processes, then faking them is hard. This is true of speech and directly-observed action. People who can create the illusion that unnatural speech and action are actually natural are called actors, and are highly paid for this skill. I pointed out in an earlier post that faking numbers is difficult because there’s a digit distribution in actual numbers which is not reproduced in faked ones.

So efforts by bad guys to conceal themselves in data by making their data look more like normal data than it ordinarily would creates an opportunity to look for them — by looking for the signature of concealment as well as the signature of whatever bad things we would already be looking for.

Of course, this does make the challenge of distinguishing normality from the unusual more difficult. But it suggest that, in some sense, we expect to see the following structure in data: large clusters of normal records, small clusters of bad-guy records quite close to them; and then single-record outliers or very small clusters corresponding to eccentrics, much further away from the normal clusters.



%d bloggers like this: