Posts Tagged 'outliers'

Looking for bad guys II: Using concealment

Bad guys will take steps to hide their traces in data, unless they’re very naive, or knowledge-discovery tools have never been applied in their particular domain before.

At first glance, it might look as if this kind of concealment might make the task of finding them harder. In fact, the opposite is true — doing anything to try and look more normal runs a serious risk of looking more abnormal (see the previous post for why this matters).

Hercule Poirot has made this point. He often says that murderers are not content to leave things alone, and it is their attempts to make detection harder that, in the end, reveal who they are.

If a bad guy wants to create data values that look more normal than they would otherwise be as the result of whatever action he is doing, he has two problems:

  • what are the normal values; and
  • can values close to them be generated?

Knowing normal values is harder than it looks. In a sense, such values are knowable, but the risk is that the more the issue is thought about, the more likely a person is to go into an infinite loop of improving the values. What’s the latest time that it’s acceptable to phone someone you’ve met at a party and make it seem casual? The first part is easy to work out, but it gets much harder with the extra qualification.

When more than one value has to be set appropriately, the problem becomes much harder because, in normal data, the values of different attributes are correlated. It is therefore possible to set two attributes each to plausible individual values, but still create an anomaly because these values rarely occur together. The observation that the values are usually correlated is also the explanation for how they come to be that way in normal data. People who don’t think about the values naturally produce the observed correlations.

Even if the desired values were known, it may be hard to generate them. If they are usually the result of unconscious processes, then faking them is hard. This is true of speech and directly-observed action. People who can create the illusion that unnatural speech and action are actually natural are called actors, and are highly paid for this skill. I pointed out in an earlier post that faking numbers is difficult because there’s a digit distribution in actual numbers which is not reproduced in faked ones.

So efforts by bad guys to conceal themselves in data by making their data look more like normal data than it ordinarily would creates an opportunity to look for them — by looking for the signature of concealment as well as the signature of whatever bad things we would already be looking for.

Of course, this does make the challenge of distinguishing normality from the unusual more difficult. But it suggest that, in some sense, we expect to see the following structure in data: large clusters of normal records, small clusters of bad-guy records quite close to them; and then single-record outliers or very small clusters corresponding to eccentrics, much further away from the normal clusters.

Looking for bad guys I: Using difference

Given that the traces left by bad guys are of three kinds:

  1. Difference;
  2. Concealment; and
  3. Manipulation

there are many implications for the way in which knowledge-discovery systems should be designed, and I’ll talk about several of them in upcoming posts.

Let’s start by thinking about difference, which is the most obvious quality of bad guys, and the only one most people think about. Bad guys are doing something that is different from most other people (although not in every setting — when we looked at Enron emails, deceptiveness was so common that it was the mainstream culture; the place resembled a pirate ship).

The problem is that lots of ordinary people are doing things that are different as well. In fact, the properties that we see as normal are actually averages over lots of individual behavior. Each person doesn’t think of making ‘normal’ choices or acting ‘normally’ — they just do whatever they decide to do.

From this point of view, it’s surprising that such a thing as normality exists. And in fact it only does in certain circumstances, and so in certain kinds of data.

Consider tastes in music. Twenty years ago, the only music that most people could easily listen to was what was played on the radio, and the slightly larger set that was available to buy at music stores. Radio stations played particular kinds of music, but there were only a smallish number of particular kinds: rock, pop, classicial, oldies, etc.. In this situation, it’s clear that most peoples’ tastes resembled the people around them — they could hardly not.

Today, the music world has changed completely. The existence of Internet radio and satellite radio makes it easy for people to experience different kinds of music. Anything that catches their attention can be followed up by downloading more material, including from artist websites that exist independently of the filtering effect of labels. Not surprisingly, tastes no longer cluster in the same way they once did, and so there isn’t any sense of ‘normal’ taste in music as there once would have been. (Anderson’s book, The Long Tail, is a good discussion of this issue. Note also that the use of tags recreates a kind of clustering, as people develop their individual tastes by following the tastes of others.)

The point is that situations where there are technical and/or social constraints make people and their actions look alike, even though all of their individual decisions are independent and free. If the set of possible decisions are small, there are fewer ways to be different. If the set of decisions are constrained by the social connections between people, it is more uncomfortable to make different decisions. This works in lots of small ways. Most people have a sense of what times of day are appropriate to telephone other people; and this has the side-effect of creating patterns of normality in call data, for example.

So, in many kinds of data, it does make sense to take about normality. Bad guys are forced to act in certain ways as part of their activities, and these actions, some of them at least, will deviate from normality. The problem is that there are also other people whose actions deviate from normality, just because. Perhaps they are eccentric, or socially inept.

This creates a problem, but not an insuperable one. What it does mean is that, rather than thinking of the problem as modelling bad guys, we think instead of the problem as modelling normality.

If we know what normality looks like, we can keep all of the records that don’t fit that model. This doesn’t finish the job, because what’s left is a mixture of bad-guy records and, for want of a better word, eccentric records.

But we have made significant progress on the overall problem because the size of the data has been reduced, probably by a very large amount. Now the problem becomes distinguishing bad guys from eccentrics. And that’s a reasonable problem.