Looking for bad guys I: Using difference

Given that the traces left by bad guys are of three kinds:

  1. Difference;
  2. Concealment; and
  3. Manipulation

there are many implications for the way in which knowledge-discovery systems should be designed, and I’ll talk about several of them in upcoming posts.

Let’s start by thinking about difference, which is the most obvious quality of bad guys, and the only one most people think about. Bad guys are doing something that is different from most other people (although not in every setting — when we looked at Enron emails, deceptiveness was so common that it was the mainstream culture; the place resembled a pirate ship).

The problem is that lots of ordinary people are doing things that are different as well. In fact, the properties that we see as normal are actually averages over lots of individual behavior. Each person doesn’t think of making ‘normal’ choices or acting ‘normally’ — they just do whatever they decide to do.

From this point of view, it’s surprising that such a thing as normality exists. And in fact it only does in certain circumstances, and so in certain kinds of data.

Consider tastes in music. Twenty years ago, the only music that most people could easily listen to was what was played on the radio, and the slightly larger set that was available to buy at music stores. Radio stations played particular kinds of music, but there were only a smallish number of particular kinds: rock, pop, classicial, oldies, etc.. In this situation, it’s clear that most peoples’ tastes resembled the people around them — they could hardly not.

Today, the music world has changed completely. The existence of Internet radio and satellite radio makes it easy for people to experience different kinds of music. Anything that catches their attention can be followed up by downloading more material, including from artist websites that exist independently of the filtering effect of labels. Not surprisingly, tastes no longer cluster in the same way they once did, and so there isn’t any sense of ‘normal’ taste in music as there once would have been. (Anderson’s book, The Long Tail, is a good discussion of this issue. Note also that the use of tags recreates a kind of clustering, as people develop their individual tastes by following the tastes of others.)

The point is that situations where there are technical and/or social constraints make people and their actions look alike, even though all of their individual decisions are independent and free. If the set of possible decisions are small, there are fewer ways to be different. If the set of decisions are constrained by the social connections between people, it is more uncomfortable to make different decisions. This works in lots of small ways. Most people have a sense of what times of day are appropriate to telephone other people; and this has the side-effect of creating patterns of normality in call data, for example.

So, in many kinds of data, it does make sense to take about normality. Bad guys are forced to act in certain ways as part of their activities, and these actions, some of them at least, will deviate from normality. The problem is that there are also other people whose actions deviate from normality, just because. Perhaps they are eccentric, or socially inept.

This creates a problem, but not an insuperable one. What it does mean is that, rather than thinking of the problem as modelling bad guys, we think instead of the problem as modelling normality.

If we know what normality looks like, we can keep all of the records that don’t fit that model. This doesn’t finish the job, because what’s left is a mixture of bad-guy records and, for want of a better word, eccentric records.

But we have made significant progress on the overall problem because the size of the data has been reduced, probably by a very large amount. Now the problem becomes distinguishing bad guys from eccentrics. And that’s a reasonable problem.



%d bloggers like this: