What do the traces of bad guys look like?

The presumption of knowledge discovery in adversarial settings is that the traces left by bad guys are somehow different from those of ordinary people doing ordinary and innocuous things. What would those differences be?

There are three main sources of difference.

The first, and most obvious, is that to be a bad guy means to do things that most people don’t do. Given that reasonable attributes are collected, this almost inevitably means that there will be some differences between data records associated with bad guys and those associated with ordinary people.

This does depend on collecting data in which the differences are actually there. For example, if you’re looking for tax fraud, it’s probably useless to collect data about people’s height, since it’s unlikely that tax fraud is associated with being tall, or being short.

So the first thing we expect is that there will be differences between bad-guy data and normal data because of the requirements of whatever bad things they are doing.

People tend to think that this will be the most important, perhaps the only, difference. This isn’t true, except if the bad guys are exceptionally stupid, or knowledge discovery is being applied for the first time and the bad guys did not anticipate its use. This does happen — first attempts to find fraud are often spectacularly successful because they look at aspects of the data that haven’t been examined before.

However, smart bad guys will anticipate the use of knowledge discovery and so they will try, as much as possible, to make the values of the data collected about them seem as innocuous as possible.

So the second difference between bad-guy data and normal data is that bad-guy data is characterized by concealment. But isn’t concealment the absence of a difference? It turns out that, by and large, concealment itself generates a characteristic signature. Knowledge discovery techniques can be tuned to look for this signature, and so preferentially for bad guys.

How is it that concealment creates its own signature? Because, as humans, we’re lousy at creating, artificially, anything that we also do naturally.

Think about having a conversation. We can all do it easily. But make it a conversation in front of lots of people, say giving a speech or acting in a play, and suddenly the way we do it changes: voice tremor, hesitation, strange speech rhythms, and so on.

The same phenonema happen when people try to construct data to resemble data that otherwise arises naturally. For example, when people create a fictitious number (for example, on a tax return or in accounts) the digit distribution is quite distinctive (called Benford’s Law). This has been used to detect potentially fraudulent tax returns.

The third difference between bad-guy traces and ordinary data is created when bad guys actively try to manipulate the knowledge discovery process.

Perhaps surprisingly, they don’t have to know a lot about exactly which knowledge discovery process is being used, and exactly how it works, to have a shot at this. But, again, the knowledge discovery process can be trained to look for the signature of manipulation.

In summary, there are three main ways in which the traces of bad guys differ from those of ordinary data:

  1. Difference. The bad-guy data is unusual, perhaps even outlying in the dataset.
  2. Concealment. The bad guys have changed data values to make them seem more normal, but this backfires because it can’t be done — normal is hard to fake.
  3. Manipulation. The bad guys have changed data values to try and force particular behavior from the knowledge discovery process, but this backfires because the manipulation creates its own signature.

Of these three factors, the second and third are often much more significant than the first. “The wicked flee when no man pursueth” (Proverbs 28:1, KJV).



%d bloggers like this: