Posts Tagged 'fraud'

Knowledge Discovery for Counterterrorism and Law Enforcement

My new book, Knowledge Discovery for Counterterrorism and Law Enforcement, is out. You can buy a copy from:

The publisher’s website

Amazon.

(Despite what these pages say, the book is available or will be within a day or two.)

As the holiday season approaches, perhaps you have a relative who’s in law enforcement, or intelligence, or security? What could be better than a book! Or maybe you’d like to buy one for yourself.

(A portion of the price of this book goes to support deserving university faculty.)

Doing prediction in adversarial settings

The overall goal of prediction in adversarial settings is to stop bad things happening — terrorist attacks, fraud, crime, money laundering, and lots of other things.

People intuitively think that the way to address this goal is to try and build a predictor for the bad thing. A few moments thought shows that building such a predictor is a very difficult, maybe impossible, thing to do. So some people immediately conclude that it’s silly or a waste of money, to try and address such goals using knowledge discovery.

There are a couple of obvious reasons why direct prediction won’t work. The first is that bad guys have a very large number of ways in which they can achieve their goal, and it’s impossible for the good guys to consider every single one in designing the predictive model.

This problem is very obvious in intrusion detection, trying to protect computer systems against attacks. There are two broad approaches. The first is to keep a list of bad things, and block any of them when they occur. This is how antivirus software works — every day (it was every week; soon it will be every hour) new additions to the list of bad things have to be downloaded. Of course, this doesn’t predict so-called zero-day attacks, which use some mechanism that has never been used before and so is not on the list of bad things. The second approach is to keep track of what has happened before, and prevent anything new from happening (without some explicit approval by the user). The trouble is that, although there are some regularities in what a user or a system does every day, there are always new things — new websites visited, email sent to new addresses. As a result, alarms are triggered so often that it drives everyone mad, and such systems often get turned off. Vista’s user authorization is a bit like this.

The other difficulty with using direct prediction is making it accurate enough. Suppose that there are two categories that we want to predict: good, and bad. A false positive is when a good record is predicted to be bad; and a false negative is when a bad record is predicted to be good. Both kinds of wrong predictions are a problem, but in different ways. A false positive causes annoyance and irritation, and generates extra work, since the record (and the person it belongs to) must be processed further. However, a false negative is usually much worse — because it means that a bad guy gets past the prediction mechanism.

Prediction technology is considered to be doing well if it achieves a prediction accuracy of around 90% (the percentage of records predicted correctly). It would be fabulous if it achieved an accuracy of 99%. But when the number of records is 1 million, a misclassification rate of 1% is 10,000 records! The consequences of this many mistakes would range from catastrophic to unusable.

These problems with prediction have been pointed out in the media and in some academic writing, as if they meant that prediction in adversarial settings is useless. This is a bit of an argument against a straw man. What is needed is a more thoughtful way of thinking about how prediction should be done, which I’ll talk about in the next posting.

Looking for Bad Guys III: Using manipulation

Bad guys who are aware the knowledge-discovery tools will be used to look for them may also try to actively manipulate the process to their own advantage.

One way to do this is to get an insider working for them, someone who can alter the data or the results of the analysis to this benefit. This is probably the most common method: over all of history, probably more sieges have been successful because someone opened the gates from the inside than because the walls were broken through. It’s easy to get caught up in the cleverness of technology and forget that sometimes suborning someone is the easiest attack.

However, the focus of this blog is knowledge discovery, so let me concentrate on that. Before we talk about how manipulation can be exploited as a discovery tool, we need to talk about what manipulation looks like; and before we can do that, we need to think about the structure of the knowledge-discovery process.

It’s helpful to divide up the stages of knowledge discovery into:

  1. Collecting the data (CCTV images, transaction logs);
  2. Analysing the data (the part that’s usually thought of as the heart of knowledge discovery);
  3. Deciding on what to do with the results and taking action;

Although an adversary can only attack the process via the data that is collected (assuming they don’t have an insider), it is helpful to think of three different kinds of attacks, directed against each of the three stages. The different attacks require understanding different aspects of the knowledge-discovery system.

Manipulating the data collection stage is probably the easiest, because it’s often possible to see and understand how the data is being collected. For example, the fields of view of CCTV cameras can usually be inferred from their positions (even if they are enclosed in black plastic bubbles) and so ways to move around them without coming into view can be worked out. Alternatively, disguises can be used to conceal who is being seen, even though an image is captured. One of the reasons identity theft is a big business is that it provides a way to have data captured about you, but data that is useless because it doesn’t connect to the real you.

Manipulating the decision and action stage is done using social engineering. This means trying to create the impression in the minds of the people who are making the decisions and taking the actions that the analysis system has made an error.

Manipulating the analysis stage is surprisingly easier than it should be. This is because most knowledge-discovery technology has been tuned to give good results in data with natural variation. This gives an opportunity to insert data that is the worst possible from the point of view of the algorithms, and so enable bad guys to hide their traces.

The technology used for knowledge discovery needs to be completely rethought to take manipulation into account. This is primarily why adversarial knowledge discovery is not just another application of knowledge discovery, but a completely different problem.

The good part about this is that attempts at manipulation also create an abnormal signature in the data; and the process can be tuned to look for this signature as well.

What do the traces of bad guys look like?

The presumption of knowledge discovery in adversarial settings is that the traces left by bad guys are somehow different from those of ordinary people doing ordinary and innocuous things. What would those differences be?

There are three main sources of difference.

The first, and most obvious, is that to be a bad guy means to do things that most people don’t do. Given that reasonable attributes are collected, this almost inevitably means that there will be some differences between data records associated with bad guys and those associated with ordinary people.

This does depend on collecting data in which the differences are actually there. For example, if you’re looking for tax fraud, it’s probably useless to collect data about people’s height, since it’s unlikely that tax fraud is associated with being tall, or being short.

So the first thing we expect is that there will be differences between bad-guy data and normal data because of the requirements of whatever bad things they are doing.

People tend to think that this will be the most important, perhaps the only, difference. This isn’t true, except if the bad guys are exceptionally stupid, or knowledge discovery is being applied for the first time and the bad guys did not anticipate its use. This does happen — first attempts to find fraud are often spectacularly successful because they look at aspects of the data that haven’t been examined before.

However, smart bad guys will anticipate the use of knowledge discovery and so they will try, as much as possible, to make the values of the data collected about them seem as innocuous as possible.

So the second difference between bad-guy data and normal data is that bad-guy data is characterized by concealment. But isn’t concealment the absence of a difference? It turns out that, by and large, concealment itself generates a characteristic signature. Knowledge discovery techniques can be tuned to look for this signature, and so preferentially for bad guys.

How is it that concealment creates its own signature? Because, as humans, we’re lousy at creating, artificially, anything that we also do naturally.

Think about having a conversation. We can all do it easily. But make it a conversation in front of lots of people, say giving a speech or acting in a play, and suddenly the way we do it changes: voice tremor, hesitation, strange speech rhythms, and so on.

The same phenonema happen when people try to construct data to resemble data that otherwise arises naturally. For example, when people create a fictitious number (for example, on a tax return or in accounts) the digit distribution is quite distinctive (called Benford’s Law). This has been used to detect potentially fraudulent tax returns.

The third difference between bad-guy traces and ordinary data is created when bad guys actively try to manipulate the knowledge discovery process.

Perhaps surprisingly, they don’t have to know a lot about exactly which knowledge discovery process is being used, and exactly how it works, to have a shot at this. But, again, the knowledge discovery process can be trained to look for the signature of manipulation.

In summary, there are three main ways in which the traces of bad guys differ from those of ordinary data:

  1. Difference. The bad-guy data is unusual, perhaps even outlying in the dataset.
  2. Concealment. The bad guys have changed data values to make them seem more normal, but this backfires because it can’t be done — normal is hard to fake.
  3. Manipulation. The bad guys have changed data values to try and force particular behavior from the knowledge discovery process, but this backfires because the manipulation creates its own signature.

Of these three factors, the second and third are often much more significant than the first. “The wicked flee when no man pursueth” (Proverbs 28:1, KJV).

What this blog is about

All of us leave traces in the data that we create, either intentionally or as a side-effect of the things we do in the world — walking in front of a CCTV camera, turning on a cell phone, or whatever.

Lots of this data is analyzed, for example by businesses that want to build a relationship to customers.

I’m interested in the special case where some of the people about whom data is collected want to hide their existence, what they are like, and what they are doing, usually because they are up to no good.

In such situations, the way in which the data is collected, and then analyzed, and then the decisions that are taken as a result have to be rethought to take account of the adversarial nature of the situation.

I’m interested in how to do knowledge discovery in these adversarial situations, and this blog will talk about the issues, the techologies, and some of the known results.

Adversarial situations include:

  • crime;
  • fraud (medical, insurance);
  • money laundering;
  • organizational malfeasance;
  • industrial espionage;
  • national defence; and
  • counterterrorism.

What bad guys do in these situations has huge costs. The cost of terorrism is obvious, but it’s less well-known that fraud costs an estimated 12% of GDP in developed economies.

Of course, the process of collecting and analyzing data is not necessarily benign, and many people have privacy concerns. We’ll talk about them too.