Posts Tagged 'knowledge discovery'

Call for Papers: Link Analysis, Counterterrorism and Security

The Call for the LACTS 2009 workshop is now available here.

The workshop takes place at the SIAM Data Mining Conference and brings together academics, practitioners, law enforcement, and intelligence people to talk about leading-edge work in the area of adversarial data analysis.

The workshop is intended primarily for early-stage work. The proceedings are published electronically, but authors may retain copyright.

The deadline for submissions is probably late December, but perhaps a little later (still being decided).

Knowledge Discovery for Counterterrorism and Law Enforcement

My new book, Knowledge Discovery for Counterterrorism and Law Enforcement, is out. You can buy a copy from:

The publisher’s website

Amazon.

(Despite what these pages say, the book is available or will be within a day or two.)

As the holiday season approaches, perhaps you have a relative who’s in law enforcement, or intelligence, or security? What could be better than a book! Or maybe you’d like to buy one for yourself.

(A portion of the price of this book goes to support deserving university faculty.)

More on Identity

I’ve mentioned the problem of figuring out when data records describe the same person in the two most recent posts. Casinos are required to ban certain people who have self-identified themselves as having a gambling problem, so they have to look carefully at everyone who books a room. They also, of course, have an interest in noticing when certain other people show up, for example card counters.

As I said yesterday, identity is a slippery thing to manage algorithmically. It’s only in the last century that governments have gotten into the act of certifying identity, via various forms of government-issued identification, going back to birth certificates.

Such documents are not necessarily very reliable. There’s a long history of forging them. But mostly identity gets fudged because people don’t use them directly — they copy names and addresses with characteristic human errors; and this process can be helped along by those who want to hide their identity. It’s socially acceptable to use variant names, and people constantly make mistakes with numbers. Those who want to can use these deniable mistakes to create multiple versions of their identities.

This is partly why there’s such an interest in biometrics. A biometric is an identity key that was given to you by God. The important distinction in biometrics is between a digital biometric and a non-digital one. A photo in a passport is a non-digital biometric — it can be used to associate the passport, and so its contents, with you, but doesn’t do much else. A digital biometric, such as a digitized photo, can act as a key to a large database of information about you.

Most biometrics are extremely easy to fool. You can read about some of the easy tricks here. Fingerprint scanners can be fooled by plastic wrap; iris scanners by printed photos of an iris.

In relationship/graph data, the problem with multiple records describing the same person is that they blur the structure of the connections around that person — making some paths seem longer, and some properties more diffuse. That’s why it’s important to be able to resolve identities when possible; but also why it’s important to stay agnostic over the long haul.

Workshop and Link Analysis, Counterterrorism, and Security

If you’re interested in the content of this blog, and you live in the Atlanta area, you might be interested in coming to LACTS, the Workshop on Link Analysis, Counterterrorism, and Security. It’s being held on April 26th (Saturday) as part of the SIAM International Data Mining Conference. A one-day registration deal is available.

The proceedings will also be available online, both via my website and from SIAM after the workshop.

Here is the schedule:

0825-0830: Introduction
Antonio Badia and David Skillicorn

0830-0900: Detecting Hidden Passages in Documents
Saket S.R. Mengle and Nazli Goharian

0900-0930: Exploiting Sensitive Information in Background Mode using Latent Semantic Indexing
R. B. Bradford

0930-1000: Topic Detection Using Independent Component Analysis
Scott Grant, David Skillicorn, and James R. Cordy

1000-1030: Coffee Break

1030-1100: Using AI for Sensemaking in Investigative Analysis
Summer Adams, Ashok K. Goel, and Neha Sugandh

1100-1130: Vulnerability Assessment on Adversarial Organization: Unifying Command and Control Structure Analysis and Social Network Analysis
Il-Chul Moon, Kathleen M. Carley, and Alexander H. Levis

1130-1200: Torus Graph Inference for Detection of Localized Activity
Elizabeth A. Beer, Carey E. Priebe, and Edward R. Scheinerman

1200-1330: Lunch (on your own)

1330-1430: Workshop Keynote: “The Road to Link Intelligence”
Sherry Marcus, 21st Century Technologies.

1430-1500: Enhancing the Automated Analysis of Criminal Careers
Tim K. Cocx, Walter A. Kosters, and Jeroen F.J. Laros

1500-1530: Summarization and Information Loss in Network Analysis
Jamie F. Olson and Kathleen M. Carley

1530-1545: Summing Up
Antonio Badia and David Skillicorn

Which predictors can rank?

To be able to build a ranking predictor, there must be some way of labelling the training records with (estimates of) their ranks, so that this property can be generalised to new records. This is often straightforward, even if the obvious target label doesn’t map directly to a ranking.

There are six mainstream prediction technologies:

  1. Decision trees. These are everyone’s favourites, but they are quite weak predictors, and can only be used to predict class labels. So no use for ranking.
  2. Neural networks. These are also well-liked, but undeservedly so. Neural networks can be effective predictors for problems where the boundaries between the classes are difficult and non-linear, but they are horrendously expensive to train. They should not be used without soul searching. They can, however, predict numerical values and so can do ranking.
  3. Support Vector Machines. These are two-class predictors that try to fit the optimal boundary (the maximal margin) between points corresponding to the records from each class. The distances from the boundary are an estimate of how confident the classifier is in the classification of each record, and so provide a kind of surrogate ranking: from large positive numbers down to 1 for one class and then from -1 to large negative numbers for the other class.
  4. Ensembles. Given any kind of simple predictor, a better predictor can be built by: creating samples of the records from the training dataset; building individual predictors from each sample; and the use the collection of predictors as single, global predictor by asking for the prediction of each one, and using voting to make the global prediction. Ensembles have a number of advantages, primarily that the individual predictors cancel out each others variance. But the number of predictors voting for the winning class can also be interpreted as a strength of opinion for that class; and so for a value on which to rank. In other words, a record can be unanimously voted normal, voted normal by all but one of the individual predictors, and so on.
  5. Random Forests. Random Forests are a particular form of ensemble predictor where each component decision tree is built making decisions about internal tests in a particularly robust and contextualized way. This makes them one of the most powerful prediction technologies known. The same technique, using the number of votes for the winning class, or the margin between the most popular and the next most popular class can be used as a ranking.
  6. Rules. Rules are used because they seem intuitive and explanatory, but they are very weak predictors. This is mostly because they capture local structure in data, rather than global structure captured by most other predictors. Rules cannot straightforwardly be used to rank.

So, although ranking predictors are very useful in adversarial situations, they are quite difficult to build and use.

Second-stage predictors

If we have a ranking predictor, we can choose to put the boundary in a position where the false negative rate is zero (which is what we desperately need to do in an adversarial situation).

The side-effect of this, of course, is that the false positive rate is likely to be very large, so at first glance this seems crazy. But only if we think that the problem can be solved by a single predictor.

Suppose, instead, that we think of the problem as one to be solved in stages.

The first predictor is intended mostly to reduce the problem to a manageable size. The boundary is set so that the false negative rate is zero, knowing that this means that many, many records will be ranked as “possibly abnormal”.

However, the total number of records is now 20% of what it was initially. We can now apply a more sophisticated predictor, still trying to model and predict normality, but able to spend more resources per record, and use a cleverer algorithm.

Again the result is a ranking of the records, into which we can insert a boundary such that the false negative rate is zero. This is still extremely conservative, so that many records that are quite normal may still be ranked on the “possibly abnormal” side of the boundary, but the number of records remaining is still smaller. Even if only half the records can be safely discarded as normal, we have still reduced the total number of records to 10% of the original.

This staged approach can be continued for as many stages as required. Each new stage works with fewer records, and can be correspondingly more sophisticated. Eventually, the number of questionable records may become small enough that a human can do the final check.

Assembling predictors in sequence means that predictors whose individual accuracies are not perfect can be arranged so that the overall effect is close to perfect.

Of course, we have not yet quite solved the problem of detecting possible bad things. The pool of abnormal records contains both records associated with bad guys and records associated with other more ordinary kinds of abnormality (eccentricity).

At this point, two lines of attack remain. The first is to build a model of eccentricity, and use it to eliminate those records from the data. The other is to observe that most bad-guy actions require some kind of collective effort, so that bad-guy records are likely to form some kind of grouping, while those of eccentrics are likely to be one-offs. (Of course, this does not solve the problem of the lone-wolf bad guy, which is why serial killers can be so hard to detect.)

Ranking versus boundaries

Knowledge discovery is full of chicken and egg problems — typically it’s not clear how to set the parameters for an algorithm until you’ve seen the results it gives on your data.

For prediction, the problem of how to specify the boundary is of this kind. Suppose that we want to build a predictor for normality, so that each record will be classified as “normal”, or “possibly abnormal”. We will have some examples of each in our training data (that is records already labelled “normal” or “abnormal”), and typically there will be many more normal records than abnormal.

But how abnormal does a record have to be before we label it as abnormal? And what parameters should we give the algorithm that builds the predictor? Different decisions will have different effects on the false positive and false negative rates. If we move the boundary so that more records are predicted to be abnormal, we reduce the number of false negatives, but increase the number of false positives. There are techniques for making a good choice (using the so-called ROC or Receiver-Operating-Characteristic curve) but these aren’t very useful in an adversarial situation, where false negatives matter a lot. Moving the boundary means changing the parameters of the algorithm that builds the predictor, so every time we want to try another position, we have to rebuild the predictor.

A better way to think about the problem is that the goal is to rank the records from most abnormal to least abnormal. In other words, the knowledge-discovery technique does not actually build a predictor, but something close to it.

Once we have a ranking, we can decide where to put the boundary, but after we have seen the analysis of the data, rather than before. There is a twofold win: we have avoided the chicken and egg problem of having to set the parameters of the algorithm, and we can easily explore the effect of different choices of the boundary without retraining.

Building a ranking predictor is a little harder than building a plain predictor, but not by much. Predictors are divided into two kinds: classifiers that predict a class label (from a finite set), and regressors that predict a continuous value. Ranking requires the second kind of predictor.

Making prediction usable

In the previous post, I pointed out that the goal of prediction in adversarial settings is to prevent bad things happening, but it doesn’t work well to attack the problem directly.

The first important insight is to see that the problem is really about predicting normality. Rather than try and predict what bad guys might do, which might be any of a very long list of things, try and predict what normal people will do. This is a great deal easier because normality has regularities.

This isn’t obvious, but it happens because we are social beings. The things we do, and the way we do them, are constrained by the fact that other people are involved. Even something as simple as walking down the sidewalk is a highly constrained process — people move to the appropriate side when they approach someone without any conscious thought, and without any apparent signal. In fact, this works so well that when you go to a country where the standard side on which to pass someone is different (left vs right) both you and the natives will feel uncomfortable without knowing why — and it takes some people years to correct their behaviour to local norms.

But our social nature means that the structure of normality is quite robust, and so can be learned with acceptable accuracy by a predictor.

Of course, this hasn’t solved the problem. But it has reduced the scale of the problem immensely. If we have a million records, a fairly simple predictor of normality can predict that 800,000 of them are certainly normal, reducing the problem by a factor of 5.

The records that remain can now be processed by a second predictor which (a) has a different task, and (b) has to handle much less data.

Doing prediction in adversarial settings

The overall goal of prediction in adversarial settings is to stop bad things happening — terrorist attacks, fraud, crime, money laundering, and lots of other things.

People intuitively think that the way to address this goal is to try and build a predictor for the bad thing. A few moments thought shows that building such a predictor is a very difficult, maybe impossible, thing to do. So some people immediately conclude that it’s silly or a waste of money, to try and address such goals using knowledge discovery.

There are a couple of obvious reasons why direct prediction won’t work. The first is that bad guys have a very large number of ways in which they can achieve their goal, and it’s impossible for the good guys to consider every single one in designing the predictive model.

This problem is very obvious in intrusion detection, trying to protect computer systems against attacks. There are two broad approaches. The first is to keep a list of bad things, and block any of them when they occur. This is how antivirus software works — every day (it was every week; soon it will be every hour) new additions to the list of bad things have to be downloaded. Of course, this doesn’t predict so-called zero-day attacks, which use some mechanism that has never been used before and so is not on the list of bad things. The second approach is to keep track of what has happened before, and prevent anything new from happening (without some explicit approval by the user). The trouble is that, although there are some regularities in what a user or a system does every day, there are always new things — new websites visited, email sent to new addresses. As a result, alarms are triggered so often that it drives everyone mad, and such systems often get turned off. Vista’s user authorization is a bit like this.

The other difficulty with using direct prediction is making it accurate enough. Suppose that there are two categories that we want to predict: good, and bad. A false positive is when a good record is predicted to be bad; and a false negative is when a bad record is predicted to be good. Both kinds of wrong predictions are a problem, but in different ways. A false positive causes annoyance and irritation, and generates extra work, since the record (and the person it belongs to) must be processed further. However, a false negative is usually much worse — because it means that a bad guy gets past the prediction mechanism.

Prediction technology is considered to be doing well if it achieves a prediction accuracy of around 90% (the percentage of records predicted correctly). It would be fabulous if it achieved an accuracy of 99%. But when the number of records is 1 million, a misclassification rate of 1% is 10,000 records! The consequences of this many mistakes would range from catastrophic to unusable.

These problems with prediction have been pointed out in the media and in some academic writing, as if they meant that prediction in adversarial settings is useless. This is a bit of an argument against a straw man. What is needed is a more thoughtful way of thinking about how prediction should be done, which I’ll talk about in the next posting.

Zefra

A friend sent me a copy of the book Crisis in Zefra, a fictional story about a future peacekeeping/counterinsurgency deployment, intended as a longer-term scenario planning document.

It was written by Karl Schroeder and can be downloaded from his site here. The book was commissioned by the Canadian military to help think through the shape of future missions. It is set, notionally, in about 2025.

The book covers a short period of time in the run-up to elections in a sub-Saharan city, beginning with an apparently normal day, during which counterinsurgents mount a major attack.

The book attempts to extrapolate technologies that are in use today to the 20-year-out timeframe — with varying degrees of success. Overall the book is interesting and worth reading for anyone who’s interested in asymmetric warfare and associated technologies.

As with most scenarios, the most interesting aspects are the blind spots. There are, of course, endless possibilities for how particular technologies might develop and so there are many potential arguments about whether this or that technology will be usable in that time frame. But there seem to me to be a couple of more strategic issues:

  1. Although there is a strong awareness of data analysis and knowledge discovery as important technologies, they are regarded as analyst tools to be invoked for a specific purpose when something attracts an analyst’s attention. For example, the technology to go back through satellite imagery and track back the trajectories of trucks and other vehicles is imagined. However, there is not sufficient awareness that these technologies can be used in a less-supervised way: by analysing such data constantly and reporting anything that appears anomalous, for example. In fact, these technologies are often most effective when they are used in a symbiotic way, with analyst and analysis software working in a tightly-coupled way, in which both sides develop their “understanding” and “techniques”.
  2. As I’ve commented before, when talking about sensemaking, most asymmetric and counterinsurgency settings should be regarded as complex systems, not complicated ones (in the Cynefin sensemaking sense). This difference is not appreciated in the analysis — the underlying attitude is that understanding the threats of the city is a big, messy problem, but one that could be fully dealt with, given enough time and resources. There’s not enough awareness of the unexpected, and broad planning for contingencies (“unknown unknowns” in the late-lamented phrase).
  3. Granted the existence of the level of technology in the story, not enough attention is paid to its prophylactic use. For example, it’s assumed that each patrol is proceeded by chemical sniffers that are able to detect the presence of explosives (the presumed solution to today’s huge problem of IEDs). However, given this technology, it would make sense to deploy it to provide barriers across which it would be difficult to take explosives at all — making it hard to get explosives into the city in the first place. Even if it is not practical to put a complete ring around the city, there are lots of benefits (as I’ve suggested in earlier discussions of defence in depth) to putting a border in place and looking hard at who tries to sneak around it.
  4. The one place where the technology extrapolation seems to be badly off is the assumption that the counterinsurgents will have biological weapons that can target anyone not a long-term inhabitant of the city because of their bone composition. This, it turns out, depends on what’s in the water that people drink over time. The technology to make these kind of assessments exists — there have been stories about police using them to determine where a body has come from; and no doubt we’ll see an episode of Bones that depends on this soon. But it’s a huge step from being able to detect such things in the lab, and being able to develop a biological vector that could be based on them. Not only is this probably too big a step for 20 years, but it’s also probably always going to be a bad idea because of the probability of unintended consequences.
  5. Maybe this is a blind spot for me, but the book assumes that patrols will continue to be an essential part of a deployment. And this in a world of semiautonomous devices that are at least as well equipped as an infantryman. I continue to be puzzled about the point of sending out patrols as a routine thing in Afghanistan. I can’t see any point to them in a city environment. Obviously, there are reasons for the military to go outside the perimeter of a secure area, but I don’t see the point of doing so to “show the flag” or any other diffuse goals.

These are not meant as criticisms of the book, which is an interesting and thought-provoking read.

Designing a passenger screening system

From the last two posts, it’s clear that a robust passenger screening system is actually a two-stage process. The no-fly list acts as a first stage. Its role is to eliminate some people from consideration (by self selection, because they think they may be on the list); and to cause others to take steps to conceal themselves or manipulate the process.

The second stage is a risk-management system whose primary target is people who have not yet been detected as terrorists, and who know that they haven’t, and amateurs. This is still potentially a large group of people, since there’s lots of grass roots terrorism about.

In designing this second stage, it’s important to think about what the threat model actually is. Actually the problem is harder than it looks, because it’s not possible to come up with every possible way in which a terrorist could cause an incident involving an aircraft. So the threat model has to play the odds, to some extent, and concentrate on likely threats.

As a general rule, it’s a good idea to build a model of normality, rather than a model of threat — so that everyone who doesn’t fit the model of normality is subject to extra scrutiny. This is hard on people who are, for whatever reason, different from the majority of others; but this is the price that has to be paid for broad-spectrum defence against many different kinds of attacks. (This is a civic problem, and suggests attention to models that try to distinguish ordinary abnormality from dangerous abnormality — which is possible, but not yet looked at very hard.)

The CAPS I system in the US was, mostly by accident, as much a normality model as a threat-based model. People were subjected to extra scrutiny if they met criteria that were loosely associated with terrorism: buying tickets for the front of the plane, paying cash, and buying one-way tickets. There’s some argument for these as relevant attributes; but it’s just as sensible to argue that these properties are quite rare and so the model is selecting for abnormality. One unfortunate side-effect of this model is that it also selects the wealthy, who found themselves facing extra screening often.

The other weakness of the CAPS I model was that the anticipated threat was too narrow, so the extra scrutiny was not broad enough. At the time, hijacking was considered a minor threat, while bombs in checked luggage were considered more likely (and more dangerous). As a result, several of the Sep 11th hijackers were flagged by the system and their checked luggage examined more carefully. Suicide bombing was not considered part of the threat.

The CAPS II program, short-lived, was more explcitly designed to model normality, with its concept of “rooted in the community”. The reason for checking commercial databases, as well as data directly related to the flight itself, was that someone who had an extensive credit record, owned a house, etc. could plausibly be considered a low risk — partly because the cost of creating such a track record would put it out of reach of many would-be terrorists.

This program raised many red flags with the public, partly because it was ill thought out, partly poorly presented, and partly because of mistakes made in other areas that led citizens not to trust government very much. (Note that the level of discomfort with businesses aggregating the same information is much lower.)

The Secure Flight program, the latest passenger screening program in the US, is a much simpler watchlist matching program. It also includes a built-in redress mechanism, and has been developed with extensive consultation. However, it is a much weaker program, relying on the completeness and integrity of the watch lists in use.

TSA discussion of Secure Flight

The requirement to use the system to detect criminals wanted for certain kinds of crimes has also been dropped.

What are no-fly lists for?

There’s a great deal of confusion in the discussion of airline security because several things are going on at once.

It’s sort of clear what the point of passenger screening is. The goal is to prevent anyone from carrying out an attack, either a hijacking or a bombing. The mechanism is to make sure that nobody can take the required tools or devices onto a flight.

Why is there an analysis component? Why not simply search everyone in exactly the same way, to make sure that they aren’t carrying anything they shouldn’t be? It’s a question of managing the costs and the risks. Analysing data about potential passengers allows them to be placed in categories. Each category consumes a different amount of resources to check; the amount is related to the perceived risk of people in that category.

Of course, there are advantages to not making these categories too rigid or predictable, as I discussed in the previous post.

This idea of risk management is a sensible one. But, given this framework, what is the point of a no-fly list? If I can be absolutely sure that a terrorist who gets on a plane does not have the ability to do anything different from the other passengers, is there any reason not to let him fly.

There are two reasons why a no-fly list might still be a good idea:

  1. It’s actually impossible to be sure that someone who gets on a plane cannot do anything destructive, no matter how much time is spent on checking beforehand. Even the Israelis, who are no slouches when it comes to airline security, and who’ve been doing it a long time, cannot guarantee that someone is actually innocuous. It’s not clear what the issues are, but it seems at least possible to make a swallowable IED, or for a group to take on board objects that are individually innocuous, but together could make something nasty.
    As a matter of practicality, it’s also the case that people who are highly motivated to carry out an attack are not worth the resources to screen completely. In other words, the point of passenger screening is to act as a safety net, catching people who aren’t known to be terrorists, either terrorists not yet discovered, or people who’ve suddenly snapped.
  2. A no-fly list puts barriers in the way of terrorists, making it hard for them to move and meet. This only makes sense for domestic travel, since borders already perform this function for international travel. In other words, once you become a terrorist you cut yourself off from moving freely around a country, because you can no longer use air transportation. Obviously, this matters more in large countries, and those with lots of islands or other physical barriers, than in small countries like those in Europe.

The biggest worry, and problem, with no-fly lists is their false positive rate, that is the number of innocent people who are mistakenly identified as terrorists and prevented from flying. This is partly a self-inflicted problem by governments which rushed to implement no-fly lists for largely cosmetic reasons, without taking the time to think about and implement reliable lists to begin with.

However, constructing and maintaining such lists is difficult because it requires making secret information widely available, which is not the way to keep it secret. (The list has to be at least partly secret because covert means might have been used to put some people on it; if they knew they were on it, they might be able to work out how.) There are technical ways to check whether someone is on a list without having to make the list public, using encryption techniques, but this idea has not, afaik, been used.

The second big problem with no-fly lists is that they are lists of identities, and it’s quite hard to robustly establish the identity of the person standing in front of you, if they are motivated not to make it easy. That’s why there’s such high levels of interest in biometrics — they provide a way to link some property of your physical presence to other data about you, and so to establish your identity. As a result, identity theft is big business — according to Bruce Schneier, a bigger business than drugs in the U.S.

Governments have also piggybacked law enforcement onto no-fly lists, so that people who are wanted for crimes of sufficient seriousness can be added to such lists. Whether or not this is a good idea is a complex subject; but it does make ordinary citizens suspicious about how much mission creep is a factor in government programs aimed at detecting and preventing terrorism. More another time, perhaps.

Airline passenger screening

The so-called Carnival Booth algorithm shows the weakness of airline (and other) passenger screening.

Suppose that I’m a bad guy and I want to carry out an attack. I start with 100 possible attackers. I send them on flights around the country, in which they behave completely innocently. Anyone who ever gets pulled over for extra screening is removed from the pool.

Eventually my pool of 100 is reduced to a much smaller number, but I can have high confidence that the remaining members will not receive any extra screening when they travel. And now I can use them to carry out an attack, with high probability that they will not be given special consideration. (Of course, they will still receive normal screening, so my attack will have to involve mechanisms that are not normally detected.)

This is why randomization helps — if the criteria for extra screening are randomized slightly from hour to hour, or perhaps some people are selected completely at random, then I can never be sure that someone who hasn’t been selected yet, will not be selected next time, when they may be travelling less innocently.

The corollary to randomization is that some people who are “obviously” innocent will sometimes be given extra screening. People tend to see this as silliness or a waste of resources — but is is actually a good use of resources. This case needs to be explained to the public more clearly.

The Carnival Booth algorithm was first explained by some students from MIT.

Does it help bad guys to know how knowledge discovery works?

The temptation faced by people who develop new ways to carry out knowledge discovery is to keep it all secret so that the bad guys can’t figure out what’s being done and so evade detection.

From the discussion in the last three posts I hope that you can see the weakness of this intuitively-appealing idea. First of all, security by obscurity — that is keeping the details of a system hidden as a way of keeping it secure — is just a bad idea. The problem is that you, the good guy, don’t necessarily find out when the bad guys figure out your system, and so they can lead you by the nose until you do. And no matter how obscure you try to be, the bad guys have a strong motivation to try and figure out your system and so, in the end, they will.

Security by obscurity does help to find amateurs and incompetents, but there are so few of these in settings where it matters that it’s not a good idea to treat them as the main problem.

In fact it’s a good idea if the bad guys know something about the kind of knowledge discovery that is being used against them. Why?? Because it encourages them to try and use concealment and manipulation; and these, as we have seen, create signatures that often make them easier to discover than if they had not bothered.

Pickpockets used to use this technique. In a crowd, one member of the gang would yell out “Beware of pickpockets” (note the same counterintuition). The result: everyone would put their hands on their wallets, showing the other members of the gang where they were. As good guys, we can use the same technique. Every knowledge discovery system is improved by treating it as a two-stage problem. The first stage is something like a big sign saying “Knowledge discovery in progress”‘ and the second stage is actual knowledge discovery, tuned to watch for concealment and manipulation.

That’s not to say that bad guys need to know everything about how a knowledge-discovery system works. One good way to introduce uncertainty is to include some randomization, so that the results for the same record might vary from time to time. This makes it hard for a bad guy to learn the exact attribute values that will cause problems.

Although randomization has some attractive properties, it is politically difficult because it means that some records escape scrutiny that they might have received on other occasions. If one of these records turned out to be a significant false negative, there would be repercussions.

Looking for Bad Guys III: Using manipulation

Bad guys who are aware the knowledge-discovery tools will be used to look for them may also try to actively manipulate the process to their own advantage.

One way to do this is to get an insider working for them, someone who can alter the data or the results of the analysis to this benefit. This is probably the most common method: over all of history, probably more sieges have been successful because someone opened the gates from the inside than because the walls were broken through. It’s easy to get caught up in the cleverness of technology and forget that sometimes suborning someone is the easiest attack.

However, the focus of this blog is knowledge discovery, so let me concentrate on that. Before we talk about how manipulation can be exploited as a discovery tool, we need to talk about what manipulation looks like; and before we can do that, we need to think about the structure of the knowledge-discovery process.

It’s helpful to divide up the stages of knowledge discovery into:

  1. Collecting the data (CCTV images, transaction logs);
  2. Analysing the data (the part that’s usually thought of as the heart of knowledge discovery);
  3. Deciding on what to do with the results and taking action;

Although an adversary can only attack the process via the data that is collected (assuming they don’t have an insider), it is helpful to think of three different kinds of attacks, directed against each of the three stages. The different attacks require understanding different aspects of the knowledge-discovery system.

Manipulating the data collection stage is probably the easiest, because it’s often possible to see and understand how the data is being collected. For example, the fields of view of CCTV cameras can usually be inferred from their positions (even if they are enclosed in black plastic bubbles) and so ways to move around them without coming into view can be worked out. Alternatively, disguises can be used to conceal who is being seen, even though an image is captured. One of the reasons identity theft is a big business is that it provides a way to have data captured about you, but data that is useless because it doesn’t connect to the real you.

Manipulating the decision and action stage is done using social engineering. This means trying to create the impression in the minds of the people who are making the decisions and taking the actions that the analysis system has made an error.

Manipulating the analysis stage is surprisingly easier than it should be. This is because most knowledge-discovery technology has been tuned to give good results in data with natural variation. This gives an opportunity to insert data that is the worst possible from the point of view of the algorithms, and so enable bad guys to hide their traces.

The technology used for knowledge discovery needs to be completely rethought to take manipulation into account. This is primarily why adversarial knowledge discovery is not just another application of knowledge discovery, but a completely different problem.

The good part about this is that attempts at manipulation also create an abnormal signature in the data; and the process can be tuned to look for this signature as well.

Looking for bad guys II: Using concealment

Bad guys will take steps to hide their traces in data, unless they’re very naive, or knowledge-discovery tools have never been applied in their particular domain before.

At first glance, it might look as if this kind of concealment might make the task of finding them harder. In fact, the opposite is true — doing anything to try and look more normal runs a serious risk of looking more abnormal (see the previous post for why this matters).

Hercule Poirot has made this point. He often says that murderers are not content to leave things alone, and it is their attempts to make detection harder that, in the end, reveal who they are.

If a bad guy wants to create data values that look more normal than they would otherwise be as the result of whatever action he is doing, he has two problems:

  • what are the normal values; and
  • can values close to them be generated?

Knowing normal values is harder than it looks. In a sense, such values are knowable, but the risk is that the more the issue is thought about, the more likely a person is to go into an infinite loop of improving the values. What’s the latest time that it’s acceptable to phone someone you’ve met at a party and make it seem casual? The first part is easy to work out, but it gets much harder with the extra qualification.

When more than one value has to be set appropriately, the problem becomes much harder because, in normal data, the values of different attributes are correlated. It is therefore possible to set two attributes each to plausible individual values, but still create an anomaly because these values rarely occur together. The observation that the values are usually correlated is also the explanation for how they come to be that way in normal data. People who don’t think about the values naturally produce the observed correlations.

Even if the desired values were known, it may be hard to generate them. If they are usually the result of unconscious processes, then faking them is hard. This is true of speech and directly-observed action. People who can create the illusion that unnatural speech and action are actually natural are called actors, and are highly paid for this skill. I pointed out in an earlier post that faking numbers is difficult because there’s a digit distribution in actual numbers which is not reproduced in faked ones.

So efforts by bad guys to conceal themselves in data by making their data look more like normal data than it ordinarily would creates an opportunity to look for them — by looking for the signature of concealment as well as the signature of whatever bad things we would already be looking for.

Of course, this does make the challenge of distinguishing normality from the unusual more difficult. But it suggest that, in some sense, we expect to see the following structure in data: large clusters of normal records, small clusters of bad-guy records quite close to them; and then single-record outliers or very small clusters corresponding to eccentrics, much further away from the normal clusters.

Looking for bad guys I: Using difference

Given that the traces left by bad guys are of three kinds:

  1. Difference;
  2. Concealment; and
  3. Manipulation

there are many implications for the way in which knowledge-discovery systems should be designed, and I’ll talk about several of them in upcoming posts.

Let’s start by thinking about difference, which is the most obvious quality of bad guys, and the only one most people think about. Bad guys are doing something that is different from most other people (although not in every setting — when we looked at Enron emails, deceptiveness was so common that it was the mainstream culture; the place resembled a pirate ship).

The problem is that lots of ordinary people are doing things that are different as well. In fact, the properties that we see as normal are actually averages over lots of individual behavior. Each person doesn’t think of making ‘normal’ choices or acting ‘normally’ — they just do whatever they decide to do.

From this point of view, it’s surprising that such a thing as normality exists. And in fact it only does in certain circumstances, and so in certain kinds of data.

Consider tastes in music. Twenty years ago, the only music that most people could easily listen to was what was played on the radio, and the slightly larger set that was available to buy at music stores. Radio stations played particular kinds of music, but there were only a smallish number of particular kinds: rock, pop, classicial, oldies, etc.. In this situation, it’s clear that most peoples’ tastes resembled the people around them — they could hardly not.

Today, the music world has changed completely. The existence of Internet radio and satellite radio makes it easy for people to experience different kinds of music. Anything that catches their attention can be followed up by downloading more material, including from artist websites that exist independently of the filtering effect of labels. Not surprisingly, tastes no longer cluster in the same way they once did, and so there isn’t any sense of ‘normal’ taste in music as there once would have been. (Anderson’s book, The Long Tail, is a good discussion of this issue. Note also that the use of tags recreates a kind of clustering, as people develop their individual tastes by following the tastes of others.)

The point is that situations where there are technical and/or social constraints make people and their actions look alike, even though all of their individual decisions are independent and free. If the set of possible decisions are small, there are fewer ways to be different. If the set of decisions are constrained by the social connections between people, it is more uncomfortable to make different decisions. This works in lots of small ways. Most people have a sense of what times of day are appropriate to telephone other people; and this has the side-effect of creating patterns of normality in call data, for example.

So, in many kinds of data, it does make sense to take about normality. Bad guys are forced to act in certain ways as part of their activities, and these actions, some of them at least, will deviate from normality. The problem is that there are also other people whose actions deviate from normality, just because. Perhaps they are eccentric, or socially inept.

This creates a problem, but not an insuperable one. What it does mean is that, rather than thinking of the problem as modelling bad guys, we think instead of the problem as modelling normality.

If we know what normality looks like, we can keep all of the records that don’t fit that model. This doesn’t finish the job, because what’s left is a mixture of bad-guy records and, for want of a better word, eccentric records.

But we have made significant progress on the overall problem because the size of the data has been reduced, probably by a very large amount. Now the problem becomes distinguishing bad guys from eccentrics. And that’s a reasonable problem.

What do the traces of bad guys look like?

The presumption of knowledge discovery in adversarial settings is that the traces left by bad guys are somehow different from those of ordinary people doing ordinary and innocuous things. What would those differences be?

There are three main sources of difference.

The first, and most obvious, is that to be a bad guy means to do things that most people don’t do. Given that reasonable attributes are collected, this almost inevitably means that there will be some differences between data records associated with bad guys and those associated with ordinary people.

This does depend on collecting data in which the differences are actually there. For example, if you’re looking for tax fraud, it’s probably useless to collect data about people’s height, since it’s unlikely that tax fraud is associated with being tall, or being short.

So the first thing we expect is that there will be differences between bad-guy data and normal data because of the requirements of whatever bad things they are doing.

People tend to think that this will be the most important, perhaps the only, difference. This isn’t true, except if the bad guys are exceptionally stupid, or knowledge discovery is being applied for the first time and the bad guys did not anticipate its use. This does happen — first attempts to find fraud are often spectacularly successful because they look at aspects of the data that haven’t been examined before.

However, smart bad guys will anticipate the use of knowledge discovery and so they will try, as much as possible, to make the values of the data collected about them seem as innocuous as possible.

So the second difference between bad-guy data and normal data is that bad-guy data is characterized by concealment. But isn’t concealment the absence of a difference? It turns out that, by and large, concealment itself generates a characteristic signature. Knowledge discovery techniques can be tuned to look for this signature, and so preferentially for bad guys.

How is it that concealment creates its own signature? Because, as humans, we’re lousy at creating, artificially, anything that we also do naturally.

Think about having a conversation. We can all do it easily. But make it a conversation in front of lots of people, say giving a speech or acting in a play, and suddenly the way we do it changes: voice tremor, hesitation, strange speech rhythms, and so on.

The same phenonema happen when people try to construct data to resemble data that otherwise arises naturally. For example, when people create a fictitious number (for example, on a tax return or in accounts) the digit distribution is quite distinctive (called Benford’s Law). This has been used to detect potentially fraudulent tax returns.

The third difference between bad-guy traces and ordinary data is created when bad guys actively try to manipulate the knowledge discovery process.

Perhaps surprisingly, they don’t have to know a lot about exactly which knowledge discovery process is being used, and exactly how it works, to have a shot at this. But, again, the knowledge discovery process can be trained to look for the signature of manipulation.

In summary, there are three main ways in which the traces of bad guys differ from those of ordinary data:

  1. Difference. The bad-guy data is unusual, perhaps even outlying in the dataset.
  2. Concealment. The bad guys have changed data values to make them seem more normal, but this backfires because it can’t be done — normal is hard to fake.
  3. Manipulation. The bad guys have changed data values to try and force particular behavior from the knowledge discovery process, but this backfires because the manipulation creates its own signature.

Of these three factors, the second and third are often much more significant than the first. “The wicked flee when no man pursueth” (Proverbs 28:1, KJV).

Knowledge discovery — good or bad?

Most people have some awareness that computer algorithms can be used to extract useful knowledge from large amounts of data. This is the basis of customer relationship management, which is used by many businesses to evaluate (?improve) the quality of their interactions with their customers, both individuals and other businesses. This way of extracting knowledge is called knowledge discovery or data mining.

Most people have some intuitive idea of how this might work — after all humans are extremely good at extracting knowledge from certain kinds of data themselves. However, people tend to jump quite quickly to one of two diametrically opposite assumptions about how knowledge discovery works.

The first is a dystopian view — knowledge extraction technology can be used to learn everything about individuals from their social gaffes to their deepest thoughts. With this kind of power, governments will be unable to resist and will use knowledge discovery as a tool for control, in the style imagined in 1984. A variation on this theme is that knowledge discovery only looks effective and so will seduce governments and others into spending vast amount of money and collecting huge datasets without any payback.

The second is a utopian view — knowledge extraction technology will make every interaction as efficient as possible, and will prevent all of the bad things in the world from happening.

The truth, of course, is somewhere between these two extremes. There are many powerful things that knowledge discovery can do, some of them non-obvious; but this requires careful thought about the process, and, potentially, considerable cost. We are a long way from using knowledge discovery to improve the collection of library fines.

There are serious issues around the intrusiveness of data collection for knowledge discovery. Many of these issues are less difficult and more manageable than they appear on the surface. The question of whether knowledge discovery is good or bad is more nuanced than almost all of the discussion about it would suggest. Stay tuned.

What this blog is about

All of us leave traces in the data that we create, either intentionally or as a side-effect of the things we do in the world — walking in front of a CCTV camera, turning on a cell phone, or whatever.

Lots of this data is analyzed, for example by businesses that want to build a relationship to customers.

I’m interested in the special case where some of the people about whom data is collected want to hide their existence, what they are like, and what they are doing, usually because they are up to no good.

In such situations, the way in which the data is collected, and then analyzed, and then the decisions that are taken as a result have to be rethought to take account of the adversarial nature of the situation.

I’m interested in how to do knowledge discovery in these adversarial situations, and this blog will talk about the issues, the techologies, and some of the known results.

Adversarial situations include:

  • crime;
  • fraud (medical, insurance);
  • money laundering;
  • organizational malfeasance;
  • industrial espionage;
  • national defence; and
  • counterterrorism.

What bad guys do in these situations has huge costs. The cost of terorrism is obvious, but it’s less well-known that fraud costs an estimated 12% of GDP in developed economies.

Of course, the process of collecting and analyzing data is not necessarily benign, and many people have privacy concerns. We’ll talk about them too.