Posts Tagged 'prediction'

What causes extremist violence?

This question has been the subject of active research for more than four decades. There have been many answers that don’t stand up to empirical scrutiny — because the number of those who participate in extremist violence is so small, and because researchers tend to interview them, but fail to interview all those identical to them who didn’t commit violence.

Here’s a list of the properties that we now know don’t lead to extremist violence:

  • ideology or religion
  • deprivation or unhappiness
  • political/social alienation
  • discrimination
  • moral outrage
  • activism or illegal non-violent political action
  • attitudes/belief

How do we know this? Mostly because, if you take a population that exhibits any of these properties (typically many hundreds of thousand) you find that one or two have committed violence, but the others haven’t. So properties such as these have absolutely no predictive power.

On the other hand, there are a few properties that do lead to extremist violence:

  • being the child of immigrants
  • having access to a local charismatic figure
  • travelling to a location where one’s internal narrative is reinforced
  • participation in a small group echo chamber with those who have similar patterns of thought
  • having a disconnected-disordered or hypercaring-compelled personality

These don’t form a diagnostic set, because there are still many people who have one or more of them, and do not commit violence. But they are a set of danger signals, and the more of them an individual has, the more attention should be paid to them (on the evidence of the past 15 years).

You can find a full discussion of these issues, and the evidence behind them, in ““Terrorists, Radicals, and Activists: Distinguishing Between Countering Violent Extremism and Preventing Extremist Violence, and Why It Matters” in Violent Extremism and Terrorism, Queen’s University Press, 2019.

 

“But I don’t have anything to hide”

This is the common response of many ordinary people when the discussion of (especially) government surveillance programs comes up. And they’re right, up to a point. In a perfect world, innocent people have nothing to fear from government.

The bigger problem, in fact, comes from the data collected and the models built by multinational businesses. Everyone has something to hide from them: the bottom line prices we are willing to pay.

We have not yet quite reached the world of differential pricing. We’ve become accustomed to the idea that the person sitting next to us on a plane may have paid (much) less for the identical travel experience, but we haven’t quite become reconciled to the idea that an online retailer might be charging us more for the same product than they charge other people, let alone that the chocolate bar at the corner store might be more expensive for us. If anything, we’re inclined to think that an organisation that has lots of data about us and has built a detailed model of us might give us a better price.

But it doesn’t require too much prescience to see that this isn’t always going to be the case. The seller’s slogan has always been “all the market can bear”.

Any commercial organization, under the name of customer relationship management, is building a model of your predicted net future value. Their actions towards you are driven by how large this is. Any benefits and discounts you get now are based on the expectation that, over the long haul, they will reap the converse benefits and more. It’s inherently an adversarial relationship.

Now think about the impact of data collection and modelling, especially with the realization that everything collected is there for ever. There’s no possibility of an economic fresh start, no bankruptcy of models that will wipe the slate clean and let you start again.

Negotiation relies on the property that each party holds back their actual bottom line. In a world where your bottom line is probably better known to the entity you’re negotiating with than it is to you, can you ever win? Or even win-win? Now tell me that you have nothing to hide.

[And, in the ongoing discussion of post-Snowden government surveillance, there’s still this enormous blind spot about the fact that multinational businesses collect electronic communication, content and metadata; location; every action on portable devices and some laptops; complete browsing and search histories; and audio around any of these devices. And they’re processing it all extremely hard.]

Time to build an artificial intelligence (well, try anyway)

NASA is a divided organisation, with one part running uncrewed missions, and the other part running crewed missions (and, apparently, internal competition between these parts is fierce). There’s no question that the uncrewed-mission part is advancing both science and solar system exploration, and so there are strong and direct arguments for continuing in this direction. The crewed-mission part is harder to make a case for; the ISS spends most of its time just staying up there, and has been reduced to running high-school science experiments. (Nevertheless, there is a strong visceral feeling that humans ought to continue to go into space, which is driving the various proposals for Mars.)

But there’s one almost invisible but extremely important payoff from crewed missions (no, it’s not Tang!): reliability. You only have to consider the differences in the level of response between the failed liftoff last week of a supply rocket to the ISS, and the accidents involving human deaths. When an uncrewed rocket fails, there’s an engineering investigation leading to a signficant improvement in the vehicle or processes. When there’s a crewed rocket failure, it triggers a phase shift in the whole of NASA, something that’s far more searching. The Grissom/White/Chaffee accident triggered a complete redesign of the way life support systems worked; the Challenger disaster caused the entire launch approval process to be rethought; and the Columbia disaster led to a new regime of inspections after reaching orbit. Crewed missions cause a totally different level of attention to safety which in turn leads to a new level of attention to reliability.

Some changes only get made when there’s a really big, compelling motivation.

Those of us who have been flying across oceans for a while cannot help but be aware that 4-engined aircraft have almost entirely been replaced by 2-engined aircraft; and that spare engines strapped underneath the wing for delivery to a remote location to replace an engine that failed are never seen either. This is just one area where increased reliability has paid off. Cars routinely last for hundreds of thousands of kilometres, which only as few high-end brands used to do forty years ago. There are, of course, many explanations, but the crewed space program is part of the story.

What does this have to do with analytics? I think the time is right to try to build an artificial intelligence. The field known as “artificial intelligence” or machine learning or data mining or knowledge discovery is notorious for overpromising and underdelivering since at least the 1950s so I need to be clear: I don’t think we actually know enough to build an artificial intelligence — but I think it has potential as an organising principle for a large amount of research that is otherwise unfocused.

One reason why I think this might be the right time is that deep learning has shown how to solve (at least potentially) some of the roadblock problems (although I don’t know that the neural network bias in much deep learning work is necessary). Consciousness, which was considered an important part of intelligence, has turned out to be a bit of a red herring. Of course, we can’t tell what the next hard part of the problem will turn out to be.

The starting problem is this: how can an algorithm identify when it is “stuck” and “go meta”? (Yes, I know that computability informs us that there isn’t a general solution but we, as humans, do this all the time — so how can we encode this mechanism algorithmically, at least as an engineering strategy?) Any level of success here would payoff immediately in analytics platforms that are really good at building models, and even assessing them, but hopeless at knowing when to revise them.

If you see something, say something — and we’ll ignore it

I arrived on a late evening flight at a Canadian airport that will remain nameless, and I was the second person into an otherwise deserted Customs Hall. On a chair was a cloth shoulder bag and a 10″ by 10″ by 4″ opaque plastic container. Being a good citizen, I went over to the distant Customs officers on duty and told them about it. They did absolutely nothing.

There are lessons here about predictive modelling in adversarial settings. The Customs officers were using, in their minds, a Bayesion predictor, which is the way that we, as humans, make many of our predictions. In this Bayesian predictor, the prior that the ownerless items contained explosives was very small, so the overall probability that they should act was also very small — and so they didn’t act.

Compare this to the predictive model used by firefighters. When a fire alarm goes off, they don’t consider a prior at all. That is, they don’t consider factors such as: a lot of new students just arrived in town, we just answered a hoax call to this location an hour ago, or anything else of the same kind. They respond regardless of whether they consider it a ‘real’ fire or not.

The challenge is how to train front-line defenders against acts of terror to use the firefighter predictive model rather than the Bayesian one. Clearly, there’s still some distance to go.

Making recommendations different enough

One of the major uses of data analytics in practice is to make recommendations, either explicitly or implicitly. This is one area where the interests of marketers and the interests of consumers largely run together. If I want to buy something (a product or access to an experience such as listening to a song or seeing a movie) I’d just as soon buy something I actually want — so a  seller who can suggest something that fits has a better chance of getting my business than one that presents a set of generic choices. Thus businesses build models of me to try and predict what I am likely to buy.

Some of these businesses are middlemen, and what they are trying to predict is what kind of ads to show me on behalf of other businesses. Although this is a major source of revenue for web-based businesses, I suspect it to be an edge-case phenomenon — that is, the only people who actually see ads on the web are people who are new to the web (and there are still lots of them every day) while those who’ve been around for a while develop blindness to everything other than content they actually want. You see businesses continually developing new ways to make their ads more obtrusive but, having trained us to ignore them, this often seems to backfire.

Other businesses use permission marketing, which they can do because they already have a relationship. This gives them an inside track — they know things about their customers that are hard to get from more open sources, such as what they have actually spent money on. When they can analyse the data available to them effectively, they are able to create the win-win situation where what they want to sell me is also what I want to buy.

But there’s a huge hole in the technology that represents an opportunity at least as large as that on which Google was founded: how new and different should the suggestion be?

For example, if you buy a book from Amazon by a popular author, your recommendation list is populated by all of the other books by that same author, the TV programs based on that author’s books, the DVDs of the TV shows, and on and on. This is largely a waste of the precious resource of quality attention. Most humans are capable of figuring out that, if they like one book by an author, they may well like others. (In fact, sites like Amazon are surprisingly bad at figuring out that, when an author writes multiple series featuring different characters and settings, an individual might like one series but not necessarily the others.)

So what would be better? The goal, surely, is to suggest products that are similar to the ones the individual has already liked or purchased, but also sufficiently different that that individual would not necessarily have noticed them. In other words, current systems present products in a sphere around the existing product, but what they should do is present products in an annulus around the existing product. Different, but not too different.

This is surprisingly difficult to do. Deciding what similarity means is already a difficult problem; deciding what “just enough dissimilarity” means has been, so far, a too difficult problem. But what an opportunity!

Election winning language patterns

One of the Freakonomics books makes the point that, in football aka soccer, a reasonable strategy when facing a free kick is to stay in the middle of the goal rather than diving to one side or the other — but goaltenders hardly ever follow this strategy because they look like such fools if it doesn’t pay off. Better to dive, and look as if you tried, even if it turns out that you dived the wrong way.

We’ve been doing some work on what kind of language to use to win elections in the U.S. and there are some similarities between the strategy that works, and the goal tending strategy.

We looked at the language patterns of all of the candidates in U.S. presidential elections over the past 20 years, and a very clear language pattern for success emerged. Over all campaigns, the candidate who best deployed this language won, and the margin of victory relates quite strongly to how well the language was used (for example, Bush and Gore used this pattern at virtually identical levels in 2000).

What is this secret language pattern that guarantees success? It isn’t very surprising: high levels of positive words, non-existent levels of negative words, abstract words in preference to concrete ones, and complete absence of reference to the opposing candidate(s).

What was surprising is how this particular pattern of success was learned and used. Although the pattern itself isn’t especially surprising, no candidate used it from the start; they all began with much more conventional patterns: negativity, content-filled policy statements, and comparisons between themselves and the other candidates. With success came a change in language, but the second part of the surprise is that the change happened, in every case, over a period of little more than a month. For some presidents, it happened around the time of their inaugurations; for others around the time of their second campaign, but it was never a gradual learning curve. This suggests that what happens is not a conscious or unconscious improved understanding of what language works, but rather a change of their view of themselves that allows them to become more statesmanlike (good interpretation) or entitled (bad interpretation). The reason that presidents are almost always re-elected is that they use this language pattern well in their second campaigns. (It’s not a matter of changing speechwriting teams or changes in the world and so in the topics being talked about — it’s almost independent of content.)

So there’s plenty of evidence that using language like this leads to electoral success but, just as for goaltenders, no candidate can bring himself or herself to use it, because they’d feel so silly if it didn’t work and they lost.

Predicting the Future

Arguably the greatest contribution of computing to the total of human knowledge is that relatively simple results from theoretical models of computation show that the future is an inherently unknowable place — not just in principle, but for fundamental reasons.
A Turing machine is a simple model of computation with two parts: an infinite row of memory elements, each able to contain a single character; and a state machine, a simple device that is positioned at one of the memory elements, and makes moves by inspecting the single character in the current element and using a simple transition table to decide what to do next. Possible next moves include changing the current character to another one (from a finite alphabet) or moving one element to the right or to the left, or stopping and turning off.
A Turing machine is a simple device; all of its parts are straightforward; and many real-world simulators have been built. But it is hypothesised that this simple device can compute any function that can be computed by any other computational device, and so it contains within its simple structure everything that can be found in the most complex supercomputer, except speed.
It has been suggested that the universe is a giant Turing machine, and everything we know so far about physics continues to work from this perspective, with the single exception that it requires that time is quantized rather than continuous — the universe ticks rather than runs.
But here’s the contribution of computation to epistemology: Almost nothing interesting about the future behaviour of a Turing machine is knowable in any shortcut way, that is in any way that is quicker than just letting the Turing machine run and seeing what happens. This includes questions like: will the Turing machine ever finish its computation? Will it revisit this particular memory element? Will this symbol ever again contain the same symbol that it does now? and many others. (These questions may be answerable in particular cases, but they can’t be answered in general — that is you can’t inspect the transitions and the storage and draw conclusions in a general way.)
If most of the future behaviour of such a simple device is not accessible to “outside” analysis, then almost every property of more complex systems must be equally inaccessible.
Note that this is not an argument built from limitations that might be thought of as “practical”. Predicting what will happen tomorrow is not, in the end, impossible because we can’t gather enough data about today, or because we don’t have the processing power to actually build the predictions — it’s a more fundamental limitation in the nature of what it means to predict the future. This limitation is akin (in fact, quite closely related) to the fact that, within any formal system, there are some theorems that we know to be true but cannot prove.

There are other arguments that also speak to the problem of predicting the future. These aren’t actually needed, given the argument above, but they are often advanced, and speak more to the practical difficulties.

The first is that non-linear systems are not easy to model, and often have unsuspected actions that are not easy to infer even when we have a detailed understanding of them. Famously, bamboo canes can suddenly appear from the ground and then grow more than 2 feet in a day.

The second is that many real-world systems are chaotic, that is infinitesimal differences in their conditions at one moment in time can turn into enormous differences at a future time. This is why forecasting the weather is difficult: a small error in measurement at one  weather station today (caused, perhaps by a butterfly flapping its wings) can completely change tomorrow’s weather a thousand miles away. The problem with predicting the future here is that the current state cannot be measured to sufficient accuracy.

So if the future is inherently, fundamentally impossible to predict, what do we mean when we talk about prediction in the context of knowledge discovery? The answer is that predictive models are not predicting a previously unknown future, but are predicting the recurrence of patterns that have existed in the past. It’s desperately important to keep this in mind.

Thus when a mortgage prediction system (should this new applicant be given a mortgage?) is built, it’s built from historical data: which of a pool of earlier applicants for mortagages did, and did not, repay those loans. The prediction for a new mortgage applicant is, roughly speaking, based on matching the new applicant to the pool of previous applicants and making a determination from what the outcomes were for those. In other words, the prediction assumes an approximate rerun of what happened before — “now” is essentially the same situation as “then”. It’s not really a prediction of the future; it’s a prediction of a rerun of the past.

All predictive models have (and must have) this historical replay character. Trouble starts when this gets forgotten, and models are used to predict scenarios that are genuinely in the future.  For example, in mortgage prediction, a sudden change in the wider economy may be significant enough that the history that is wired into the predictor no longer makes sense. Using the predictor to make new lending decisions becomes foolhardy.

Other situations have similar pitfalls, but they are a bit better hidden. For example, the dream of personalised medicine is to be able to predict the outcome for a patient who has been diagnosed with a particular disease and is being given a particular treatment. This might work, but it assumes that every new patient is close enough to some of the previous patients that there’s some hope of making a plausible prediction. At present, this is foundering on the uniqueness of each patient, especially as the available pool of existing patients for building the predictor is often quite limited. Without litigating the main issue, models that attempt to predict future global temperatures are vulnerable to the same pitfall: previous dependencies of temperatures on temperatures at earlier times do not provide a solid epistemological basis for predicting future temperatures based on temperatures now (and with the triple whammy of fundamental unpredictability, chaos, and non-linear systems).

All predictors should be built so that predictions all pass through a preliminary step that compares them to the totality of the data used to build the predictor. New records that do not resemble records used for training cannot legitimately be passed to the predictor, since the result has a strong probability of being fictional. In other words, the fact that a predictor was build from a particular set of training data must be preserved in the predictor’s use. Of course, there’s an issue of how similar a new record must be to the training records to be plausibly predicted. But at least this question should be asked.

So can we predict the future? No, we can only repredict the past.

You heard it here first

As I predicted on August 8th, Obama has won the U.S. presidential election. The prediction was made based on his higher levels of persona deception, that is the ability to present himself as better and more wonderful than he actually is. Romney developed this a lot during the campaign and the gap was closing, but it wasn’t enough.

On a side note, it’s been interesting to notice the emphasis in the media on factual deception, and the huge amount of fact checking that they love to do. As far as I can tell, factual deception has at best a tiny effect on political success, whether because it’s completely discounted or because the effect of persona is so much stronger. On the record, it seems to me to be a tough argument that Obama has been a successful president, and indeed I saw numerous interviews with voters who said as much — but then went on to say that they would still be voting for him. So I’m inclined to the latter explanation.

Adversarial knowledge discovery is not just knowledge discovery with classified data

Someone made a comment to me this week implying that data mining/knowledge discovery in classified settings was a straightforward problem because the algorithms didn’t have to be classified, just the datasets.

This view isn’t accurate, for the following reason: mainstream/ordinary knowledge discovery builds models inductively from data by maximizing the fit between the model and the available data. In an adversarial setting, this creates problems: adversaries can make a good guess about what your model will look like, and so can do better at hiding their records (making them less obvious or detectable); and they can also tell exactly what kinds of data they should try and force you to use if/when you retrain your model.

A simple example. The two leading-edge predictors today are support vector machines and random forests and, in mainstream settings, there’s often little to choose between them in prediction performance. However, in an adversarial setting, the difference is huge: support vector machines are relatively easy to game, because even one record that’s in the wrong place or mislabelled can change the entire orientation of the decision surface. To make it even worse, the way in which the boundary changes is roughly predictable from the kind of manipulation made. (You can see how this works in: J. G. Dutrisac, David B. Skillicorn: Subverting prediction in adversarial settings. ISI 2008: 19-24.). Random forests, on the other hand, are much more robust predictors, partly because their ensemble characteristics makes it hard to force particular behaviour because of the inherent randomness ‘inside’ the algorithm.

The same thing happens with clustering algorithms. Algorithms such as Expectation-Maximization are easily misled by a few carefully chosen records; and no mainstream clustering technique does a good job of finding the kinds of ‘fringe’ clusters that occur when something unusual is present in the data, but adversaries have tried hard to make it as usual as possible.

In fact, adversarial knowledge discovery requires building the entire framework of knowledge discovery over again, taking into account the adversarial nature of the problem from the very beginning. Some parts of mainstream knowledge discovery can be used directly; others with some adaptation; and others can’t safely be used. There are also some areas where we don’t know very much about how to solve problems that aren’t interesting in the mainstream, but are critical in the adversarial domain.

Which predictors can rank?

To be able to build a ranking predictor, there must be some way of labelling the training records with (estimates of) their ranks, so that this property can be generalised to new records. This is often straightforward, even if the obvious target label doesn’t map directly to a ranking.

There are six mainstream prediction technologies:

  1. Decision trees. These are everyone’s favourites, but they are quite weak predictors, and can only be used to predict class labels. So no use for ranking.
  2. Neural networks. These are also well-liked, but undeservedly so. Neural networks can be effective predictors for problems where the boundaries between the classes are difficult and non-linear, but they are horrendously expensive to train. They should not be used without soul searching. They can, however, predict numerical values and so can do ranking.
  3. Support Vector Machines. These are two-class predictors that try to fit the optimal boundary (the maximal margin) between points corresponding to the records from each class. The distances from the boundary are an estimate of how confident the classifier is in the classification of each record, and so provide a kind of surrogate ranking: from large positive numbers down to 1 for one class and then from -1 to large negative numbers for the other class.
  4. Ensembles. Given any kind of simple predictor, a better predictor can be built by: creating samples of the records from the training dataset; building individual predictors from each sample; and the use the collection of predictors as single, global predictor by asking for the prediction of each one, and using voting to make the global prediction. Ensembles have a number of advantages, primarily that the individual predictors cancel out each others variance. But the number of predictors voting for the winning class can also be interpreted as a strength of opinion for that class; and so for a value on which to rank. In other words, a record can be unanimously voted normal, voted normal by all but one of the individual predictors, and so on.
  5. Random Forests. Random Forests are a particular form of ensemble predictor where each component decision tree is built making decisions about internal tests in a particularly robust and contextualized way. This makes them one of the most powerful prediction technologies known. The same technique, using the number of votes for the winning class, or the margin between the most popular and the next most popular class can be used as a ranking.
  6. Rules. Rules are used because they seem intuitive and explanatory, but they are very weak predictors. This is mostly because they capture local structure in data, rather than global structure captured by most other predictors. Rules cannot straightforwardly be used to rank.

So, although ranking predictors are very useful in adversarial situations, they are quite difficult to build and use.

Second-stage predictors

If we have a ranking predictor, we can choose to put the boundary in a position where the false negative rate is zero (which is what we desperately need to do in an adversarial situation).

The side-effect of this, of course, is that the false positive rate is likely to be very large, so at first glance this seems crazy. But only if we think that the problem can be solved by a single predictor.

Suppose, instead, that we think of the problem as one to be solved in stages.

The first predictor is intended mostly to reduce the problem to a manageable size. The boundary is set so that the false negative rate is zero, knowing that this means that many, many records will be ranked as “possibly abnormal”.

However, the total number of records is now 20% of what it was initially. We can now apply a more sophisticated predictor, still trying to model and predict normality, but able to spend more resources per record, and use a cleverer algorithm.

Again the result is a ranking of the records, into which we can insert a boundary such that the false negative rate is zero. This is still extremely conservative, so that many records that are quite normal may still be ranked on the “possibly abnormal” side of the boundary, but the number of records remaining is still smaller. Even if only half the records can be safely discarded as normal, we have still reduced the total number of records to 10% of the original.

This staged approach can be continued for as many stages as required. Each new stage works with fewer records, and can be correspondingly more sophisticated. Eventually, the number of questionable records may become small enough that a human can do the final check.

Assembling predictors in sequence means that predictors whose individual accuracies are not perfect can be arranged so that the overall effect is close to perfect.

Of course, we have not yet quite solved the problem of detecting possible bad things. The pool of abnormal records contains both records associated with bad guys and records associated with other more ordinary kinds of abnormality (eccentricity).

At this point, two lines of attack remain. The first is to build a model of eccentricity, and use it to eliminate those records from the data. The other is to observe that most bad-guy actions require some kind of collective effort, so that bad-guy records are likely to form some kind of grouping, while those of eccentrics are likely to be one-offs. (Of course, this does not solve the problem of the lone-wolf bad guy, which is why serial killers can be so hard to detect.)

Ranking versus boundaries

Knowledge discovery is full of chicken and egg problems — typically it’s not clear how to set the parameters for an algorithm until you’ve seen the results it gives on your data.

For prediction, the problem of how to specify the boundary is of this kind. Suppose that we want to build a predictor for normality, so that each record will be classified as “normal”, or “possibly abnormal”. We will have some examples of each in our training data (that is records already labelled “normal” or “abnormal”), and typically there will be many more normal records than abnormal.

But how abnormal does a record have to be before we label it as abnormal? And what parameters should we give the algorithm that builds the predictor? Different decisions will have different effects on the false positive and false negative rates. If we move the boundary so that more records are predicted to be abnormal, we reduce the number of false negatives, but increase the number of false positives. There are techniques for making a good choice (using the so-called ROC or Receiver-Operating-Characteristic curve) but these aren’t very useful in an adversarial situation, where false negatives matter a lot. Moving the boundary means changing the parameters of the algorithm that builds the predictor, so every time we want to try another position, we have to rebuild the predictor.

A better way to think about the problem is that the goal is to rank the records from most abnormal to least abnormal. In other words, the knowledge-discovery technique does not actually build a predictor, but something close to it.

Once we have a ranking, we can decide where to put the boundary, but after we have seen the analysis of the data, rather than before. There is a twofold win: we have avoided the chicken and egg problem of having to set the parameters of the algorithm, and we can easily explore the effect of different choices of the boundary without retraining.

Building a ranking predictor is a little harder than building a plain predictor, but not by much. Predictors are divided into two kinds: classifiers that predict a class label (from a finite set), and regressors that predict a continuous value. Ranking requires the second kind of predictor.

Making prediction usable

In the previous post, I pointed out that the goal of prediction in adversarial settings is to prevent bad things happening, but it doesn’t work well to attack the problem directly.

The first important insight is to see that the problem is really about predicting normality. Rather than try and predict what bad guys might do, which might be any of a very long list of things, try and predict what normal people will do. This is a great deal easier because normality has regularities.

This isn’t obvious, but it happens because we are social beings. The things we do, and the way we do them, are constrained by the fact that other people are involved. Even something as simple as walking down the sidewalk is a highly constrained process — people move to the appropriate side when they approach someone without any conscious thought, and without any apparent signal. In fact, this works so well that when you go to a country where the standard side on which to pass someone is different (left vs right) both you and the natives will feel uncomfortable without knowing why — and it takes some people years to correct their behaviour to local norms.

But our social nature means that the structure of normality is quite robust, and so can be learned with acceptable accuracy by a predictor.

Of course, this hasn’t solved the problem. But it has reduced the scale of the problem immensely. If we have a million records, a fairly simple predictor of normality can predict that 800,000 of them are certainly normal, reducing the problem by a factor of 5.

The records that remain can now be processed by a second predictor which (a) has a different task, and (b) has to handle much less data.

Doing prediction in adversarial settings

The overall goal of prediction in adversarial settings is to stop bad things happening — terrorist attacks, fraud, crime, money laundering, and lots of other things.

People intuitively think that the way to address this goal is to try and build a predictor for the bad thing. A few moments thought shows that building such a predictor is a very difficult, maybe impossible, thing to do. So some people immediately conclude that it’s silly or a waste of money, to try and address such goals using knowledge discovery.

There are a couple of obvious reasons why direct prediction won’t work. The first is that bad guys have a very large number of ways in which they can achieve their goal, and it’s impossible for the good guys to consider every single one in designing the predictive model.

This problem is very obvious in intrusion detection, trying to protect computer systems against attacks. There are two broad approaches. The first is to keep a list of bad things, and block any of them when they occur. This is how antivirus software works — every day (it was every week; soon it will be every hour) new additions to the list of bad things have to be downloaded. Of course, this doesn’t predict so-called zero-day attacks, which use some mechanism that has never been used before and so is not on the list of bad things. The second approach is to keep track of what has happened before, and prevent anything new from happening (without some explicit approval by the user). The trouble is that, although there are some regularities in what a user or a system does every day, there are always new things — new websites visited, email sent to new addresses. As a result, alarms are triggered so often that it drives everyone mad, and such systems often get turned off. Vista’s user authorization is a bit like this.

The other difficulty with using direct prediction is making it accurate enough. Suppose that there are two categories that we want to predict: good, and bad. A false positive is when a good record is predicted to be bad; and a false negative is when a bad record is predicted to be good. Both kinds of wrong predictions are a problem, but in different ways. A false positive causes annoyance and irritation, and generates extra work, since the record (and the person it belongs to) must be processed further. However, a false negative is usually much worse — because it means that a bad guy gets past the prediction mechanism.

Prediction technology is considered to be doing well if it achieves a prediction accuracy of around 90% (the percentage of records predicted correctly). It would be fabulous if it achieved an accuracy of 99%. But when the number of records is 1 million, a misclassification rate of 1% is 10,000 records! The consequences of this many mistakes would range from catastrophic to unusable.

These problems with prediction have been pointed out in the media and in some academic writing, as if they meant that prediction in adversarial settings is useless. This is a bit of an argument against a straw man. What is needed is a more thoughtful way of thinking about how prediction should be done, which I’ll talk about in the next posting.