Making recommendations different enough

One of the major uses of data analytics in practice is to make recommendations, either explicitly or implicitly. This is one area where the interests of marketers and the interests of consumers largely run together. If I want to buy something (a product or access to an experience such as listening to a song or seeing a movie) I’d just as soon buy something I actually want — so a  seller who can suggest something that fits has a better chance of getting my business than one that presents a set of generic choices. Thus businesses build models of me to try and predict what I am likely to buy.

Some of these businesses are middlemen, and what they are trying to predict is what kind of ads to show me on behalf of other businesses. Although this is a major source of revenue for web-based businesses, I suspect it to be an edge-case phenomenon — that is, the only people who actually see ads on the web are people who are new to the web (and there are still lots of them every day) while those who’ve been around for a while develop blindness to everything other than content they actually want. You see businesses continually developing new ways to make their ads more obtrusive but, having trained us to ignore them, this often seems to backfire.

Other businesses use permission marketing, which they can do because they already have a relationship. This gives them an inside track — they know things about their customers that are hard to get from more open sources, such as what they have actually spent money on. When they can analyse the data available to them effectively, they are able to create the win-win situation where what they want to sell me is also what I want to buy.

But there’s a huge hole in the technology that represents an opportunity at least as large as that on which Google was founded: how new and different should the suggestion be?

For example, if you buy a book from Amazon by a popular author, your recommendation list is populated by all of the other books by that same author, the TV programs based on that author’s books, the DVDs of the TV shows, and on and on. This is largely a waste of the precious resource of quality attention. Most humans are capable of figuring out that, if they like one book by an author, they may well like others. (In fact, sites like Amazon are surprisingly bad at figuring out that, when an author writes multiple series featuring different characters and settings, an individual might like one series but not necessarily the others.)

So what would be better? The goal, surely, is to suggest products that are similar to the ones the individual has already liked or purchased, but also sufficiently different that that individual would not necessarily have noticed them. In other words, current systems present products in a sphere around the existing product, but what they should do is present products in an annulus around the existing product. Different, but not too different.

This is surprisingly difficult to do. Deciding what similarity means is already a difficult problem; deciding what “just enough dissimilarity” means has been, so far, a too difficult problem. But what an opportunity!

Election winning language patterns

One of the Freakonomics books makes the point that, in football aka soccer, a reasonable strategy when facing a free kick is to stay in the middle of the goal rather than diving to one side or the other — but goaltenders hardly ever follow this strategy because they look like such fools if it doesn’t pay off. Better to dive, and look as if you tried, even if it turns out that you dived the wrong way.

We’ve been doing some work on what kind of language to use to win elections in the U.S. and there are some similarities between the strategy that works, and the goal tending strategy.

We looked at the language patterns of all of the candidates in U.S. presidential elections over the past 20 years, and a very clear language pattern for success emerged. Over all campaigns, the candidate who best deployed this language won, and the margin of victory relates quite strongly to how well the language was used (for example, Bush and Gore used this pattern at virtually identical levels in 2000).

What is this secret language pattern that guarantees success? It isn’t very surprising: high levels of positive words, non-existent levels of negative words, abstract words in preference to concrete ones, and complete absence of reference to the opposing candidate(s).

What was surprising is how this particular pattern of success was learned and used. Although the pattern itself isn’t especially surprising, no candidate used it from the start; they all began with much more conventional patterns: negativity, content-filled policy statements, and comparisons between themselves and the other candidates. With success came a change in language, but the second part of the surprise is that the change happened, in every case, over a period of little more than a month. For some presidents, it happened around the time of their inaugurations; for others around the time of their second campaign, but it was never a gradual learning curve. This suggests that what happens is not a conscious or unconscious improved understanding of what language works, but rather a change of their view of themselves that allows them to become more statesmanlike (good interpretation) or entitled (bad interpretation). The reason that presidents are almost always re-elected is that they use this language pattern well in their second campaigns. (It’s not a matter of changing speechwriting teams or changes in the world and so in the topics being talked about — it’s almost independent of content.)

So there’s plenty of evidence that using language like this leads to electoral success but, just as for goaltenders, no candidate can bring himself or herself to use it, because they’d feel so silly if it didn’t work and they lost.

Predicting the Future

Arguably the greatest contribution of computing to the total of human knowledge is that relatively simple results from theoretical models of computation show that the future is an inherently unknowable place — not just in principle, but for fundamental reasons.
A Turing machine is a simple model of computation with two parts: an infinite row of memory elements, each able to contain a single character; and a state machine, a simple device that is positioned at one of the memory elements, and makes moves by inspecting the single character in the current element and using a simple transition table to decide what to do next. Possible next moves include changing the current character to another one (from a finite alphabet) or moving one element to the right or to the left, or stopping and turning off.
A Turing machine is a simple device; all of its parts are straightforward; and many real-world simulators have been built. But it is hypothesised that this simple device can compute any function that can be computed by any other computational device, and so it contains within its simple structure everything that can be found in the most complex supercomputer, except speed.
It has been suggested that the universe is a giant Turing machine, and everything we know so far about physics continues to work from this perspective, with the single exception that it requires that time is quantized rather than continuous — the universe ticks rather than runs.
But here’s the contribution of computation to epistemology: Almost nothing interesting about the future behaviour of a Turing machine is knowable in any shortcut way, that is in any way that is quicker than just letting the Turing machine run and seeing what happens. This includes questions like: will the Turing machine ever finish its computation? Will it revisit this particular memory element? Will this symbol ever again contain the same symbol that it does now? and many others. (These questions may be answerable in particular cases, but they can’t be answered in general — that is you can’t inspect the transitions and the storage and draw conclusions in a general way.)
If most of the future behaviour of such a simple device is not accessible to “outside” analysis, then almost every property of more complex systems must be equally inaccessible.
Note that this is not an argument built from limitations that might be thought of as “practical”. Predicting what will happen tomorrow is not, in the end, impossible because we can’t gather enough data about today, or because we don’t have the processing power to actually build the predictions — it’s a more fundamental limitation in the nature of what it means to predict the future. This limitation is akin (in fact, quite closely related) to the fact that, within any formal system, there are some theorems that we know to be true but cannot prove.

There are other arguments that also speak to the problem of predicting the future. These aren’t actually needed, given the argument above, but they are often advanced, and speak more to the practical difficulties.

The first is that non-linear systems are not easy to model, and often have unsuspected actions that are not easy to infer even when we have a detailed understanding of them. Famously, bamboo canes can suddenly appear from the ground and then grow more than 2 feet in a day.

The second is that many real-world systems are chaotic, that is infinitesimal differences in their conditions at one moment in time can turn into enormous differences at a future time. This is why forecasting the weather is difficult: a small error in measurement at one  weather station today (caused, perhaps by a butterfly flapping its wings) can completely change tomorrow’s weather a thousand miles away. The problem with predicting the future here is that the current state cannot be measured to sufficient accuracy.

So if the future is inherently, fundamentally impossible to predict, what do we mean when we talk about prediction in the context of knowledge discovery? The answer is that predictive models are not predicting a previously unknown future, but are predicting the recurrence of patterns that have existed in the past. It’s desperately important to keep this in mind.

Thus when a mortgage prediction system (should this new applicant be given a mortgage?) is built, it’s built from historical data: which of a pool of earlier applicants for mortagages did, and did not, repay those loans. The prediction for a new mortgage applicant is, roughly speaking, based on matching the new applicant to the pool of previous applicants and making a determination from what the outcomes were for those. In other words, the prediction assumes an approximate rerun of what happened before — “now” is essentially the same situation as “then”. It’s not really a prediction of the future; it’s a prediction of a rerun of the past.

All predictive models have (and must have) this historical replay character. Trouble starts when this gets forgotten, and models are used to predict scenarios that are genuinely in the future.  For example, in mortgage prediction, a sudden change in the wider economy may be significant enough that the history that is wired into the predictor no longer makes sense. Using the predictor to make new lending decisions becomes foolhardy.

Other situations have similar pitfalls, but they are a bit better hidden. For example, the dream of personalised medicine is to be able to predict the outcome for a patient who has been diagnosed with a particular disease and is being given a particular treatment. This might work, but it assumes that every new patient is close enough to some of the previous patients that there’s some hope of making a plausible prediction. At present, this is foundering on the uniqueness of each patient, especially as the available pool of existing patients for building the predictor is often quite limited. Without litigating the main issue, models that attempt to predict future global temperatures are vulnerable to the same pitfall: previous dependencies of temperatures on temperatures at earlier times do not provide a solid epistemological basis for predicting future temperatures based on temperatures now (and with the triple whammy of fundamental unpredictability, chaos, and non-linear systems).

All predictors should be built so that predictions all pass through a preliminary step that compares them to the totality of the data used to build the predictor. New records that do not resemble records used for training cannot legitimately be passed to the predictor, since the result has a strong probability of being fictional. In other words, the fact that a predictor was build from a particular set of training data must be preserved in the predictor’s use. Of course, there’s an issue of how similar a new record must be to the training records to be plausibly predicted. But at least this question should be asked.

So can we predict the future? No, we can only repredict the past.

Radicalization as an infection

There are many ways to think about radicalization, but there’s one that fits the data fairly well, and has actionable consequences, but is being almost entirely ignored — modelling radicalization as an infection.

The research work didn’t model radicalization directly (how could you?) but rather how postings on particular jihadist topics spread in web forums. The relevant paper is:

Jiyoung Woo; Hsinchun Chen, An event-driven SIR model for topic diffusion in web forums, Intelligence and Security Informatics (ISI), 2012 IEEE International Conference on , vol., no., pp.108,113, 11-14 June 2012.

They showed that the data fit well with a standard model called the SIR model (Susceptible-Infected-Recover) and they computed the parameters for the infection rate and the recovery rate. In other words, they showed that it made sense to consider some of the participants to be susceptible to infection by radical ideology, some fraction of these were then infected and espoused this ideology and then some fraction of these recovered and no longer held radical ideas. These fractions could be estimated empirically from the data.

But the important thing that’s being ignored in the current world situation is that there is a recovery rate! People do become radicalized, yes, but not all of them, and the vast majority of them move from being radicalized to no longer being radicalized.

We don’t notice many of these people. They move to a mental place where they would be willing to carry out some kind of jihadist action, but they don’t actually do it before the spell wears off and they are no longer radicalized. Only a few become radicalized and act on it.

Even people who go quite far down the road towards action, perhaps travelling to Syria and participating in violence, may still become disillusioned. Not everyone who returns to the West from the Middle East does so to try and carry out an attack. Many, perhaps most, have had a disease, have recovered from it, and want to get on with their lives.

The risk of a draconian response to returnees is that we may push them back into violence (“If I’m going to be treated as a violent extremist, I might as well act like one”).

Of course, this doesn’t make the job of the intelligence services any easier, at least in the short term. But emergencies tend to create blinkered views of the problem, and I think that this might be happening.

Inspire and Azan paper is out

The paper Edna Reid and I wrote about the language patterns in Inspire and Azan magazines has now appeared (at least online) in Springer’s Security Informatics journal. Here’s the citation:

“Language Use in the Jihadist Magazines Inspire and Azan”
David B Skillicorn and Edna F Reid
Springer Security Informatics.2014, 3:9
Security Informatics

The paper examines the intensity of various kinds of language in these jihadist magazines. The main conclusions are:

  • These magazines use language as academic models of propaganda would predict, something that has not been empirically verified at this scale AFAIK.
  • The intellectual level of these magazines is comparable to other mass market magazines — they aren’t particularly simplistic, and they assume a reasonably well-educated readership.
  • The change in editorship/authorship after the deaths of Al-Awlaki and Samir Khan are clearly visible in Inspire. The new authors have changed for each issue, but there is an overarching similarity. Azan has articles claiming many different authors, but the writing style is similar across all articles and issues; so it’s either written by a single person or by a tightly knit group.
  • Jihadist language intensity has been steadily increasing over the past few issues of Inspire, after being much more stable during the Al-Awlaki years (this is worrying).
  • Inspire is experimenting with using gamification strategies to increase motivation for lone-wolf attacks and/or to decrease the reality of causing deaths and casualties. It’s hard to judge whether this is being done deliberately, or by osmosis — the levels of gamification language waver from issue to issue.

ISIS is putting out its own magazine. Its name, “Islamic State News”, and the fact that it is entirely pictorial (comic or graphic novel depending on your point of view) says something about their view of the target audience.

Canada’s Anti Spam — its one good feature spoiled

I commented earlier that the new Canadian Anti Spam law and Spam Reporting Centre were a complete waste of money because:

1. Spam is no longer a consumer problem, but a network problem which this legislation won’t help.
2. Most spammers are beyond the reach of Canadian law enforcement, even if attribution could be cleanly done.
3. There’s an obvious countermeasure for spammers — send lots of spam to the Spam Reporting Centre and pollute the data.

There was one (unintended) good feature, however. Legitimate businesses who knew my email address and therefore assumed, as businesses do, that I would love to get email from them about every imaginable topic, have felt obliged to ask for my permission to keep doing so. (They aren’t getting it!)

BUT all of these emails contain a link of the form “click here to continue getting our (um) marketing material”, because they’ve realised that nobody’s going to bother with a more cumbersome opt-in mechanism.

Guess what? Spear phishing attacks have been developed to piggyback on this flood of permission emails — I’ve had a number already this week. Since they appear to come from mainstream organisations and the emails look just like theirs, I’m sure they’re getting lots of fresh malware downloaded. So look out for even more botnets based in Canada. And thanks again, Government of Canada for making all of this possible.

The right level of abstraction = the right level to model

I think the take away from my last post is that models of systems should aim to model them at the right level of abstraction, where that right level corresponds to the places where there are bottlenecks. These bottlenecks are places where, as we zoom out in terms of abstraction, the system suddenly seems simpler. The underlying differences don’t actually make a difference; they are just variation.

The difficulty is that it’s really, really hard to see or decide where these bottlenecks are. We rightly laud Newton for seeing that a wide range of different systems could all be described by a single equation; but it’s also true that Einstein showed that this apparent simplicity was actually an approximation for a certain (large!) subclass of systems, and so the sweet spot of system modelling isn’t quite where Newton thought it was.

For living systems, it’s even harder to see where the right level of abstraction lies. Linnaeus (apparently the most-cited human) certainly created a model that was tremendously useful, working at the level of the species. This model has frayed a bit with the advent of DNA technology, since the clusters from observations don’t quite match the clusters from DNA, but it was still a huge contribution. But it’s turning out to be very hard to figure out the right level of abstractions to capture ideas like “particular disease” “particular cancer” even though we can diagnose them quite well. The variations in what’s happening in cells are extremely difficult to map to what seems to be happening in the disease.

For human systems, the level of abstraction is even harder to get right. In some settings, humans are surprisingly sheep-like and broad-brush abstractions are easy to find. But dig a little, and it all falls apart into “each person behaves as they like”. So predicting the number of “friends” a person will have on a social media site is easy (it will be distributed around Dunbar’s number), but predicting whether or not they will connect with a particular person is much, much harder. Does advertising work? Yes, about half of it (as Ogilvy famously said). But will this ad influence this person? No idea. Will knowing the genre of this book or film improve the success rate of recommendations? Yes. Will it help with this book and this person? Not so much.

Note the connection between levels of abstraction and clustering. In principle, if you can cluster (or, better, bicluster) data about your system and get (a) strong clusters, and (b) not too many of them, then you have some grounds for saying that you’re modelling at the right level. But this approach founders on the details: which attributes to include, which algorithm to use, which similarity measure, which parameters, and so on and on.



Follow

Get every new post delivered to your Inbox.

Join 28 other followers