Canada’s Anti Spam — its one good feature spoiled

I commented earlier that the new Canadian Anti Spam law and Spam Reporting Centre were a complete waste of money because:

1. Spam is no longer a consumer problem, but a network problem which this legislation won’t help.
2. Most spammers are beyond the reach of Canadian law enforcement, even if attribution could be cleanly done.
3. There’s an obvious countermeasure for spammers — send lots of spam to the Spam Reporting Centre and pollute the data.

There was one (unintended) good feature, however. Legitimate businesses who knew my email address and therefore assumed, as businesses do, that I would love to get email from them about every imaginable topic, have felt obliged to ask for my permission to keep doing so. (They aren’t getting it!)

BUT all of these emails contain a link of the form “click here to continue getting our (um) marketing material”, because they’ve realised that nobody’s going to bother with a more cumbersome opt-in mechanism.

Guess what? Spear phishing attacks have been developed to piggyback on this flood of permission emails — I’ve had a number already this week. Since they appear to come from mainstream organisations and the emails look just like theirs, I’m sure they’re getting lots of fresh malware downloaded. So look out for even more botnets based in Canada. And thanks again, Government of Canada for making all of this possible.

The right level of abstraction = the right level to model

I think the take away from my last post is that models of systems should aim to model them at the right level of abstraction, where that right level corresponds to the places where there are bottlenecks. These bottlenecks are places where, as we zoom out in terms of abstraction, the system suddenly seems simpler. The underlying differences don’t actually make a difference; they are just variation.

The difficulty is that it’s really, really hard to see or decide where these bottlenecks are. We rightly laud Newton for seeing that a wide range of different systems could all be described by a single equation; but it’s also true that Einstein showed that this apparent simplicity was actually an approximation for a certain (large!) subclass of systems, and so the sweet spot of system modelling isn’t quite where Newton thought it was.

For living systems, it’s even harder to see where the right level of abstraction lies. Linnaeus (apparently the most-cited human) certainly created a model that was tremendously useful, working at the level of the species. This model has frayed a bit with the advent of DNA technology, since the clusters from observations don’t quite match the clusters from DNA, but it was still a huge contribution. But it’s turning out to be very hard to figure out the right level of abstractions to capture ideas like “particular disease” “particular cancer” even though we can diagnose them quite well. The variations in what’s happening in cells are extremely difficult to map to what seems to be happening in the disease.

For human systems, the level of abstraction is even harder to get right. In some settings, humans are surprisingly sheep-like and broad-brush abstractions are easy to find. But dig a little, and it all falls apart into “each person behaves as they like”. So predicting the number of “friends” a person will have on a social media site is easy (it will be distributed around Dunbar’s number), but predicting whether or not they will connect with a particular person is much, much harder. Does advertising work? Yes, about half of it (as Ogilvy famously said). But will this ad influence this person? No idea. Will knowing the genre of this book or film improve the success rate of recommendations? Yes. Will it help with this book and this person? Not so much.

Note the connection between levels of abstraction and clustering. In principle, if you can cluster (or, better, bicluster) data about your system and get (a) strong clusters, and (b) not too many of them, then you have some grounds for saying that you’re modelling at the right level. But this approach founders on the details: which attributes to include, which algorithm to use, which similarity measure, which parameters, and so on and on.

Three kinds of knowledge discovery

I’ve always made a distinction between “mainstream” data mining (or knowledge discovery or data analytics) and “adversarial” data mining — they require quite distinct approaches and algorithms. But my work with bioinformatic datasets has made me realise that there are more of these differences, and the differences go deeper than people generally understand. That may be part of the reason why some kinds of data mining are running into performance and applicability brick walls.

So here are 3 distinct kinds of data mining, with some thoughts about what makes them different:

1. Modelling natural/physical, that is clockwork, systems.
Such systems are characterised by apparent complexity, but underlying simplicity (the laws of physics). Such systems are entropy minimising everywhere. Even though parts of such systems can look extremely complex (think surface of a neutron star), the underlying system to be modelled must be simpler than its appearances would, at first glance, suggest.

What are the implications for modelling? Some data records will always be more interesting or significant than others — for most physical systems, records describing the status of deep space are much less interesting than those near a star or planet. So there are issues around the way data is sampled.
Some attributes will also be more interesting or significant than others — but, and here’s the crucial point, this significance is a global property. It’s possible to have irrelevant or uninteresting attributes, but these attributes are similarly uninteresting everywhere. Thus is makes sense to use attribute selection as part of the modelling process.

Because the underlying system is simpler than its appearance suggests, there is a bias towards simple models. In other words, physical systems are the domain of Occam’s Razor.

2. Living systems.
Such systems are characterised by apparent simplicity, but underlying complexity (at least relatively speaking). In other words, most living systems are really complicated underneath, but their appearances often conceal this complexity. It isn’t obvious to me why this should be so, and I haven’t come across much discussion about it — but living systems are full of what computing people call encapsulation, putting parts of systems into boxes with constrained interfaces to the outside.

One big example where this matters, and is starting to cause substantial problems for data mining, is the way diseases work. Most diseases are complex activities in the organism that has the disease, and their precise working out often depends on the genotype and phenotype of that organism as well as of the diseases themselves. In other words, a disease like influenza is a collaborative effort between the virus and the organism that has the flu — but it’s still possible to diagnose the disease because of large-scale regularities that we call symptoms.
It follows that, between the underlying complexity of disease, genotype, and phenotype, and the outward appearances of symptoms, or even RNA concentrations measured by microarrays, there must be substantial “bottlenecks” that reduce the underlying complexity. Our lack of understanding of these bottlenecks has made personalised medicine a much more elusive target than it seemed to be a decade ago. Systems involving living things are full of these bottlenecks that reduce the apparent complexity: species, psychology, language.

All of this has implications for data mining of systems involving living things, most of which have been ignored. First, the appropriate target for modelling should be these bottlenecks because this is where such systems “make the most sense”; but we don’t know where the bottlenecks are, that is which part of the system (which level of abstraction) should be modelled. In general, this means we don’t know how to guess the appropriate complexity of model to fit with the system. (And the model should usually be much more complex than we expect — in neurology, one of the difficult lessons has been that the human brain isn’t divided into nice functional building blocks; rather it is filled with “hacks”. So is a cell.)

Because systems involving living things are locally entropy reducing, different parts of the system play qualitatively different roles. Thus some data records are qualitatively of different significance to others, so the implicit sampling involved in collecting a dataset is much more difficult, but much more critical, than for clockwork systems.

Also, because different parts of the system are so different, the attributes relevant to modelling each part of the system will also tend to be different. Hence, we expect that biclustering will play an important role in modelling living systems. (Attribute selection may also still play a role, but only to remove globally uninteresting attributes; and this should probably be done with extreme caution.)

Systems of living things can also be said to have competing interests, even though these interests are not conscious. Thus such systems may involve communication and some kind of “social” interaction — which introduces a new kind of complexity: non-local entropy reduction. It’s not clear (to me at least) what this means for modelling, but it must mean that it’s easy to fall into a trap of using models that are too simple and too monolithic.

3. Human systems.
Human systems, of course, are also systems involving living things, but the big new feature is the presence of consciousness. Indeed, in settings where humans are involved but their actions and interactions are not conscious, models of the previous kind will suffice.

Systems involving conscious humans are locally and non-locally entropy reducing, but there are two extra feedback loops: (1) the loop within the mind of each actor which causes changes in behaviour because of modelling other actors and themself (the kind of thing that leads to “I know that he knows that I know that … so I’ll …); (2) the feedback loop between actors and data miners.

The first feedback loop creates two processes that must be considered in the modelling:
a. Self-consciousness, which generates, for example, purpose tremor;
b. Social consciousness, which generates, for example, strong signals from deception.

The second feedback loop creates two other processes:
a. Concealment, the intent or action of actors hiding some attributes or records from the modelling;
b. Manipulation, the deliberate attempt to change the outcomes of any analysis that might be applied.

I argue that all data mining involving humans has an adversarial component, because the interests of those being modelled never run exactly with each other, or with those doing the modelling, and so all of these processes must be considered whenever modelling of human systems is done. (You can find much more on this topic by reading back in the blog.)

But one obvious effect is that records and attributes need to have metadata associated with them that carries information about properties such as uncertainty or trustworthiness. Physical systems and living systems might mislead you, but only with your implicit connivance or misunderstanding; systems involving other humans can mislead you either with intent or as a side-effect of misleading someone else.

As I’ve written about before, systems where actors may be trying to conceal or manipulate require care in choosing modelling techniques so as not to be misled. On the other hand, when actors are self-conscious or socially conscious they often generate signals that can help the modelling. However, a complete way of accounting for issues such as trust at the datum level has still to be designed.

Crime in Chicago

The Chicago Police Department makes details of all of its incidents available. For each one, there’s a record describing what kind of incident and crime it was, where it took place (thinly anonymized), and when it happened.

This data is available for more than a decade’s worth of crimes (a big file!) but I’ve used one subset of just over a month as a working dataset. What sorts of things can be learned from such data? One research project looked at seasonal patterns, and discovered that there are strong and consistent patterns over time.

I’m more interested in this dataset as a publicly available example of the kind of information that might be collected about terrorist incidents, IED explosions, and the like. Such data is of mixed type: some fields are numeric, others (times and dates) are cyclic, and still others are textual. So it’s not straightforward to apply knowledge discovery algorithms to them.

In what follows, I’m using a hashing technique to deal with the non-numeric fields, z-scoring to treat variation in each attribute as equally significant, and singular value decomposition to project the data into lower dimensions and to visualize it. To help understand which attributes are making an interesting difference, the resulting plot can be overlaid so that the colour of each point corresponds to the value of that particular attribute for the record corresponding to that point. (In an earlier post, I did a similar analysis for the complete START set of terrorist attacks.)

Incidents are labelled with the physical coordinates of where they took place, so one way to visualize the data is to plot each of the attributes against position. Here is a figure showing the distribution on incidents in space, labelled their FBI crime descriptor:

There are patterns to be seen, but they are weak and hard to pick out.

The advantage of clustering using singular value decomposition (SVD) is that we can see the effects of all of the attributes on which incidents resemble others, without having to know anything in advance about the signficance of any one of them.

Here’s the clustering derived from SVD:

Clustering of all incidents derived from SVD

Clustering of all incidents derived from SVD

It’s clear from this figure that there are clusters, i.e. not all crimes are one-offs, nor are all crimes essentially the same. But what properties of these incidents accounts for the 7 or so strong clusters? That’s where being able to overlay single attributes on the clusters can help.

For example, if we overlay time of day on the clustering like this:

Clusters overlaid with time of day

Clusters overlaid with time of day

we see that time of day is somehow orthogonal to the clustering — it has some relevance, but each of the clusters has the same internal structure with respect to time. So this doesn’t help to explain the clustering but it does suggest that there is deeper structure that is connected to time.

On the other hand, if we overlay the clustering with whether or not it led to an arrest:

Clustering overlaid with arrests

Clustering overlaid with arrests

we see that arrested or not plays a major differentiating role in the clustering. Similarly, if we overlay whether or not the incident was domestic:

Clustering overlaid with domestic or not

Clustering overlaid with domestic or not

we see that this also makes a big difference in the clustering.

If we overlay with the primary description of the incident:

Clustering overlaid with the primary classification of the incident.

Clustering overlaid with the primary classification of the incident.

then we see that another part of the clustering is explained. Notice that, while arrested or not varies from top to bottom, incident classification varies from left to right. In other words, macroscopically at least, there is no correlation between type of crime and arrest rate — they vary in an uncorrelated way — which is a good thing.

There’s a similar structure in the secondary crime description attribute:

Clustering overlaid with the secondary crime classification

Clustering overlaid with the secondary crime classification

We can get another sense of which attributes are driving the clustering by plotting the attributes, produced from the same SVD. In these plots, points far from the centre are more significant than those close to the centre, and those in the same direction from the centre are correlated. Furthermore, points representing incidents are “pulled” towards attributes for which they have large values. So this plot provides a sense of which attributes play the largest role in differentiating incidents, and how they fit together.

Variation among attributes aligned with variation among incidents

Variation among attributes aligned with variation among incidents

In fact, we can plot incidents and attributes in the same plot to make the connections obvious:

The relationship between attributes and incidents

The relationship between attributes and incidents

This dataset also shows how difficult some of the problems of anomaly detection are. Suppose we wanted to answer the question: which incident was the most unusual in this dataset? The SVD provides a theoretically well-motivated answer: the one whose representative point is farthest from the origin. However, looking at the clustering, this theoretical answer seems rather weak.

Update on Inspire and Azan magazines

Issue 12 of Inspire and Issue 5 of Azan are now out, so I’m updating the analysis of the language patterns in these two sequences of magazines.

To recap, both of these magazines are glossy and picture-heavy and intended primarily to encourage lone-wolf attacks by diaspora jihadists. It’s unclear how much impact they have actually had — several attackers have had copies, but so have many other non-attackers in the same environments. We have written a full analysis that can be downloaded from SSRN (here).

Here is the variation among issues for Inspire, based on the 1000 most-frequent words:


You can see that the first 8 issues, edited by Samir Khan, are quite similar to one another, except for Issues 3 and 7, which are different in tone (and quite similar to one another, although that isn’t obvious in this figure). The new issues, by unknown editors don’t resemble one another very much, but they do have an underlying consistency (they form almost a straight line) which argues for some underlying organization.

The other interesting figures are based on a model of the intensity of jihadi language. The figure shows the variation among issues of both magazines, with jihadi intensity increasing from right to left:


Overall, the jihadist intensity of Azan is lower than that of Inspire; but the most recent four issues of Inspire represent a departure: their levels are much, much greater than previous issues of Inspire and all of the issues of Azan. This is a worrying trend.

Inspire and Azan magazines

I’ve been working (with Edna Reid) on understanding Inspire and Azan magazines from the perspective of their language use.

These two magazines are produced by islamists, aimed at Western audiences, and intended primarily to motivate lone-wolf attacks. Inspire comes out of AQAP, whereas Azan seems to have a Pakistan/Afghanistan base and to be targeted more at South Asians.

Both magazines have some inherent problems: it’s difficult to convince others to carry out actions that will get them killed or imprisoned using such a narrow channel and appealing only to mind and emotions. The evidence for the effectiveness of these magazines is quite weak — those (few) who have carried out lone-wolf attacks in the West have often been found to have read these magazines — but so have many others in their communities who didn’t carry out such attacks.

Regardless of effectiveness, looking at language usage gives us a way to reverse engineer what’s going on the minds of the writers and editors. For example, it’s clear that the first 8 issues of Inspire were produced by the same (two) people, but that issues 9-11 have been produced by three different people (but with some interesting underlying commonalities). It’s also clear that all of the issues of Azan so far are produced by one person (or perhaps a small group with a very similar mindset) despite the different names used as article authors.

Overall, Inspire lacks a strategic focus. Issues appear when some event in the outside world suggests a theme, and what gets covered, and how, varies quite substantially from issue to issue. Azan, on the other hand, has been tightly focused with a consistent message, and much more regular publication. Measures of infomative and imaginative language are also consistently higher for Azan than for Inspire.

The intensity of jihadist language in Inspire has been steadily increasing in recent issues. The level of deception has also been increasing, this latter surprising because previous studies have suggested that jihadi intensity tends to be correlated with low levels of deception. This may be a useful signal for intelligence organizations.

A draft of the paper about this is available on SSRN:


Get every new post delivered to your Inbox.

Join 27 other followers