Posts Tagged 'intelligence analysis'

Provenance — a neglected aspect of data analytics

Provenance is defined by Merriam-Webster as “the history of ownership of a valued object or work of art or literature“, but the idea has much wider applicability.

There are three kinds of provenance:

  1. Where did an object come from. This kind of provenance is often associated with food and drink: country of origin for fresh produce, brand for other kinds of food, appellation d’origine contrôlée for French wines, and many other examples. This kind of provenance is usually signalled by something that is attached to the object.
  2. Where did an object go on its way from source to destination. This is actually the most common form of provenance historically — the way that you know that a chair really is a Chippendale is to be able to trace its ownership all the way back to the maker. A chair without provenance is probably much less valuable, even though it may look like a Chippendale, and the wood seems the right age. This kind of provenance is beginning to be associated with food. For example, some shipments now have temperature sensors attached to them that record the maximum temperature they ever encountered between source and destination. Many kinds of shipments have had details about their pathway and progress available to shippers, but this is now being exposed to customers as well. So if you buy something from Amazon you can follow its progress (roughly) from warehouse to you.
  3. The third kind of provenance is still in its infancy — what else did the object encounter on it way from source to destination. This comes in two forms. First, what  other objects was it close to? This is the essence of Covid19 contact tracing apps, but it applies to any situation where closeness could be associated with poor outcomes. Second, where the objects that it was close to ones that were expected or made sense?

The first and second forms of provenace don’t lead to interesting data-analytics problems. They can be solved by recording technologies with, of course, issues of reliability, unforgeability, and non-repudiation.

But the third case raises many interesting problems. Public health models of the spread of infection usually assume some kind of random particle model of how people interact (with various refinements such as compartments). These models would be much more accurate if they could be based on actual physical encounter networks — but privacy quickly becomes an issue. Nevertheless, there are situations where encounter networks are already collected for other reasons: bus and train driver handovers, shift changes of other kinds, police-present incidents; and such data provides natural encounter networks. [One reason why Covid19 contact tracing apps work so poorly is that Bluetooth proximity is a poor surrogate for potentially infectious physical encounter.]

Customs also has a natural interest in provenance: when someone or something presents at the border, the reason they’re allowed to pass or not is all about provenance: hard coded in a passport, pre-approved by the issue of a visa, or with real-time information derived from, say, a vehicle licence plate.

Some of clearly suspicious, but hard to detect, situations arise from mismatched provenance. For example, if a couple arrive on the same flight, then they will usually have been seated together; if two people booked their tickets or got visas using the same travel agency at the same time then they will either arrive on different flights (they don’t know each other), or they will arrive on the same flight and sit together (they do know each other). In other words, the similarity of provenance chains should match the similarity of relationships, and mismatches between the two signal suspicious behaviour. Customs data analytics is just beginning to explore leveraging this kind of data.

Understanding risk at the disaster end of the spectrum

In conventional risk analysis, risk is often expressed as

risk = threat probability x potential loss

When the values of the terms on the right hand side are in the middle of their ranges, then our intuition seems to understand this equation quite well.

But when the values are near their extremes, our intuition goes out the window, as the world’s coronavirus experience shows. The pandemic is what Taleb calls a black swan, an event where the threat probability is extremely low, but the potential loss is extremely high. For example, if the potential loss is of the order of 10^9 (a billion) then a threat probability of 1 in a thousand still has a risk of magnitude a million.

I came across another disaster waiting to happen, with the same kind fo characteristics as the coronavirus pandemic — cyber attacks on water treatment facilities.

In the U.S. water treatment facilities are small organizations that don’t have specialized IT staff who can protect their systems. But the consequences of cyber attacks on such facilities can cause mass casualties. While electricity grids, Internet infrastructure, and financial systems have received some protection attention, water treatment is the forgotten sibling. A classic example of a small (but growing) threat probability but a huge potential loss.

The threat isn’t even theoretical. Attacks have already been attempted.

What causes extremist violence?

This question has been the subject of active research for more than four decades. There have been many answers that don’t stand up to empirical scrutiny — because the number of those who participate in extremist violence is so small, and because researchers tend to interview them, but fail to interview all those identical to them who didn’t commit violence.

Here’s a list of the properties that we now know don’t lead to extremist violence:

  • ideology or religion
  • deprivation or unhappiness
  • political/social alienation
  • discrimination
  • moral outrage
  • activism or illegal non-violent political action
  • attitudes/belief

How do we know this? Mostly because, if you take a population that exhibits any of these properties (typically many hundreds of thousand) you find that one or two have committed violence, but the others haven’t. So properties such as these have absolutely no predictive power.

On the other hand, there are a few properties that do lead to extremist violence:

  • being the child of immigrants
  • having access to a local charismatic figure
  • travelling to a location where one’s internal narrative is reinforced
  • participation in a small group echo chamber with those who have similar patterns of thought
  • having a disconnected-disordered or hypercaring-compelled personality

These don’t form a diagnostic set, because there are still many people who have one or more of them, and do not commit violence. But they are a set of danger signals, and the more of them an individual has, the more attention should be paid to them (on the evidence of the past 15 years).

You can find a full discussion of these issues, and the evidence behind them, in ““Terrorists, Radicals, and Activists: Distinguishing Between Countering Violent Extremism and Preventing Extremist Violence, and Why It Matters” in Violent Extremism and Terrorism, Queen’s University Press, 2019.


‘AI’ performance not what it seems

As I’ve written about before, ‘AI’ tends to be misused to refer to almost any kind of data analytics or derived tool — but let’s, for the time being, go along with this definition.

When you look at the performance of these tools and systems, it’s often quite poor, but I claim we’re getting fooled by our own cognitive biases into thinking that it’s much better than it is.

Here are some examples:

  • Netflix’s recommendations for any individual user seem to overlap 90% with the ‘What’s trending’ and ‘What’s new’ categories. In other words, Netflix is recommending to you more or less what it’s recommending to everyone else. Other recommendation systems don’t do much better (see my earlier post on ‘The Sound of Music Problem’ for part of the explanation).
  • Google search results are quite good at returning, in the first few links, something relevant to the search query, but we don’t ever get to see what was missed and might have been much more relevant.
  • Google News produces what, at first glance, appear to be quite reasonable summaries of recent relevant news, but when you use it for a while you start to see how shallow its selection algorithm is — putting stale stories front and centre, and occasionally producing real howlers, weird stories from some tiny venue treated as if they were breaking and critical news.
  • Self driving cars that perform well, but fail completely when they see certain patches on the road surface. Similarly, facial recognition systems that fail when the human is wearing a t-shirt with a particular patch.

The commonality between these examples, and many others, is that the assessment from use is, necessarily, one-sided — we get to see only the successes and not the failures. In other words (HT Donald Rumsfeld), we don’t see the unknown unknowns. As a result, we don’t really know how well these ‘AI’ systems really do, and whether it’s actually safe to deploy them.

Some systems are ‘best efforts’ (Google News) and that’s fair enough.

But many of these systems are beginning to be used in consequential ways and, for that, real testing and real public test results are needed. And not just true positives, but false positives and false negatives as well. There are two main flashpoints where this matters: (1) systems that are starting to do away with the human in the loop (self driving cars, 737 Maxs); and (2) systems where humans are likely to say or think ‘The computer (or worse, the AI) can’t be wrong’; and these are starting to include policing and security tools. Consider, for example, China’s social credit system. The fact that it gives low scores to some identified ‘trouble makers’ does not imply that everyone who gets a low score is a trouble  maker — but this false implication lies behind this and almost all discussion of ‘AI’ systems.

Annular similarity

When similarity is used for clustering, then obviously the most similar objects need to be placed in the same cluster.

But when similarity is being used for human consumption, a different dynamic is in play — humans usually already know what the most similar objects are, and are interested in those that are (just) beyond those.

This can be seen most clearly in recommender systems. Purchase an item or watch a Netlflix show, and your recommendation list will fill up with new objects that are very similar to the thing you just bought/watched.

From a strictly algorithm point of view, this is a success — the algorithm found objects similar to the starting object. But from a human point of view this is a total fail because it’s very likely that you, the human, already know about all of these recommended objects. If you bought something, you probably compared the thing you bought with many or all of the objects that are now being recommended to you. If you watched something, the recommendations are still likely to be things you already knew about.

The misconception about what similarity needs to mean to be useful to humans is at the heart of the failure of recommender systems, and even the ad serving systems that many of the online businesses make their money from. Everyone has had the experience of buying something, only to have their ad feed (should they still see it) fill up with ads for similar products (“I see you just bought a new car — here are some other new cars you might like”).

What’s needed is annular similarity — a region that is centred at the initial object, but excludes new objects that are too similar, and focuses instead on objects that are a bit similar.

Amazon tries to do this via “People who bought this also bought” which can show useful add-on products. (They also use “People who viewed this also viewed” but this is much less effective because motivations are so variable.) But this mechanism also fails because buying things together doesn’t necessarily mean that they belong together — it’s common to see recommendations based on the fact that two objects were on special on the same day, and so more likely to be bought together because of the opportunity, rather than any commonality.

Annular similarity is also important in applications that help humans to learn new things: web search, online courses, intelligence analysis. That’s why we built the ATHENS divergent web search engine (refs below) — give it some search terms and it returns (clusters of) web pages that contain information that is just over the horizon from the search terms. We found that this required two annuli — we first constructed the information implicit in the search terms, then an annulus around that of information that we assumed would be known to someone who knew the core derived from the search terms, and only then did we generate another annulus which contains the results returned.

We don’t know many algorithmic ways to find annular similarity. In any distance-based clustering it’s possible, of course, to define an annulus around any point. But it’s tricky to decide on what the inner and outer radii should be, the calculations have to happen in high-dimensional space where the points are very sparse, and it’s not usually clear whether the space is isotropic.

Annular similarity doesn’t work (at least straightforwardly) in density-based (e.g. DBScan) or distribution-based clustering (e.g. EM) because the semantics of ‘cluster’ doesn’t allow for an annulus.

One way that does work (and was used extensively in the ATHENS system) is based on singular vallue decomposition (SVD). An SVD projects a high-dimensional space into a low-dimensional one in such a way as to preserve as much of the variation as possible. One of its useful side-effects is that a point that is similar to many other points tends to be projected close to the origin; and a point that is dissimilar to most other points also tends to be projected close to the origin because the dimension(s) it inhabits have little variation and tend to be projected away. In the resulting low-dimensional projection, points far from the origin tend to be interestingly dissimilar to those at the centre of the structure — and so an annulus imposed on the embedding tends to find an interesting set of objects.

Unfortunately this doesn’t solve the recommender system problem because recommenders need to find similar points that have more non-zeroes than the initial target point — and the projection doesn’t preserve this ordering well. That means that the entire region around the target point has to be searched, which becomes expensive.

There’s an opportunity here to come up with better algorithms to find annular structures. Success would lead to advances in several diverse areas.

(A related problem is the Sound of Music problem, the tendency for a common/popular object to muddle the similarity structure of all of the other objects because of its weak similarity to all of them. The Sound of Music plays this role in movie recommendation systems, but think of wrapping paper as a similar object in the context of Amazon. I’ve written about this in a previous post.)


Tracy A. Jenkin, Yolande E. Chan, David B. Skillicorn, Keith W. Rogers:
Individual Exploration, Sensemaking, and Innovation: A Design for the Discovery of Novel Information. Decision Sciences 44(6): 1021-1057 (2013)
Tracy A. Jenkin, David B. Skillicorn, Yolande E. Chan:
Novel Idea Generation, Collaborative Filtering, and Group Innovation Processes. ICIS 2011
David B. Skillicorn, Nikhil Vats:
Novel information discovery for intelligence and counterterrorism. Decision Support Systems 43(4): 1375-1382 (2007)
Nikhil Vats, David B. Skillicorn:
Information discovery within organizations using the Athens system. CASCON 2004: 282-292


Islamist violent extremism and anarchist violent extremism

Roughly speaking, three explanations for islamist violent extremism have been put forward:

  1. It’s motivated by a religious ideology (perhaps a perversion of true Islam, but sincerely held by its adherents);
  2. It’s motivated by political or insurgent ends, and so the violence is instrumental;
  3. It’s the result of psychological disturbance in its adherents.

In the months after the 9/11 World Trade Center attacks, Marc Sageman argued vigorously for the first explanation, pointing out that those involved in al Qaeda at the time were well-educated and at least middle class, were religious, and showed no signs of psychological disturbances. There was considerable push back to his arguments, mostly promoting Explanation 3 but, in the end, most Western governments came around to his view.

In the decade since, most Western countries have slipped into Explanation 2. I have argued that this is largely because these countries are post-Christian, and so most of those in the political establishment have post-modern ideas about religion as a facade for power. They project this world view onto the Middle Eastern world, and so cannot see that Explanation 1 is even possible — to be religious is to be naive at best and stupid at worst. This leads to perennial underestimation of islamist violent extremist goals and willingness to work towards them.

It’s widely agreed that the motivation for Daesh is a combination of Explanations 1 and 2, strategically Explanation 1, but tactically Explanation 2.

The new feature, however, is that Daesh’s high-volume propaganda is reaching many psychologically troubled individuals in Western countries who find its message to be an organising principle and a pseudo-community.

“Lone wolf” attacks can therefore be divided into two categories: those motivated by Explanation 1, and those motivated by Explanation 3, and the latter are on the rise. Marc Sageman has written about the extent to which foiled “plots” in the U.S. come very close to entrapment of vulnerable individuals who imagine that they would like to be terrorists, and take some tiny initial step, only to find an FBI agent alongside them, urging them to take it further. (M. Sageman, The Stagnation in Terrorism Research, Terrorism and Political Violence, Vol. 26, No. 4, 2014, 565-580)

Understanding these explanations is critical to efforts at de-radicalization. Despite extensive efforts, I have seen very little evidence that de-radicalization actually works. But it make a difference what you think you’re de-radicalizing from. Addressing Explanation 1 seems to be the most common strategy (“your view of Islam is wrong, see the views of respected mainstream Imams, jihad means personal struggle”).

Addressing Explanation 2 isn’t usually framed as de-radicalization but, if the violence is instrumental, then instrumental arguments would help (“it will never work, the consequences are too severe to be worth it”).

Addressing Explanation 3 is something we know how to do, but this explanation isn’t the popular one at present, and there are many pragmatic issues about getting psychological help to people who don’t acknowledge that they need it.

Reading the analysis of anarchist violence in the period from about 1880 to around 1920 has eerie similarities to the analysis of islamist violence in the past 15 years, both in the popular press, and in the more serious literature. It’s clear that there were some (but only a very few) who were in love with anarchist ideology (Explanation 1); many more who saw it as a way (the only way) to change society for the better (Explanation 2) — one of the popular explanations for the fading away of anarchist attacks is that other organisations supporting change developed; but there were also large numbers of troubled individuals who attached themselves to anarchist violence for psychological reasons. It’s largely forgotten how common anarchist attacks became during these few decades. Many were extremely successful — assassinations of a French president, an American president, an Austrian Empress, an Italian king — and, of course, the Great War was inadvertently triggered by an assassination of an Archduke.

Western societies had little more success stemming anarchist violence than we are having with islamist violence. The Great War probably had as much effect as anything, wiping out the demographic most associated with the problem. We will have to come up with a better solution.

(There’s a nice recap of anarchist violence and its connections to islamist violence here.)

Inspire and Azan paper is out

The paper Edna Reid and I wrote about the language patterns in Inspire and Azan magazines has now appeared (at least online) in Springer’s Security Informatics journal. Here’s the citation:

“Language Use in the Jihadist Magazines Inspire and Azan”
David B Skillicorn and Edna F Reid
Springer Security Informatics.2014, 3:9
Security Informatics

The paper examines the intensity of various kinds of language in these jihadist magazines. The main conclusions are:

  • These magazines use language as academic models of propaganda would predict, something that has not been empirically verified at this scale AFAIK.
  • The intellectual level of these magazines is comparable to other mass market magazines — they aren’t particularly simplistic, and they assume a reasonably well-educated readership.
  • The change in editorship/authorship after the deaths of Al-Awlaki and Samir Khan are clearly visible in Inspire. The new authors have changed for each issue, but there is an overarching similarity. Azan has articles claiming many different authors, but the writing style is similar across all articles and issues; so it’s either written by a single person or by a tightly knit group.
  • Jihadist language intensity has been steadily increasing over the past few issues of Inspire, after being much more stable during the Al-Awlaki years (this is worrying).
  • Inspire is experimenting with using gamification strategies to increase motivation for lone-wolf attacks and/or to decrease the reality of causing deaths and casualties. It’s hard to judge whether this is being done deliberately, or by osmosis — the levels of gamification language waver from issue to issue.

ISIS is putting out its own magazine. Its name, “Islamic State News”, and the fact that it is entirely pictorial (comic or graphic novel depending on your point of view) says something about their view of the target audience.

Pull from data versus push to analyst

One of the most striking things about the discussion of the NSA data collection that Snowden has made more widely known is the extent to which the paradigm for its use is database oriented. Both the media and, more surprisingly, the senior administrators talk only about using the data as a repository: “if we find a cell phone in Afghanistan we can look to see which numbers in the US it has been calling and who those numbers in turn call” has been the canonical justification. In other words, the model is: collect the data and then have analysts query it as needed.

The essence of data mining/knowledge discovery is exactly the opposite: allow the data to actively and inductively generate models with an associated quality score, and use analysts to determine which of these models is truly plausible and then useful. In other words, rather than having analysts create models in their heads and then use queries to see if they are plausible (a “pull” model), algorithmics generates models inductively and presents them to analysts (a “push” model). Since getting analysts to creatively think of reasonable models is difficult (and suffers from the “failure of imagination” problem, the inductive approach is both cheaper and more effective.

For example, given the collection of metadata about which phone numbers call which others, it’s possible to build systems that produce results of the form: here’s a set of phone numbers whose calling patterns are unlike any others (in the whole 500 million node graph of phones). Such a calling pattern might not represent something bad, but it’s usually worth a look. The phone companies themselves do some of this kind of analysis, for example to detect phones that are really business lines but are claiming to be residential and, in the days when long distance was expensive, to detect the same scammers moving across different phone numbers.

I would hope that inductive model building is being used on collected data, and the higher-ups in the NSA either don’t really understand or are being cagey. But I’ve talked to a lot of people in government who collect large data but are completely stuck in the database model, and have no inkling of inductive modelling.

More thwarted attacks in Canada

Some things in life happen because of a lot of little decisions over time — if you don’t brush your teeth you’re going to get cavities; others happen very quickly — you might see a TV program about a hobby only once and it becomes something that you do through your whole life. Radicalisation is more like the latter than the former.

As a rule of thumb, in Western countries about 1 in 10,000 Muslims becomes a violent extremist. So that means that 9,999 people in the same families, suburbs, schools, work environments, with the same access to government services, and with the same neighbours don’t become radicalised. Right away, that’s a pretty strong signal that the causes of radicalisation are not macro causes, but much smaller ones, related to individual personalities and life journeys. The problem isn’t with any government’s international policies, or with it’s domestic policies, or with its social support system; it’s about the accidental events. Which means that there isn’t a lot to be done about it via the heavy hammers of government programs.

It also means that finding people who have become violent extremists is difficult. There is an advantage to a global brand like al Qaeda: it encourages wannabees to get in touch with it, providing an opportunity for intelligence and law enforcement to notice. Canada’s record at finding Islamist violent extremists before they carry out attacks has been good, much better than its record at finding those who’ve been blowing up hydro towers and banks precisely because these other violent extremists don’t need to communicate outside of whatever their small group is.

We’ll wait to see if Nuttall and Korody really did ‘self-radicalise’ without any contact with someone who was already radicalised, and whether the security services got onto them without a tipoff from someone who knew them — if either of these, that will be a first for Canada.

It’s not secret if it’s been in the papers

Everything (except for a few small factoids) that Snowden has revealed publicly so far also appeared in the May 10th 2006 USA Today front-page article, so much of the breast-beating of the past two weeks has had elements of farce associated with it.

And based on what’s come out so far, the US would have some trouble convicting Snowden of more than some low-level improper handling of data charges — someone with a security clearance is not prevented from saying things that are in the public domain. Obviously a trial would also be something of an embarrassment as well. Perhaps that’s why the US pursuit of Snowden has been somewhat laconic.

He may, of course, have taken other material which is more damaging. Even here, though, it’s hard to see what this could be. The media has been full of “Now our enemies (Russians, Chinese, al Qaeda) know that we intercept their signals”. But, of course, they already knew, not least because of the USA Today article. Reuters put out an article explaining how jihadists were adapting their technology now that they know about this US capability. Absolute rubbish! The only people who might not have known were low-level amateurs, and even then they’d have to be not very bright or rather disconnected from the internet. So knowledge of the existence of these programs does not aid the enemy.

What about targeting details? The US military testified before Congress last year that they worked on the assumption that their military networks (air gapped from the internet) were compromised; and the subtext wasn’t that they wished they had the skills to do the same to the military networks of other countries. Lists of compromised IP addresses are not especially valuable since enemies assume that all IP addresses might have been. In other words, the enemy are not going to look at this kind of data and say “Shoot, they got into that system” because they will already have assumed that they had. (Of course, despite efforts to be professional, there’s always a difference between “We assume this system has been compromised” and “We know this system has been compromised”.)

Details of technologies used might be of some interest. Other countries will certainly already have this information (that’s what their intelligence services are for) but terrorist groups might not. On the other hand, the technical possibilities are fairly obvious — for example, there was a recent paper showing that content in encrypted Skype traffic could be detected in some detail.

What might be more interesting to enemies is details of timelines and policies, for example how quickly is something interesting likely to be noticed and how quickly would it flow up the chain of command for action to be taken. This kind of information is hard to infer from the technical layout of the system — but, for that reason, it’s probably something Snowden didn’t know much about.

How does collaboration change behaviour?

If you wanted to know whether someone who was collaborating with others or someone working alone would have the most variable behaviour, I think you could make arguments for both sides. On the one hand, someone collaborating is getting stimulus from those others which might lead to greater variation; on the other hand, some, especially in the business community, might characterise ‘stimulus’ as ‘interruption’ or constraints and think that it might lead to less variation.

The paper “Detecting Collaboration from Behavior” by Bauer, Garcia, Colbaugh, and Glass, presented at the recent ISI2013 in Seattle shows that the answer, at least for Wikipedia editors, is that collaboration increases variation as measured by entropy.

Obviously, this result needs to be followed up in other domains to see if it continues to be true — but there isn’t a lot about the analysis that’s specific to Wikipedia, or editing, so it looks like it will. From an intelligence point of view this suggests another channel for seeing what going on inside and between groups of violent extremists. It fits nicely with the analysis my group is doing looking at how language patterns across conversations (for example, email threads) reveal the interactions among the authors.

You may also be interested in ‘fertilizer’ or ‘last minute flights’

or ‘7 amazing ways to remove explosive residue’.

As I mentioned in my last post, the online-advertising businesses are spending as much time building models of us all as the NSA is spending building models of violent extremists, and have access to more data.

So how are they doing? If we looked at the ads being served to people like the Tsernaev brothers, would we find that these businesses have (unwittingly) built usable models of lone-wolf violent extremists — and so the pattern of ads served to such people is actually a signal of their potential for violence? There seems at least a decent chance that they have and maybe this should be followed up.

Terrorist incidents come in only a few flavors

Terrorist attacks are different in many ways: they take place in different countries, with different motivations behind them, using different mechanisms, and with varying degrees of success. But are there any commonalities that could be used, for example, to categorize them and so to defend against them in more focused ways? The answer is yes, there are large-scale similarities.

To do this analysis, I started from the Global Terrorism Database developed by START, the National Consortium for the Study of Terrorism and Responses to Terrorism. The database contains details of all incidents that meet their coding standards since the beginning of 1970, and I used the version released at the end of 2012. There was one major discontinuity where new fields were added but overall the coding has been consistent over the entire 40+ year period.

The image below shows the clustering of all attacks over that time period:

attackslabelledThe large structure looks like a hinge with clusters A and B at the top, clusters C and D forming the hinge itself, and clusters E, F, G, and H at the bottom. There’s also a distinction between the clusters at the front (B, D, F, and H) and those at the back (A,C,E, and G). (You’ll have to expand the figure to see the labels clearly.)

The first thing to notice is that there are only 8 clusters and, with the exception of H which is quite diffuse, they clusters are fairly well defined. In other words, there are 8 distinctive kinds of terrorist attack (and only 8, over a very long time period).

Let’s dig into these clusters and see what they represent. The distinction between the front and the back is almost entirely related to issues of attribution: whether the attack was claimed, how clear that claim is (for example, are there multiple claim of responsibility for the same incident), and whether the incident should be properly claimed as terrorism or something else (quasi-military, for example).

The structure of the hinge differentiates between incidents involving capturing people (hijackings or kidnappings in A and B) and incidents that are better characterized as attacks (C, D, E, F, G, H).  The extremal ends of A and B (to the right) are incidents that lasted longer and/or the ransom was larger.

The differences between C/D, E/F, and G/H arise from the number of targets (which seems to be highly correlated with the number of different nationalities involved). So C and D are attacks on a single target, E and F are attacks on two targets, and G and H are attacks on three targets. Part of the diffuse structure of H happens because claims are always murkier for more complex attacks and part because there is a small group of incidents involving 4 targets that appears, as you’d expect, even further down and to the right.

Here are some interesting figures which overlay the intensity of a property on the clustering, so that you can see how it’s associated with the clusters:


This figure shows whether the incident was claimed or not. The color coding runs from dark red to bright yellow; I’m not specifying the direction, because it’s complicated, but the contrast shows differences. In each case, the available color spectrum is mapped to the range of values.


This figure shows the differences between incidents where there were some hostages or kidnapped and those where there weren’t.

overlaycountryThis figure shows that the country in which the incident took place is mostly unrelated to other properties of the incident; in other words, attacks are similar no matter where they take place.

This analysis shows that, despite human variability, those designing terrorist incidents choose from a fairly small repertoire of possibilities. That’s not to say that there couldn’t be attacks in which some people are also taken hostage; rather that those doing the planning don’t seem to conceptualize incidents that way, so when it happens it’s  more or less by accident. Perhaps some kind of Occam’s razor plays a role: planning an incident is already difficult so there isn’t a lot of brainpower to try for extra cleverness, and there’s probably also a perception that complexity increases risk.

Questions are data too

In the followup investigation of the Boston Marathon bombings, we see again the problem that data analytics has with questions.

Databases are built to store data. But, as Jeff Jones has most vocally pointed out, simply keeping the data is not enough in adversarial settings. You also need to keep the questions, and treat them as part of the ongoing data. The reason is obvious once you think about it — intelligence analysts need not only to know the known facts; they also need to know that someone else has asked the same question they just asked. Questions are part of the mental model of analysts, part of their situational awareness, but current systems don’t capture this part and preserve it so that others can build on it. In other words, we don’t just need to connect the dots; we need to connect the edges!

Another part of this is that, once questions are kept, they can be re-asked automatically. This is immensely powerful. At present, an analyst can pose a question (“has X ever communicated with Y?”), get a negative answer, only for information about such a communication to arrive a microsecond later and not be noticed. In fast changing environments, this can happen frequently, but it’s implausible to expect analysts to remember and re-pose their questions at intervals, just in case.

We still have some way to go with the tools and techniques available for intelligence analysis.

Inspire Magazine Issue 10

The tenth issue of this al Qaeda in the Arabian Peninsula magazine is out. Continuing the textual analysis I’ve done on the earlier issues, I can conclude two things:

  1. Issue 10 wasn’t written by whoever wrote Issue 9 (nor by those who wrote the previous issues since they’re dead). In almost every respect the language resembles that of earlier issues, and is bland with respect to almost every word category. Except …
  2. The intensity of Jihadist language, which has been steadily increasing over the series, decreases sharply in Issue 10. Whoever the new editors/authors are, their hearts are not in it as much as the previous ones.

Understanding High-Dimensional Spaces

My new book with the title above has been published by Springer, just in time for Christmas gift giving for the data miner on your list.

The book explores how to represent high-dimensional data (which almost all data is), and how to understand the models, particularly for problems where the goal is to find the most interesting subset of the records. “Interesting”, of course, means different things in different settings; a big part of the focus is on finding outliers and anomalies.

Partly the book is a reaction to the often unwitting assumption that clouds of data can be understood as if they had a single centre — for example, much of the work on social networks.

The most important technical ideas are (a) that clusters themselves need to be understood as having a structure which provides each one with a higher-level context that is usually important to make sense of them, and (b) that the empty space between clusters also provides information that can help to understand the non-empty-space.

You can buy the book here.

The Analysis Chasm

I’ve recently heard a couple of government people (in different countries) complain about the way in which intelligence analysis is conceptualized, and so how intelligence organizations are constructed. There are two big problems:

1.  “Intelligence analysts” don’t usually interact with datasets directly, but rather via “data analysts”, who aren’t considered “real” analysts. I’m told that, at least in Canada, you have to have a social science degree to be an intelligence analyst. Unsurprisingly (at least for now) people with this background don’t have much feel for big data and for what can be learned from it. Intelligence analysts tend to treat the aggregate of the datasets and the data analysts as a large black box, and use it as a form of Go Fish. In other words, intelligence analysts ask data analysts “Have we seen one of these?”; the data analysts search the datasets and the models built from them, and writes a report giving the answer. The data analyst doesn’t know why the question was asked and so cannot write a more helpful report that would be possible given some knowledge of the context. Neither side is getting as much benefit from the data as they could, and it’s mostly because of a separation of roles that developed historically, but makes little sense.

2. Intelligence analysts, and many data analysts, don’t understand inductive modelling from data. It’s not that they don’t have the technical knowledge (although they usually don’t) but they don’t have the conceptual mindset to understand that data can push models to analysts: “Here’s something that’s anomalous and may be important”; “Here’s something that only occurs a few times in a dataset where all behavior should be typical and so highly repetitive”; “Here’s something that has changed since yesterday in a way that nothing else has”. Data systems that do inductive modelling don’t have to wait for an analyst to think “Maybe this is happening”. The role of an analyst changes from being the person who has to think up hypotheses, to the person who has to judge hypotheses for plausibility. The first task is something humans aren’t especially good at, and it’s something that requires imagination, which tends to disappear in a crisis or under pressure. The second task is easier, although not something we’re necessarily perfect at.

There simply is no path for inductive models from data to get to intelligence analysts in most organizations today. It’s difficult enough to get data analysts to appreciate the possibilities; getting models across the chasm, unsolicited, to intelligence analysts is (to coin a phrase) a bridge too far.

Addressing both of these problems requires a fairly revolutionary redesign of the way intelligence analysis is done, and an equally large change in the kind of education that analysts receive. And it really is a different kind of education, not just a kind of training, because inductive modelling from data seems to require a mindset change, not the supply of some missing mental information. Until such changes are made, most intelligence organizations are fighting with one and a half arms tied behind their collective backs.

Edge typing to make transitivity useful

There have been several studies over the past year that have shown that we are influenced by properties of people to whom we are not directly connected. There seems to be a pattern: if I have property X, then my immediate friends tend to be more Xy than would otherwise be expected (not surprising), but their friends whom I don’t know are also affected by my Xiness, and sometimes even their friends (which does seem surprising).

All of which is to say that transitivity in social networks is more interesting and important than it might seem intuitively. Social network sites have tried, in various ways, to exploit transitivity, usually in some form of speading: recommending things that I like or am doing to my immediate neighbours, and suggesting that people at distance 2 might usefully become friends at distance 1.

These attempts have, I think it is fair to say, been less than successful. A big part of the reason is the failure to model links (edges) as being of different kinds, as well as different intensities. Such sites do have access to intensity data, so they can estimate a weight for edges linking people (although probably this also is a bit shaky since many forms of contact are automated, so it’s not clear how much bonding each actually represents). In particular, connections that derive from work and those that derive from leisure seem like they should be treated differently, and some of the embarrasing faux pas have resulted from e.g. trying to get people to friend their boss’s boss. But, in general, people live in many different communities, and transitivity doesn’t work well across communities. It seems hard to be able to tell when transitivity is and is not a good thing without distinguishing different kinds of connections, and so different kinds of edges. For example, a personal relationship could be represented by a red edge, a work relationship by a blue edge, and a family relationship by a green edge. Now transitivity along paths of the same colour becomes a much more powerful, and less treacherous, idea.

From the perspective of data analysis, there are two challenges: how to acquire the information about what kind of edge a relationship is, and how to modify analysis techniques to take edges of different kinds into account.

The colour of an edge is not easy to induce from observing the activity on that edge. For example, suppose that you have access to someone’s email and you want to work out who are their friends, and who are professional contacts. The structure of email addresses doesn’t help much because a friend’s work email might be used, and because email addresses tend to be surrogates anyway. Time of day doesn’t help much because many people send personal email at work, and many send work emails out of working hours. The content of emails might help, but many organisations have extensive in-house non-work emails (for example, Enron had many emails about fantasy football that circulated only within the company). Social network sites have an advantage because they can ask users to explain which category of “friending” a particular contact is (this could be a big win — a category of “annoying person I don’t want to offend by removing the contact” could easily become the most popular edge type). In an intelligence or law enforcement setting, where the existence of the contact is acquired by observation or interception, the problem of categorising the contact is just as difficult.

Even if the edges can be labelled to indicate their type, using this information to improve the analysis of the resulting graph is difficult, and largely unstudied (AFAIK). Most approaches use some kind of iterative approach (see here for a recent example and some references). Integrating edge types into spectral approaches would be particularly useful — volunteers anyone?

Modelling expectations to help focus

I’ve argued for, and am struggling to build, knowledge-discovery systems that can inductively decide which parts of the available data, and which emergent knowledge from the data are likely to be most ‘interesting’ so that an intelligence analyst can be guided to focus his/her limited attention there.

One important way of approaching this, which has the added advantage that it hardens the system against being systematically misled (which is especially a problem in adversarial settings) is to build in ways of considering what should happen. In other words, as well as the ‘main’ modelling process, there should also be added models that are constantly projecting what the incoming data, the main model, and the results should be like.

So I was interested to see the discussion in New Scientist of work showing that the human brain appears to do exactly this — we predict what the scenes we are looking at should look like, presumably so that we can divert resources to aspects of the scene that don’t match this expectation. Now if only we could make this computational…

The article is here.