Collect-Analyse-Share

The Australian government white paper on counterterrorism, about which I wrote last week, sank below public consciousness without trace, after a 1-day flurry of news stories focused on the $69m to be spent on biometric strengthening of visas from certain countries. So much for the threat of homegrown terrorism!

One of the important facets of the report, and the one closest to what I do, is the emphasis on intelligence-led counterterrorism. This was described using three phases, collect, analyse, and share; all of which were claimed to be deficient, but for only two of which any remedies were suggested (as I blogged about previously). But maybe it’s worth spending some time on each of these phases.

Collect.  I take it that the point of intelligence-led counterterrorism is that the normal sequence is intended to be intelligence–investigation–arrest–nonevent rather than the more typical event–investigation–arrest sequence that tends to happen with e.g. crime, although less and less so. (In fact, for the majority of thwarted attacks, the sequence has been whistleblower–investigation–nonevent, which raises a number of deeper questions.)

Intelligence as a lead operation means ways to get knowledge from the data without having to explicitly look for it — in other words, it has to be fundamentally inductive. But the data has to be there  in the first place, so what kind of data should be collected?

There are two fundamentally different answers to this question, and so two very different kinds of data:

  1. Data collected about all of a particular class of objects, with the expectation that the records of interest will be a small, perhaps a vanishingly small, fraction of the total. For example, data about border crossing, customs, quarantine, financial transactions, taxes and communication is collected by most goverments for all of the relevant events, and some kind of processing is done to decide which of these events deserve further attention. Notice that most commercial data collection is like this too: shops collect data about all of their customers, airlines about all of their passengers, and so on.
  2. This kind of widespread data collection often raises red flags because of concerns about privacy, government power, and sometimes commercial power. The interesting thing is that groups in every country invoke moral arguments for why certain kinds of data collection are wrong and others are right (or at least justified) — but the boundary is different in different countries and at different times. In the U.S. there is deep suspicion about widespread government collection, but very little about commerical collection; in Europe it tends to be the other way around. A decent argument could be made that this battle was lost when income tax (or at least PAYE) was invented, since it meant the governments became involved in every detail of how money was made.

  3. Data collected about known “bad guys” or their associated actions. This kind of data is collected by governments using some kind of warrant-based procedure (that is, there is a need of some kind to demonstrate the potential badness of a “bad guy”) and by commercial organisations using a special-purpose investigative process. The goal in this case is either to confirm the suspected badness (search warrant, insurance investigation) or to discover other “bad guys” by association. This kind of data collection often raises fewer red flags because, in the case of governments, the process is typically quasi-judicial and so has inbuilt checks and balances; and in the commercial case, the costs of investigation act as a brake on doing it too much. (This latter cost is coming down quickly as more and more investigation can be done sitting in front of a screen, so this attitude may be changing.)

Analyse. The second phase is to take the collected data and use it inductively to build models of what it reveals. Very little of this happens today. Most analysis is of the slice-and-dice variety: given lots of data, usually of many different kinds, an analyst uses a sophisticated data-manipulation system to look at it in many different ways, and explore the connections that are implicit in it. The flagship in this area is probably Palantir (unpaid plug) which, though extremely expensive, does many interesting things on very large datasets. (See this video for a demo of how it can be used in an intelligence setting.)

The weakness with tools like this is that the analyst has to drag the knowledge out of the data, rather than having the data produce the knowledge. While we are far from doing this in a general way (“Computer, tell me what I should know”), we are, as a research community, making some progress. It is the invisibility of this possibility and the current partial successes that I see as the biggest weakness of the white paper.

Share. The inability of intelligence and law enforcement organisations to share is a chronic problem but I suspect it is often mischaracterised. Knowledge is power, and there’s a strange desire among some people to hang onto data, even if it has no value to them; but I suspect the issue is much more often pragmatic. Different law enforcement and intelligence organisations have different systems, and it’s often a nightmare to move data even within an organisation, let along between organisations just because of different formats, database schemas, encodings, and sheer size. This is not an easy problem to address. The temptation is often to wish for a single overarching, consistent database into which everything can be thrown. This is a poor idea for two reasons:

  • The price per unit of storage is often low (and decreasing) up to a certain size, but then takes a major jump. For example, a terabyte drive costs around $100, but a 100TB drive costs a lot more than 100 times as much. At any given moment, there’s a sweet spot between capacity, access time, and cost; moving away from it drives up the cost.
  • Any storage mechanism makes some kinds of analysis easy and other kinds difficult; the kinds that are difficult are done less often, so things get missed. In other words, variation in storage solutions associates with variation in analysis carried out, which decreases gaps.

There is a deeper issue about sharing: the knowledge present in data is often very implicit (that is, it’s hard to know what’s there until you look) so giving someone else your data may also be unintentionally giving them more than you realised. People who work with census data have had to think about this issue for a long time because census data at the level of individuals is very private, but at the collective level is (has to be) very public. This is also an important issue for businesses that build models of their customers. There are interesting technical problems here that might make a bigger difference in the end than trying to legislate increased cooperation.

Advertisements

0 Responses to “Collect-Analyse-Share”



  1. Leave a Comment

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s





%d bloggers like this: