Posts Tagged 'palantir'

Estimating the significance of a factoid

You only have to mention Palantir to attract lots of traffic — oops, I did it again šŸ™‚

Those of you who’ve been following along know that I’m interested in tools that help an analyst decide how to treat a new piece of information that arrives in an analysis system from the outside world. Many analysis tools provide exactly nothing to support analysts with this taskĀ  — new data arrives, and is stored away in the system, but analysts can only discover this by querying the system with a query that includes the new data as part of the result.

The next level of tool allows persistent queries; so an analyst can ask about some topic, and the system remembers the query. If new data appears that would have matched the query, the system notifies the analyst (think Google Alerts). This is a big step up in performance from an analyst point of view. Jeff Jonas has argued that, in fact, queries should be thought of as a symmetric form of data that can be accessed explicitly as well. For example, it may be signficant for an analyst that another analyst has made the same or a similar query.

However, this still requires analysts to manage their set of interests quite explicitly and over potentially long periods of time. We humans are not very good at managing the state of multiple mental projects, a fact that has made David Allen a lot of money. In the simplest case, if a system tells an analyst about a new result for a query that was originally made months ago, it may take a long time to recreate the situation that led to the original query being made, and so a long time to estimate the significance of the new information.

I don’t have a silver bullet to solve this class of problems. But I do think that it’s essential that tools become more proactive so that some of the judgement of how significant a newly arrived fact is can be made automatically and computationally. Of course, this is deeply contextual and very difficult.

It does seem helpful, though, to consider the spectrum of significance that might be associated with a new factoid. Let me suggest the following spectrum:

  • Normal.Ā  Such a factoid is already fully accounted for by the existing mental model or situational awareness of the analysis. Its significance is presumptively low. Often this can be estimated fairly well using the equivalences common = commonplace = normal. In other words, if it resembles a large number of previous normal factoids, then it’s a normal factoid.
  • Anomalous. Such a factoid lies outside the normal but is ‘so close’ that it is best accounted for as a small deviation from normal. It’s the kind of factoid for which a plausible explanation is easy to come up with in a very short time frame.
  • Interesting. Such a factoid calls into question the accuracy or completeness of the existing model or situational awareness — something has been missed, or the structure of the model is not what it appeared to be.
  • Novel. Such a factoid does not resemble any that were used to build the model or situational awareness in the first place so its significance cannot be assessed in the current framework. The model must be incomplete in a substantial way.
  • Random. Stuff happens and some factoids will be so unusual that they have nothing to say about the existing model.

This is a spectrum, so there are no natural boundaries between these categories — and yet the actions that follow do depend on which of these five categories a factoid is placed in.

What makes estimating the significance of a new factoid difficult is that significance is greatest for the middle categories, and lowest for the extremal ones. Both normal and random are not signficant, while interesting and novel are the most significant. Many of the natural technologies tend to take a more monotonic view, for example intrusion detection systems. But we know several techniques for measuring significance that have the right qualitative properties, and these make it plausible that we can build systems that can present analysts with new factoids along with an indication of their presumptive significance.


The Australian government white paper on counterterrorism, about which I wrote last week, sank below public consciousness without trace, after a 1-day flurry of news stories focused on the $69m to be spent on biometric strengthening of visas from certain countries. So much for the threat of homegrown terrorism!

One of the important facets of the report, and the one closest to what I do, is the emphasis on intelligence-led counterterrorism. This was described using three phases, collect, analyse, and share; all of which were claimed to be deficient, but for only two of which any remedies were suggested (as I blogged about previously). But maybe it’s worth spending some time on each of these phases.

Collect.Ā  I take it that the point of intelligence-led counterterrorism is that the normal sequence is intended to be intelligence–investigation–arrest–nonevent rather than the more typical event–investigation–arrest sequence that tends to happen with e.g. crime, although less and less so. (In fact, for the majority of thwarted attacks, the sequence has been whistleblower–investigation–nonevent, which raises a number of deeper questions.)

Intelligence as a lead operation means ways to get knowledge from the data without having to explicitly look for it — in other words, it has to be fundamentally inductive. But the data has to be thereĀ  in the first place, so what kind of data should be collected?

There are two fundamentally different answers to this question, and so two very different kinds of data:

  1. Data collected about all of a particular class of objects, with the expectation that the records of interest will be a small, perhaps a vanishingly small, fraction of the total. For example, data about border crossing, customs, quarantine, financial transactions, taxes and communication is collected by most goverments for all of the relevant events, and some kind of processing is done to decide which of these events deserve further attention. Notice that most commercial data collection is like this too: shops collect data about all of their customers, airlines about all of their passengers, and so on.
  2. This kind of widespread data collection often raises red flags because of concerns about privacy, government power, and sometimes commercial power. The interesting thing is that groups in every country invoke moral arguments for why certain kinds of data collection are wrong and others are right (or at least justified) — but the boundary is different in different countries and at different times. In the U.S. there is deep suspicion about widespread government collection, but very little about commerical collection; in Europe it tends to be the other way around. A decent argument could be made that this battle was lost when income tax (or at least PAYE) was invented, since it meant the governments became involved in every detail of how money was made.

  3. Data collected about known “bad guys” or their associated actions. This kind of data is collected by governments using some kind of warrant-based procedure (that is, there is a need of some kind to demonstrate the potential badness of a “bad guy”) and by commercial organisations using a special-purpose investigative process. The goal in this case is either to confirm the suspected badness (search warrant, insurance investigation) or to discover other “bad guys” by association. This kind of data collection often raises fewer red flags because, in the case of governments, the process is typically quasi-judicial and so has inbuilt checks and balances; and in the commercial case, the costs of investigation act as a brake on doing it too much. (This latter cost is coming down quickly as more and more investigation can be done sitting in front of a screen, so this attitude may be changing.)

Analyse. The second phase is to take the collected data and use it inductively to build models of what it reveals. Very little of this happens today. Most analysis is of the slice-and-dice variety: given lots of data, usually of many different kinds, an analyst uses a sophisticated data-manipulation system to look at it in many different ways, and explore the connections that are implicit in it. The flagship in this area is probably Palantir (unpaid plug) which, though extremely expensive, does many interesting things on very large datasets. (See this video for a demo of how it can be used in an intelligence setting.)

The weakness with tools like this is that the analyst has to drag the knowledge out of the data, rather than having the data produce the knowledge. While we are far from doing this in a general way (“Computer, tell me what I should know”), we are, as a research community, making some progress. It is the invisibility of this possibility and the current partial successes that I see as the biggest weakness of the white paper.

Share. The inability of intelligence and law enforcement organisations to share is a chronic problem but I suspect it is often mischaracterised. Knowledge is power, and there’s a strange desire among some people to hang onto data, even if it has no value to them; but I suspect the issue is much more often pragmatic. Different law enforcement and intelligence organisations have different systems, and it’s often a nightmare to move data even within an organisation, let along between organisations just because of different formats, database schemas, encodings, and sheer size. This is not an easy problem to address. The temptation is often to wish for a single overarching, consistent database into which everything can be thrown. This is a poor idea for two reasons:

  • The price per unit of storage is often low (and decreasing) up to a certain size, but then takes a major jump. For example, a terabyte drive costs around $100, but a 100TB drive costs a lot more than 100 times as much. At any given moment, there’s a sweet spot between capacity, access time, and cost; moving away from it drives up the cost.
  • Any storage mechanism makes some kinds of analysis easy and other kinds difficult; the kinds that are difficult are done less often, so things get missed. In other words, variation in storage solutions associates with variation in analysis carried out, which decreases gaps.

There is a deeper issue about sharing: the knowledge present in data is often very implicit (that is, it’s hard to know what’s there until you look) so giving someone else your data may also be unintentionally giving them more than you realised. People who work with census data have had to think about this issue for a long time because census data at the level of individuals is very private, but at the collective level is (has to be) very public. This is also an important issue for businesses that build models of their customers. There are interesting technical problems here that might make a bigger difference in the end than trying to legislate increased cooperation.