Posts Tagged 'connecting the dots'



Questionoids as well as factoids

In the previous post I talked about the problem of “connecting the dots” and how this innocuous-sounding phrase conceals problems that we don’t yet know how to solve — some we don’t even know how to attack.

There’s another side to the story, and that’s the questions that are applied to the collection of factoids. These are important for two reasons.

1.  Asking a question for which a particular factoid is the answer should perhaps have some impact on the importance/interestingness of that factoid. This isn’t a magic bullet (because unknown unknowns might also be important, but won’t be looked for). But it’s a start.

(Google presumably is using some variant of this idea to weight the importance of web pages since Pagerank is based on explicitly created links as indicators of importance, but few of us create explicit links any more because it’s easier to go via Google — so some other indicators must surely come into play. But I haven’t seen anything public about this.)

2.  Asking a question for which there is no matching factoid does not mean that the question should be discarded (as it is in e.g. database systems). Rather such unanswered questions should become data themselves (and so should the answered ones). New factoids should be considered against the aggregrate of these questions to see if they match — in other words, all queries should be persistent. That way, if someone asks about X and information about X is not known, the appearance of a factoid about X should cause a response to be generated, long after the analyst originally posed the question. Even if there was a factoid about X and so a response to the query, a new factoid about X will automatically generate a supplementary response.

In this view, which was first and most clearly enunciated by Jeff Jonas as part of the NORA and EAS systems, there are two forms of data: factoids and questionoids. Pairs of one of each “match” and cause a response to the outside world. But both kinds are worthy of meta-analysis, and the results of this analysis can be used to change the way the opposite kind of data is weighted.

The question data is also interesting in its own right. For example, an analyst may be interested to know that someone else has asked the same question as they did, even if it was asked months ago.

Connecting the dots is hard

In the aftermath of the Christmas Day attempted in-flight bombing, the issue of whether intelligence agencies should have been able to “connect the dots” beforehand has once again been heavily discussed.

Putting together disparate pieces of information to discover a pattern of interest is much more difficult than it looks, but the reason is subtle.

First, and obviously, it’s much easier to find some pieces of information that fit into a pattern when you already know the pattern — so connecting the dots after an incident always looks easy. This is just another way of saying the hindsight is 20/20.

What about the problem of putting together the pieces before an incident? Let’s suppose, for simplicity, that the pieces of information are simple factoids: this person did something, said something, bought something that might have been potentially suspicious/interesting. We could give each of these factoids a weight indicating how important we suppose it to be, perhaps based on its inherent unusualness or connection to a perceived risk; and perhaps also based on the reliability associated with it.

Even this simple first step is not straightforward because the weight depends to some extent on some perception of perceived modes of attack: buying the chemical components of an explosive seems like it should be high weight; but many other actions have ambiguous possible weights depending on what we (implicitly beforehand) think is plausible or likely. Buying Jello might indicate an interest in growing bacteria, or just a taste for a cheap dessert.

And if assigning weights to individual factoids is difficult, the difficulty is compounded by the sheer number of such factoids that exist. I don’t have any hard information, but from public statements we could estimate that perhaps 10,000 potential terrorists are being tracked around the world; on any given day, the number of factoids generated by their actions, communications, and web traffic could easily be hundreds of time greater.

So, it’s no surprise that individual factoids get underweighted when they first enter intelligence systems. The net effect of the failure to detect the recent attack is that all factoids will be given more weight, which relatively has no effect at all (except to keep already overstretched intelligence officers busier).

But this is the easy part. The connecting of these factoid dots is much, much harder.

First, the existence of a connection between two factoids can change (perhaps dramatically) the weight associated with both of them. So, theoretically at least, the potential association between each pair of factoids should be explored. In complexity terms, the number of comparisons is quadratic in the number of factoids: if there are 100 factoids, the  there are 100×100 possible connections. Calculations that have complexity quadratic in the size of their inputs are just on the boundary of the practically doable — possible for small numbers of inputs, but taking too long for larger numbers of inputs. For 1,000,000 factoids per day, the number of pairwise connections to check is 1,000,000,000,000,  just doable on special-purpose hardware at a central site. In practice, I suspect that only a smallish subset of these connections are actually considered in real-time, so there is now the possibility of failing to connect two dots just because there are a lot of dots and therefore a lot more possible connections.

It gets worse. When a new connection changes the weights of the factoids that it connects, these changed weights now affect other factoids to which they are connected; and these in turn propagate a changed weight to the factoids to which they are connected; and so on. In other words, discovering a new connection between two factoids can alter the perceived weight of many, even all, of the other factoids. This means, among other things, that it’s hard to work with just part of the graph, because a change made somewhere else can radically change the meaning of the part.

It gets even worse. The connections themselves, and aggregates of the connections, can become meta-factoids. For example, the fact that person A communicates with person B via phone but person B communicates with person A by email is potentially a useful factoid, revealing something about the communication infrastructure each has access to, their attitudes to it (perhaps their perception about security risks of different technologies), and even that they are trying to communicate covertly (since most people communicate symmetrically). The connections between factoids create a web or graph whose structure at many different levels can reveal relationships among the factoids that change their individual significance.

The bottom line is that it’s extremely hard to find sets of dots with interesting connections. Part of this is the sheer complexity of the data structure and the algorithms that would be required. But we actually don’t know much about the useful algorithms. These webs or graphs have many emergent properties. If we understood them, there are surely ways to focus attention on only parts of the data that have the greatest potential to lead to interesting factoids and connections. Network science is emerging as a new area of research where exactly these kinds of questions are being explored, but it is in its infancy, and we know only the most rudimentary properties of such structures: the common appearance of power laws, preferential attachment as a construction mechanism, some measures of importance of nodes within a graph, and so on. But the big theories remain elusive.