Pull from data versus push to analyst

One of the most striking things about the discussion of the NSA data collection that Snowden has made more widely known is the extent to which the paradigm for its use is database oriented. Both the media and, more surprisingly, the senior administrators talk only about using the data as a repository: “if we find a cell phone in Afghanistan we can look to see which numbers in the US it has been calling and who those numbers in turn call” has been the canonical justification. In other words, the model is: collect the data and then have analysts query it as needed.

The essence of data mining/knowledge discovery is exactly the opposite: allow the data to actively and inductively generate models with an associated quality score, and use analysts to determine which of these models is truly plausible and then useful. In other words, rather than having analysts create models in their heads and then use queries to see if they are plausible (a “pull” model), algorithmics generates models inductively and presents them to analysts (a “push” model). Since getting analysts to creatively think of reasonable models is difficult (and suffers from the “failure of imagination” problem, the inductive approach is both cheaper and more effective.

For example, given the collection of metadata about which phone numbers call which others, it’s possible to build systems that produce results of the form: here’s a set of phone numbers whose calling patterns are unlike any others (in the whole 500 million node graph of phones). Such a calling pattern might not represent something bad, but it’s usually worth a look. The phone companies themselves do some of this kind of analysis, for example to detect phones that are really business lines but are claiming to be residential and, in the days when long distance was expensive, to detect the same scammers moving across different phone numbers.

I would hope that inductive model building is being used on collected data, and the higher-ups in the NSA either don’t really understand or are being cagey. But I’ve talked to a lot of people in government who collect large data but are completely stuck in the database model, and have no inkling of inductive modelling.

Advertisements

0 Responses to “Pull from data versus push to analyst”



  1. Leave a Comment

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s





%d bloggers like this: