Posts Tagged 'NSA'

Pull from data versus push to analyst

One of the most striking things about the discussion of the NSA data collection that Snowden has made more widely known is the extent to which the paradigm for its use is database oriented. Both the media and, more surprisingly, the senior administrators talk only about using the data as a repository: “if we find a cell phone in Afghanistan we can look to see which numbers in the US it has been calling and who those numbers in turn call” has been the canonical justification. In other words, the model is: collect the data and then have analysts query it as needed.

The essence of data mining/knowledge discovery is exactly the opposite: allow the data to actively and inductively generate models with an associated quality score, and use analysts to determine which of these models is truly plausible and then useful. In other words, rather than having analysts create models in their heads and then use queries to see if they are plausible (a “pull” model), algorithmics generates models inductively and presents them to analysts (a “push” model). Since getting analysts to creatively think of reasonable models is difficult (and suffers from the “failure of imagination” problem, the inductive approach is both cheaper and more effective.

For example, given the collection of metadata about which phone numbers call which others, it’s possible to build systems that produce results of the form: here’s a set of phone numbers whose calling patterns are unlike any others (in the whole 500 million node graph of phones). Such a calling pattern might not represent something bad, but it’s usually worth a look. The phone companies themselves do some of this kind of analysis, for example to detect phones that are really business lines but are claiming to be residential and, in the days when long distance was expensive, to detect the same scammers moving across different phone numbers.

I would hope that inductive model building is being used on collected data, and the higher-ups in the NSA either don’t really understand or are being cagey. But I’ve talked to a lot of people in government who collect large data but are completely stuck in the database model, and have no inkling of inductive modelling.

It’s not secret if it’s been in the papers

Everything (except for a few small factoids) that Snowden has revealed publicly so far also appeared in the May 10th 2006 USA Today front-page article, so much of the breast-beating of the past two weeks has had elements of farce associated with it.

And based on what’s come out so far, the US would have some trouble convicting Snowden of more than some low-level improper handling of data charges — someone with a security clearance is not prevented from saying things that are in the public domain. Obviously a trial would also be something of an embarrassment as well. Perhaps that’s why the US pursuit of Snowden has been somewhat laconic.

He may, of course, have taken other material which is more damaging. Even here, though, it’s hard to see what this could be. The media has been full of “Now our enemies (Russians, Chinese, al Qaeda) know that we intercept their signals”. But, of course, they already knew, not least because of the USA Today article. Reuters put out an article explaining how jihadists were adapting their technology now that they know about this US capability. Absolute rubbish! The only people who might not have known were low-level amateurs, and even then they’d have to be not very bright or rather disconnected from the internet. So knowledge of the existence of these programs does not aid the enemy.

What about targeting details? The US military testified before Congress last year that they worked on the assumption that their military networks (air gapped from the internet) were compromised; and the subtext wasn’t that they wished they had the skills to do the same to the military networks of other countries. Lists of compromised IP addresses are not especially valuable since enemies assume that all IP addresses might have been. In other words, the enemy are not going to look at this kind of data and say “Shoot, they got into that system” because they will already have assumed that they had. (Of course, despite efforts to be professional, there’s always a difference between “We assume this system has been compromised” and “We know this system has been compromised”.)

Details of technologies used might be of some interest. Other countries will certainly already have this information (that’s what their intelligence services are for) but terrorist groups might not. On the other hand, the technical possibilities are fairly obvious — for example, there was a recent paper showing that content in encrypted Skype traffic could be detected in some detail.

What might be more interesting to enemies is details of timelines and policies, for example how quickly is something interesting likely to be noticed and how quickly would it flow up the chain of command for action to be taken. This kind of information is hard to infer from the technical layout of the system — but, for that reason, it’s probably something Snowden didn’t know much about.

Government signals intelligence versus multinationals

In all of the discussion about the extent to which the U.S. NSA is collecting and analyzing data, the role of the private sector in similar analysis has been strangely neglected.

Observe, first, that all of the organizations that were asked to provide data to the NSA did not have to do anything special to do so. Verizon, the proximate example, was required to provide, for every phone call, the originating and destination numbers, the time, the duration, and the cell tower(s) involved for mobile calls — and all of this information was already collected. Why would they collect it, if not to have it available for their own analysis? It isn’t for billing — part of the push to envelope pricing plans was to save the costs of producing detailed bills, for which the cost was often greater than the cost of completing the call itself.

Second, government signals intelligence is constrained in the kind of data they are permitted to collect: traffic analysis (metadata) for everyone, but content only for foreign nationals and those specifically permitted by warrants for cause. Multinationals, on the other hand, can collect content for everyone. If you have a gmail account (I don’t), then Google not only sees all of your email traffic, but also sees and analyzes the content of every email you send and receive. If you send an email to someone with a gmail account, the content of that email is also analyzed. Of course, Google is only one of the players; many other companies have access to emails, other online communications (IM, Skype), and search histories, including which link(s) in the search results you actually follow.

A common response to these differences is something like “Well,  I trust large multinationals, but I don’t trust my government”. I don’t really understand this argument; multinationals are driven primarily (?only) by the need for profits. Even when they say that they will behave well, they are unable to carry out this promise. A public company cannot refrain from taking actions that will produce greater profits, since its interests are the interests of its shareholders. And, however well meaning, when a company is headed for bankruptcy and one of its valuable assets is data and models about millions of people, it’s naive to believe that the value of that asset won’t be realized.

Another popular response is “Well, governments have the power of arrest, while the effect of multinational is limited to the commercial sphere”. That’s true, but in Western democracies at least it’s hard for governments to exert their power without inviting scrutiny from the judicial system. At least there are checks and balances. If a multinational decides to exert its power, there is much less transparency and almost no mechanism for redress. For example, a search engine company can downweight my web site in results (this has already been done) and drive me out of business; an email company can lose all of my emails or pass their content to my competitors. I don’t lose my life or my freedom, but I could lose my livelihood.

A third popular response is “Well, multinationals are building models of me so that they can sell me things that are better aligned with my interests”. This is, at best, a half-truth. The reason they want a model of you is so that they can try and sell you things you might be persuaded to buy, not things that that you should or want to buy. In other words, the purpose of targeted advertising is at least to get you to buy more than you otherwise would, and to buy the highest profit margin version of things you might actually want to buy. Your interests and the interests of advertisers are only partially aligned, even when they have built a completely accurate model of you.

Sophisticated modelling from data has its risks, and we’re still struggling to understand the tradeoffs between power and consequences and between cost and effectiveness. But, at this moment, the risks seem to me to be greatest from multinational data analysis than from government data analysis.