Governments, intelligence organizations, and law enforcement have traditionally made a distinction between collecting data about what I do in public spaces, and what I do in private spaces. These were originally conceptualized in geographical terms: on the street (public) vs inside my house (private). Even in these terms, issues quickly arise: what about what I am doing in front of a window? What if there is a CCTV camera that points at the space outside my house, but also happens to capture a view into the house? Is the fact that the snow has melted on my roof but not my neighbors grounds for a search warrant?
The beginning of electronic communication changed the venue, but didn’t change the framing of the boundary between public and private: the content of a communication is typically considered to be private, but its envelope, how and when it was sent and to whom, is typically considered to be public. Enter the distinction between data (the contents) and metadata (the envelope).
Note in passing that the multinational internet organizations did not, and do not, acknowledge the existence of any boundary between public and private. Their avowed intent is to collect and organize all the world’s information. And, by and large, people have been willing not only to permit this, but to actively help. (It’s not clear to me how much of this should be attributed to ignorance and how much to foolhardiness.)
What’s changed in the past (say) decade is a growing public awareness of how much metadata about them there is, and how easily it is collected and stored in today’s world. However, in my view, to frame the problem as one of data is to get into a muddle.
The conventional view of privacy is that it’s about information flow: who is allowed to collect and retain my information. Privacy advocates, therefore, want to put barriers in the way of collecting “personal information”. But what is “personal information” and how does it fit into the data/metadata distinction? Much of the discussion today jumps off from the realization that some metadata can naturally be considered personal information. So the proposed solutions are all about limiting information collection practices, and preserving the integrity of data once collected. However, this attack on the problem doesn’t provide a rationale for which data should be treated as personal; and the rationale needs to be widely agreed and self-evidently sensible if it is to affect government choices, and the wider discussion of the balance between security and privacy.
It is more helpful to consider privacy not as a scalar property that I either have none of, some of , or lots of, but rather as a property of a relationship. A has privacy with respect to B is B is unable to build a useful model of A. “Unable” might mean “physically unable”, but it might also mean “legally unable”.
So I might have little privacy with respect to my spouse, and to Google, and to foreign governments; more with respect to my work colleagues; and even more with respect to the Canadian (and other Five Eyes) government. (Yes, that really is the order.)
This definition reveals aspects of privacy that information containment definitions cannot. Yes, if B cannot gather information about me, then B cannot model me usefully, so my definition subsumes existing ones.
But B also cannot build a useful model of me if B lacks the algorithmic and computational resources to actually do the modelling. This was the basis of practical privacy even as recently as 30 years ago — if you wanted to build a decent model of someone you either had to live with them, or you had to physically travel around to various information sources gathering small pieces of information at each one. It was at least partially feasible to model a single individual, but it didn’t scale. (This, of course, is how it was possible to insert agents into other countries under deep cover, a practice that is much, much difficult today than it was during the Cold War.)
B also cannot build a useful model of me if the data about me is massively contradictory. So one approach to preserving my privacy, which does not jump out from conventional information flow views, is to increase the flow of information about me that is available in the real world, but make sure that the vast majority is untrue. Of course, this would be a tedious process if it had to be done manually, but the availability of computation helps individuals as well as modellers. To take a simple example, it would not be hard to create multiple email addresses and have your mail client randomly choose which one to use to send each outgoing message. It wouldn’t be too hard to aggregate this information and still build a model, but it’s a start. Now imagine that your mailer generates multiple versions of each email you send with the contents subtly changed (this is within the state of the art — see the Frankenstein malware) and sent to multiple similar receivers, only one of which is real. The problem of finding out who said what to whom becomes much more difficult very quickly.
People need to be able to operate, at least some of the time, in private: businesses making pricing decisions, inventors creating intellectual property, governments building budgets. It seems to me that there are only really two alternatives for the near future:
- Build enclaves within the present communication/computation infrastructure that are much harder to penetrate, but for which the true costs are paid (i.e. without the ad subsides that provide much of our “free” infrastructure today). This would require the infrastructure providers to assume liability and charge accordingly. It’s not clear this is doable, but it might be.
- Learn to do private activities “out in the open” using techniques such as fake replicants, since it is much easier for an insider to identify the genuine one than it is for an outsider (at least within limited time frames).
In the shorter term, the discourse on privacy would make more sense if we expand the discussion from information flow management to the bigger issue of modelling.