Posts Tagged 'privacy'

Privacy and social media

I was at a meeting last week at which one the speakers said this (roughly paraphrased):  15 years ago, the amount of data visible on a typical Facebook user’s profile page would have required a warrant to collect (and the warrant would have been difficult to get). 100 years ago this amount of data probably couldn’t have been collected, at least not at reasonable cost.

I think he’s probably right. Empirical data, rather than academic theorizing, has consistently shown that people are willing to go public with an amazing amount of data about themselves. This decision may be pragmatic: being visible brings benefits that outweigh the risks; it may be ignorance of what those risks are; it may be the inability to understand, in a visceral way, just how public something posted on the internet is and how long it will last. As far as I know, there’s been little concrete research on this issue.

This massive release of personal data is changing the discussion of what privacy is and what its role in society should be. This is especially true in places like the U.S. where the relevant law is expressed in terms of what the social expectation of privacy is — so that the boundary between public and private moves “automatically” as society changes.

But it’s worth reminding ourselves that little more than a 100 years ago, nobody had any privacy in the sense that everyone in your village or town knew everything about you, including everything about your whole life history and that of your parents and grandparents and so on. Until about 100 years ago, almost nobody was ever alone, either inside or outside. The whole idea of privacy is an invention of urbanisation where, for the first time in history, someone other than a hermit could act anonymously. It’s also an invention of secularization since, in most religious traditions, God is conceived of as omniscient so that no human could act anonymously or invisibly in a deep sense.

Using private documents to improve search in public documents

I’m back from the SIAM International Conference on Data Mining, and the 5th Workshop on Link Analysis, Counterterrorism, and Security, which I helped to organize. The workshop papers are now online, along with some open problems that were discussed at the end of the workshop.

I’ll post about some ideas that were tossed around at the workshop and conference in the next few days.

Let me start by talking about the work of Roger Bradford. Information retrieval starts from a document-term matrix, which is typically extremely large and sparse, and then reduces the dimensionality by using an SVD, a process sometimes called latent semantic indexing. This creates a representation space for both documents and terms. A query is treated as if it were a kind of short document and mapped into this representation space. Its near neighbours are then the documents retrieved in response to the query; and they can be sorted in decreasing distance from the query point as well.

Bradford showed that the original space can be built using a set of private documents and a set of public documents, and that the resulting representation space allows better retrieval performance than the space derived from the public documents, without allowing the properties of the private documents to be inferred.

In fact, the set of private documents can be diluted by mixing them with other documents before the process starts, making it even more difficult to work backwards to the private documents.

This process has a number of applications that he talks about in the paper. One of the most interesting is that it allows different organizations, for example allies, to share sensitive information without compromising it to each other — and still get the benefits of the relationships in the full set of documents.

Knowledge discovery — good or bad?

Most people have some awareness that computer algorithms can be used to extract useful knowledge from large amounts of data. This is the basis of customer relationship management, which is used by many businesses to evaluate (?improve) the quality of their interactions with their customers, both individuals and other businesses. This way of extracting knowledge is called knowledge discovery or data mining.

Most people have some intuitive idea of how this might work — after all humans are extremely good at extracting knowledge from certain kinds of data themselves. However, people tend to jump quite quickly to one of two diametrically opposite assumptions about how knowledge discovery works.

The first is a dystopian view — knowledge extraction technology can be used to learn everything about individuals from their social gaffes to their deepest thoughts. With this kind of power, governments will be unable to resist and will use knowledge discovery as a tool for control, in the style imagined in 1984. A variation on this theme is that knowledge discovery only looks effective and so will seduce governments and others into spending vast amount of money and collecting huge datasets without any payback.

The second is a utopian view — knowledge extraction technology will make every interaction as efficient as possible, and will prevent all of the bad things in the world from happening.

The truth, of course, is somewhere between these two extremes. There are many powerful things that knowledge discovery can do, some of them non-obvious; but this requires careful thought about the process, and, potentially, considerable cost. We are a long way from using knowledge discovery to improve the collection of library fines.

There are serious issues around the intrusiveness of data collection for knowledge discovery. Many of these issues are less difficult and more manageable than they appear on the surface. The question of whether knowledge discovery is good or bad is more nuanced than almost all of the discussion about it would suggest. Stay tuned.



Follow

Get every new post delivered to your Inbox.