Posts Tagged 'privacy'

Using private documents to improve search in public documents

I’m back from the SIAM International Conference on Data Mining, and the 5th Workshop on Link Analysis, Counterterrorism, and Security, which I helped to organize. The workshop papers are now online, along with some open problems that were discussed at the end of the workshop.

I’ll post about some ideas that were tossed around at the workshop and conference in the next few days.

Let me start by talking about the work of Roger Bradford. Information retrieval starts from a document-term matrix, which is typically extremely large and sparse, and then reduces the dimensionality by using an SVD, a process sometimes called latent semantic indexing. This creates a representation space for both documents and terms. A query is treated as if it were a kind of short document and mapped into this representation space. Its near neighbours are then the documents retrieved in response to the query; and they can be sorted in decreasing distance from the query point as well.

Bradford showed that the original space can be built using a set of private documents and a set of public documents, and that the resulting representation space allows better retrieval performance than the space derived from the public documents, without allowing the properties of the private documents to be inferred.

In fact, the set of private documents can be diluted by mixing them with other documents before the process starts, making it even more difficult to work backwards to the private documents.

This process has a number of applications that he talks about in the paper. One of the most interesting is that it allows different organizations, for example allies, to share sensitive information without compromising it to each other — and still get the benefits of the relationships in the full set of documents.

Knowledge discovery — good or bad?

Most people have some awareness that computer algorithms can be used to extract useful knowledge from large amounts of data. This is the basis of customer relationship management, which is used by many businesses to evaluate (?improve) the quality of their interactions with their customers, both individuals and other businesses. This way of extracting knowledge is called knowledge discovery or data mining.

Most people have some intuitive idea of how this might work — after all humans are extremely good at extracting knowledge from certain kinds of data themselves. However, people tend to jump quite quickly to one of two diametrically opposite assumptions about how knowledge discovery works.

The first is a dystopian view — knowledge extraction technology can be used to learn everything about individuals from their social gaffes to their deepest thoughts. With this kind of power, governments will be unable to resist and will use knowledge discovery as a tool for control, in the style imagined in 1984. A variation on this theme is that knowledge discovery only looks effective and so will seduce governments and others into spending vast amount of money and collecting huge datasets without any payback.

The second is a utopian view — knowledge extraction technology will make every interaction as efficient as possible, and will prevent all of the bad things in the world from happening.

The truth, of course, is somewhere between these two extremes. There are many powerful things that knowledge discovery can do, some of them non-obvious; but this requires careful thought about the process, and, potentially, considerable cost. We are a long way from using knowledge discovery to improve the collection of library fines.

There are serious issues around the intrusiveness of data collection for knowledge discovery. Many of these issues are less difficult and more manageable than they appear on the surface. The question of whether knowledge discovery is good or bad is more nuanced than almost all of the discussion about it would suggest. Stay tuned.