Finding things you don’t already know

One of the big problems with the tools available on the web is that they are all convergent — in other words, they help you find out more about things you already know something about. Search engines are the most obvious example. Almost nobody looks beyond the first one or two results returned, so there’s almost no opportunity to be surprised by something that was returned in response to a query. Blogs and tags look, at first glance, as if they’re a little better; but after a while you come to realize that an awful lot of content is recycled among different blogs, and genuinely coming across something new and unexpected is a rarity.

We started working, some years ago, on a divergent information system called Athens. It’s designed to show you things that you don’t already know about, but which you are well-positioned to understand. So you get to find out new things, but not just random new things — things that should make some sense to you.

In a nutshell, the way it works is as follows:

  • You provide some “search terms” that indicate an area that you already know about.
  • To reduce the dependence of the system on your particular choices of initial search terms, Athens uses them as search terms, fetches a large set of pages, extracts their content, and treats this as a definition of the “area you already know about”.
  • Athens then searches out one level from this initial area using a combination of your original search terms, and new terms discovered from understanding the initial area. A large set of pages are retrieved and clustered. The content of these new pages is assumed to be things you already know about, because they are so closely related to the initial subject area.
  • Athens then repeats the process using combinations of search terms from the new clusters, and from the topics of the initial area. Again, a large set of pages are retrieved and clustered.
  • These clusters are presented as the results of the search. They are two levels away from the initial search, and so are likely to be both novel and related to the initial area of knowledge.

This works well, but there are some intricate problems along the way.

The first is that people actually have trouble getting their minds around the idea of novel knowledge. They keep being sucked into expectations derived from search engines such as Google and Yahoo — they expect, unconsciously, to see things that look familiar. And when they don’t, they want to see and be reassured about the connections between the new ideas and the old.

The second problem is that it’s hard to present the content of a cluster of novel information in a way that makes sense, given that the user doesn’t know yet what it is about. This goes to show how much we fill in the results of ordinary search with background and contextual information without realizing that we are doing it.

The other thing that surprises people is how long the process takes. A search engine query takes less than a second. An Athens query can take 8-10 hours!

Athens is useful for novel knowledge discovery in a number of different ways, which I’ll talk about in the next post.

What can be learned from text IV

The fourth property that can be learned from text is the intention of the author. This actually covers a wide range of related ideas:

  • Why did the author write the text?
  • What does the text reveal about the author’s worldview and assumptions?
  • What effect on the minds of readers is the author trying to achieve?
  • What is the author trying to get the readers to do?
  • Are there secondary audiences that the author is trying to affect as well?

There is considerable disagreement about whether these questions make sense, depending on one’s view of how humans model the world and how, or if, we influence each other.

In one view, popular in media studies, worldviews and assumptions come in a small set of flavors into which most (all) media stories can be fitted — for example, conflict or human interest. Others would consider a larger set of possibilities. Intention then includes actions like convincing someone else of your worldview, or merging the worldviews of two different groups so that they can work together (think anti-globalisation and animal rights, for example).

In another view, associated more with sociology, assumptions are much more fine-grained and capture the ways in which  we interpret what we see in the world around us. These assumptions are communicated between people in tacit ways that are hard to make operational.

There’s a kind of middle ground which is much more pragmatic. In this view, intentions are signalled much more by verbs. An intention looks something like: a subject, an object, a verb, and perhaps an adverb (although these terms might describe something much more complicated than just a syntactic structure). Such structures can (clumsily at the moment) be extracted from text, and the ones that are most relevant examined to try and extract the underlying intention, or at least to classify it by importance, and perhaps target. For example, the verb “rise up” has no meaning other than an incitement to some kind of negative action, so finding such intentions in text could be helpful in a law-enforcement or intelligence setting.

What can be learned from text III

Another property that can be learned from text is the author’s attitude to whatever the text is about. This is called, variously, sentiment analysis or appraisal theory. For obvious reasons, it has always been interesting to advertisers and marketers.

In its simplest form, it just analyzes text for associations of adjectives with the nouns of interest, for example films or people. This could be as simple as seeing whether the adjective “good” or “bad” appears near the noun(s) in question. It is not too difficult to extend this to other sets of adjectives that can be considered positive or negative: “the movie was exciting” (good), or “the movie was boring” (bad).

However, this process is not quite as easy as it looks. First of all, it’s hard in languages like English to be sure which adjective goes with which noun — proximity in the sentence is often used, but this is not very robust: “Although parts of the movie were good, overall it was bad” is not a positive comment about the movie.

Second, authors often use devices such as irony and sarcasm which look, syntactically, as if they are giving one opinion, but are actually giving the opposite opinion. Humans figure this out using deep background knowledge about the situation and about human mental life, so it’s difficult for an algorithm to mimic this level of understanding.

Third, texts often comment about the parts of an object as well as the whole object, and it becomes difficult to decide which adjectives go with which parts.

There are three levels of algorithmic analysis used for this problem:

  1. Using simple sets of opinion adjectives (and maybe other words) and trying to associate them to the nouns of interest using proximity, perhaps with a little extra sophistication, trying to pick out dependent clauses etc.
  2. Parsing the text more deeply and using natural language analysis techniques to associate opinion words with the nouns of interest.
  3. Using systemic functional linguistics approaches, which treat language generation as a goal-driven task by an individual in a societal setting, as well as a technology.

These levels are arranged in increasing order of sophistication, and also of complexity. However, even the best algorithms perform only at the 80% or so level, and that’s only capturing relatively unsophisticated judgements.

There are obvious applications to sentiment analysis in adversarial situations: trying to decide whether a terrorist group pronouncement or a threat represents a genuine opinion by the author or some form of propaganda; and who the propaganda might be aimed at.

What can be learned from text II

Today let me talk about the internal state channel and what can be learned from it.

First, let me point out that this channel is even more driven by subconscious processes, so we have very little control over it, even if we know how it works. This makes it very revealing.

Some of the properties that can be inferred from the internal state channel are:

  • personality. This is useful in many adversarial situations because you can’t usually get an adversary to sit down and take an MMPI test. Several ways to categorize personality from word usage have been developed, although they are obviously somewhat limited.
  • status of each participant in a conversation. In general, lower status participants tend to use first-person singular pronouns at higher rates, which gives clues about how each participant regards him/herself with respect to the others involved.
  • health. The health of an individual going forward for several months can be predicted by flexibility in pronoun use. This is a bit different from the other categories because it doesn’t rely on a particular signature of word frequencies, but on the ability of each individual to vary his/her word usage widely over time. In other words, unhealthy people maintain a single perspective on the world, and so use consistent pronouns to describe themselves and those around them; healthier people have a changing perspective that is reflected in changing pronouns. (Note the connections to first position, second position etc. associated with NLP and its antecedent psychological approaches.)
  • stress/depression. This is related to the previous category, but both stress and depression show up in characteristic ways in word usage. In the case of depression, the changes continue even after the depression ends, so that the never-depressed differ from the once-depressed.
  • community involvement or embeddedness. The way in which first-person singular and plural pronouns are used gives clues about how an individual feels in relation to a community.
  • deception. I’ve written extensively about this in previous posts.

Much of these results are the work of James Pennebaker and his group at the University of Texas at Austin. His work (here) is quite accessible. The paper by Chung and Pennebaker is particularly relevant.

What can be learned from text?

I’ve blogged before about text as a way of seeing both a content channel and an internal state channel from its author’s mind; and I’ve talked about looking for deception in the internal state channel extensively in the context of politics.

What else can be learned from text?

From the content channel, these kind of things can be extracted:

  • the names of people, places, things, and sometimes other specifics such as money (these are called named entities).
  • the topic of a text, that is what it is about expressed as a single word or phrase (or perhaps a short list of such things). This property is what is used to collect together similar stories in Google News, for example.
  • events that occurred in the text. Although this seems straightforward, it is actually quite difficult and little work has been done on it.
  • a summary of the text. There is a wide range of possible summary styles: a tag cloud is one kind of summary; the snippet that comes below the url in a page of seach results is another; but there are also summaries that consist of sets of sentences drawn from the text that represent the main ideas or content.
  • a narrative from the text. Summaries try to cherry pick ideas from content, while a narrative tries to provide some sense of the flow of events.

All of these techniques make it easier to take a text, usually one of a very large number of texts, and decide whether it is worth looking at the raw text or not.

Other properties of the content channel that are often of interest are:

  • who the author is. This is an old kind of question, going back to questions like: Are Shakespeare’s plays written by Shakespeare, or the Earl of Essex, or someone else. It’s actually a range of questions: was this text written by author A or someone (anyone) else; was this text written by author A or author B; or match this large set of documents and this large set of authors.
  • properties of the author. For example, there is some work on determining author gender, author age, and author first language.

All of these problems have been studied, and many can be solved with knowledge discovery techniques whose accuracies are in the range 70-85%. This sounds pretty good, but is not terribly practical when the set of texts involved is in the tens of thousands. Most people who use Google News will have come across the occasional howler topic that is completely useless, for example.

Analyzing graph/relational data

One of the current puzzles is why knowledge discovery techniques for graph data do not perform as well, in practice, as they should in theory. The Netflix prize competition, which asks teams to predict user ratings of new movies based on several years of data about previous ratings, has turned out to be surprisingly difficult.

Ronald Coifman’s invited talk at the SIAM Data Mining Conference had something to add to this approach. He showed that the spectral approach to graph analysis, which works with eigenvectors of some matrix derived from the adjacency matrix of the graph, is really the same underneath as a wavelet approach, in which the structure in the graph is analyzed at varying scales. He has applied these ideas to graphs in which the edge affinities are derived from the thresholded pairwise affinities of data records, which makes it straightforward to turn attributed data into graph data without having to commit to a particular set of attributes in advance. This makes the approach easy to apply to data such as images and audio where there are a very large number of attributes.

The abstract of the talk is here, and slides may eventually be posted on this site as well.

Maggioni’s web site is a good place to read more.

Using private documents to improve search in public documents

I’m back from the SIAM International Conference on Data Mining, and the 5th Workshop on Link Analysis, Counterterrorism, and Security, which I helped to organize. The workshop papers are now online, along with some open problems that were discussed at the end of the workshop.

I’ll post about some ideas that were tossed around at the workshop and conference in the next few days.

Let me start by talking about the work of Roger Bradford. Information retrieval starts from a document-term matrix, which is typically extremely large and sparse, and then reduces the dimensionality by using an SVD, a process sometimes called latent semantic indexing. This creates a representation space for both documents and terms. A query is treated as if it were a kind of short document and mapped into this representation space. Its near neighbours are then the documents retrieved in response to the query; and they can be sorted in decreasing distance from the query point as well.

Bradford showed that the original space can be built using a set of private documents and a set of public documents, and that the resulting representation space allows better retrieval performance than the space derived from the public documents, without allowing the properties of the private documents to be inferred.

In fact, the set of private documents can be diluted by mixing them with other documents before the process starts, making it even more difficult to work backwards to the private documents.

This process has a number of applications that he talks about in the paper. One of the most interesting is that it allows different organizations, for example allies, to share sensitive information without compromising it to each other — and still get the benefits of the relationships in the full set of documents.

Spin in the US Presidential Primaries — Summary

As we enter what looks like it might be the end phase of the primary season, I thought I would summarize what I’ve written about spin during the process.

  1. What is spin? People often talk about spin as messing with the content of a communication: leaving bits out, or changing the emphasis. What I’m talking about here is a mental (unconscious) process where a person presents themselves or their content in a way that does not reflect what they know to be true about it. Politicians (and the rest of us) do this, to some extent, all of the time — trying to make a good impression. For a politician, outright lying is a poor idea (recall Clinton under sniper fire) but there is a lot of pressure to be “all things to all men”. Because the communication is not the speaker’s natural persona, this kind of spin produces a detectable signature in the communication.
  2. What is the model of spin (deception)? This work is based on Pennebaker’s empirically-derived model of word-usage changes when people are being deceptive. This model is characterized by (a) reduced rates of first-person singular pronouns; (b) reduced rates of exclusive words, words that mark the beginning of a phrase or clause that qualifies or refines what has gone before; (c) increased rates of negative-emotion words; and (d) increased rates of action verbs. These changes are unconsciously produced, so cannot be directly altered by a speaker, even one who knows the model. Although the model was developed for plain deception (outright lying) it seems to detect deception across the full range from lying, through spin, to negotiating and dating.
  3. Why does there have to be a context? Because the model relies on increases and decreases in word-usage rates, there must be some kind of context of similar communications or documents to be able to tell whether a given frequency represents an increase or a decrease. Therefore, absolute spin scores cannot be determined — instead we can only rank a set of communications from most to least spinful. Even within the context of the presidential primaries, underlying language use has changed, most obviously from a ”getting to know me” phase to a “getting to know my policies” phase.
  4. The early primaries. From the beginning of 2008 until the 3rd week of February, all three candidates were introducing themselves. In the speeches given during this period, McCain has the least spin, followed by Clinton, followed by Obama, with noticeably higher levels of spin. McCain generally used (and uses) high rates of first-person singular pronouns, justifying his ’straight talk” claim; Clinton generally used high rates of exclusive words, adding refinement and qualification to many of her statements. Obama’s speeches were lacking in both: he used “we” at extremely high rates (and “I” hardly at all), and his statements were simple and declarative. This makes for speeches that are light on content but, when well delivered, emotionally uplifting. (Reading Obama’s speeches rather than hearing makes one wonder what all the fuss is about — the speeches themselves are rather dry, and the delivery is everything.)
  5. Obama decides he’s won. Over the weekend of February 24th, Obama’s language patterns changed dramatically, becoming very similar to Clinton’s. I conclude that this weekend his campaign did the calculations and decided that Clinton could not win the nomination (which seemed, and seems, mathematically true). He cannot have consciously altered his speech patterns, so this must the result of reframing what’s going on to himself — presumably stepping out from behind the persona he had been using before that and presenting something closer to his real self.
  6. The past month. In the past month, both Obama and Clinton show higher levels of spin whenever the pressure on them has increased, and they have become defensive. For example, Obama’s levels of spin jumped back to January levels when the Wright controversy became public. In such situations, Obama’s level of spin is characteristically higher than Clinton’s.
  7. Responses versus statements. It is difficult to analyze and compare the debate statements of the candidates with their speeches. The question and answer form naturally changes the rates of word usage: for example, if the question is “Would you…” it’s much more likely that the answer will begin “I will…”. And, to make matters worse, debates are not really question and answer since candidates have prepared statements for likely questions and they will use them regardless of the form (and sometimes the content) of the question. It is not yet clear how applicable the deception model is in question and answer situations, so I have not analyzed the debates, except for the most-recent — where Obama still shows up having higher spin than Clinton.
  8. Does spin work? Spin is a two-edged sword. On the one hand, it does work in the sense that it can make a speaker appealing to people who would otherwise not be attracted to him/her — which is why,  of course, politicians use it. On the other hand, if a candidate steps out of character, even briefly, people may realize that it is a facade and react in a strong negative way. And, to make it harder, the facade and the language usage are largely subconscious, so a candidate may misstep without realizing it.
  9. “I” versus “we”. There’s a lot of (positive) discussion of Obama’s high rates of use of “we”. This pronoun is irrelevant to deception — people who are being honest or deceptive may use “we” at either high or low rates. However, the difference between using these two pronouns is partially understood. People who are being open and not status conscious typically use “I” a lot, while those who are being closed and status conscious typically use “we” a lot. In particular, “we’ is often code for “you” — commanding without creating the impression of command. In other words, “we” is a weasel word.
  10. Growing into a persona. It’s possible that Clinton and McCain have been in politics so long that a persona that they originally assumed has now become so much a part of them that it has become their real personality; and that is why their levels of spin are low. By this explanation, the reason that Obama has such high levels of spin is that he’s a relative newcomer to the US national arena, and so he still “puts on” a persona. This doesn’t seem all that convincing — first, he has a long history in public life, although on a smaller stage; second, he seems to be able to step out from behind the persona when things are going well.

The analysis on which this summary is based (and the figures that go with it) can be found in earlier postings.

Offline for a while

I won’t be posting until April 28th as I’m off to the SIAM Data Mining Conference in Atlanta next week. Watch for stuff from the conference when I get back.

Clinton and Obama spin including Penn debate

Here are the results of the analysis of spin for Clinton and Obama, using speeches from 2008, and the Penn debate last night. 19 is Clinton, and 39 is Obama.
Obama and Clinton including Penn debate

Clinton: blue dots; Obama: blue stars.

As I’ve noted before, Clinton in general uses less spin than Obama, except when she’s under pressure. It’s also clear from this figure that there is a significant difference between the two orthogonal to the line that indicates level of spin. This is almost entirely due to Clinton’s liking for action verbs, to which Obama is relatively allergic.

It’s not quite fair to compare debate content to prepared speeches, because a debate requires some level of responsiveness to questions (which changes the model of deception). However, it was clear that most of the answers last night were heavily prepped, so perhaps it’s not too different.

The striking thing is that both performances represent all-time lows in spin for both candidates. This is somewhat surprising since Clinton is presumptively in the last few days of her candidacy. Both candidates seem to have decided that they can only be who they are — which has to be a good thing.

Mind you, Obama is still in love with the pronoun “we”. He almost said “when we become president” last night!

Text in Adversarial Situations

Whenever we, as humans, write or speak we reveal something about ourselves. Part of this is what we want to reveal — the purpose of our communication. But we also reveal a great deal that we did not necessarily intend to reveal, and this is part of what makes textual analysis interesting in adversarial situations.

It’s helpful to think of what is happening when we speak or write as happening simultaneously in two channels:

  1. The content channel, which serves the purpose for which the communication is intended, and is often carried by the `big’ words: nouns and verbs; and
  2. The internal state channel, which reveals information about our mental state, intentions, and feelings, and is often carried by the `little’ words such as conjunctions and verbs.

The internal state channel is what we examined when looking for deception in earlier posts.

We tend to think, intuitively, that we control the content channel consciously, although we can’t control the internal state channel. In fact, we actually do not control either channel very well, at the level of details; although, of course, when we set out to say something we usually manage to get the content we want across.

When we think about communication emanating from bad guys, there are a number of different scenarios:

  1. They are communicating in a public way to disseminate content. They may attempt concealment, but are aware that their communication could be intercepted both accidentally by almost anyone, and by people looking for it explicitly.
  2. They are communicating in a private way to disseminate content. They will have to attempt concealment, and know that communication that is intercepted will be scrutinized carefully.
  3. They are communicating in a public way, but it is their mental state that is most interesting. For example, propaganda is a form of content-filled public communication, but the mental models and intentions behind it are probably of more interest than the content.

Each of these scenarios requires a particular kind of analysis, but the overall structure is the same:

  • Use selection to find potentially interesting communication in the mass of communication in today’s media and internet. This relies on modelling normality; looking for concealment; and looking for evasion (reaction to simple selection techniques).
  • Use analysis techniques on the set of selected communications to extract content, authorship information, metainformation (e.g. traffic analysis), intention, emotional state, deception, and attitudes.

I’ll talk in more detail about these aspects in subsequent posts.

Hiding a secret in a virtual world

Yesterday, I talked about ways to hide a secret in the internet or web. One of the newer ways that is attracting attention is to hide a secret in a virtual world.

People usually start by thinking about hosted worlds such as Second Life. In these environments, there is some level of exposure because the hosting organization can see everything that is happening, and can prevent certain kinds of bad things from happening (in a limited way — griefers seem to be abe to do a lot without much obstacle).

However, new `open source’ virtual world environments are rapidly being developed, and these provide more accessible ways to hide secrets. For example, the Multiverse allows anyone to host a virtual world, and virtual worlds can be connected to one another via teleport stations that allow a user to jump from one to another (and also to jump about within a single world). A virtual world is like a web site, and a teleport is like a click on a hyperlink. Such worlds are accessible by standard clients run by users.

At this moment, there is a lot of overhead in learning how to put together such a world; and both servers and clients need a lot of horsepower and bandwidth to create a useable experience. But this will come, of course.

Communicating in such virtual worlds has the advantages of communicating in the real world, plus some extras. Two individuals can `meet’ and `communicate’ in the virtual world just as they could in the real world. Of course, the owner of the virtual world can log what happens, but when such worlds become as common as web sites, some worlds can be maintained for some apparently innocent purpose, but actually to enable covert communication.

It’s also possible to have avatars in the virtual world that can dispense information when they are given a suitable passphrase, of the “The geese fly south” variety; but communicate innocently to others. So an avatar can hold a secret and give it to anyone who knows how to access it. This isn’t limited to a small amount of text either — movie screens that play arbitrary content can be created and played in response to a passphrase.

In the real world, people can be followed and their communications intercepted. This is much more difficult in a network of virtual worlds. First, teleportation means that an avatar can move around, even within a world, in a way that is hard to track as an outsider. Second, communication can happen at a distance; an avatar can `talk’ to another even when they aren’t virtually close. So communication can happen without an obvious meeting.

At present, AFAIK, it is not possible to leave objects in Multiverse worlds for collection by others; but this functionality might well be adopted.

One of my students has been building a virtual world representation of a web site. Details about some of these features can be found
here.

There are going to be messy legal issues as well. It seems that the default interpretation of U.S. 4th Amendment protections is based on `expectation’ of privacy. This is a poor basis for argument given rapidly changing technology and different levels of understanding by ordinary people about how technology works. But it’s conceivable that a virtual world will acquire an expectation of privacy, even though it’s a public space.

Hiding a secret in the Internet

Although it might seem intuitively obvious that the way to hide a secret is to keep it hidden, there are some reasons why it makes sense to hide a secret ‘out in the open’ on the Internet or the web. Doing so may make it easier to pass to other people than sending it explicitly to them because the sender doesn’t have to know where the receiver actually is (or who they are pretending to be). Physical meetings create a strong trace, so it might be more attractive not to meet but to communicate instead.

There are a number of ways to communicate a secret covertly even though it is out in the open on the Internet. Here are some of them:

  1. Rely on hiding the secret in the middle of a torrent of other communication. For example, create a blog on the most boring subject imaginable, wait for any initial interest to die down, and then post very occasionally. With a little bit of cryptic language, the chance of anyone stumbling on it, and being interested enough to act on what they see, is very small.
  2. Encrypt the secret. This makes it possible for the secret to be in plain view, but only those who have the decryption key can open the envelope and look at the content.
  3. Put it in a password-protected place on a web site. There is already a lot of protected content on the web, so this doesn’t make it stand out.
  4. Put it in a web directory, but don’t provide any link to it. Although web crawlers can see the content, there’s a gentleman’s agreement (worth testing from time to time) that the large commercial search engines won’t index it. Since the content can’t be searched for, it can only be found by knowing where it is, and the web is a big place.
  5. Use a non-http protocol. Many peer to peer systems already move hidden content around the internet and, because each is free to use its own protocol, standard tools do not ’see’ this traffic or its content.

Each of these approaches has disadvantages and weaknesses. They are:

  1. Knowledge discovery tools that look for particular words are already deployed. Tools that look for unusualness or abnormality are also beginning to be developed. It is becoming easier and easier to filter content and pick out particular kinds, without needing a good model of what those particular kinds might look like.
  2. Encryption seems like a strong way to protect content, but its weakness is the handling of the keys needed for decryption. These are another kind of secret that must be protected even more strongly than the secret we’re thinking about. Also researchers in cryptography continue to discover unsuspected flaws in cryptographic schemes.
  3. Password protection of web sites is not very robust. Most servers protect particular directories/folders, but not recursively, so if you can guess the name of a subdirectory, you can usually get access to it.
  4. Relying on web crawlers not to index or otherwise capture data in unlinked files is just a dangerous idea, if you want to keep the data secret.
  5. Non-http protocols can provide a more robust protection scheme, but they require particular software to understand the variant protocol, so this new software becomes a different kind of secret.

So, although it is possible to hide a secret in public on the Internet, it is not so easy to do it well. One of the new hiding places is virtual worlds, which I’ll talk about tomorrow.

Artificial identities

Those of us on the side of the angels would prefer that identity functioned as a very strong digital key so that there was a direct mapping from individual to key to data. If such a thing existed, many of the problems of casinos, law enforcement, counterterrorism, and so on would be much easier. Hence the torrent of interest in biometrics, which readers will know is actually not much of a solution.

From our position as individuals, however, we would prefer that identity was a much weaker thing, so that we could interact with other people and organizations without making it possible for others to take anything they learn about us and turn it into a search key to learn more.

Neither side of this is either completely good or completely bad. Even in private life, it’s probably good on balance that one person can find out that someone they’re dating is abusive, or cheap, or whatever.

What’s changed in the modern world is the ease with which different data about a person can be fused. Once upon a time, this depended on a few direct keys, like full names or social security numbers. With technology, almost any property of a person can be become a key with which to find out more about them.

The solution that most people imagine to the problems of preserving a private life in a world where it’s easy to learn more about someone is to somehow create partial identities. One way to do this is to create multiple email addresses, and use different ones in different contexts. Historically, authors used pen names to decouple their working identity from their personal identity.

When any attribute can become a key, partial identities don’t work any more (which doesn’t stop people trying to build approaches and systems based on them).

A better approach is to allow people to use false identities. This idea doesn’t necessarily appeal to people in Western culture because it seems as if it’s alittle bit dishonest. In other cultures, this idea would seem completely natural. And actually in AngloSaxon common law a person is allowed to assume another identity, as long as there is no intent to defraud (although this is increasingly difficult in practice, for example getting on a plane).

A false identity is only useful is certain properties can be associated with it — for example a fixed amount on money that the identity can spend. Building such identities with properties requires a kind of trusted agent, who can guarantee that there is a real identity behind the false one, and that the property genuinely does belong to the real identity. But this idea can be made to work.

Behavioral Detection

There was an interesting incident last week, where a TSA behavioral detection officer detected someone who turned out to have bomb-making equipment with which he was planning to fly.

Although there has been some media coverage, it was probably news to most people that TSA used behavioral detection. There are staff at a few US airports, with plans to roll out the program in more airports quite rapidly. As a program, it has been successful — about 1% of stops seem to produce arrests, which is actually a pretty good rate. Of course, almost all of these arrests are for crimes unrelated to air travel, which raises some difficult issues that will exercise the minds of those who are extremely keen on privacy (but note the existence of a Terry stop in the US which seems to me to more-or-less cover this case).

I’ve decided that when people use the word ‘profiling’ what they mean is predictive modelling based on arbitrary or intuitively derived attributes and models.

It isn’t appropriate to call this “behavioral profiling”. Its technical roots seem to come from two places. The first is Ekman’s work on microexpressions, which some people may be familiar with from Gladwell’s book, Blink. The idea is that all of us exhibit fleeting patterns of muscular use, primarily facial, that reveal parts of our underlying internal state. Some practice is evidently needed to learn to notice these in real time, but it can be done.

The second part is less public but seems to have been developed by the Israelis as a way to defend against suicide bombers. I’ve heard one talk by an Israeli about this (light on technical details), but they claim to have high success rates across a wide spectrum, for example including schoolboys.

In the conversation about the recent incident, many connections were drawn with air security at Ben-Gurion, the main airport in Israel. It’s not entirely fair to draw such comparisons. Ben-Gurion uses a defence in depth where, as I understand it, data analysis begins when you order your taxi to go to the airport (as well as the obvious analysis of flight manifests).

More on Identity

I’ve mentioned the problem of figuring out when data records describe the same person in the two most recent posts. Casinos are required to ban certain people who have self-identified themselves as having a gambling problem, so they have to look carefully at everyone who books a room. They also, of course, have an interest in noticing when certain other people show up, for example card counters.

As I said yesterday, identity is a slippery thing to manage algorithmically. It’s only in the last century that governments have gotten into the act of certifying identity, via various forms of government-issued identification, going back to birth certificates.

Such documents are not necessarily very reliable. There’s a long history of forging them. But mostly identity gets fudged because people don’t use them directly — they copy names and addresses with characteristic human errors; and this process can be helped along by those who want to hide their identity. It’s socially acceptable to use variant names, and people constantly make mistakes with numbers. Those who want to can use these deniable mistakes to create multiple versions of their identities.

This is partly why there’s such an interest in biometrics. A biometric is an identity key that was given to you by God. The important distinction in biometrics is between a digital biometric and a non-digital one. A photo in a passport is a non-digital biometric — it can be used to associate the passport, and so its contents, with you, but doesn’t do much else. A digital biometric, such as a digitized photo, can act as a key to a large database of information about you.

Most biometrics are extremely easy to fool. You can read about some of the easy tricks here. Fingerprint scanners can be fooled by plastic wrap; iris scanners by printed photos of an iris.

In relationship/graph data, the problem with multiple records describing the same person is that they blur the structure of the connections around that person — making some paths seem longer, and some properties more diffuse. That’s why it’s important to be able to resolve identities when possible; but also why it’s important to stay agnostic over the long haul.

More on Las Vegas

Las Vegas is an interesting example for those who think about adversarial knowledge discovery because it shows how little people really value privacy. The casinos leave their customers no privacy — their every action is captured while they’re in the casino, and even making a hotel reservation starts a chain of analysis in motion. Imagine the fuss if a government did anything remotely like this!

I mentioned Jeff Jonas yesterday. He has made two contributions to adversarial knowledge discovery:

  1. An agnostic approach to people’s identity in data. One of the problems of data analysis, particularly when the data comes from multiple sources, is putting together the attributes that belong to the same person. Usually some kind of key is used, perhaps a biometric, so that records that belong together can be discovered to belong together. If you are trying to hide some of your data, you want to confuse the key as much as possible.There are many ways to do this, but the best ones are ones that can be disavowed if the question ever comes up. So people who are trying to hide use variants of their names, mix up digits of phone numbers and street addresses, and anything else that can also happen accidentally. Some studies of criminal records have shown that nearly half have been altered in this kind of way.Jonas’s Non-Obvious Relationship system allows records that might belong to the same person to coexist. When some analysis is done of the connections between people, which is where this kind of blurring matters because it increases the distances between the nodes representing people, the software can make an on-the-fly determination of how confident it is that a set of records represent the same person, without making the determination irreversible.
  2. He argues that any data analysis system should treat the queries that are made to it as new parts of the data, that should be added to the system and kept in it. This has two advantages. The first is that, if the answer to a query appears in the data after the query has been made, the answer can still be provided to the person who asked. The second is that a second query of the same kind can produce a response about the existence of the first query. In other words, one person can discover that someone else was asking the same question. Both of these are important and useful in adversarial settings, and should be considered for other data-analysis systems.

Jonas’s blog is
here.

Lessons from “21″, aka “Bringing Down the House”

The film “21″ is (loosely) based on the book by Mezrich, “Bringing down the House”. It describes the actions of a group of cardcounters who managed to make a lot of money from North American casinos.

Las Vegas is an interesting environment to think about adversarial knowledge discovery because it’s so well-developed, because there’s money to spend on ideas that work, and because it’s fairly easy to measure how well tools are working. For example, the expected return from every game is tracked in real-time, and deviations attract attention within a very small time window.

Jeff Jonas, now at IBM, has a lot of experience in this world — his Non-Obvious Relationship Awareness tool was/is heavily used by casinos to find an individual using more than one identity. If you ever have the chance to hear him speak, it’s always fun. Just one quote: “What happens in Vegas, stays in Vegas — on video”.

The main lesson from the book (and maybe the film) is how important social engineering can be in knowledge discovery. It’s no good using sophisticated data collection, and heavy-lifting data analysis, if the results are discounted because they don’t fit the preconceptions of the people who have to make decisions and take actions as the result of the knowledge.

In the casino context, the card counters played roles that were crafted to look like the people casinos want to see — people who, even when they’re winning big right now, will come back and lose even more tomorrow.

Building resistance to social engineering is difficult because it doesn’t lend itself to technological solutions. The “failure of imagination” with which the 9/11 Commission charges U.S. intelligence agencies is largely a social engineering issue. There were scattered pieces of information around, but they were discounted and/or ignored because nobody really believed in the actual possibility of an attack of that magnitude.

We don’t know a lot about how to increase imagination.

US presidential election spin — recent speeches

As promised, I’ve updated the plot of candidate spin to reflect only the speeches given in March.

The significance of this is that spin is always relative to some set of similar documents; and, in the Democratic campaign at least, the substance and tone of the conversation has changed substantially in the past few weeks. All candidates have moved away from presenting themselves as people to presenting their policies (McCain is back talking about himself, but this isn’t captured in this set of speeches).

Spin in speeches in March

9 is Obama’s race speech, 10 is his Iraq speech.

McCain still shows low levels of spin. However, the position of Obama and Clinton has largely reversed. Clinton shows high levels of spin, suggesting that her policy speeches are being simplified from the way she would usually speak (which is plausible). Obama continues to increase his level of first-person singular pronoun use, suggesting that he remains confident that he has won the nomination, despite the media’s attempts to keep stirring the pot.

Workshop and Link Analysis, Counterterrorism, and Security

If you’re interested in the content of this blog, and you live in the Atlanta area, you might be interested in coming to LACTS, the Workshop on Link Analysis, Counterterrorism, and Security. It’s being held on April 26th (Saturday) as part of the SIAM International Data Mining Conference. A one-day registration deal is available.

The proceedings will also be available online, both via my website and from SIAM after the workshop.

Here is the schedule:

0825-0830: Introduction
Antonio Badia and David Skillicorn

0830-0900: Detecting Hidden Passages in Documents
Saket S.R. Mengle and Nazli Goharian

0900-0930: Exploiting Sensitive Information in Background Mode using Latent Semantic Indexing
R. B. Bradford

0930-1000: Topic Detection Using Independent Component Analysis
Scott Grant, David Skillicorn, and James R. Cordy

1000-1030: Coffee Break

1030-1100: Using AI for Sensemaking in Investigative Analysis
Summer Adams, Ashok K. Goel, and Neha Sugandh

1100-1130: Vulnerability Assessment on Adversarial Organization: Unifying Command and Control Structure Analysis and Social Network Analysis
Il-Chul Moon, Kathleen M. Carley, and Alexander H. Levis

1130-1200: Torus Graph Inference for Detection of Localized Activity
Elizabeth A. Beer, Carey E. Priebe, and Edward R. Scheinerman

1200-1330: Lunch (on your own)

1330-1430: Workshop Keynote: “The Road to Link Intelligence”
Sherry Marcus, 21st Century Technologies.

1430-1500: Enhancing the Automated Analysis of Criminal Careers
Tim K. Cocx, Walter A. Kosters, and Jeroen F.J. Laros

1500-1530: Summarization and Information Loss in Network Analysis
Jamie F. Olson and Kathleen M. Carley

1530-1545: Summing Up
Antonio Badia and David Skillicorn

Next Page »