Posts Tagged 'identity'

Blurring Identity

As I posted a few days ago, if we can’t avoid having data about ourselves, our actions, and our identities collected, then one plausible strategy is to create artificial data to supplement the real data. This makes it harder to find the real identity inside the larger blurred identity.
And now there’s a tool to help: Please Dont Stalk Me which allows twitter users to make it look as it their tweets come from anywhere in the world. Here’s the web site:


Fake Identities

Sixty years ago, all that was required to create a fake identity was to be able to forge documents — think ‘Allo ‘Allo. However, such identities didn’t stand up to much scrutiny since any check with a central location discovered the forgery. So intelligence organisations during the Cold War had to create deeper fake identities by inserting false records into government archives, or by taking over the identities of other people who weren’t using them. The nature of the process meant that they had to be developed ahead of when they would be needed, because there was no practical way to retro-insert the necessary documents.

Fast forward to today and have pity on the intelligence organisation employees whose job it is to keep many cover identities functioning in case they’re needed one day: posting on Facebook and LinkedIn, making comments on other peoples’ posts, and generally simulating a real person against the day when that record of existence might be needed as background for someone to assume the identity.

They have some new problems. First, it’s not enough to create tokens of identity; they have to involve activities, and so the work is constant. Second, everything they do is (potentially) recorded so that any mistake is captured for ever. Third, and most importantly, we don’t know enough about how real people behave online to fake it successfully. For example, suppose I want to post on a social network site about the median number of times for someone with my demographics. I usually don’t have enough information to be able to guess what that median is with any reliability. Even if I do, there are deeper patterns to such postings, for example the distribution by time of day that I really should try to mimic but practically can’t. If I’m an intelligence professional, I may unwittingly post during the work day when the identity I am simulating would more plausibly post in the evening. And if I try to construct a realistic social network around myself I have even more difficulty knowing what it “should” look like, let alone making it happen.

For a fun example of this process and its pitfalls, search for “Robin Sage” at your favorite search engine.

The flipside of the difficulty of faking an online identity is that my online presence over time becomes a better guarantee of my identity than that provided by governments. Because I’ve had a web page for a long time, and its been captured at unpredictable moments by the Wayback Engine, it provides quite a strong basis for my identity.

Super Identities

I heard a talk on the UK Super Identity Project last week which stimulated some musings on this important topic.

Once upon a time, almost everyone lived in villages, and identity was not an issue — everyone you knew also knew you and many of them had done so since you were born. So identity issues hardly arose, apart from an occasional baby substitution (but note Solomon in 1 Kings 3:16-28 for an early identity issue). As rich people began to travel, new forms of identity evidence such as passports and letters of introduction were developed.

About a hundred years ago and as the result of mass movement to cities, questions of identity become common. You can see from the detective stories of the time how easy it was to assume another identity, and how difficult it was to verify one, much as it is in cyberspace today. To deal with these issues, governments become involved as the primary definers of identity, getting in on the act with birth certificates (before that, e.g. baptismal records), and then providing a continuous record throughout life.

In parallel, there’s the development of biometric identifiers, mostly to deal with law enforcement, first the Bertillon system and then fingerprints (although as I’ve noted here before, one of the first of the detective stories to include fingerprints– The Red Thumb Mark — is about how easy they are to forge).

The Super Identity project is trying to fuse a set of weak identifiers into a single identity with some reliability. Identities are important for three main reasons (a) trust, for example so that I can assume that someone I’m interacting with online is the person I think it is; (b) monetizing, for example so that an advertiser can be sure that the customized ad is being sent to the right person; and (c) law enforcement and intelligence, for example, these identities are actually the same underlying person.

There are many identifying aspects, almost all of which are bound to a particular individual in a weak way. They come in four main categories:

  1. Physical identifiers such as an address, or a place of employment.
  2. Biometrics (really a subset of the physical) such as fingerprints, iris patterns, voice and so on. These at first glance seem to be rather strongly bound to individuals, but all is not as it appears and they can often be forged in practice, if not in theory. There is an important subset of biometrics that are often forgotten, those that arise from subconscious processes; these include language use, and certain kinds of tics and habits. They are, in many ways, more reliable than more physical biometrics because they tend to be hidden from us, and so are harder to control.
  3. Online identifiers such as email addresses, social network presence, web pages, which are directly connected to individuals. Equally important are the indirect online identifiers that appear as an (often invisible) side-effect of online activity such as location.
  4. Identifiers associated with accessing the online world, that is identifiers associated with bridging from the real world to the online world. These include (beloved by governments despite their weakness) IP addresses which led to a recent police raid, complete with stun grenades, on an innocent house.

The problem with trying to fuse these weak identifying aspects into a single superidentity which can be robustly associated with an individual is this: it’s relatively difficult to avoid creating these identifying aspects, but it’s relatively easy to create more identifying aspects that can be used either to actively mislead or passively confuse the creation of the superidentity.

For example, there’s been some success in matching userids from different settings (gmail, facebook, flickr) and attributing them to the same person. But surely this can only work as long as that person makes no effort to prevent it. If I want to make it hard to match up my different forms of web presence then I can choose userids that don’t associate in a natural way — but I can also create extra bogus accounts that make the matching process much harder just from a computational point of view.

So it may be possible to create a cloud of identifying aspects, but it seems much more difficult to find the real person within that cloud, especially if they’re trying to make themselves hard to find. The Super Identity project would no doubt respond that most people aren’t about making themselves harder to identify. I doubt this; I think we’re moving to a world where obfuscation is going to be the only way to gain some privacy — a world in which the only way to dissociate ourselves from something we don’t want made public is to make the connection sufficiently doubtful that it cannot reliably acted on. This might be called self-spamming.

For example, if a business decides to offer differential pricing to certain kinds of customers (which has already happened), then I want to be able to dissociate myself from the category that gets offered the higher price if I possibly can. If the business has too good a model of my identity, I may not be able to prevent them treating me the way they want to rather than the way I want them to. (This is, of course, why almost all data mining is, in the end, going to be adversarial.)

In the end, behavior is the best signal of identity because it’s hard for us to modify, partly because we don’t have conscious awareness of much of it, and partly because we don’t have conscious control even when we have awareness. No wonder behavior modelling is becoming a hot topic, particularly in the adversarial domain.

How do I demonstrate that I am me?

The question of identity, how the question in the title gets answered, is one with an interesting history; and one that is changing again at the moment.

For much of human history, identity was almost completely determined by the fact that a person was born and grew up in a community where everyone knew them, and never moved far from this community. This is still true in many parts of the world, but was surprisingly true in the developed world until quite recently.

Things changed when migration to cities started in a big way, in Western countries perhaps around the 16th century and accelerating since then. Someone who moved to a city could become anyone they wanted as long as they kept away from people from the same general area as they were, who might know them or know of them. This was harder than it seemed, mostly because of the tendency of people with the same origin to live contiguously when they arrived in a city (so if you were from X but didn’t live in the X area, you automatically attracted attention). This ability to assume new identities was grist to the mill of detective stories up to about 100 years ago (notably Austin Freeman).

In the last 100 years, governments have become the guarantors of identity because of the requirement to collect taxes, mostly income taxes; and, for an increasing number of people, because of the need to cross borders. So governments issue identity documents that are tied to a single person via some kind of link, perhaps a biometric or even an address. And, for most people, this is where things stand now.

But there are new forms of identity beginning to be created, and new ways to blur identities as well.

I have had a web page with my photo on it, and links to my papers, and so on, since the web began. Copies of this web page have been periodically archived, at moments that I can’t control, by the Wayback Engine and probably several other places as well. If I want to prove my identity, I can now do it without any government intervention by pointing to these copies of my web page which have information that links them unqiuely to me. For many people, their Facebook or LinkedIn profile pages would do the same thing if they were publicly archivable. So identity is once again moving away from something that is government mediated to something that is more decentralized and community based.

On the other side of the coin, governments and others are actively creating artificial personas, sometimes called sock puppets. These personas are controlled by a real person, but one person can control many of them, and the postings of each persona don’t need to be the ones that the controller would naturally make. In other words if, on the internet, nobody can tell you’re a dog, it follows that nobody can tell you’re not a construct either.

In order to make these sock puppets realistic, a back story has to be created for each one; increasingly, this means that they have to have a created trail in places where this might be looked for. Once upon a time, intelligence organizations would go into official records and create entries for non-existent people; this is inherently difficult, especially in records that are owned by other governments (remember, governments validated identities); so often identities of people who had died were used as starting points. I expect we’ll see that same thing happening in the online world.

But there’s an important difference: while governments can go back and change history embodied in records, neither they nor anyone else can change the history embedded in web sites that, at random times, take a snapshot of some part of the web. So creating realistic sock puppets is actually really difficult.

There’s also the issue of language: one controller runnning multiple sock puppets cannot avoid using detectably similar language patterns for all of them; and eventually this will make it possible to detect artificial personas.

More on Identity

I’ve mentioned the problem of figuring out when data records describe the same person in the two most recent posts. Casinos are required to ban certain people who have self-identified themselves as having a gambling problem, so they have to look carefully at everyone who books a room. They also, of course, have an interest in noticing when certain other people show up, for example card counters.

As I said yesterday, identity is a slippery thing to manage algorithmically. It’s only in the last century that governments have gotten into the act of certifying identity, via various forms of government-issued identification, going back to birth certificates.

Such documents are not necessarily very reliable. There’s a long history of forging them. But mostly identity gets fudged because people don’t use them directly — they copy names and addresses with characteristic human errors; and this process can be helped along by those who want to hide their identity. It’s socially acceptable to use variant names, and people constantly make mistakes with numbers. Those who want to can use these deniable mistakes to create multiple versions of their identities.

This is partly why there’s such an interest in biometrics. A biometric is an identity key that was given to you by God. The important distinction in biometrics is between a digital biometric and a non-digital one. A photo in a passport is a non-digital biometric — it can be used to associate the passport, and so its contents, with you, but doesn’t do much else. A digital biometric, such as a digitized photo, can act as a key to a large database of information about you.

Most biometrics are extremely easy to fool. You can read about some of the easy tricks here. Fingerprint scanners can be fooled by plastic wrap; iris scanners by printed photos of an iris.

In relationship/graph data, the problem with multiple records describing the same person is that they blur the structure of the connections around that person — making some paths seem longer, and some properties more diffuse. That’s why it’s important to be able to resolve identities when possible; but also why it’s important to stay agnostic over the long haul.