Posts Tagged 'security'

Backdoors to encryption — 100 years of experience

The question of whether those who encrypt data, at rest or in flight, should be required to provide a master decryption key to government or law enforcement is back in the news, as it is periodically.

Many have made the obvious arguments about why this is a bad idea, and I won’t repeat them.

But let me point out that we’ve been here before, in a slightly different context. A hundred years ago, law enforcement came up against the fact that criminals knew things that could (a) be used to identify other criminals, and (b) prevent other crimes. This knowledge was inside their heads, rather than inside their cell phones.

Then, as now, it seemed obvious that law enforcement and government should be able to extract that knowledge, and interrogation with violence or torture was the result.

Eventually we reached (in Western countries, at least) an agreement that, although there could be a benefit to the knowledge in criminals’ heads, there was a point beyond which we weren’t going to go to extract it, despite its potential value.

The same principle surely applies when the knowledge is on a device rather than in a head. At some point, law enforcement must realise that not all knowledge is extractable.

(Incidentally, one of the arguments made about the use of violence and torture is that the knowledge extracted is often valueless, since the target will say anything to get it to stop. It isn’t hard to see that devices can be made to use a similar strategy. They would have a pin code or password that could be used under coercion and that would appear to unlock the device, but would in fact produce access only to a virtual subdevice which seemed innocuous. Especially as Customs in several countries are now demanding pins and passwords as a condition of entry, such devices would be useful for innocent travellers as well as guilty — to protect commercial and diplomatic secrets for a start.)

‘AI’ performance not what it seems

As I’ve written about before, ‘AI’ tends to be misused to refer to almost any kind of data analytics or derived tool — but let’s, for the time being, go along with this definition.

When you look at the performance of these tools and systems, it’s often quite poor, but I claim we’re getting fooled by our own cognitive biases into thinking that it’s much better than it is.

Here are some examples:

  • Netflix’s recommendations for any individual user seem to overlap 90% with the ‘What’s trending’ and ‘What’s new’ categories. In other words, Netflix is recommending to you more or less what it’s recommending to everyone else. Other recommendation systems don’t do much better (see my earlier post on ‘The Sound of Music Problem’ for part of the explanation).
  • Google search results are quite good at returning, in the first few links, something relevant to the search query, but we don’t ever get to see what was missed and might have been much more relevant.
  • Google News produces what, at first glance, appear to be quite reasonable summaries of recent relevant news, but when you use it for a while you start to see how shallow its selection algorithm is — putting stale stories front and centre, and occasionally producing real howlers, weird stories from some tiny venue treated as if they were breaking and critical news.
  • Self driving cars that perform well, but fail completely when they see certain patches on the road surface. Similarly, facial recognition systems that fail when the human is wearing a t-shirt with a particular patch.

The commonality between these examples, and many others, is that the assessment from use is, necessarily, one-sided — we get to see only the successes and not the failures. In other words (HT Donald Rumsfeld), we don’t see the unknown unknowns. As a result, we don’t really know how well these ‘AI’ systems really do, and whether it’s actually safe to deploy them.

Some systems are ‘best efforts’ (Google News) and that’s fair enough.

But many of these systems are beginning to be used in consequential ways and, for that, real testing and real public test results are needed. And not just true positives, but false positives and false negatives as well. There are two main flashpoints where this matters: (1) systems that are starting to do away with the human in the loop (self driving cars, 737 Maxs); and (2) systems where humans are likely to say or think ‘The computer (or worse, the AI) can’t be wrong’; and these are starting to include policing and security tools. Consider, for example, China’s social credit system. The fact that it gives low scores to some identified ‘trouble makers’ does not imply that everyone who gets a low score is a trouble  maker — but this false implication lies behind this and almost all discussion of ‘AI’ systems.

China-Huawei-Canada fail

Huawei has been trying to convince the world that they are a private company with no covert relationships to the Chinese government that might compromise the security of their products and installations.

This attempt has been torpedoed by the Chinese ambassador to Canada who today threatened ‘retaliation’ if Canada joins three of the Five Eyes countries (and a number of others) in banning Huawei from provisioning 5G networks. (The U.K. hasn’t banned Huawei equipment, but BT is uninstalling it, and the unit set up jointly by Huawei and GCHQ to try to alleviate concerns about Huawei’s hardware and software has recently reported that it’s less certain about the security of these systems now than it was when the process started.)

It’s one thing for a government to act as a booster for national industries — it’s another to deploy government force directly.

China seems to have a tin ear for the way that the rest of the world does business; it can’t help but hurt them eventually.

Lessons from Wannacrypt and its cousins

Now that the dust has settled a bit, we can look at the Wannacrypt ransomware, and the other malware  that are exploiting the same vulnerability, more objectively.

First, the reason that this attack vector existed is because Microsoft, a long time ago, made a mistake in a file sharing protocol. It was (apparently) exploited by the NSA, and then by others with less good intentions, but the vulnerability is all down to Microsoft.

There are three pools of vulnerable computers that played a role in spreading the Wannacrypt worm, as well as falling victim to it.

  1. Enterprise computers which were not being updated in a timely way because it was too complicated to maintain all of their other software systems at the same time. When Microsoft issues a patch, bad actors immediately try to reverse engineer it to work out what vulnerability it addresses. The last time I heard someone from Microsoft Security talk about this, they estimated it took about 3 days for this to happen. If you hadn’t updated in that time, you were vulnerable to an attack that the patch would have prevented. Many businesses evaluated the risk of updating in a timely way as greater than the risk of disruption because of an interaction of the patch with their running systems — but they may now have to re-evaluate that calculus!
  2. Computers running XP for perfectly rational reasons. Microsoft stopped supporting XP because they wanted people to buy new versions of their operating system (and often new hardware to be able to run it), but there are many, many people in the world for whom a computer running XP was a perfectly serviceable product, and who will continue to run it as long as their hardware keeps working. The software industry continues to get away with failing to warrant their products as fit for purpose, but it wouldn’t work in other industries. Imagine the discovery that the locks on a car stopped working after 5 years — could a manufacturer get away with claiming that the car was no longer supported? (Microsoft did, in this instance, release a patch for XP, but well after the fact.)
  3. Computers running unregistered versions of Microsoft operating systems (which therefore do not get updates). Here Microsoft is culpable for an opposite reason. People can run an unregistered version for years and years, provided they’re willing to re-install it periodically. It’s technically possible to prevent (or make much more difficult) this kind of serial illegality.

The analogy is with public health. When there’s a large pool of unvaccinated people, the risk to everyone increases. Microsoft’s business decisions make the pool of ‘unvaccinated’ computers much larger than it needs to be. And while this pool is out there, there will always be bad actors who can find a use for the computers it contains.

Advances in Social Network Analysis and Mining Conference — Sydney

This conference will be in Sydney in 2017, from 31st July to 3rd August.

As well as the main conference, there is also a workshop, FOSINT: Foundations of Open Source Intelligence, which may be of even more direct interest for readers of this blog.

Also I will be giving a tutorial on Adversarial Analytics as part of the conference.

Even more security theatre

I happened to visit a consulate to do some routine paperwork. Here’s the security process I encountered:

  1. Get identity checked from passport, details entered (laboriously) into online system.
  2. Cell phone locked away.
  3. Wanded by metal detection wand.
  4. Sent by secure elevator to another floor, to a waiting room with staff behind bullet-proof glass.

Here’s the thing: I got to carry my (unexamined) backpack with me through the whole process!

And what’s the threat from a cell phone in this context? Embarrassing pictures of the five year old posters on the wall of the waiting room?

I understand that government departments have difficulty separating serious from trivial risks, because if anything happened they would be blamed, regardless of how low-probability the risk was. But there’s no political reason not to make whatever precautions you decide to take actually helpful to reduce the perceived risks.

“But I don’t have anything to hide” Part III

I haven’t been able to verify it, but Marc Goodman mentions (in an interview with Tim Ferriss) that the Mumbai terrorists searched the online records of hostages when they were deciding who to kill. Another reason not to be profligate about what you post on social media.

Government signals intelligence versus multinationals

In all of the discussion about the extent to which the U.S. NSA is collecting and analyzing data, the role of the private sector in similar analysis has been strangely neglected.

Observe, first, that all of the organizations that were asked to provide data to the NSA did not have to do anything special to do so. Verizon, the proximate example, was required to provide, for every phone call, the originating and destination numbers, the time, the duration, and the cell tower(s) involved for mobile calls — and all of this information was already collected. Why would they collect it, if not to have it available for their own analysis? It isn’t for billing — part of the push to envelope pricing plans was to save the costs of producing detailed bills, for which the cost was often greater than the cost of completing the call itself.

Second, government signals intelligence is constrained in the kind of data they are permitted to collect: traffic analysis (metadata) for everyone, but content only for foreign nationals and those specifically permitted by warrants for cause. Multinationals, on the other hand, can collect content for everyone. If you have a gmail account (I don’t), then Google not only sees all of your email traffic, but also sees and analyzes the content of every email you send and receive. If you send an email to someone with a gmail account, the content of that email is also analyzed. Of course, Google is only one of the players; many other companies have access to emails, other online communications (IM, Skype), and search histories, including which link(s) in the search results you actually follow.

A common response to these differences is something like “Well,  I trust large multinationals, but I don’t trust my government”. I don’t really understand this argument; multinationals are driven primarily (?only) by the need for profits. Even when they say that they will behave well, they are unable to carry out this promise. A public company cannot refrain from taking actions that will produce greater profits, since its interests are the interests of its shareholders. And, however well meaning, when a company is headed for bankruptcy and one of its valuable assets is data and models about millions of people, it’s naive to believe that the value of that asset won’t be realized.

Another popular response is “Well, governments have the power of arrest, while the effect of multinational is limited to the commercial sphere”. That’s true, but in Western democracies at least it’s hard for governments to exert their power without inviting scrutiny from the judicial system. At least there are checks and balances. If a multinational decides to exert its power, there is much less transparency and almost no mechanism for redress. For example, a search engine company can downweight my web site in results (this has already been done) and drive me out of business; an email company can lose all of my emails or pass their content to my competitors. I don’t lose my life or my freedom, but I could lose my livelihood.

A third popular response is “Well, multinationals are building models of me so that they can sell me things that are better aligned with my interests”. This is, at best, a half-truth. The reason they want a model of you is so that they can try and sell you things you might be persuaded to buy, not things that that you should or want to buy. In other words, the purpose of targeted advertising is at least to get you to buy more than you otherwise would, and to buy the highest profit margin version of things you might actually want to buy. Your interests and the interests of advertisers are only partially aligned, even when they have built a completely accurate model of you.

Sophisticated modelling from data has its risks, and we’re still struggling to understand the tradeoffs between power and consequences and between cost and effectiveness. But, at this moment, the risks seem to me to be greatest from multinational data analysis than from government data analysis.

Language learning as a model of radicalisation

The Canadian Prime Minister said today, in response to the arrests for the planned Via Rail attacks, and perhaps to the Boston Marathon bombings as well, that these are not a reason to “commit sociology”. I think he’s exactly right. As I said in the previous post, I’m dubious that levels of dissatisfaction with societies, or even with religions, play a major role in radicalisation — it’s a much more individual-specific process. This is why only a tiny fraction of people in exactly the same social, religious, and even family setting become radicalised.

I’m also deeply skeptical that anyone becomes radicalised via the Internet. Our survey results indicated that variations in access to the Internet, or to mass media channels that have a frankly jihadist orientation have no correlation with attitudes on radicalisation-relevant subjects or dissatisfaction of any kind. I’m convinced that it always takes contact with a person, perhaps only one and perhaps only once, for radicalisation to happen.

Here’s where the analogy with language learning comes in. I learned French (in Australia) the same way I learned Latin (declensions, conjugations, agreement). I read French well and could speak it after a fashion. But the first time I heard French radio and then met people who actually spoke French, there was a kind of click in my brain and something changed about the way I used and learned French. I don’t think this is just autobiography; as I mentioned in the last post, learning languages via TV programs doesn’t work nearly as well as you might expect it to.

I’m fairly convinced something similar happens with radicalisation. An individual can watch the videos, talk the talk, fantasise the actions, but unless/until they make contact with someone who has actually done something, there isn’t any danger. Once this happens, of course, radicalisation can proceed very quickly indeed, which explains (I guess) the several cases where apparent changes have been very swift.

Inspire Magazine Issue 10

The tenth issue of this al Qaeda in the Arabian Peninsula magazine is out. Continuing the textual analysis I’ve done on the earlier issues, I can conclude two things:

  1. Issue 10 wasn’t written by whoever wrote Issue 9 (nor by those who wrote the previous issues since they’re dead). In almost every respect the language resembles that of earlier issues, and is bland with respect to almost every word category. Except …
  2. The intensity of Jihadist language, which has been steadily increasing over the series, decreases sharply in Issue 10. Whoever the new editors/authors are, their hearts are not in it as much as the previous ones.

Understanding High-Dimensional Spaces

My new book with the title above has been published by Springer, just in time for Christmas gift giving for the data miner on your list.

The book explores how to represent high-dimensional data (which almost all data is), and how to understand the models, particularly for problems where the goal is to find the most interesting subset of the records. “Interesting”, of course, means different things in different settings; a big part of the focus is on finding outliers and anomalies.

Partly the book is a reaction to the often unwitting assumption that clouds of data can be understood as if they had a single centre — for example, much of the work on social networks.

The most important technical ideas are (a) that clusters themselves need to be understood as having a structure which provides each one with a higher-level context that is usually important to make sense of them, and (b) that the empty space between clusters also provides information that can help to understand the non-empty-space.

You can buy the book here.

Super Identities

I heard a talk on the UK Super Identity Project last week which stimulated some musings on this important topic.

Once upon a time, almost everyone lived in villages, and identity was not an issue — everyone you knew also knew you and many of them had done so since you were born. So identity issues hardly arose, apart from an occasional baby substitution (but note Solomon in 1 Kings 3:16-28 for an early identity issue). As rich people began to travel, new forms of identity evidence such as passports and letters of introduction were developed.

About a hundred years ago and as the result of mass movement to cities, questions of identity become common. You can see from the detective stories of the time how easy it was to assume another identity, and how difficult it was to verify one, much as it is in cyberspace today. To deal with these issues, governments become involved as the primary definers of identity, getting in on the act with birth certificates (before that, e.g. baptismal records), and then providing a continuous record throughout life.

In parallel, there’s the development of biometric identifiers, mostly to deal with law enforcement, first the Bertillon system and then fingerprints (although as I’ve noted here before, one of the first of the detective stories to include fingerprints– The Red Thumb Mark — is about how easy they are to forge).

The Super Identity project is trying to fuse a set of weak identifiers into a single identity with some reliability. Identities are important for three main reasons (a) trust, for example so that I can assume that someone I’m interacting with online is the person I think it is; (b) monetizing, for example so that an advertiser can be sure that the customized ad is being sent to the right person; and (c) law enforcement and intelligence, for example, these identities are actually the same underlying person.

There are many identifying aspects, almost all of which are bound to a particular individual in a weak way. They come in four main categories:

  1. Physical identifiers such as an address, or a place of employment.
  2. Biometrics (really a subset of the physical) such as fingerprints, iris patterns, voice and so on. These at first glance seem to be rather strongly bound to individuals, but all is not as it appears and they can often be forged in practice, if not in theory. There is an important subset of biometrics that are often forgotten, those that arise from subconscious processes; these include language use, and certain kinds of tics and habits. They are, in many ways, more reliable than more physical biometrics because they tend to be hidden from us, and so are harder to control.
  3. Online identifiers such as email addresses, social network presence, web pages, which are directly connected to individuals. Equally important are the indirect online identifiers that appear as an (often invisible) side-effect of online activity such as location.
  4. Identifiers associated with accessing the online world, that is identifiers associated with bridging from the real world to the online world. These include (beloved by governments despite their weakness) IP addresses which led to a recent police raid, complete with stun grenades, on an innocent house.

The problem with trying to fuse these weak identifying aspects into a single superidentity which can be robustly associated with an individual is this: it’s relatively difficult to avoid creating these identifying aspects, but it’s relatively easy to create more identifying aspects that can be used either to actively mislead or passively confuse the creation of the superidentity.

For example, there’s been some success in matching userids from different settings (gmail, facebook, flickr) and attributing them to the same person. But surely this can only work as long as that person makes no effort to prevent it. If I want to make it hard to match up my different forms of web presence then I can choose userids that don’t associate in a natural way — but I can also create extra bogus accounts that make the matching process much harder just from a computational point of view.

So it may be possible to create a cloud of identifying aspects, but it seems much more difficult to find the real person within that cloud, especially if they’re trying to make themselves hard to find. The Super Identity project would no doubt respond that most people aren’t about making themselves harder to identify. I doubt this; I think we’re moving to a world where obfuscation is going to be the only way to gain some privacy — a world in which the only way to dissociate ourselves from something we don’t want made public is to make the connection sufficiently doubtful that it cannot reliably acted on. This might be called self-spamming.

For example, if a business decides to offer differential pricing to certain kinds of customers (which has already happened), then I want to be able to dissociate myself from the category that gets offered the higher price if I possibly can. If the business has too good a model of my identity, I may not be able to prevent them treating me the way they want to rather than the way I want them to. (This is, of course, why almost all data mining is, in the end, going to be adversarial.)

In the end, behavior is the best signal of identity because it’s hard for us to modify, partly because we don’t have conscious awareness of much of it, and partly because we don’t have conscious control even when we have awareness. No wonder behavior modelling is becoming a hot topic, particularly in the adversarial domain.

Finally — the end of the Castle Model of cybersecurity?

The Castle model is the way that cybersecurity has been done for the last 20 years. The idea is to build security that keeps bad guys out of your system — you can tell what the metaphor is by the names that are used: INTRUSION detection, fireWALL. Of course, this isn’t the whole story; people have been accustomed to having to do antivirus scans and (less likely) anti-malware scans, but the idea of perimeter defence is deeply ingrained.

We don’t even behave in the real world that way. If you owned a castle with thick walls and the drawbridge was up, you might still raise an eyebrow at a bunch of marauders wandering around inside looting and pillaging. But in the online world, we’re all too likely to let anyone who can get past the perimeter do pretty much anything they want. And, by the way, insiders are already inside the perimeter which is why they are such a large threat.

The credit card hack at Global Payments, made (finally) public last week is a good example. First, the PCI DSS, which defines the standards for credit card processing security only mandates that user data should be “protected” but doesn’t say how. Commentators on this incident have assumed that the data held by Global Payments was all encrypted, but there’s nothing in the requirements that says it has to be, so perhaps it wasn’t. But Global Payments clearly also didn’t have the right kind of sanity checks on exfiltration of data. Even if the hack came through an account belonging to someone who had a legitimate need to look at transactions, surely there should have been controls to limit such access to one day’s worth, or a few thousand, or something like that. Exporting 1.5 million transactions should surely have required some extra levels of authentication and the involvement of an actual person at Global Payments. But the bigger issue is that the PCI DSS doesn’t mandate any “inside the gates” security measures.

So what’s the alternative to the castle model? We are still thinking this through, but it must involve controls on who can do what inside the system (as we usually do in even moderately secure real-world settings), controls on exfiltration of data (downloading, copying to portable devices, outgoing email), and especially on the size of outgoing data, and better logging and internal observation (real-world buildings have a night watchman to limit what can be done in the quiet times).

Even the U.S. military, whose network is air-gapped from the internet, admits that penetration of their networks is so complete that it’s pointless to concentrate on defending their network’s borders and more important to focus on controlling access to the data held within these networks (BBC story).

It’s time for a change of metaphor in cybersecurity — the drawbridge is down whether we like it or not, and so we need to patrol the corridors and watch for people carrying suspiciously large bags of swag.

European Intelligence and Security Informatics conference

The program is now available here and looks impressive (note also the associated Open Source Intelligence workshop in which one of my students has a paper about our work on interestingness).

Low Hanging Fruit in Cybersecurity III

Any attempt to decide whether a particular action is “bad” or “good” requires some model of what “good” actually means. The only basis for intelligent action in almost any setting is to be able to have a plan for the expected, but also a mechanism for noticing the unexpected — to which some kind of meta-planning can be attached. This is, of course, a crucial part of how we function as humans; we don’t hang as software often does, because if we encounter the unexpected, we do something about it. (Indeed, an argument along this line has been used by J.R. Lucas to argue that the human mind is not a Turing machine.)

But most cybersecurity applications do not try (much) to build a model of what “good” or “expected” or “normal” should be like. Granted, this can be difficult; but I can’t help but think that often it’s not as difficult as it looks at first. Partly this is because of the statistical distribution that I discussed in my last post — although, on the internet, lots of things could happen, most of them are extremely unlikely. It may be too draconian to disallow them, but it seems right to be suspicious of them.

Actually, three different kinds of models of what should happen are needed. These are:

  1. A model of what “normal” input should look like. For example, for an intrusion detection system, this might be IP addresses and port numbers; for a user-behavioral system, this might be executables and times of day.
  2. A  model of what “normal” transformations look like. Inputs arriving in the system lead to consequent actions. There should be a model of how these downstream actions depend on the system inputs.
  3. A model of what “normal” rates of change look like. For example, I may go to a web site in a domain I’ve never visited before; but over the course of different time periods (minutes, hours, days) the rate at which I encounter brand new web sites exhibits characteristic patterns.

An exception to the first model shows that something new is happening in the “outside” world — it’s a signal of novelty. An exception to the second model shows that the system’s model of activity is not rich enough — it’s a signal of interestingness. An exception to the third model shows that the environment is changing.

Activity that does not fit with any one of these models should not necessarily cause the actions to be refused or to sound alarms — but it does provide a hook to which a meta-level of analysis can be attached, using more sophisticated models with new possibilities that are practical only because they don’t get invoked very often.

Again think of the human analogy. We spent a great deal of our time running on autopilot/habit. This saves us cognitive effort for things that don’t need much. But, when anything unusual happens, we can quickly snap into a new mode where we can make different kinds of decisions as needed. This isn’t a single two-level hierarchy — in driving, for example, we typically have quite a sophisticated set of layers of attention, and move quickly to more attentive states as conditions require.

Cybersecurity systems would, it seems to me, work much more effectively if they used the combination of models of expected/normal behavior, organized in hierarchies, as their building blocks.

Low Hanging Fruit in Cybersecurity II

If cybersecurity exists to stop bad things happening in computing systems, then it seems to me that there are several implicit assumptions that underlie many approaches and techniques that might not be completely helpful. These are:

  • The distinction between “good” (or “allowable”) and “bad” is a binary distinction;
  • The decision about this distinction has to be made monolithically in a single step;
  • The distribution of likely things that could happen is uniform (flat).

Even to write them explicitly shows that they can’t quite be right, but nevertheless I suspect they exist, unexamined, in the design of many security systems.

What happens if we remove these assumptions?

If the distinction between “good” and “bad” is not discrete, then our systems instead allocate some kind of continuous risk or suspicion to actions. This creates an interesting new possibility — the decision about what to do about an action can now be decoupled from how the action is categorized. This is not even a possibility if the only distinction we recognize is binary.

From a purely technical point of view, this means that many different kinds of risk measuring algorithms can be developed and used orthogonally to decisions about what the outputs of these algorithms means. Critical boundaries can be determined after the set of risks has been calculated, and may even be derived from the distribution of such risks. For example, bad things are (almost always) rare, so a list of actions ordered by risk will normally have a bulge of “normal” actions and then a small number of anomalous actions. The boundary could be placed at the edge of the bulge.

Second, what if the decision about whether to allow an action doesn’t have to be made all at once. Then systems can have defence in depth. The first, outer, layer can decide on the risk of a new action and decide whether or not to allow it. But it can be forgiving of potential risky actions if there are further layers of categorization and defence to follow. What it can do is to disallow the clearly and definitively bad things, reducing the number of potentially bad things that have to be considered at later stages.

From a technical point of view, this means that weaker but cheaper algorithms can be used on the front lines of defence, with more effective but more expensive algorithms available for later stages (where they work with less data, and so do not cost as much overall, despite being more expensive per instance).

Third, what if our defence took into account that the landscape of expected actions is not uniform, so that low probability events should automatically be treated as more suspicious. For example, spam filtering does lots of clever things, but it doesn’t build a model of the sources of my email, and flag emails from countries that I’ve never, ever received email from as inherently more likely to be spam. (Yes, I know that sender addresses can be spoofed.)

This idea has been used in behavioral profiling of computer activity, and it sort of works. But it needs to be combined with the ideas above, so that actions can be rated along a continuum from: routine (allow), to unusual but still not that unusual (allow, but maybe with a user question or at least logged for occasional inspection), to very unusual (user better explicitly allow), to bizarre (disallow). Windows has a weak version of this, which hasn’t been accepted well by users, but it flags only one thing (program start) and it doesn’t build a model of typical behavior by each user.

For example, the set of IP addresses with which my computer interacts is quite large, and hard to represent by some kind of convex structure, so intrusion detection doesn’t work very well if it depends on wrapping/categorising those IP addresses that are OK, and blocking traffic from those that are not. And usually the set of OK IP addresses is not derived from those I interact with, but encoded in some set of rules that apply to many computers. But if instead I built a model of the IP addresses I interact with, allowing older ones to get stale and disappear, and then looked at new IP addresses and allowed them if they resembled (tricky) those I already interact with, and asked me about the others, then this might work better than current approaches. An IP address is a hierarchical structure, with a possible country followed by the top octet, and so on, so I can discriminate quite finely about what it might mean. Even a web server that is theoretically visible to every other IP address could still benefit from handling unlikely source IP addresses differently.

OK, maybe this isn’t exactly low hanging fruit, but the ideas are straightforward and (IMHO) should be built into the design of more robust systems.

Call for Papers: Link Analysis, Counterterrorism and Security

The Call for the LACTS 2009 workshop is now available here.

The workshop takes place at the SIAM Data Mining Conference and brings together academics, practitioners, law enforcement, and intelligence people to talk about leading-edge work in the area of adversarial data analysis.

The workshop is intended primarily for early-stage work. The proceedings are published electronically, but authors may retain copyright.

The deadline for submissions is probably late December, but perhaps a little later (still being decided).

Knowledge Discovery for Counterterrorism and Law Enforcement

My new book, Knowledge Discovery for Counterterrorism and Law Enforcement, is out. You can buy a copy from:

The publisher’s website


(Despite what these pages say, the book is available or will be within a day or two.)

As the holiday season approaches, perhaps you have a relative who’s in law enforcement, or intelligence, or security? What could be better than a book! Or maybe you’d like to buy one for yourself.

(A portion of the price of this book goes to support deserving university faculty.)

Using private documents to improve search in public documents

I’m back from the SIAM International Conference on Data Mining, and the 5th Workshop on Link Analysis, Counterterrorism, and Security, which I helped to organize. The workshop papers are now online, along with some open problems that were discussed at the end of the workshop.

I’ll post about some ideas that were tossed around at the workshop and conference in the next few days.

Let me start by talking about the work of Roger Bradford. Information retrieval starts from a document-term matrix, which is typically extremely large and sparse, and then reduces the dimensionality by using an SVD, a process sometimes called latent semantic indexing. This creates a representation space for both documents and terms. A query is treated as if it were a kind of short document and mapped into this representation space. Its near neighbours are then the documents retrieved in response to the query; and they can be sorted in decreasing distance from the query point as well.

Bradford showed that the original space can be built using a set of private documents and a set of public documents, and that the resulting representation space allows better retrieval performance than the space derived from the public documents, without allowing the properties of the private documents to be inferred.

In fact, the set of private documents can be diluted by mixing them with other documents before the process starts, making it even more difficult to work backwards to the private documents.

This process has a number of applications that he talks about in the paper. One of the most interesting is that it allows different organizations, for example allies, to share sensitive information without compromising it to each other — and still get the benefits of the relationships in the full set of documents.