Posts Tagged 'data mining'

‘AI’ performance not what it seems

As I’ve written about before, ‘AI’ tends to be misused to refer to almost any kind of data analytics or derived tool — but let’s, for the time being, go along with this definition.

When you look at the performance of these tools and systems, it’s often quite poor, but I claim we’re getting fooled by our own cognitive biases into thinking that it’s much better than it is.

Here are some examples:

  • Netflix’s recommendations for any individual user seem to overlap 90% with the ‘What’s trending’ and ‘What’s new’ categories. In other words, Netflix is recommending to you more or less what it’s recommending to everyone else. Other recommendation systems don’t do much better (see my earlier post on ‘The Sound of Music Problem’ for part of the explanation).
  • Google search results are quite good at returning, in the first few links, something relevant to the search query, but we don’t ever get to see what was missed and might have been much more relevant.
  • Google News produces what, at first glance, appear to be quite reasonable summaries of recent relevant news, but when you use it for a while you start to see how shallow its selection algorithm is — putting stale stories front and centre, and occasionally producing real howlers, weird stories from some tiny venue treated as if they were breaking and critical news.
  • Self driving cars that perform well, but fail completely when they see certain patches on the road surface. Similarly, facial recognition systems that fail when the human is wearing a t-shirt with a particular patch.

The commonality between these examples, and many others, is that the assessment from use is, necessarily, one-sided — we get to see only the successes and not the failures. In other words (HT Donald Rumsfeld), we don’t see the unknown unknowns. As a result, we don’t really know how well these ‘AI’ systems really do, and whether it’s actually safe to deploy them.

Some systems are ‘best efforts’ (Google News) and that’s fair enough.

But many of these systems are beginning to be used in consequential ways and, for that, real testing and real public test results are needed. And not just true positives, but false positives and false negatives as well. There are two main flashpoints where this matters: (1) systems that are starting to do away with the human in the loop (self driving cars, 737 Maxs); and (2) systems where humans are likely to say or think ‘The computer (or worse, the AI) can’t be wrong’; and these are starting to include policing and security tools. Consider, for example, China’s social credit system. The fact that it gives low scores to some identified ‘trouble makers’ does not imply that everyone who gets a low score is a trouble  maker — but this false implication lies behind this and almost all discussion of ‘AI’ systems.

New data on human trafficking

The UN’s International Organisation for Migration has just released a large tranche of data about migration patterns, and a slick visualization interface:

Some of the results are quite counter-intuitive. Worth a look.

And so it begins

Stories out today that Google is now able to connect the purchasing habits of anyone it has a model for (i.e. almost everybody who’s ever been online) with Google’s own data on online activity.

For example, this story:

Google says that this enables them to draw the line between the ads that users have been shown, and the products that they buy. There’s a discrepancy in this story because Google also claim that they don’t get the list of products purchased using a credit card, but only the total amount. So a big hmmmmm.

(And if I were Google, I’d be concerned that there isn’t much of a link! Consumers might be less resentful if Google did indeed serve ads for things they wanted to buy, but everyone I’ve ever heard talk about online ads says the same thing: the ads either have nothing to do with their interests, or they are ads for things that they just bought.)

But connecting users to purchases (rather than ads to purchases) is the critical step to building a model of how much users are willing to pay — and this is the real risk of multinational data collection and analytics (as I’ve discussed in earlier posts).

“But I don’t have anything to hide”

This is the common response of many ordinary people when the discussion of (especially) government surveillance programs comes up. And they’re right, up to a point. In a perfect world, innocent people have nothing to fear from government.

The bigger problem, in fact, comes from the data collected and the models built by multinational businesses. Everyone has something to hide from them: the bottom line prices we are willing to pay.

We have not yet quite reached the world of differential pricing. We’ve become accustomed to the idea that the person sitting next to us on a plane may have paid (much) less for the identical travel experience, but we haven’t quite become reconciled to the idea that an online retailer might be charging us more for the same product than they charge other people, let alone that the chocolate bar at the corner store might be more expensive for us. If anything, we’re inclined to think that an organisation that has lots of data about us and has built a detailed model of us might give us a better price.

But it doesn’t require too much prescience to see that this isn’t always going to be the case. The seller’s slogan has always been “all the market can bear”.

Any commercial organization, under the name of customer relationship management, is building a model of your predicted net future value. Their actions towards you are driven by how large this is. Any benefits and discounts you get now are based on the expectation that, over the long haul, they will reap the converse benefits and more. It’s inherently an adversarial relationship.

Now think about the impact of data collection and modelling, especially with the realization that everything collected is there for ever. There’s no possibility of an economic fresh start, no bankruptcy of models that will wipe the slate clean and let you start again.

Negotiation relies on the property that each party holds back their actual bottom line. In a world where your bottom line is probably better known to the entity you’re negotiating with than it is to you, can you ever win? Or even win-win? Now tell me that you have nothing to hide.

[And, in the ongoing discussion of post-Snowden government surveillance, there’s still this enormous blind spot about the fact that multinational businesses collect electronic communication, content and metadata; location; every action on portable devices and some laptops; complete browsing and search histories; and audio around any of these devices. And they’re processing it all extremely hard.]

Refining “Data Science”

Regular readers will know that I have been thinking about the constellation of ideas that are getting a lot of play in universities and the research community around words like “data science”, and ‘big data”,  and especially the intersection of these ideas with the other constellation of “data mining”, “knowledge discovery” and “machine learning”.

I’ve argued that inductive model discovery (which I think is the core of all of these ideas) is a new way of doing science that is rapidly replacing the conventional Enlightenment or Popperian view of science. This is happening especially quickly in fields that struggled to apply the conventional scientific method, especially in medicine, the social “sciences”, business schools, and in the humanities.

Attending the International Conference on Computational Science meeting made me realise, however, that computational science is a part of this story as well.

Here’s how I see the connections between these three epistemologies:

  1. Conventional science. Understand systems via controlled experiments: setting up configurations that differ in only a few managed ways and seeing whether those differences correspond to different system behaviours. If they do, construct an “explanation”; if they don’t, it’s back to the drawing board.
  2. Computational science. Understand systems by building simulations of them and tweaking the simulations to see if the differences are those that are expected from the tweaks. (Simulations increase the range of systems that can be investigated when either the tweaks can’t be done on the real system, or when the system is hypothesised to be emergent from some simpler pieces.)
  3. Data science. Understand systems by looking at the different configurations that naturally occur and seeing how these correspond to different system behaviors. When they do, construct an “explanation”.

In other words, conventional science pokes the system being investigated in careful ways and sees how it reacts; computational science creates a replica of the system and pokes that; and data science looks at the system being poked and tries to match the reactions to the poking.

Underlying these differences in approach is also, of course, differences in validation: how one tells if an explanation is sufficient. The first two both start from a hypothesis and use statistical machinery to decide whether the hypothesis is supported sufficiently strongly. The difference is that the computational science has more flexibility to set up controlled experiments and so, all things considered, can get stronger evidence. (But there is always the larger question of whether the simulation actually reproduces the system of interest — critical, but often ignored, and with huge risks of “unknown unknowns”.) Data science, in contrast, validates its models of the system being studied by approaches such as the use of a test set, a component of the system that was not used to build the model, but which should behave as the original system did. It is also buttressed by the ability to generate multiple models and so compare among them.

Data science is advancing on two fronts: first, the flexibility it provides to conventional science not to have to construct carefully balanced controlled experiments; second, and much more significantly, the opportunity it creates for making scientific progress in the social sciences and humanities, replacing “qualitative” by “quantitative” in unprecedented ways.

Why Data Science?

Data Science has become a hot topic lately. As usual, there’s not a lot of agreement about what data science actually is. I was on a panel last week, and someone asked afterwards what the difference was between data mining, which we’ve been doing for 15 years, and data science.

It’s a good question. Data science is a new way of framing the scientific enterprise in which a priori hypothesis creation is replaced by inductive modelling; and this is exactly what data mining/knowledge discovery is about (as I’ve been telling my students for a decade).

What’s changed, perhaps, is that scientists in many different areas have realised the existence and potential of this approach, and are commandeering it for their own.

I’ve included the slides from a recent talk I gave on this subject (at the University of Technology Sydney).

And once again let me emphasise that the social sciences and humanities did not really have access to the Enlightenment model of doing science (because they couldn’t do controlled experiments), but they certainly do to the new model. So expect a huge development in data social science and data humanities as soon as research students with the required computational skills move into academia in quantity.

Why data science (ppt slides)

More subtle lessons from the Sony hack

There are some obvious lessons to learn from the Sony hack: perimeter defence isn’t much use when the perimeter has thousands of gates in it (it looks as if the starting point was a straightforward spearphishing attack); and if you don’t compartmentalise your system inside the perimeter, then anyone who gets past it has access to everything.

But the less obvious lesson has to do with the difference between our human perception of the difficulties of de-anonymization and aggregation, and the actual power of analytics to handle both. For example, presumably Sony kept data on their employees health in properly-protected HIPAA-compliant storage — but there were occasional emails that mentioned individuals and their health status. The people sending these emails presumably didn’t feel as if any particular one was a breach of privacy — the private content in each one was small. But they failed to realise that all of these emails get aggregated, at least in backups. So now all of those little bits of information are in one place, and the risks of building significant models from them has increased substantially.

Anyone with analytic experience and access to a large number of emails can find structures that are decidedly non-obvious; but this is far from intuitive to the public at large, and hence to Sony executives.

We need to learn to value data better, and to understand in a deep way that the value of data increases superlinearly with the amount that is collected into a single coherent unit.

Making recommendations different enough

One of the major uses of data analytics in practice is to make recommendations, either explicitly or implicitly. This is one area where the interests of marketers and the interests of consumers largely run together. If I want to buy something (a product or access to an experience such as listening to a song or seeing a movie) I’d just as soon buy something I actually want — so a  seller who can suggest something that fits has a better chance of getting my business than one that presents a set of generic choices. Thus businesses build models of me to try and predict what I am likely to buy.

Some of these businesses are middlemen, and what they are trying to predict is what kind of ads to show me on behalf of other businesses. Although this is a major source of revenue for web-based businesses, I suspect it to be an edge-case phenomenon — that is, the only people who actually see ads on the web are people who are new to the web (and there are still lots of them every day) while those who’ve been around for a while develop blindness to everything other than content they actually want. You see businesses continually developing new ways to make their ads more obtrusive but, having trained us to ignore them, this often seems to backfire.

Other businesses use permission marketing, which they can do because they already have a relationship. This gives them an inside track — they know things about their customers that are hard to get from more open sources, such as what they have actually spent money on. When they can analyse the data available to them effectively, they are able to create the win-win situation where what they want to sell me is also what I want to buy.

But there’s a huge hole in the technology that represents an opportunity at least as large as that on which Google was founded: how new and different should the suggestion be?

For example, if you buy a book from Amazon by a popular author, your recommendation list is populated by all of the other books by that same author, the TV programs based on that author’s books, the DVDs of the TV shows, and on and on. This is largely a waste of the precious resource of quality attention. Most humans are capable of figuring out that, if they like one book by an author, they may well like others. (In fact, sites like Amazon are surprisingly bad at figuring out that, when an author writes multiple series featuring different characters and settings, an individual might like one series but not necessarily the others.)

So what would be better? The goal, surely, is to suggest products that are similar to the ones the individual has already liked or purchased, but also sufficiently different that that individual would not necessarily have noticed them. In other words, current systems present products in a sphere around the existing product, but what they should do is present products in an annulus around the existing product. Different, but not too different.

This is surprisingly difficult to do. Deciding what similarity means is already a difficult problem; deciding what “just enough dissimilarity” means has been, so far, a too difficult problem. But what an opportunity!

Predicting the Future

Arguably the greatest contribution of computing to the total of human knowledge is that relatively simple results from theoretical models of computation show that the future is an inherently unknowable place — not just in principle, but for fundamental reasons.
A Turing machine is a simple model of computation with two parts: an infinite row of memory elements, each able to contain a single character; and a state machine, a simple device that is positioned at one of the memory elements, and makes moves by inspecting the single character in the current element and using a simple transition table to decide what to do next. Possible next moves include changing the current character to another one (from a finite alphabet) or moving one element to the right or to the left, or stopping and turning off.
A Turing machine is a simple device; all of its parts are straightforward; and many real-world simulators have been built. But it is hypothesised that this simple device can compute any function that can be computed by any other computational device, and so it contains within its simple structure everything that can be found in the most complex supercomputer, except speed.
It has been suggested that the universe is a giant Turing machine, and everything we know so far about physics continues to work from this perspective, with the single exception that it requires that time is quantized rather than continuous — the universe ticks rather than runs.
But here’s the contribution of computation to epistemology: Almost nothing interesting about the future behaviour of a Turing machine is knowable in any shortcut way, that is in any way that is quicker than just letting the Turing machine run and seeing what happens. This includes questions like: will the Turing machine ever finish its computation? Will it revisit this particular memory element? Will this symbol ever again contain the same symbol that it does now? and many others. (These questions may be answerable in particular cases, but they can’t be answered in general — that is you can’t inspect the transitions and the storage and draw conclusions in a general way.)
If most of the future behaviour of such a simple device is not accessible to “outside” analysis, then almost every property of more complex systems must be equally inaccessible.
Note that this is not an argument built from limitations that might be thought of as “practical”. Predicting what will happen tomorrow is not, in the end, impossible because we can’t gather enough data about today, or because we don’t have the processing power to actually build the predictions — it’s a more fundamental limitation in the nature of what it means to predict the future. This limitation is akin (in fact, quite closely related) to the fact that, within any formal system, there are some theorems that we know to be true but cannot prove.

There are other arguments that also speak to the problem of predicting the future. These aren’t actually needed, given the argument above, but they are often advanced, and speak more to the practical difficulties.

The first is that non-linear systems are not easy to model, and often have unsuspected actions that are not easy to infer even when we have a detailed understanding of them. Famously, bamboo canes can suddenly appear from the ground and then grow more than 2 feet in a day.

The second is that many real-world systems are chaotic, that is infinitesimal differences in their conditions at one moment in time can turn into enormous differences at a future time. This is why forecasting the weather is difficult: a small error in measurement at one  weather station today (caused, perhaps by a butterfly flapping its wings) can completely change tomorrow’s weather a thousand miles away. The problem with predicting the future here is that the current state cannot be measured to sufficient accuracy.

So if the future is inherently, fundamentally impossible to predict, what do we mean when we talk about prediction in the context of knowledge discovery? The answer is that predictive models are not predicting a previously unknown future, but are predicting the recurrence of patterns that have existed in the past. It’s desperately important to keep this in mind.

Thus when a mortgage prediction system (should this new applicant be given a mortgage?) is built, it’s built from historical data: which of a pool of earlier applicants for mortagages did, and did not, repay those loans. The prediction for a new mortgage applicant is, roughly speaking, based on matching the new applicant to the pool of previous applicants and making a determination from what the outcomes were for those. In other words, the prediction assumes an approximate rerun of what happened before — “now” is essentially the same situation as “then”. It’s not really a prediction of the future; it’s a prediction of a rerun of the past.

All predictive models have (and must have) this historical replay character. Trouble starts when this gets forgotten, and models are used to predict scenarios that are genuinely in the future.  For example, in mortgage prediction, a sudden change in the wider economy may be significant enough that the history that is wired into the predictor no longer makes sense. Using the predictor to make new lending decisions becomes foolhardy.

Other situations have similar pitfalls, but they are a bit better hidden. For example, the dream of personalised medicine is to be able to predict the outcome for a patient who has been diagnosed with a particular disease and is being given a particular treatment. This might work, but it assumes that every new patient is close enough to some of the previous patients that there’s some hope of making a plausible prediction. At present, this is foundering on the uniqueness of each patient, especially as the available pool of existing patients for building the predictor is often quite limited. Without litigating the main issue, models that attempt to predict future global temperatures are vulnerable to the same pitfall: previous dependencies of temperatures on temperatures at earlier times do not provide a solid epistemological basis for predicting future temperatures based on temperatures now (and with the triple whammy of fundamental unpredictability, chaos, and non-linear systems).

All predictors should be built so that predictions all pass through a preliminary step that compares them to the totality of the data used to build the predictor. New records that do not resemble records used for training cannot legitimately be passed to the predictor, since the result has a strong probability of being fictional. In other words, the fact that a predictor was build from a particular set of training data must be preserved in the predictor’s use. Of course, there’s an issue of how similar a new record must be to the training records to be plausibly predicted. But at least this question should be asked.

So can we predict the future? No, we can only repredict the past.

The right level of abstraction = the right level to model

I think the take away from my last post is that models of systems should aim to model them at the right level of abstraction, where that right level corresponds to the places where there are bottlenecks. These bottlenecks are places where, as we zoom out in terms of abstraction, the system suddenly seems simpler. The underlying differences don’t actually make a difference; they are just variation.

The difficulty is that it’s really, really hard to see or decide where these bottlenecks are. We rightly laud Newton for seeing that a wide range of different systems could all be described by a single equation; but it’s also true that Einstein showed that this apparent simplicity was actually an approximation for a certain (large!) subclass of systems, and so the sweet spot of system modelling isn’t quite where Newton thought it was.

For living systems, it’s even harder to see where the right level of abstraction lies. Linnaeus (apparently the most-cited human) certainly created a model that was tremendously useful, working at the level of the species. This model has frayed a bit with the advent of DNA technology, since the clusters from observations don’t quite match the clusters from DNA, but it was still a huge contribution. But it’s turning out to be very hard to figure out the right level of abstractions to capture ideas like “particular disease” “particular cancer” even though we can diagnose them quite well. The variations in what’s happening in cells are extremely difficult to map to what seems to be happening in the disease.

For human systems, the level of abstraction is even harder to get right. In some settings, humans are surprisingly sheep-like and broad-brush abstractions are easy to find. But dig a little, and it all falls apart into “each person behaves as they like”. So predicting the number of “friends” a person will have on a social media site is easy (it will be distributed around Dunbar’s number), but predicting whether or not they will connect with a particular person is much, much harder. Does advertising work? Yes, about half of it (as Ogilvy famously said). But will this ad influence this person? No idea. Will knowing the genre of this book or film improve the success rate of recommendations? Yes. Will it help with this book and this person? Not so much.

Note the connection between levels of abstraction and clustering. In principle, if you can cluster (or, better, bicluster) data about your system and get (a) strong clusters, and (b) not too many of them, then you have some grounds for saying that you’re modelling at the right level. But this approach founders on the details: which attributes to include, which algorithm to use, which similarity measure, which parameters, and so on and on.

Three kinds of knowledge discovery

I’ve always made a distinction between “mainstream” data mining (or knowledge discovery or data analytics) and “adversarial” data mining — they require quite distinct approaches and algorithms. But my work with bioinformatic datasets has made me realise that there are more of these differences, and the differences go deeper than people generally understand. That may be part of the reason why some kinds of data mining are running into performance and applicability brick walls.

So here are 3 distinct kinds of data mining, with some thoughts about what makes them different:

1. Modelling natural/physical, that is clockwork, systems.
Such systems are characterised by apparent complexity, but underlying simplicity (the laws of physics). Such systems are entropy minimising everywhere. Even though parts of such systems can look extremely complex (think surface of a neutron star), the underlying system to be modelled must be simpler than its appearances would, at first glance, suggest.

What are the implications for modelling? Some data records will always be more interesting or significant than others — for most physical systems, records describing the status of deep space are much less interesting than those near a star or planet. So there are issues around the way data is sampled.
Some attributes will also be more interesting or significant than others — but, and here’s the crucial point, this significance is a global property. It’s possible to have irrelevant or uninteresting attributes, but these attributes are similarly uninteresting everywhere. Thus is makes sense to use attribute selection as part of the modelling process.

Because the underlying system is simpler than its appearance suggests, there is a bias towards simple models. In other words, physical systems are the domain of Occam’s Razor.

2. Living systems.
Such systems are characterised by apparent simplicity, but underlying complexity (at least relatively speaking). In other words, most living systems are really complicated underneath, but their appearances often conceal this complexity. It isn’t obvious to me why this should be so, and I haven’t come across much discussion about it — but living systems are full of what computing people call encapsulation, putting parts of systems into boxes with constrained interfaces to the outside.

One big example where this matters, and is starting to cause substantial problems for data mining, is the way diseases work. Most diseases are complex activities in the organism that has the disease, and their precise working out often depends on the genotype and phenotype of that organism as well as of the diseases themselves. In other words, a disease like influenza is a collaborative effort between the virus and the organism that has the flu — but it’s still possible to diagnose the disease because of large-scale regularities that we call symptoms.
It follows that, between the underlying complexity of disease, genotype, and phenotype, and the outward appearances of symptoms, or even RNA concentrations measured by microarrays, there must be substantial “bottlenecks” that reduce the underlying complexity. Our lack of understanding of these bottlenecks has made personalised medicine a much more elusive target than it seemed to be a decade ago. Systems involving living things are full of these bottlenecks that reduce the apparent complexity: species, psychology, language.

All of this has implications for data mining of systems involving living things, most of which have been ignored. First, the appropriate target for modelling should be these bottlenecks because this is where such systems “make the most sense”; but we don’t know where the bottlenecks are, that is which part of the system (which level of abstraction) should be modelled. In general, this means we don’t know how to guess the appropriate complexity of model to fit with the system. (And the model should usually be much more complex than we expect — in neurology, one of the difficult lessons has been that the human brain isn’t divided into nice functional building blocks; rather it is filled with “hacks”. So is a cell.)

Because systems involving living things are locally entropy reducing, different parts of the system play qualitatively different roles. Thus some data records are qualitatively of different significance to others, so the implicit sampling involved in collecting a dataset is much more difficult, but much more critical, than for clockwork systems.

Also, because different parts of the system are so different, the attributes relevant to modelling each part of the system will also tend to be different. Hence, we expect that biclustering will play an important role in modelling living systems. (Attribute selection may also still play a role, but only to remove globally uninteresting attributes; and this should probably be done with extreme caution.)

Systems of living things can also be said to have competing interests, even though these interests are not conscious. Thus such systems may involve communication and some kind of “social” interaction — which introduces a new kind of complexity: non-local entropy reduction. It’s not clear (to me at least) what this means for modelling, but it must mean that it’s easy to fall into a trap of using models that are too simple and too monolithic.

3. Human systems.
Human systems, of course, are also systems involving living things, but the big new feature is the presence of consciousness. Indeed, in settings where humans are involved but their actions and interactions are not conscious, models of the previous kind will suffice.

Systems involving conscious humans are locally and non-locally entropy reducing, but there are two extra feedback loops: (1) the loop within the mind of each actor which causes changes in behaviour because of modelling other actors and themself (the kind of thing that leads to “I know that he knows that I know that … so I’ll …); (2) the feedback loop between actors and data miners.

The first feedback loop creates two processes that must be considered in the modelling:
a. Self-consciousness, which generates, for example, purpose tremor;
b. Social consciousness, which generates, for example, strong signals from deception.

The second feedback loop creates two other processes:
a. Concealment, the intent or action of actors hiding some attributes or records from the modelling;
b. Manipulation, the deliberate attempt to change the outcomes of any analysis that might be applied.

I argue that all data mining involving humans has an adversarial component, because the interests of those being modelled never run exactly with each other, or with those doing the modelling, and so all of these processes must be considered whenever modelling of human systems is done. (You can find much more on this topic by reading back in the blog.)

But one obvious effect is that records and attributes need to have metadata associated with them that carries information about properties such as uncertainty or trustworthiness. Physical systems and living systems might mislead you, but only with your implicit connivance or misunderstanding; systems involving other humans can mislead you either with intent or as a side-effect of misleading someone else.

As I’ve written about before, systems where actors may be trying to conceal or manipulate require care in choosing modelling techniques so as not to be misled. On the other hand, when actors are self-conscious or socially conscious they often generate signals that can help the modelling. However, a complete way of accounting for issues such as trust at the datum level has still to be designed.

Protecting data

In the world of things, we often value objects more than the rest of the world would value them — the inherited china or silver, the souvenir bought on a meaningful trip, and so on.

In the world of data, this seems to be exactly the other way around: once we’ve captured some data (and perhaps used it to model our customers) then most of its value to us has been exhausted. So we fail to see that it has much greater value to others than it has to us — and fail to protect it well. For example, Adobe used and uses its data about customers for its own internal purposes. But it clearly failed to realize that this data was of huge potential value to criminals who can use it for identity theft.

The bottom line is: data should be protected by its real-world open-market value, not by its current value to the business. Until this sinks in, we are going to continue to see data breaches in businesses and governments.

Understanding “anomaly” in large dynamic datasets

A pervasive mental model of what it means to be an “anomaly” is that this concept is derived from difference or dissimilarity; anomalous objects or records are those that are far from the majority, common, ordinary, or safe records. This intuition is embedded in the language used — for example, words like “outlier”.

May I suggest that a much more helpful, and even more practical, intuition of what “anomaly” means comes from the consideration of boundaries rather than dissimilarity. Consider the following drastically simplifed rendering of a clustering:


There are 3 obvious clusters and a selection of individual points. How are we to understand these points?

The point A, which would conventionally by considered the most obvious outlier, is probably actually the least interesting. Points like this are almost always the result of some technical problem on the path between data collection and modelling. You wouldn’t think this would happen with automated systems, but it’s actually surprisingly common for data not to fit properly into a database schema or for data to be shifted over one column in a spreadsheet, and that’s exactly the kind of thing that leads to points like A. An inordinate amount of analyst attention can be focused on such points because they look so interesting, but they’re hardly ever of practical importance.

Points B and C create problems for many outlier/anomaly detection algorithms because they aren’t particularly far from the centre of gravity of the entire dataset. Sometimes points like these are called local outliers or inliers and their significance is judged by how far they are (how dissimilar) from their nearest cluster.

Such accounts are inadequate because they are too local. A much better way to judge B and C is to consider the boundaries between each cluster and the aggregate rest of the clusters; and then to consider how close such points lie to these boundaries. For example, B lies close to the boundary between the lower left cluster and the rest and is therefore an interesting anomalous point. If it were slightly further down in the clustering it would be less anomalous because it would be closer to the lower left cluster and further from this boundary. Point C is more anomalous than B because it lies close to three boundaries: those between the lower left cluster and the rest, between the upper left cluster and the rest, and the rightmost cluster and the rest. (Note that a local outlier approach might not think C is anomalous because it’s close to all three clusters.)

The point D is less anomalous  than B and C, but is also close to a boundary, the boundary the wraps the rightmost cluster. So this idea can be extended to many different settings. For example, wrapping a cluster more or less tightly changes the set of points that are “outside” the wrapping and so gives an ensemble score for how unusual the points on the fringe of a cluster might be. This is especially important in adversarial settings, because these fringes are often where those with bad intent lurk.

The heart of this approach is that anomaly must be a global property derived from all of the data, not just a local property derived from the neighbourhood of the point in question. Boundaries encode non-local properties in a way that similarity (especially similarity in a geometry, which is usually how clusterings are encoded) does not.

The other attractive feature of this approach is that it actually defines regions of the space based on the structure of the “normal” clusters. These regions can be precomputed and then, when new points arrive, it’s fast to decide how to understand them. In other words, the boundaries become ridge lines of high abnormality in the space and it’s easy to see and understand the height of any other point in the space. Thus the model works extremely effectively for dynamic data as long as there’s an initial set of normal data to prime the system. (New points can also be exploited as feedback to the system so that, if a sequence of points arrive in a region, the first few will appear as strong anomalies, but their presence creates a new cluster, and hence a new set of boundaries that mean that newer points in the same region no longer appear anomalous).

Businesses processing emails

The Daily Mail reports an experiment by the High-Tech Bridge company in which they sent private emails or uploaded documents containing unique urls to 50 different platforms, and then waited to see if and who visited these urls.

Sure enough, several of them were visited by the businesses that had handled the matching document, including Facebook, Twitter, and Google. This won’t come as a surprise to readers of this blog, but once again points out the extent to which businesses like these are processing any documents they see to extract models of the sender/receiver.

There has been some confusion in the media about how this process might work. Evidently it’s not obvious to many that such a process is automated — there isn’t anyone ‘reading’ these documents, but they’re being processed by software which is capable of ingesting pages pointed to, and processing the contents of those pages as well. It would help if we agreed to verbs that distinguished ‘read by a human’ from ‘processed by software’ that were simple enough for the wider public to understand the difference.

Pull from data versus push to analyst

One of the most striking things about the discussion of the NSA data collection that Snowden has made more widely known is the extent to which the paradigm for its use is database oriented. Both the media and, more surprisingly, the senior administrators talk only about using the data as a repository: “if we find a cell phone in Afghanistan we can look to see which numbers in the US it has been calling and who those numbers in turn call” has been the canonical justification. In other words, the model is: collect the data and then have analysts query it as needed.

The essence of data mining/knowledge discovery is exactly the opposite: allow the data to actively and inductively generate models with an associated quality score, and use analysts to determine which of these models is truly plausible and then useful. In other words, rather than having analysts create models in their heads and then use queries to see if they are plausible (a “pull” model), algorithmics generates models inductively and presents them to analysts (a “push” model). Since getting analysts to creatively think of reasonable models is difficult (and suffers from the “failure of imagination” problem, the inductive approach is both cheaper and more effective.

For example, given the collection of metadata about which phone numbers call which others, it’s possible to build systems that produce results of the form: here’s a set of phone numbers whose calling patterns are unlike any others (in the whole 500 million node graph of phones). Such a calling pattern might not represent something bad, but it’s usually worth a look. The phone companies themselves do some of this kind of analysis, for example to detect phones that are really business lines but are claiming to be residential and, in the days when long distance was expensive, to detect the same scammers moving across different phone numbers.

I would hope that inductive model building is being used on collected data, and the higher-ups in the NSA either don’t really understand or are being cagey. But I’ve talked to a lot of people in government who collect large data but are completely stuck in the database model, and have no inkling of inductive modelling.

Computing in Compromised Environments

As I’ve argued before, the Castle Model of cybersecurity is pretty much doomed — there’s no harm in antivirus and antimalware tools, but they provide only modest defence in a world where adversaries have access to the source code of the systems and tools that we run. Nobody, even at the high end, can assume that their systems haven’t been infiltrated by adversaries.

So if it’s impossible to keep the Vikings from roaming the hallways of the castle looking for things to steal, can anything be done to allow useful work to get done and at the same time protect against issues such as theft of intellectual property? The answer is yes, but it requires a change of mindset.

First, most things that can be stolen from the online world are not like pots of gold or the secret formula for antigravity — things for which existence is the fundamental property. Rather, most things that can be stolen are about choices from alternatives: will the tender bid be for this many dollars or that many dollars? Is the system going to use this technique or that technique? Is the software code going to be like this or like that? In other words, the property can be protected by adding uncertainty — if something is stolen but it may or may not be the true thing, then the stealing is much less rewarding, and might be useless.

As a concrete example, suppose the CEO is recommending to the Board that the business move in direction A, and this information is contained in a briefing note online. If there is also a briefing note recommending a move in direction B, and one recommending direction C and it’s not possible to tell which is the true one, then the theft of any or all of them provides adversaries with little information.

So the heart of the idea is to replicate information so that the true information is hidden in a welter of similar information that is interestingly different.

Making this idea work requires a couple of technical pieces which are buildable. For simplicity, I’ll describe the system in the case where there are only two copies of each document, but everything extends straightforwardly to as many replicas as you want, so that the uncertainty can be made arbitrarily large.

The first part is to defeat the possibility of working out which are the real documents by traffic and behavioral analysis. The ‘trick’ here is to use the ideas developed for the Frankenstein malware — create the fake documents from pieces of real documents, and create the fake editing actions by pasting together real editing actions. In other words, whenever a human carries out a sequence of edits, the actions and their timing are captured and replayed against fake documents. Thus even an observer with access to the complete system from the inside cannot distinguish between a live human working on a document and a piece of software doing the same (not, at least, without keystroke loggers, and even that can be worked around).

There are some obvious special cases: it helps to insert or remove ‘not’ around verbs; and it helps to change numbers in arbitrary ways. The point is not that the fakes should look plausible to careful analysis — it’s that they shouldn’t be detectable as fakes using automated analysis. Note that many real documents exist in unpolished and perhaps contradictory states as they are developed as well.

So the basic mechanism is that humans work on the real documents but software simulates humans working on the fake documents. Of course, the humans should be encouraged to work on the fake documents occasionally too.

The second part, then, is how the humans know which documents are the real ones in such a way that someone lurking inside the system can’t. Let’s suppose that each file exists with two name variants: fnA and fnB, one of which is real and other fake. To let the humans keep track of which is real, we need one offline secret. Each user is given an integer which is their part of the secret. Each time they log on, the system sends them another random integer (which is chosen from a fixed range, large enough that it is difficult for adversaries to infer what the range might be). If this random number is greater than the user’s number, then version A is the real one, if it is smaller then version B is. (This is a very simple version of Shamir’s secret sharing scheme, and all of the more sophisticated versions, including updating regimes can be slotted in here.)

A user cannot infer any other user’s offline secret; nor the range of the random numbers (although an adversary can know this since they can steal the code that implements it); and knowing someone else’s offline secret adds nothing. Each user’s offline secret can be changed at any time, even without any online consultation if the range of random numbers is allowed to be known in the offline world. The system itself can permute file names or make the apparent file names user-dependent with a few tweaks of the way in which numbers are generated. More complex secret-sharing can require more than one user to share their offline secrets to enable access to the true versions of particular files.

This looks, at first glance, like a lot of work. But the costs of our current security schemes are non-trivial, and both cycles and storage are relatively cheap. This scheme even makes it possible to use clouds again, something that has been pretty effectively torpedoed by the revelations of the level of interception in Five Eyes countries in the past week.

You may also be interested in ‘fertilizer’ or ‘last minute flights’

or ‘7 amazing ways to remove explosive residue’.

As I mentioned in my last post, the online-advertising businesses are spending as much time building models of us all as the NSA is spending building models of violent extremists, and have access to more data.

So how are they doing? If we looked at the ads being served to people like the Tsernaev brothers, would we find that these businesses have (unwittingly) built usable models of lone-wolf violent extremists — and so the pattern of ads served to such people is actually a signal of their potential for violence? There seems at least a decent chance that they have and maybe this should be followed up.

Government signals intelligence versus multinationals

In all of the discussion about the extent to which the U.S. NSA is collecting and analyzing data, the role of the private sector in similar analysis has been strangely neglected.

Observe, first, that all of the organizations that were asked to provide data to the NSA did not have to do anything special to do so. Verizon, the proximate example, was required to provide, for every phone call, the originating and destination numbers, the time, the duration, and the cell tower(s) involved for mobile calls — and all of this information was already collected. Why would they collect it, if not to have it available for their own analysis? It isn’t for billing — part of the push to envelope pricing plans was to save the costs of producing detailed bills, for which the cost was often greater than the cost of completing the call itself.

Second, government signals intelligence is constrained in the kind of data they are permitted to collect: traffic analysis (metadata) for everyone, but content only for foreign nationals and those specifically permitted by warrants for cause. Multinationals, on the other hand, can collect content for everyone. If you have a gmail account (I don’t), then Google not only sees all of your email traffic, but also sees and analyzes the content of every email you send and receive. If you send an email to someone with a gmail account, the content of that email is also analyzed. Of course, Google is only one of the players; many other companies have access to emails, other online communications (IM, Skype), and search histories, including which link(s) in the search results you actually follow.

A common response to these differences is something like “Well,  I trust large multinationals, but I don’t trust my government”. I don’t really understand this argument; multinationals are driven primarily (?only) by the need for profits. Even when they say that they will behave well, they are unable to carry out this promise. A public company cannot refrain from taking actions that will produce greater profits, since its interests are the interests of its shareholders. And, however well meaning, when a company is headed for bankruptcy and one of its valuable assets is data and models about millions of people, it’s naive to believe that the value of that asset won’t be realized.

Another popular response is “Well, governments have the power of arrest, while the effect of multinational is limited to the commercial sphere”. That’s true, but in Western democracies at least it’s hard for governments to exert their power without inviting scrutiny from the judicial system. At least there are checks and balances. If a multinational decides to exert its power, there is much less transparency and almost no mechanism for redress. For example, a search engine company can downweight my web site in results (this has already been done) and drive me out of business; an email company can lose all of my emails or pass their content to my competitors. I don’t lose my life or my freedom, but I could lose my livelihood.

A third popular response is “Well, multinationals are building models of me so that they can sell me things that are better aligned with my interests”. This is, at best, a half-truth. The reason they want a model of you is so that they can try and sell you things you might be persuaded to buy, not things that that you should or want to buy. In other words, the purpose of targeted advertising is at least to get you to buy more than you otherwise would, and to buy the highest profit margin version of things you might actually want to buy. Your interests and the interests of advertisers are only partially aligned, even when they have built a completely accurate model of you.

Sophisticated modelling from data has its risks, and we’re still struggling to understand the tradeoffs between power and consequences and between cost and effectiveness. But, at this moment, the risks seem to me to be greatest from multinational data analysis than from government data analysis.

Questions are data too

In the followup investigation of the Boston Marathon bombings, we see again the problem that data analytics has with questions.

Databases are built to store data. But, as Jeff Jones has most vocally pointed out, simply keeping the data is not enough in adversarial settings. You also need to keep the questions, and treat them as part of the ongoing data. The reason is obvious once you think about it — intelligence analysts need not only to know the known facts; they also need to know that someone else has asked the same question they just asked. Questions are part of the mental model of analysts, part of their situational awareness, but current systems don’t capture this part and preserve it so that others can build on it. In other words, we don’t just need to connect the dots; we need to connect the edges!

Another part of this is that, once questions are kept, they can be re-asked automatically. This is immensely powerful. At present, an analyst can pose a question (“has X ever communicated with Y?”), get a negative answer, only for information about such a communication to arrive a microsecond later and not be noticed. In fast changing environments, this can happen frequently, but it’s implausible to expect analysts to remember and re-pose their questions at intervals, just in case.

We still have some way to go with the tools and techniques available for intelligence analysis.