Posts Tagged 'data mining'

And so it begins

Stories out today that Google is now able to connect the purchasing habits of anyone it has a model for (i.e. almost everybody who’s ever been online) with Google’s own data on online activity.

For example, this story:

Google says that this enables them to draw the line between the ads that users have been shown, and the products that they buy. There’s a discrepancy in this story because Google also claim that they don’t get the list of products purchased using a credit card, but only the total amount. So a big hmmmmm.

(And if I were Google, I’d be concerned that there isn’t much of a link! Consumers might be less resentful if Google did indeed serve ads for things they wanted to buy, but everyone I’ve ever heard talk about online ads says the same thing: the ads either have nothing to do with their interests, or they are ads for things that they just bought.)

But connecting users to purchases (rather than ads to purchases) is the critical step to building a model of how much users are willing to pay — and this is the real risk of multinational data collection and analytics (as I’ve discussed in earlier posts).

“But I don’t have anything to hide”

This is the common response of many ordinary people when the discussion of (especially) government surveillance programs comes up. And they’re right, up to a point. In a perfect world, innocent people have nothing to fear from government.

The bigger problem, in fact, comes from the data collected and the models built by multinational businesses. Everyone has something to hide from them: the bottom line prices we are willing to pay.

We have not yet quite reached the world of differential pricing. We’ve become accustomed to the idea that the person sitting next to us on a plane may have paid (much) less for the identical travel experience, but we haven’t quite become reconciled to the idea that an online retailer might be charging us more for the same product than they charge other people, let alone that the chocolate bar at the corner store might be more expensive for us. If anything, we’re inclined to think that an organisation that has lots of data about us and has built a detailed model of us might give us a better price.

But it doesn’t require too much prescience to see that this isn’t always going to be the case. The seller’s slogan has always been “all the market can bear”.

Any commercial organization, under the name of customer relationship management, is building a model of your predicted net future value. Their actions towards you are driven by how large this is. Any benefits and discounts you get now are based on the expectation that, over the long haul, they will reap the converse benefits and more. It’s inherently an adversarial relationship.

Now think about the impact of data collection and modelling, especially with the realization that everything collected is there for ever. There’s no possibility of an economic fresh start, no bankruptcy of models that will wipe the slate clean and let you start again.

Negotiation relies on the property that each party holds back their actual bottom line. In a world where your bottom line is probably better known to the entity you’re negotiating with than it is to you, can you ever win? Or even win-win? Now tell me that you have nothing to hide.

[And, in the ongoing discussion of post-Snowden government surveillance, there’s still this enormous blind spot about the fact that multinational businesses collect electronic communication, content and metadata; location; every action on portable devices and some laptops; complete browsing and search histories; and audio around any of these devices. And they’re processing it all extremely hard.]

Refining “Data Science”

Regular readers will know that I have been thinking about the constellation of ideas that are getting a lot of play in universities and the research community around words like “data science”, and ‘big data”,  and especially the intersection of these ideas with the other constellation of “data mining”, “knowledge discovery” and “machine learning”.

I’ve argued that inductive model discovery (which I think is the core of all of these ideas) is a new way of doing science that is rapidly replacing the conventional Enlightenment or Popperian view of science. This is happening especially quickly in fields that struggled to apply the conventional scientific method, especially in medicine, the social “sciences”, business schools, and in the humanities.

Attending the International Conference on Computational Science meeting made me realise, however, that computational science is a part of this story as well.

Here’s how I see the connections between these three epistemologies:

  1. Conventional science. Understand systems via controlled experiments: setting up configurations that differ in only a few managed ways and seeing whether those differences correspond to different system behaviours. If they do, construct an “explanation”; if they don’t, it’s back to the drawing board.
  2. Computational science. Understand systems by building simulations of them and tweaking the simulations to see if the differences are those that are expected from the tweaks. (Simulations increase the range of systems that can be investigated when either the tweaks can’t be done on the real system, or when the system is hypothesised to be emergent from some simpler pieces.)
  3. Data science. Understand systems by looking at the different configurations that naturally occur and seeing how these correspond to different system behaviors. When they do, construct an “explanation”.

In other words, conventional science pokes the system being investigated in careful ways and sees how it reacts; computational science creates a replica of the system and pokes that; and data science looks at the system being poked and tries to match the reactions to the poking.

Underlying these differences in approach is also, of course, differences in validation: how one tells if an explanation is sufficient. The first two both start from a hypothesis and use statistical machinery to decide whether the hypothesis is supported sufficiently strongly. The difference is that the computational science has more flexibility to set up controlled experiments and so, all things considered, can get stronger evidence. (But there is always the larger question of whether the simulation actually reproduces the system of interest — critical, but often ignored, and with huge risks of “unknown unknowns”.) Data science, in contrast, validates its models of the system being studied by approaches such as the use of a test set, a component of the system that was not used to build the model, but which should behave as the original system did. It is also buttressed by the ability to generate multiple models and so compare among them.

Data science is advancing on two fronts: first, the flexibility it provides to conventional science not to have to construct carefully balanced controlled experiments; second, and much more significantly, the opportunity it creates for making scientific progress in the social sciences and humanities, replacing “qualitative” by “quantitative” in unprecedented ways.

Why Data Science?

Data Science has become a hot topic lately. As usual, there’s not a lot of agreement about what data science actually is. I was on a panel last week, and someone asked afterwards what the difference was between data mining, which we’ve been doing for 15 years, and data science.

It’s a good question. Data science is a new way of framing the scientific enterprise in which a priori hypothesis creation is replaced by inductive modelling; and this is exactly what data mining/knowledge discovery is about (as I’ve been telling my students for a decade).

What’s changed, perhaps, is that scientists in many different areas have realised the existence and potential of this approach, and are commandeering it for their own.

I’ve included the slides from a recent talk I gave on this subject (at the University of Technology Sydney).

And once again let me emphasise that the social sciences and humanities did not really have access to the Enlightenment model of doing science (because they couldn’t do controlled experiments), but they certainly do to the new model. So expect a huge development in data social science and data humanities as soon as research students with the required computational skills move into academia in quantity.

Why data science (ppt slides)

More subtle lessons from the Sony hack

There are some obvious lessons to learn from the Sony hack: perimeter defence isn’t much use when the perimeter has thousands of gates in it (it looks as if the starting point was a straightforward spearphishing attack); and if you don’t compartmentalise your system inside the perimeter, then anyone who gets past it has access to everything.

But the less obvious lesson has to do with the difference between our human perception of the difficulties of de-anonymization and aggregation, and the actual power of analytics to handle both. For example, presumably Sony kept data on their employees health in properly-protected HIPAA-compliant storage — but there were occasional emails that mentioned individuals and their health status. The people sending these emails presumably didn’t feel as if any particular one was a breach of privacy — the private content in each one was small. But they failed to realise that all of these emails get aggregated, at least in backups. So now all of those little bits of information are in one place, and the risks of building significant models from them has increased substantially.

Anyone with analytic experience and access to a large number of emails can find structures that are decidedly non-obvious; but this is far from intuitive to the public at large, and hence to Sony executives.

We need to learn to value data better, and to understand in a deep way that the value of data increases superlinearly with the amount that is collected into a single coherent unit.

Making recommendations different enough

One of the major uses of data analytics in practice is to make recommendations, either explicitly or implicitly. This is one area where the interests of marketers and the interests of consumers largely run together. If I want to buy something (a product or access to an experience such as listening to a song or seeing a movie) I’d just as soon buy something I actually want — so a  seller who can suggest something that fits has a better chance of getting my business than one that presents a set of generic choices. Thus businesses build models of me to try and predict what I am likely to buy.

Some of these businesses are middlemen, and what they are trying to predict is what kind of ads to show me on behalf of other businesses. Although this is a major source of revenue for web-based businesses, I suspect it to be an edge-case phenomenon — that is, the only people who actually see ads on the web are people who are new to the web (and there are still lots of them every day) while those who’ve been around for a while develop blindness to everything other than content they actually want. You see businesses continually developing new ways to make their ads more obtrusive but, having trained us to ignore them, this often seems to backfire.

Other businesses use permission marketing, which they can do because they already have a relationship. This gives them an inside track — they know things about their customers that are hard to get from more open sources, such as what they have actually spent money on. When they can analyse the data available to them effectively, they are able to create the win-win situation where what they want to sell me is also what I want to buy.

But there’s a huge hole in the technology that represents an opportunity at least as large as that on which Google was founded: how new and different should the suggestion be?

For example, if you buy a book from Amazon by a popular author, your recommendation list is populated by all of the other books by that same author, the TV programs based on that author’s books, the DVDs of the TV shows, and on and on. This is largely a waste of the precious resource of quality attention. Most humans are capable of figuring out that, if they like one book by an author, they may well like others. (In fact, sites like Amazon are surprisingly bad at figuring out that, when an author writes multiple series featuring different characters and settings, an individual might like one series but not necessarily the others.)

So what would be better? The goal, surely, is to suggest products that are similar to the ones the individual has already liked or purchased, but also sufficiently different that that individual would not necessarily have noticed them. In other words, current systems present products in a sphere around the existing product, but what they should do is present products in an annulus around the existing product. Different, but not too different.

This is surprisingly difficult to do. Deciding what similarity means is already a difficult problem; deciding what “just enough dissimilarity” means has been, so far, a too difficult problem. But what an opportunity!