Posts Tagged 'advertising'

The right level of abstraction = the right level to model

I think the take away from my last post is that models of systems should aim to model them at the right level of abstraction, where that right level corresponds to the places where there are bottlenecks. These bottlenecks are places where, as we zoom out in terms of abstraction, the system suddenly seems simpler. The underlying differences don’t actually make a difference; they are just variation.

The difficulty is that it’s really, really hard to see or decide where these bottlenecks are. We rightly laud Newton for seeing that a wide range of different systems could all be described by a single equation; but it’s also true that Einstein showed that this apparent simplicity was actually an approximation for a certain (large!) subclass of systems, and so the sweet spot of system modelling isn’t quite where Newton thought it was.

For living systems, it’s even harder to see where the right level of abstraction lies. Linnaeus (apparently the most-cited human) certainly created a model that was tremendously useful, working at the level of the species. This model has frayed a bit with the advent of DNA technology, since the clusters from observations don’t quite match the clusters from DNA, but it was still a huge contribution. But it’s turning out to be very hard to figure out the right level of abstractions to capture ideas like “particular disease” “particular cancer” even though we can diagnose them quite well. The variations in what’s happening in cells are extremely difficult to map to what seems to be happening in the disease.

For human systems, the level of abstraction is even harder to get right. In some settings, humans are surprisingly sheep-like and broad-brush abstractions are easy to find. But dig a little, and it all falls apart into “each person behaves as they like”. So predicting the number of “friends” a person will have on a social media site is easy (it will be distributed around Dunbar’s number), but predicting whether or not they will connect with a particular person is much, much harder. Does advertising work? Yes, about half of it (as Ogilvy famously said). But will this ad influence this person? No idea. Will knowing the genre of this book or film improve the success rate of recommendations? Yes. Will it help with this book and this person? Not so much.

Note the connection between levels of abstraction and clustering. In principle, if you can cluster (or, better, bicluster) data about your system and get (a) strong clusters, and (b) not too many of them, then you have some grounds for saying that you’re modelling at the right level. But this approach founders on the details: which attributes to include, which algorithm to use, which similarity measure, which parameters, and so on and on.

Government signals intelligence versus multinationals

In all of the discussion about the extent to which the U.S. NSA is collecting and analyzing data, the role of the private sector in similar analysis has been strangely neglected.

Observe, first, that all of the organizations that were asked to provide data to the NSA did not have to do anything special to do so. Verizon, the proximate example, was required to provide, for every phone call, the originating and destination numbers, the time, the duration, and the cell tower(s) involved for mobile calls — and all of this information was already collected. Why would they collect it, if not to have it available for their own analysis? It isn’t for billing — part of the push to envelope pricing plans was to save the costs of producing detailed bills, for which the cost was often greater than the cost of completing the call itself.

Second, government signals intelligence is constrained in the kind of data they are permitted to collect: traffic analysis (metadata) for everyone, but content only for foreign nationals and those specifically permitted by warrants for cause. Multinationals, on the other hand, can collect content for everyone. If you have a gmail account (I don’t), then Google not only sees all of your email traffic, but also sees and analyzes the content of every email you send and receive. If you send an email to someone with a gmail account, the content of that email is also analyzed. Of course, Google is only one of the players; many other companies have access to emails, other online communications (IM, Skype), and search histories, including which link(s) in the search results you actually follow.

A common response to these differences is something like “Well,  I trust large multinationals, but I don’t trust my government”. I don’t really understand this argument; multinationals are driven primarily (?only) by the need for profits. Even when they say that they will behave well, they are unable to carry out this promise. A public company cannot refrain from taking actions that will produce greater profits, since its interests are the interests of its shareholders. And, however well meaning, when a company is headed for bankruptcy and one of its valuable assets is data and models about millions of people, it’s naive to believe that the value of that asset won’t be realized.

Another popular response is “Well, governments have the power of arrest, while the effect of multinational is limited to the commercial sphere”. That’s true, but in Western democracies at least it’s hard for governments to exert their power without inviting scrutiny from the judicial system. At least there are checks and balances. If a multinational decides to exert its power, there is much less transparency and almost no mechanism for redress. For example, a search engine company can downweight my web site in results (this has already been done) and drive me out of business; an email company can lose all of my emails or pass their content to my competitors. I don’t lose my life or my freedom, but I could lose my livelihood.

A third popular response is “Well, multinationals are building models of me so that they can sell me things that are better aligned with my interests”. This is, at best, a half-truth. The reason they want a model of you is so that they can try and sell you things you might be persuaded to buy, not things that that you should or want to buy. In other words, the purpose of targeted advertising is at least to get you to buy more than you otherwise would, and to buy the highest profit margin version of things you might actually want to buy. Your interests and the interests of advertisers are only partially aligned, even when they have built a completely accurate model of you.

Sophisticated modelling from data has its risks, and we’re still struggling to understand the tradeoffs between power and consequences and between cost and effectiveness. But, at this moment, the risks seem to me to be greatest from multinational data analysis than from government data analysis.

What can be learned from text III

Another property that can be learned from text is the author’s attitude to whatever the text is about. This is called, variously, sentiment analysis or appraisal theory. For obvious reasons, it has always been interesting to advertisers and marketers.

In its simplest form, it just analyzes text for associations of adjectives with the nouns of interest, for example films or people. This could be as simple as seeing whether the adjective “good” or “bad” appears near the noun(s) in question. It is not too difficult to extend this to other sets of adjectives that can be considered positive or negative: “the movie was exciting” (good), or “the movie was boring” (bad).

However, this process is not quite as easy as it looks. First of all, it’s hard in languages like English to be sure which adjective goes with which noun — proximity in the sentence is often used, but this is not very robust: “Although parts of the movie were good, overall it was bad” is not a positive comment about the movie.

Second, authors often use devices such as irony and sarcasm which look, syntactically, as if they are giving one opinion, but are actually giving the opposite opinion. Humans figure this out using deep background knowledge about the situation and about human mental life, so it’s difficult for an algorithm to mimic this level of understanding.

Third, texts often comment about the parts of an object as well as the whole object, and it becomes difficult to decide which adjectives go with which parts.

There are three levels of algorithmic analysis used for this problem:

  1. Using simple sets of opinion adjectives (and maybe other words) and trying to associate them to the nouns of interest using proximity, perhaps with a little extra sophistication, trying to pick out dependent clauses etc.
  2. Parsing the text more deeply and using natural language analysis techniques to associate opinion words with the nouns of interest.
  3. Using systemic functional linguistics approaches, which treat language generation as a goal-driven task by an individual in a societal setting, as well as a technology.

These levels are arranged in increasing order of sophistication, and also of complexity. However, even the best algorithms perform only at the 80% or so level, and that’s only capturing relatively unsophisticated judgements.

There are obvious applications to sentiment analysis in adversarial situations: trying to decide whether a terrorist group pronouncement or a threat represents a genuine opinion by the author or some form of propaganda; and who the propaganda might be aimed at.