Predicting fraud risk from customer online properties

This interesting paper presents the results of an investigation into how well a digital footprint, properties associated with online interactions with a business such as platform and time of day, can be used to predict risk of non-payment for a pay-on-delivery shopping business.

The properties that are predictive are not, by themselves, all that surprising: those who shop in the middle of the night are higher risk, those who come from price comparison web sitesĀ  are lower risk, and so on.

What is surprising is that, overall, predictive performance rivals, and perhaps exceeds, risk prediction from FICO (i.e. credit scores) — but these properties are much easier to collect, and the model based on them can be applied to those who don’t have a credit score. What’s more, the digital footprint and FICO-based models are not very correlated, and so using both does even better.

The properties collected for the digital fingerprint are so easy to collect that almost any online business or government department could (and should) be using them to get a sense of their customers.

I’ve heard (but can’t find a reference) that Australian online insurance quotes vary in price based on what time of day they are requested — I suppose based on an intuition that procrastination is correlated with risky behaviour. I’d be grateful if anyone has details of any organisation that it using this kind of predictive model for their customers.

Businesses like Amazon require payment up front — but they also face a panel of risks, including falsified payments (stolen credit cards) and delivery hijacking. Approaches like this might well help in detecting these other kinds of fraud.


Blockchain — hype vs reality

I was at a Fintech conference last week where (not surprisingly) there was a lot of talk about blockchains and their potential impact in the banking and insurance sectors. There seem to me to be two issues that don’t get enough attention (although wiser heads than mine have written about both):

  1. The current implementations are much too heavy-weight to survive. I’ve seen numbers like 0.2% of the entire world’s energy is already going to drive blockchains, the response times are too slow for many (?most) real applications, and the data sizes that can be preserved on a blockchain are too small. This seems like a situation where being the second or third mover is a huge advantage. For example the Algorand approach seems to have substantial advantages over current blockchain implementations.
  2. The requirement to trust an intermediary (disintermediation) changes to a requirement to trust the software developers who implement the interfaces to the blockchain. There are known examples where deposits to a cryptocurrency have permanently disappeared as the result of software bugs in the deposit code (video of a talk by Brendan Cordy); and this will surely get worse as blockchains are used for more complex things. The fundamental issue is that using a blockchain requires atomic commit operations, and software engineering is not really up for the challenge (although this is an interesting area where formal methods might make a difference).

Cash and money laundering

Although privacy advocates often favour continuing a cash economy, it’s becoming clear that cash is heavily the place where bad things happen.

There are two ways in which cash is used in the black (and often criminal) economy. The first is as the basis of a barter economy — those involved buy and sell almost exclusively in cash, and their activities never get onto the radar of banks, taxation departments, or financial intelligence units. There’s not much that can be done about this, except that those who live in the cash economy usually have a lifestyle that’s much higher than their official income, so tax authorities might take an interest.

Cash is much more usable if it can be converted into electronic currency in banks. The number of ways this can be done is steadily diminishing. In Canada, banks and other financial institutions are required to report cash deposits above $10,000 unless they can show that they come from routine business activities (and there are quite specific rules about what this means). There are some loopholes, but not many and not big.

Australia has just taken steps towards banning cash deposits above $10,000 altogether. The uproar has been revealing.

One operator of an armoured car business complained to the media that he was moving ~$5 million a month in cash, about half from car dealers, and that this would ruin his business. The treasurer’s response was, roughly speaking, “Good”. (Businesses that sell transport already have quite strong restrictions about their cash activities in many countries.)

Also in Australia, the proportion of cash transactions dropped to 37% by 2016, with a corresponding drop in the total value they represent. However, the ratio of physical currency to GDP is at an all time high; there is $3000 in circulation for every Australian.

There are either some people with vast sums stashed under their mattresses, or this money is being used mostly by criminals. (Hint: it’s the latter.)

It’s no wonder that many countries are trying to be more aggressive about reducing the cash in circulation, mostly by removing high-denomination notes (because more value can be packed into a tighter space).

But it also seems clear that banks can’t resist to lure of large cash deposits, no matter how much they should. And as long as banks don’t tell them, country’s financial intelligence units can’t do a lot about it.


The Sound of Music Problem

Recommender and collaborative filtering systems all try to recommend new possibilities to each user, based on the global structure of what all users like and dislike. Underlying such systems is an assumption that we are all alike, just in different ways.

One of the thorny problems of recommender systems is the Sound of Music Problem — there are some objects that are liked by almost everybody, although not necessarily very strongly. Few people hate “The Sound of Music” (TSOM) film, but it’s more a tradition than a strong favourite. Objects like this — seen/bought by many, positively rated but not strongly — cause problems in recommender systems because they create similarities between many other objects that are stronger than ‘they ought to be’. If A likes an object X and TSOM, and B likes Y and TSOM, then TSOM makes A and B seem more similar than they would otherwise be; and also makes X and Y seems more similar as a consequence. When the common object is really common, this effect seriously distorts the quality of the recommendations. For example, Netflix’s list of recommendations for each user has substantial overlap with its ‘Trending Now’ list — in other words, Netflix is recommending, to almost everyone, the same set of shows that almost everyone is watching. Nor is this problem specific to Netflix: Amazon makes similar globally popular recommendations. Less obviously, Google’s page ranking algorithm produces a global ranking of all pages (more or less), from which the ones containing the given search terms are selected.

The property of being a TSOM-like object is related to global popularity, but it isn’t quite the same. Such an object must have been seen/bought by many people, but it is it’s place in the network of objects that causes the trouble, not its high frequency of having been rated as such.

Solutions to this problem have been known from a while (For example, Matt Brand’s paper outlines a scalable approach that works well in practice. This work deserves to be better known). It’s a bit of a mystery why such solutions haven’t been used by all of the businesses that do recommendation, and so why we live with recommendations that are so poor.

Provenance — the most important ignored concept

I’ve been thinking lately about the problems faced by Customs and Border Protection pieces of government. The essential problem they face is: when an object (possibly a person) arrives at a border, they must decide whether or not to allow that object to pass, and whether to charge for the privilege.

The basis for this decision is never the honest face of the traveller or the nicely wrapped package. Rather, all customs organisations gather extra information to use. At one time, this was a document of some sort accompanying the object, and itself certified by someone trustworthy; increasingly it is data collected about the object before it arrived at the border, and also the social or physical network in which it is embedded. For humans, this might be data provided on a visa application, travel payment details, watch lists, and social network activity. For physical objects, this might be shipper and destination properties and history.

In other words, customs wants to know the provenance of every object, in as much detail as possible, and going back as far as possible. All of the existing analytic techniques in this space are approximating that provenance data and how it can be leveraged.

Other parts of government, for example law enforcement and intelligence, are also interested in provenance. For example, sleeper agents are those embedded in another country with a false provenance. The use of aliases is an attempt to break chains of provenance.

Ordinary citizens are wary of this kind of government data collection, feeling that society rightly depends on the existence of two mechanisms for deleting parts of our personal history: ignoring non-consequential parts (youthful hijinks, sealed criminal records) at least after they disappear far enough into the past; and forgiving other parts. There’s a widespread acceptance that people can change and it’s too restrictive to make this impossible.

The way we handle this today is that we explicitly delete certain information from the provenance, but make decisions as if the provenance were complete. This works all of the way from government level to the level of a marriage.

There is another way to handle it, though. That is to keep the provenance complete, but make decisions in ways that acknowledge explicitly that using some data is less appropriate. The trouble is that nobody trusts organisations to make decisions in this second way.

There’s an obvious link between blockchains and provenance, but the whole point of a blockchain is to make erasure difficult. So they will really only have a role to play in this story if we, collectively, adopt the second mechanism.

The issues become more difficult if we consider the way in which businesses use our personal information. Governments have a responsibility to implement social decisions and so are motivated to treat individual data in socially approved ways. Businesses have no such constraint — indeed they have a fiduciary responsibility to leverage data as much as they can. A business that knows data about me has no motivation to delete or ignore some of it.

This is the dark side of provenance. As I’ve discussed before, the ultimate goal of online businesses is to build models of everyone on the planet and how much they’re willing to pay for every product. In turn, sales businesses can use these models to apply differential pricing to their customers, and so increase their profits.

This sets up an adversarial situation, where consumers are trying to look like their acceptable price is much lower than it is. Having a complete, unalterable personal provenance makes this much, much more difficult. But by the time most consumers realise this, it will be too late.

In all of these settings, the data collected about us is becoming so all-encompassing and permanent that it is going to change the way society works. Provenance is an under-discussed topic, but it’s becoming a central one.

What is the “artificial intelligence” that is all over the media?

Anyone who’s been paying attention to the media will be aware that “artificial intelligence” is a hot topic. And it’s not just the media — China recently announced a well-funded quasi-industrial campus devoted to “artificial intelligence”.

Those of us who’ve been around for a while know that “artificial intelligence” becomes the Next Big Thing roughly every 20 years, and so are inclined to take every hype wave with a grain (or truckload) of salt. Building something algorithmic that could genuinely be considered intelligent is much, much, much harder than it looks, and I don’t think we’re even close to solving this problem. Not that we shouldn’t try, as I’ve argued earlier, but mostly for the side-effects a try will spin off.

So what is it that the media (etc.) think has happened that’s so revolutionary? I think there are two main things:

  1. There’s no question that deep learning has made progress at solving some problems that have been intractable for a while, notably object recognition and language manipulations. So it’s not surprising that there’s a sense that suddenly “artificial intelligence” has made progress. However, much of this progress has been over-hyped by the businesses doing it. For example, object recognition has this huge, poorly understood weakness that a pair of images that look to us identical can produce the correct identification of the objects they contain from one image, and complete garbage from the other, using the same algorithm. In other words, the continuity between images that somehow “ought” to be present when there are only minor pixel level changes is not detected properly by the algorithmic object recogniser. In the language space, word2vec doesn’t seem to do what it’s claimed to do, and the community has had trouble reproducing some of the early successes.
  2. Algorithms are inserting themselves into decision making settings which previously required humans in the loop. When machines supplemented human physical strength, there was an outcry about the takeover of the machines; when algorithms supplemented human mental abilities, there was an outcry about the takeover of the machines; and now that algorithms are controlling systems without humans, there’s a fresh outcry. Of course, this isn’t as new as it looks — autopilots have been flying planes better than human pilots for a while, and automated train systems are commonplace. But driverless vehicles have pushed these abilities into public view in a new, forceful way, and it’s unsettling. And militaries keep talking about automated warfare (although this seems implausible given current technology).

My sense is that these developments are all incremental, from a technical perspective if not from a social one. There are interesting new algorithms, many of them simply large-scale applications of heuristics, but nothing that qualifies as a revolution. I suspect that “artificial intelligence” is yet another bubble, as it was in the 60s and in the 80s.

New data on human trafficking

The UN’s International Organisation for Migration has just released a large tranche of data about migration patterns, and a slick visualization interface:

Some of the results are quite counter-intuitive. Worth a look.