Archive for the 'Uncategorized' Category

Annular similarity

When similarity is used for clustering, then obviously the most similar objects need to be placed in the same cluster.

But when similarity is being used for human consumption, a different dynamic is in play — humans usually already know what the most similar objects are, and are interested in those that are (just) beyond those.

This can be seen most clearly in recommender systems. Purchase an item or watch a Netlflix show, and your recommendation list will fill up with new objects that are very similar to the thing you just bought/watched.

From a strictly algorithm point of view, this is a success — the algorithm found objects similar to the starting object. But from a human point of view this is a total fail because it’s very likely that you, the human, already know about all of these recommended objects. If you bought something, you probably compared the thing you bought with many or all of the objects that are now being recommended to you. If you watched something, the recommendations are still likely to be things you already knew about.

The misconception about what similarity needs to mean to be useful to humans is at the heart of the failure of recommender systems, and even the ad serving systems that many of the online businesses make their money from. Everyone has had the experience of buying something, only to have their ad feed (should they still see it) fill up with ads for similar products (“I see you just bought a new car — here are some other new cars you might like”).

What’s needed is annular similarity — a region that is centred at the initial object, but excludes new objects that are too similar, and focuses instead on objects that are a bit similar.

Amazon tries to do this via “People who bought this also bought” which can show useful add-on products. (They also use “People who viewed this also viewed” but this is much less effective because motivations are so variable.) But this mechanism also fails because buying things together doesn’t necessarily mean that they belong together — it’s common to see recommendations based on the fact that two objects were on special on the same day, and so more likely to be bought together because of the opportunity, rather than any commonality.

Annular similarity is also important in applications that help humans to learn new things: web search, online courses, intelligence analysis. That’s why we built the ATHENS divergent web search engine (refs below) — give it some search terms and it returns (clusters of) web pages that contain information that is just over the horizon from the search terms. We found that this required two annuli — we first constructed the information implicit in the search terms, then an annulus around that of information that we assumed would be known to someone who knew the core derived from the search terms, and only then did we generate another annulus which contains the results returned.

We don’t know many algorithmic ways to find annular similarity. In any distance-based clustering it’s possible, of course, to define an annulus around any point. But it’s tricky to decide on what the inner and outer radii should be, the calculations have to happen in high-dimensional space where the points are very sparse, and it’s not usually clear whether the space is isotropic.

Annular similarity doesn’t work (at least straightforwardly) in density-based (e.g. DBScan) or distribution-based clustering (e.g. EM) because the semantics of ‘cluster’ doesn’t allow for an annulus.

One way that does work (and was used extensively in the ATHENS system) is based on singular vallue decomposition (SVD). An SVD projects a high-dimensional space into a low-dimensional one in such a way as to preserve as much of the variation as possible. One of its useful side-effects is that a point that is similar to many other points tends to be projected close to the origin; and a point that is dissimilar to most other points also tends to be projected close to the origin because the dimension(s) it inhabits have little variation and tend to be projected away. In the resulting low-dimensional projection, points far from the origin tend to be interestingly dissimilar to those at the centre of the structure — and so an annulus imposed on the embedding tends to find an interesting set of objects.

Unfortunately this doesn’t solve the recommender system problem because recommenders need to find similar points that have more non-zeroes than the initial target point — and the projection doesn’t preserve this ordering well. That means that the entire region around the target point has to be searched, which becomes expensive.

There’s an opportunity here to come up with better algorithms to find annular structures. Success would lead to advances in several diverse areas.

(A related problem is the Sound of Music problem, the tendency for a common/popular object to muddle the similarity structure of all of the other objects because of its weak similarity to all of them. The Sound of Music plays this role in movie recommendation systems, but think of wrapping paper as a similar object in the context of Amazon. I’ve written about this in a previous post.)

 

Tracy A. Jenkin, Yolande E. Chan, David B. Skillicorn, Keith W. Rogers:
Individual Exploration, Sensemaking, and Innovation: A Design for the Discovery of Novel Information. Decision Sciences 44(6): 1021-1057 (2013)
Tracy A. Jenkin, David B. Skillicorn, Yolande E. Chan:
Novel Idea Generation, Collaborative Filtering, and Group Innovation Processes. ICIS 2011
David B. Skillicorn, Nikhil Vats:
Novel information discovery for intelligence and counterterrorism. Decision Support Systems 43(4): 1375-1382 (2007)
Nikhil Vats, David B. Skillicorn:
Information discovery within organizations using the Athens system. CASCON 2004: 282-292

 

Advertisements

China-Huawei-Canada fail

Huawei has been trying to convince the world that they are a private company with no covert relationships to the Chinese government that might compromise the security of their products and installations.

This attempt has been torpedoed by the Chinese ambassador to Canada who today threatened ‘retaliation’ if Canada joins three of the Five Eyes countries (and a number of others) in banning Huawei from provisioning 5G networks. (The U.K. hasn’t banned Huawei equipment, but BT is uninstalling it, and the unit set up jointly by Huawei and GCHQ to try to alleviate concerns about Huawei’s hardware and software has recently reported that it’s less certain about the security of these systems now than it was when the process started.)

It’s one thing for a government to act as a booster for national industries — it’s another to deploy government force directly.

China seems to have a tin ear for the way that the rest of the world does business; it can’t help but hurt them eventually.

The cybercrime landscape in Canada

Statscan recently released the results of their survey of cybercrime and cybersecurity in 2017 (https://www150.statcan.gc.ca/n1/pub/71-607-x/71-607-x2018007-eng.htm).

Here are some the highlights:

  • About 20% of Canadian businesses had a cybersecurity incident (that they noticed). Of these around 40% had no detectable motive, another 40% were aimed at financial gain, and around 23% were aimed at getting information.
  • More than half had enough of an impact to prevent the business operating for at least a day.
  • Rates of incidents were much higher in the banking sector and pipeline transportation sector (worrying), and in universities (not unexpected, given their need to operate openly).
  • About a quarter of businesses don’t use an anti-malware tool, about a quarter do not have email security (not clear what this means, but presumably antivirus scanning of incoming email, and maybe exfiltration protection), and almost a third do not have network security. These are terrifying numbers.

Relatively few businesses have a policy re managing and reporting cybersecurity incidents; vanishingly few have senior executive involvement in cybersecurity.

It could be worse, but this must be disappointing to those in the Canadian government who’ve been developing and pushing out cyber awareness.

People in glass houses

There’s a throwaway line in Woodward’s book about the Trump White House (“Fear”, Simon and Schuster, 2018) where he says that the senior military were unwilling to carry out offensive cyber-offensive operations because they didn’t think the US would fare well under retaliation.

Then this week the GAO came out with a report on cybersecurity in DOD weapons systems (as opposed to DOD networks). It does not make happy reading. (Full report).

Here’s what seems to me to be the key quotation:

“We found that from 2012 to 2017, DOD testers routinely found mission critical cyber vulnerabilities in nearly all weapon systems that were under development. Using relatively simple tools and techniques, testers were able to take control of these systems and largely operate undetected”

Almost every word could be italicized and many added exclamation marks would hardly suffice.

To be fair, some of these systems are still under development. But the report makes clear that, for many of them, cybersecurity was not really considered in their design. The typical assumption was that weapons systems are standalone. But in a world where software runs everything, there has to be a mechanism for software updates at least, and so a connection to the outside world. As the Iranians discovered, even update from a USB is not attack-proof. And security is a difficult property to retrofit, so these systems will never be as cyberattack resistant as we might all have wished.

Predicting fraud risk from customer online properties

This interesting paper presents the results of an investigation into how well a digital footprint, properties associated with online interactions with a business such as platform and time of day, can be used to predict risk of non-payment for a pay-on-delivery shopping business.

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3163781

The properties that are predictive are not, by themselves, all that surprising: those who shop in the middle of the night are higher risk, those who come from price comparison web sitesĀ  are lower risk, and so on.

What is surprising is that, overall, predictive performance rivals, and perhaps exceeds, risk prediction from FICO (i.e. credit scores) — but these properties are much easier to collect, and the model based on them can be applied to those who don’t have a credit score. What’s more, the digital footprint and FICO-based models are not very correlated, and so using both does even better.

The properties collected for the digital fingerprint are so easy to collect that almost any online business or government department could (and should) be using them to get a sense of their customers.

I’ve heard (but can’t find a reference) that Australian online insurance quotes vary in price based on what time of day they are requested — I suppose based on an intuition that procrastination is correlated with risky behaviour. I’d be grateful if anyone has details of any organisation that it using this kind of predictive model for their customers.

Businesses like Amazon require payment up front — but they also face a panel of risks, including falsified payments (stolen credit cards) and delivery hijacking. Approaches like this might well help in detecting these other kinds of fraud.

Blockchain — hype vs reality

I was at a Fintech conference last week where (not surprisingly) there was a lot of talk about blockchains and their potential impact in the banking and insurance sectors. There seem to me to be two issues that don’t get enough attention (although wiser heads than mine have written about both):

  1. The current implementations are much too heavy-weight to survive. I’ve seen numbers like 0.2% of the entire world’s energy is already going to drive blockchains, the response times are too slow for many (?most) real applications, and the data sizes that can be preserved on a blockchain are too small. This seems like a situation where being the second or third mover is a huge advantage. For example the Algorand approach seems to have substantial advantages over current blockchain implementations.
  2. The requirement to trust an intermediary (disintermediation) changes to a requirement to trust the software developers who implement the interfaces to the blockchain. There are known examples where deposits to a cryptocurrency have permanently disappeared as the result of software bugs in the deposit code (video of a talk by Brendan Cordy); and this will surely get worse as blockchains are used for more complex things. The fundamental issue is that using a blockchain requires atomic commit operations, and software engineering is not really up for the challenge (although this is an interesting area where formal methods might make a difference).

Cash and money laundering

Although privacy advocates often favour continuing a cash economy, it’s becoming clear that cash is heavily the place where bad things happen.

There are two ways in which cash is used in the black (and often criminal) economy. The first is as the basis of a barter economy — those involved buy and sell almost exclusively in cash, and their activities never get onto the radar of banks, taxation departments, or financial intelligence units. There’s not much that can be done about this, except that those who live in the cash economy usually have a lifestyle that’s much higher than their official income, so tax authorities might take an interest.

Cash is much more usable if it can be converted into electronic currency in banks. The number of ways this can be done is steadily diminishing. In Canada, banks and other financial institutions are required to report cash deposits above $10,000 unless they can show that they come from routine business activities (and there are quite specific rules about what this means). There are some loopholes, but not many and not big.

Australia has just taken steps towards banning cash deposits above $10,000 altogether. The uproar has been revealing.

One operator of an armoured car business complained to the media that he was moving ~$5 million a month in cash, about half from car dealers, and that this would ruin his business. The treasurer’s response was, roughly speaking, “Good”. (Businesses that sell transport already have quite strong restrictions about their cash activities in many countries.)

http://www.news.com.au/finance/economy/federal-budget/im-being-kicked-in-the-teeth-security-company-owner-says-10000-cash-limit-will-demolish-him/news-story/05c71552b8f5c63c6846b13335763063

Also in Australia, the proportion of cash transactions dropped to 37% by 2016, with a corresponding drop in the total value they represent. However, the ratio of physical currency to GDP is at an all time high; there is $3000 in circulation for every Australian.

https://www.smh.com.au/business/the-economy/we-re-turning-from-cash-but-demand-for-notes-has-never-been-higher-20180510-p4zegm.html

There are either some people with vast sums stashed under their mattresses, or this money is being used mostly by criminals. (Hint: it’s the latter.)

It’s no wonder that many countries are trying to be more aggressive about reducing the cash in circulation, mostly by removing high-denomination notes (because more value can be packed into a tighter space).

But it also seems clear that banks can’t resist to lure of large cash deposits, no matter how much they should. And as long as banks don’t tell them, country’s financial intelligence units can’t do a lot about it.

 


Advertisements