Archive for the 'Uncategorized' Category

Pandemic theatre

Canada is the latest country to announce that they are rolling out a contact tracing app for covid-19. There are so many issues with this idea, and its timing, that we have to consider it as pandemic theatre.

These contact tracing apps work as follows: each phone is given a random identifier. Whenever your phone and somebody else’s phone get close enough, they exchange these identifiers. If anyone is diagnosed with Covid, their identifier is flagged and all of the phones that have been close to the flagged phone in the past 2 weeks are notified so that users know that they have been close to someone who subsequently got the disease.

First, Canada is very late to the party. This style of contact tracing app was first designed by Singapore, Australia rolled its version out at the end of April, and many other countries have also had one available for a while. Rather than using one of the existing apps (which require very little centralised and so specialised infrastructure), Canada is developing its own — sometime soon, maybe.

Second, these apps have serious drawbacks, and might not be effective, even in principle. Bluetooth, which is used to detect a nearby phone, is a wireless system and so detects any other phone within a few metres. But it can’t tell that the other phone is behind a wall, or behind a screen, or even in a car driving by with the windows closed. So it’s going to detect many ‘contacts’ that can’t possibly have spread covid, especially in cities. Are people really going to isolate based on such a notification?

Third, these apps, collectively, have to capture a large number of contacts to actually help with the public health issue. It’s been estimated that around 80% of users need to download and use the app to get reasonable effectiveness. Take up in practice has been much, much less than this, often around 20%. Although these apps have been in use for, let’s say, 45 days in countries that have them, I cannot find a single report of an exposure notification anywhere.

Governments are inclined to say things like “Well, contact tracing apps aren’t doing anything useful now, but in the later stages they’ll be incredibly useful” (and so, presumably, we don’t have to rush to build them). But it’s mostly about being seen to do something rather than actually doing something helpful.

Embarrassware

There’s a new wrinkle on ransomware.

Smarter criminals are now exfiltrating files that they find which might be embarrassing to the organisation whose site they’ve hacked. Almost any organisation will have some dirty laundry it would rather not have publicised: demonstrations of incompetence, inappropriate emails, strategic directions, tactical decisions, ….

The criminals threaten to publish these documents within a short period of time as a way to increase the pressure to pay the ransom. Now even an organisation that has good backups may want to pay the ransom.

Actually finding content that the organisation might not want made public is a challenging natural language problem (although there is probably low-hanging fruit such as pornographic images). But, like the man (allegedly Arthur Conan Doyle) who sent a telegram to his friend saying “Fly, all is discovered” (The Strand, George Newnes, September 18, 1897, No. 831 – Vol. XXXII) and saw him leave town, it might not be necessary to specify which actual documents will be published.

Understanding risk at the disaster end of the spectrum

In conventional risk analysis, risk is often expressed as

risk = threat probability x potential loss

When the values of the terms on the right hand side are in the middle of their ranges, then our intuition seems to understand this equation quite well.

But when the values are near their extremes, our intuition goes out the window, as the world’s coronavirus experience shows. The pandemic is what Taleb calls a black swan, an event where the threat probability is extremely low, but the potential loss is extremely high. For example, if the potential loss is of the order of 10^9 (a billion) then a threat probability of 1 in a thousand still has a risk of magnitude a million.

I came across another disaster waiting to happen, with the same kind fo characteristics as the coronavirus pandemic — cyber attacks on water treatment facilities.

https://www.csoonline.com/article/3541837/attempted-cyberattack-highlights-vulnerability-of-global-water-infrastructure.html

In the U.S. water treatment facilities are small organizations that don’t have specialized IT staff who can protect their systems. But the consequences of cyber attacks on such facilities can cause mass casualties. While electricity grids, Internet infrastructure, and financial systems have received some protection attention, water treatment is the forgotten sibling. A classic example of a small (but growing) threat probability but a huge potential loss.

The threat isn’t even theoretical. Attacks have already been attempted.

Using technology for contact tracing done right

There has understandably been a lot of interest in using technology, especially cell phones, to help with tracking the spread of covid-19.

This raises substantial privacy issues, especially as we know that government powers grabbed in an emergency tend not to be rolled back when the emergency is over.

One of the difficulites is that not everybody with a cell phone carries it all times (believe it or not), and not everybody leaves their location sensor turned on. So many of the proposals founder on issues such as these; all the more so as those who don’t want to be tracked are more likely to be evasive.

One of the cleverer ideas is an app used in Singapore, TraceTogether. If you install the app, and have Bluetooth turned on, then the app swaps identities with any phone with the app that comes close enough to detect.

Using public key infrastructure, the identity of the other phones you’ve encountered is stored, encrypted, on your phone (and vice versa on theirs).

If you get sick, the app will send your list of phones you’ve been close to the government which can use its key to decrypt them. They can then notify everyone and contact trace them in minutes.

Note that the app doesn’t record where you crossed paths with others, just that you did. This, together with the fact that nobody but the government can decrypt your contacts, gives you a substantial amount of privacy, probably the best you can hope for given the public health need.

The epidemiology of spam

As someone who’s had the same email address for nearly 40 years, I get a lot of spam. (Of course, almost all of it is automatically filtered away.)

It’s been noticeable that spam was way down from January this year; and became vanishingly rare once India was put on lockdown last week.

But this week it’s come roaring back as China once again opens for business. I guess we know where most of it comes from (and maybe spam has a role to play as a covid-19 detector — perhaps we can find out how many infections there are really in Iran, for example).

Detecting intent and abuse in natural language

One of my students has developed a system for detecting intent and abuse in natural language. As part of the validation, he has designed a short survey to get human assessments of how the system performs.

If you’d like to participate, the url is

aquarius.cs.queensu.ca

Thanks in advance!

Towards a cashless economy

Australia is close to passing laws that would make it impossible to pay for anything with cash above $A10,000.

What’s interesting is who’s objecting: the Housing Association (cash for property innocently, really?); farmers (maybe this is about barter and/or tax avoidance), dentists (??), and big retail (hmm, sounds like high end products such as jewelry might be the issue here). Retailers are quoted as saying “customers like to save cash to make big payments” which sounds rather implausible.

One of the things that works against stamping out money laundering is that it means stamping out the black, and most of the grey, economy. The pushback from these parts of the economy is presumaby something between loss of perks and a feeling that the tax bite is too big.

Update to “A Gentle Guide to Money Laundering”

I’ve updated my guide to money laundering, mostly to include a discussion of Unexplained Wealth Orders, which seem likely to become a major part of the solution.

money laundering version 2 (Feb 2020)

More thoughts on Huawei

“5G” is marketing speak for whatever is coming next in computer networks. It promises 100 times greater speed and the ability to connect many more devices in a small space. However, “5G” is unlikely to exist as a real thing until two serious problem are addressed. First, there is no killer app that demands this increase in performance. Examples mentioned breathlessly by the media include being able to download an entire movie in seconds (which doesn’t seem to motivate many people), the ability for vehicles to communicate with one another (still years away), and the ability for Internet of Things to communicate widely (the whole communicating lightbulbs phenomenon seems to have put consumers off rather than motivated them). Second, “5G” will require a much denser network of cell towers and it’s far from clear how they will be paid for and powered. The 5G networks touted in the media today require specialized handsets that are incompatible with existing networks and exist only in the downtown cores of a handful of cities. So “5G” per se is hardly a pressing issue.

Nevertheless, it does matter who provides the next generation of network infrastructure because networks have become indispensable to ordinary life – not just entertainment, but communication and business. And that’s why several countries have been so vocal against Huawei’s attempts to become a key player.

There are two significant issues. First, a network switch provider can see, block, or divert all the traffic passing through its switches. Even encrypting the traffic content doesn’t help much; it’s still possible to see who’s communicating with whom and how often. Huawei, however much it claims to the contrary, is subject to Chinese law that requires it to cooperate with the Chinese government and so can never provide neutral services. It doesn’t help to say, as Huawei does, that because it never has acted at the behest of the Chinese government, it never will in the future. Nor does it help to say that no backdoor has ever been found in its software. All network switches have the capability to be updated over the Internet, so the software it is running today need not be the software it is running tomorrow. It is not surprising that many governments, including the US and Australia, have reservations about allowing Huawei to provide network infrastructure.

Second, the next generation of network infrastructure will have to be more complex than what exists now. A long-standing collaboration between the UK and Huawei tried to improve confidence in Huawei products by disassembling and testing them. Their concern, for a number of years, was that supposedly identical software built in China and built in the UK turned out to be of different sizes. This is a bad sign, because it suggests that the software pays attention to where it is being built and modifies itself accordingly (much as VW emissions testing software checked whether the vehicle was undergoing an emissions test and modified its behaviour ). However, their 2019 report concluded that the issue stemmed from Huawei’s software construction processes, which were so flawed that they were unable to build software consistently anywhere. The software being studied is for today’s 4G network infrastructure, and the joint GCHQ-Huawei Centre concluded that it would take them several years even to reach today’s software engineering state-of-the-art. It seems inconceivable that Huawei will be able to produce usable network infrastructure for an environment that will be many times more complex.

These two problems, in a way, cancel each other out – if the network infrastructure is of poor quality it probably can’t be manipulated explicitly by Huawei. But its poor quality increases the opportunity for attacks on networks by China (without involving Huawei), Russia, Iran, or even terrorist groups.

Huawei systems are cheaper than their competitors, and it’s a truism that convenience trumps security. But the long-term costs of a Huawei connected world may be more than we want to pay.

The difference between kinetic and cyber attacks

It’s striking — and worrying — that missile launches by North Korea, no matter how unimportant in the big picture, get worldwide news coverage every time they happen.

But North Korea’s ongoing cyberattacks, which are having serious effects and are raising startlingly large amounts of money for the regime are mentioned only on technical sites, and only occasionally.

We have to hope that military and government have a more balanced view of the relative threat — but it seems clear that politicians don’t.

Backdoors to encryption — 100 years of experience

The question of whether those who encrypt data, at rest or in flight, should be required to provide a master decryption key to government or law enforcement is back in the news, as it is periodically.

Many have made the obvious arguments about why this is a bad idea, and I won’t repeat them.

But let me point out that we’ve been here before, in a slightly different context. A hundred years ago, law enforcement came up against the fact that criminals knew things that could (a) be used to identify other criminals, and (b) prevent other crimes. This knowledge was inside their heads, rather than inside their cell phones.

Then, as now, it seemed obvious that law enforcement and government should be able to extract that knowledge, and interrogation with violence or torture was the result.

Eventually we reached (in Western countries, at least) an agreement that, although there could be a benefit to the knowledge in criminals’ heads, there was a point beyond which we weren’t going to go to extract it, despite its potential value.

The same principle surely applies when the knowledge is on a device rather than in a head. At some point, law enforcement must realise that not all knowledge is extractable.

(Incidentally, one of the arguments made about the use of violence and torture is that the knowledge extracted is often valueless, since the target will say anything to get it to stop. It isn’t hard to see that devices can be made to use a similar strategy. They would have a pin code or password that could be used under coercion and that would appear to unlock the device, but would in fact produce access only to a virtual subdevice which seemed innocuous. Especially as Customs in several countries are now demanding pins and passwords as a condition of entry, such devices would be useful for innocent travellers as well as guilty — to protect commercial and diplomatic secrets for a start.)

Democratic debates strategy

In an analysis of the language used by US presidential candidates in the last 7 elections, Christian Leuprecht and I showed that there’s a language pattern that predicts the winner, and even the margin. The pattern is this: use lots of positive language, use no negative language at all (even words like ‘don’t’ and won’t’), talk about abstractions not policy, and don’t talk about your opponent(s). (For example, Trump failed on the fourth point, but was good on the others, while Hillary Clinton did poorly on all four.)

In some ways, this pattern is intuitive: voters don’t make rational choices of the most qualified candidate — they vote for someone they relate to.

Why don’t candidates use this pattern? Because the media hates it! Candidates (except Trump) fear being labelled as shallow by the media, even though using the pattern helps them with voters. You can see this at work in the way the opinion pieces decide who ‘won’ the debates.

The Democratic debates show candidates using the opposite strategy: lots of detailed policy, lots of negativity (what’s wrong that I will fix), and lots of putting each other down.

Now it’s possible that the strategy needed to win a primary is different to that which wins a general election. But if you want to assess the chances of those who might make it through, then this pattern will help to see what their chances are against Trump in 2020.

Incumbency effects in U.S. presidential campaigns: Language patterns
matter, Electoral Studies, Vol 43, 95-103.
https://www.sciencedirect.com/science/article/pii/S0261379416302062

Tips for adversarial analytics

I put togethers this compendium of thngs that are useful to know for those starting out in analytics for policing, signals intelligence, counterterrorism, anti-money-laundering, cybersecurity, and customs; and which might be useful to those using analytics when organisational priorities come into conflict with customers (as they almost always do).

Most of the content is either tucked away in academic publications, not publishable by itself, or common knowledge among practitioners but not written down.

I hope you find it helpful (pdf):  Tips for Adversarial Analytics

Unexplained Wealth Orders

Money laundering conventionally focuses on finding the proceeds of crime. It has two deterrent effects: the proceeds are confiscated so that ‘crime doesn’t pay’; and discovering the proceeds can be used to track back to find the crime, and the criminals that produced it.

Since crimes prefer not to leave traces, the proceeds of crime used to be primarily in cash — think drug dealers. As a result, criminals tended to accumulate large amounts of cash. To get any advantage from it, they had three options: spend it in a black economy, insert it into the financial system, or ship it to another country so that its origin was obscured.

Money laundering detection used to concentrate on these mechanisms. Many countries have made an effort to stamp out the cash economy for large scale purchases (jewels, cars, houses, art) by requiring cash transactions of size to be reported, and by removing large denomination currency from circulation (so that moving cash requires larger, more obtrusive volume). Most countries also require large cash deposits to banks to be reported. Preventing transport of cash across borders is more difficult — many countries have exit and entry controls on cash carried by travellers, but do much less well interdicting containers full of cash.

One reason why much of current money laundering detection is ineffective is that there are now wholesale businesses who provide money laundering as a service: give them your illicit money, and they’ll give you back some fraction of that money in a way that makes it seem legitimate. These businesses break the link between the money and the crime, making it almost impossible to prosecute since there’s no way to draw a line from the crime.

Unexplained wealth orders target the back end of the process instead. They require people who have and spend money in quantity to explain how they came by the money, even if the money is in the financial system and apparently plausible. This is extremely effective, because it means that criminals cannot easily spend their ill-gotten gains without risking their confiscation.

Of course, this is not a new idea. Police have always kept a look out for people who seemed to have more money than they should when they wanted to figure out who had committed a bank robbery or something similar.

The new factor in unexplained wealth orders is that the burden of proof shifts to the person spending the money to show that they came by it legitimately, rather than being on law enforcement to show that the money is proceeds of crime (which no longer works, because of the middemen mentioned above). This creates a new problem for criminals.

Of course, the development and use of unexplained wealth orders raises questions of civil liberties, especially when the burden of proof shifts from one side to the other.  However, unexplained wealth has always attracted the attention of taxation authorities and so these orders aren’t perhap as new as they seem. Remember, Al Capone was charged with tax evasion, not racketeering.

Unexplained wealth orders seem like an effective new tool in the arsenal of monay laundering detection. They deserve to be considered carefully.

What causes extremist violence?

This question has been the subject of active research for more than four decades. There have been many answers that don’t stand up to empirical scrutiny — because the number of those who participate in extremist violence is so small, and because researchers tend to interview them, but fail to interview all those identical to them who didn’t commit violence.

Here’s a list of the properties that we now know don’t lead to extremist violence:

  • ideology or religion
  • deprivation or unhappiness
  • political/social alienation
  • discrimination
  • moral outrage
  • activism or illegal non-violent political action
  • attitudes/belief

How do we know this? Mostly because, if you take a population that exhibits any of these properties (typically many hundreds of thousand) you find that one or two have committed violence, but the others haven’t. So properties such as these have absolutely no predictive power.

On the other hand, there are a few properties that do lead to extremist violence:

  • being the child of immigrants
  • having access to a local charismatic figure
  • travelling to a location where one’s internal narrative is reinforced
  • participation in a small group echo chamber with those who have similar patterns of thought
  • having a disconnected-disordered or hypercaring-compelled personality

These don’t form a diagnostic set, because there are still many people who have one or more of them, and do not commit violence. But they are a set of danger signals, and the more of them an individual has, the more attention should be paid to them (on the evidence of the past 15 years).

You can find a full discussion of these issues, and the evidence behind them, in ““Terrorists, Radicals, and Activists: Distinguishing Between Countering Violent Extremism and Preventing Extremist Violence, and Why It Matters” in Violent Extremism and Terrorism, Queen’s University Press, 2019.

 

Detecting abusive language online

My student, Hannah Leblanc, has just defended her thesis looking at predicting abusive language. The document is

https://qspace.library.queensu.ca/handle/1974/26252

Rather than treat this as an empirical problem — gather all the signal you can, select attributes using training data, and then build a predictor using those attributes — she started with models of what might drive abusive language. In particular, abuse may be associated with subjectivity (objective language is less likely to be abusive, even if it contains individual words that might look abusive) and with otherness (abuse often results from one group targeting another). She also looked at emotion and mood signals and their association with abuse.

All of the models perform almost perfectly at detecting non-abuse; they struggle more with detecting abuse. Some of this comes from mislabelling — documents that are marked as abusive but really aren’t; but much of the rest comes from missing signal — abusive words disguised so that they don’t match the words of a lexicon.

Overall the model achieves accuracy of 95% and F-score of 0.91.

Software quality in another valley

In the mid-90s, there was a notable drop in the quality of software, more or less across the board. The thinking at the time was that this was a result of software development businesses (*cough* Microsoft) deciding to hire physicists and mathematicians because they were smarter (maybe) than computer scientists and, after all, building software was a  straightforward process as long as you were smart enough. This didn’t work out so well!

But I think we’re well into a second version of this pattern, driven by a similar misconception — that coding is the same thing as software design and engineering — and a woeful ignorance of user interface design principles.

Here are some recent examples that have crossed my path:

  • A constant redesign of web interfaces that doesn’t change the functionality, but moves all of the relevant parts to somewhere else on the screen, forcing users to relearn how to navigate.
  • A complete ignorance of Fitt’s Law (the bigger the target, the faster you can hit it with mouse or finger), especially, and perhaps deliberately, those that kill a popup. Example: CNN’s ‘breaking news’ banner across the top.
  • My android duckduckgo app has a 50:50 chance, when you hit the back button, of going back a page or dropping you out of the app.
  • The Fast Company web page pops up two banners on every page (one inviting you to sign up for their email newsletter and one saying that they use cookies) which together consume more than half of the real estate. (The cookies popup fills the screen in many environments; thanks, GDPR!)
  • If you ask Alexa for a news bulletin, it starts at the moment you stopped listening last time — except that that was typically yesterday, so it essentially starts at a random point. (Yes, it tells you that you can listen to the latest episode, but the spell it requires to make this happen is so unclear I haven’t worked it out.)
  • And then there’s all the little mysteries: Firefox addins that seem to lose some of their functionality, Amazon’s Kindle book deal of the day site doesn’t list the deals of the day.

There are several possible explanations for this wave of loss of quality. The first is  the one I suggested above: that there’s an increase in unskilled software builders who just are not able to build robust products, especially apps. About a third of our undergraduates seem to have side hustles where they’re building apps, and the quality of the academic work they submit doesn’t suggest that these apps represent value for money, even if free.

Second, it may be that the environments in which new software is deployed have reached a limit where robustness is no longer plausible. This could be true at the OS level (e.g. Windows), or phone systems (e.g. Android) or web browsers. In all of these environments the design goal has been to make them infinitely extensible but also (mostly) backwards compatible — and maybe this isn’t really possible. Certainly, it’s easy to get the impression that the developers never tried their tools — “How could they not notice that” is a standard refrain.

Third, it may be that there’s a mindset among the developers of free-to-the-user software (where the payment comes via monetising user behaviour) that free software doesn’t have to be good software — because the punters will continue to use it, and how can they complain?

Whichever of these explanations (or some other one) is true, it looks like we’re in for a period in which our computational lives are going to get more irritating and expensive.

‘AI’ performance not what it seems

As I’ve written about before, ‘AI’ tends to be misused to refer to almost any kind of data analytics or derived tool — but let’s, for the time being, go along with this definition.

When you look at the performance of these tools and systems, it’s often quite poor, but I claim we’re getting fooled by our own cognitive biases into thinking that it’s much better than it is.

Here are some examples:

  • Netflix’s recommendations for any individual user seem to overlap 90% with the ‘What’s trending’ and ‘What’s new’ categories. In other words, Netflix is recommending to you more or less what it’s recommending to everyone else. Other recommendation systems don’t do much better (see my earlier post on ‘The Sound of Music Problem’ for part of the explanation).
  • Google search results are quite good at returning, in the first few links, something relevant to the search query, but we don’t ever get to see what was missed and might have been much more relevant.
  • Google News produces what, at first glance, appear to be quite reasonable summaries of recent relevant news, but when you use it for a while you start to see how shallow its selection algorithm is — putting stale stories front and centre, and occasionally producing real howlers, weird stories from some tiny venue treated as if they were breaking and critical news.
  • Self driving cars that perform well, but fail completely when they see certain patches on the road surface. Similarly, facial recognition systems that fail when the human is wearing a t-shirt with a particular patch.

The commonality between these examples, and many others, is that the assessment from use is, necessarily, one-sided — we get to see only the successes and not the failures. In other words (HT Donald Rumsfeld), we don’t see the unknown unknowns. As a result, we don’t really know how well these ‘AI’ systems really do, and whether it’s actually safe to deploy them.

Some systems are ‘best efforts’ (Google News) and that’s fair enough.

But many of these systems are beginning to be used in consequential ways and, for that, real testing and real public test results are needed. And not just true positives, but false positives and false negatives as well. There are two main flashpoints where this matters: (1) systems that are starting to do away with the human in the loop (self driving cars, 737 Maxs); and (2) systems where humans are likely to say or think ‘The computer (or worse, the AI) can’t be wrong’; and these are starting to include policing and security tools. Consider, for example, China’s social credit system. The fact that it gives low scores to some identified ‘trouble makers’ does not imply that everyone who gets a low score is a trouble  maker — but this false implication lies behind this and almost all discussion of ‘AI’ systems.

Huawei’s new problem

The Huawei Cyber Security Evaluation Centre (HCSEC) is a joint effort, between GCHQ and Huawei, to increase confidence in Huawei products for use in the UK. It’s been up and running since 2013.

In its 2018 report, the focus was on issues of replicable builds. Binaries compiled in China were not the same size as binaries built in the UK. To a computer scientist, this is a bad sign since it suggests that the code contains conditional compilation statements such as:

If country_code == UK

insert backdoor

In the intervening year, they have dug into this issue, and the answer they come up with is unexpected. It turns out that the problem is not a symptom of malice, but a symptom of incompetence. The code is simply not well enough engineered to produce consistent results.

Others have discussed the technical issues in detail:

https://www.theregister.co.uk/2019/03/28/hcsec_huawei_oversight_board_savaging_annual_report/

but here are some quotes from the 2019 report:

“there remains no end-to-end integrity of the products as delivered by Huawei and limited confidence on Huawei’s ability to understand the content of any given build and its ability to perform true root cause analysis of identified issues. This raises significant concerns about vulnerability management in the long-term”

“Huawei’s software component management is defective, leading to higher vulnerability rates and significant risk of unsupportable software”

“No material progress has been made on the issues raised in the
previous 2018 report”

“The Oversight Board continues to be able to provide only limited
assurance that the long-term security risks can be managed in the
Huawei equipment currently deployed in the UK”

Not only is the code quality poor, but they see signs of attempts to cover up the shortcuts and practices that led to the issue in the first place.

The report is also scathing about Huawei’s efforts/promises to clean up its act; and they estimate a best case timeline of 5 years to get to well-implemented code.

5G (whatever you take that to mean) will be at least ten times more complex than current networking systems. I think any reasonable computer scientist would conclude that Huawei will simply be unable to build such systems.

Canada, and some other countries, are still debating whether or not to ban Huawei equipment. This report suggests that such decisions can be depoliticised, and made based purely on economic grounds.

But, from a security point of view, there’s still an issue — the apparently poor quality of Huawei software creates a huge threat surface that can be exploited by the governments of China (with or without Huawei involvement), Russia, Iran, and North Korea, as well as non-state actors and cyber criminals.

(Several people have pointed out that other network multinationals have not been scrutinised at the same depth and, for all we know, they may be just as bad. This seems to me implausible. One of the unsung advantages that Western businesses have is the existence of NASA, which has been pioneering reliable software for 50 years. If you’re sending a computer on a one-way trip to a place where no maintenance is possible, you pay a LOT of attention to getting the software right. The ideas and technology developed by NASA have had an influence in software engineering programs in the West that has tended to raise the quality of all of the software developed there. There have been unfortunate lapses, whenever the idea that software engineering is JUST coding becomes popular (Windows 95, Android apps) but overall the record is reasonably good. Lots better than the glimpse we get of Huawei, anyway.)

Annular similarity

When similarity is used for clustering, then obviously the most similar objects need to be placed in the same cluster.

But when similarity is being used for human consumption, a different dynamic is in play — humans usually already know what the most similar objects are, and are interested in those that are (just) beyond those.

This can be seen most clearly in recommender systems. Purchase an item or watch a Netlflix show, and your recommendation list will fill up with new objects that are very similar to the thing you just bought/watched.

From a strictly algorithm point of view, this is a success — the algorithm found objects similar to the starting object. But from a human point of view this is a total fail because it’s very likely that you, the human, already know about all of these recommended objects. If you bought something, you probably compared the thing you bought with many or all of the objects that are now being recommended to you. If you watched something, the recommendations are still likely to be things you already knew about.

The misconception about what similarity needs to mean to be useful to humans is at the heart of the failure of recommender systems, and even the ad serving systems that many of the online businesses make their money from. Everyone has had the experience of buying something, only to have their ad feed (should they still see it) fill up with ads for similar products (“I see you just bought a new car — here are some other new cars you might like”).

What’s needed is annular similarity — a region that is centred at the initial object, but excludes new objects that are too similar, and focuses instead on objects that are a bit similar.

Amazon tries to do this via “People who bought this also bought” which can show useful add-on products. (They also use “People who viewed this also viewed” but this is much less effective because motivations are so variable.) But this mechanism also fails because buying things together doesn’t necessarily mean that they belong together — it’s common to see recommendations based on the fact that two objects were on special on the same day, and so more likely to be bought together because of the opportunity, rather than any commonality.

Annular similarity is also important in applications that help humans to learn new things: web search, online courses, intelligence analysis. That’s why we built the ATHENS divergent web search engine (refs below) — give it some search terms and it returns (clusters of) web pages that contain information that is just over the horizon from the search terms. We found that this required two annuli — we first constructed the information implicit in the search terms, then an annulus around that of information that we assumed would be known to someone who knew the core derived from the search terms, and only then did we generate another annulus which contains the results returned.

We don’t know many algorithmic ways to find annular similarity. In any distance-based clustering it’s possible, of course, to define an annulus around any point. But it’s tricky to decide on what the inner and outer radii should be, the calculations have to happen in high-dimensional space where the points are very sparse, and it’s not usually clear whether the space is isotropic.

Annular similarity doesn’t work (at least straightforwardly) in density-based (e.g. DBScan) or distribution-based clustering (e.g. EM) because the semantics of ‘cluster’ doesn’t allow for an annulus.

One way that does work (and was used extensively in the ATHENS system) is based on singular vallue decomposition (SVD). An SVD projects a high-dimensional space into a low-dimensional one in such a way as to preserve as much of the variation as possible. One of its useful side-effects is that a point that is similar to many other points tends to be projected close to the origin; and a point that is dissimilar to most other points also tends to be projected close to the origin because the dimension(s) it inhabits have little variation and tend to be projected away. In the resulting low-dimensional projection, points far from the origin tend to be interestingly dissimilar to those at the centre of the structure — and so an annulus imposed on the embedding tends to find an interesting set of objects.

Unfortunately this doesn’t solve the recommender system problem because recommenders need to find similar points that have more non-zeroes than the initial target point — and the projection doesn’t preserve this ordering well. That means that the entire region around the target point has to be searched, which becomes expensive.

There’s an opportunity here to come up with better algorithms to find annular structures. Success would lead to advances in several diverse areas.

(A related problem is the Sound of Music problem, the tendency for a common/popular object to muddle the similarity structure of all of the other objects because of its weak similarity to all of them. The Sound of Music plays this role in movie recommendation systems, but think of wrapping paper as a similar object in the context of Amazon. I’ve written about this in a previous post.)

 

Tracy A. Jenkin, Yolande E. Chan, David B. Skillicorn, Keith W. Rogers:
Individual Exploration, Sensemaking, and Innovation: A Design for the Discovery of Novel Information. Decision Sciences 44(6): 1021-1057 (2013)
Tracy A. Jenkin, David B. Skillicorn, Yolande E. Chan:
Novel Idea Generation, Collaborative Filtering, and Group Innovation Processes. ICIS 2011
David B. Skillicorn, Nikhil Vats:
Novel information discovery for intelligence and counterterrorism. Decision Support Systems 43(4): 1375-1382 (2007)
Nikhil Vats, David B. Skillicorn:
Information discovery within organizations using the Athens system. CASCON 2004: 282-292