Archive Page 2

Secrets and authentication: lessons from the Yahoo hack

Authentication (I’m allowed to access something or do something) is based on some kind of secret. The standard framing of this is that there are three kinds of secrets:

  1. Something I have (like a device that generates 1-time keys)
  2. Something I am (like a voiceprint or a fingerprint), or
  3. Something I know (like a password).

There are problems with the first two mechanisms. Having something (a front door key) is the way we authenticate getting into our houses and offices, but it doesn’t transfer well to the digital space. Being something looks like it works better but suffers from the problem that, if the secret becomes widely known, there’s often no way to change the something (“we’ll be operating on your vocal cords to change your voice, Mr. Smith”). Which is why passwords tend to be the default authentication mechanism.

At first glance, passwords look pretty good. I have a secret, the password, and the system I’m authenticating with has another secret, the encrypted version of the password. Unfortunately, the system’s secret isn’t very secret because the encrypted version of my password is almost always transmitted in clear because of the prevalence of wifi. Getting from the system’s secret to mine is hard, which is supposed to prevent reverse engineering my secret from the system’s.

The problem is that the space of possible passwords is small enough that the easy mapping, from my secret to the system’s, can be tried for all strings of reasonable length. So brute force enables the reverse engineering that was supposed to be hard. Making passwords longer and more random helps, but only at the margin.

We could instead make the secret a function instead of a string. As the very simplest example, the system could present me with a few small integers, and my authentication would be based on knowing that I’m supposed to add the first two and subtract the third. My response to the system is the resulting value. Your secret might be to add the first and the third and ignore the second.

But limitations on what humans can compute on the fly means that the space of functions can’t actually be very large, so this doesn’t lead to a practical solution.

Some progress can be made by insisting that both I and the system must have different secrets. Then a hack of either the system or of me by phishing isn’t enough to gain access to the system. There are a huge number of secret sharing schemes of varying complexity. But for the simplest example, my secret is a binary string of length n, and the system’s secret is another binary string of length n. We exchange encrypted versions of our strings, and the system authenticates me if the exclusive-or of its string and mine has a particular pattern. Usefully, I can also find out if the system is genuine by carrying out my own check. This particular pattern is (sort of) a third secret, but one that neither of us have to communicate and so is easier to protect.

This system can be broken, but it requires a brute force attack on the encrypted version of my secret, the encrypted version of the system’s secret, and then working out what function is applied to merge the two secrets (xor here, but it could be something much more complex). And that still doesn’t get access to the third secret.

Passwords are the dinosaurs of the internet age; secret sharing is a reasonable approach for the short to medium term, but (as I’ve argued here before) computing in compromised environments is still the best hope for the longer term.

Advertisements

Security theatre lives

Sydney tests its emergency notification system in the downtown core at the same time of day every time. So if a person wanted to cause an incident, guess what time they would choose?

It also seems to be done on Fridays, which is exactly the worst day to choose, since it’s the most common day for islamist incidents.

Security theatre = doing things that sound like they improve security without actually improving them (and sometimes making them worse).

“But I don’t have anything to hide” Part III

I haven’t been able to verify it, but Marc Goodman mentions (in an interview with Tim Ferriss) that the Mumbai terrorists searched the online records of hostages when they were deciding who to kill. Another reason not to be profligate about what you post on social media.

The growing role of data curation

My view of Data Science, or Big Data if you prefer, is that it divides naturally into three different subfields:

  1. Data curation, which involves focusing on the issues of managing large amounts of heterogeneous data, but is primarily concerned about provenance, that is tracking the metadata about the data.
  2. Computational science, which builds models of the real-world inside computer systems to study their properties.
  3. Analytics, which infers the properties of systems based on data about them.

I’ve posted about these ideas previously (https://skillicorn.wordpress.com/2015/05/09/why-data-science/),

Data curation might have seemed like the poor cousin among these three, and certainly gets the least funding and attention.

But issues of provenance have suddenly become mainstream as everyone on the web struggles to figure out what to do about fake news stories. So far, the Internet has not really addressed the issues of metadata. Most of the big content providers know who generated the content that they create and distribute, but they don’t necessarily make this information known or available for those who read the content to leverage. It’s time for the data curation experts, who tend to come from information systems and library science, to step up.

Data curation is also about to become the front line in cyberattack. As I’ve suggested (Skillicorn, DB, Leuprecht, C, and Tait, V. 2016. Beyond the Castle Model of Cybersecurity.  Government Information Quarterly.), a natural cyberdefence strategy is replication. Data exfiltration is made much more difficult if there many, superficially similar, versions of any document or data that might be a target. However, progress in assigning provenance becomes the cyberattack that matches this cyber defence.

So here’s the research question for data curation: how can I tell, from the internal evidence, and partial external evidence, whether this particular document is legitimate (or is the legitimate version of a set of almost-replicates)?

6.5/7 US presidential elections predicted from language use

I couldn’t do a formal analysis of Trump/Clinton language because Trump didn’t put his speeches online — indeed many of them weren’t scripted. But, as I posted recently, his language was clearly closer to our model of how to win elections than Clinton’s was.

So since 1992, the language model has correctly predicted the outcome, except for 2000 when the model predicted a very slight advantage for Gore over Bush (which is sort of what happened).

People judge candidates on who they seem to be as a person, a large part of which is transmitted by the language they use. Negative and demeaning statements obviously affect this, but so does positivity and optimism.

Voting is not rational choice

Pundits and the media continue to be puzzled by the popularity of Donald Trump. They point out that much of what he says isn’t true, that his plans lack content, that his comments about various subgroups are demeaning, and so on, and so on.

Underlying these plaintive comments is a fundamental misconception about how voters choose the candidate they will vote for. This has much more to do with standard human, in the first few seconds, judgements of character and personality than it does about calm, reasoned decision making.

Our analysis of previous presidential campaigns (about which I’ve posted earlier) makes it clear that this campaign is not fundamentally different in this respect. It’s always been the case that voters decide based on the person who appeals to them most on a deeper than rational level. As we discovered, the successful formula for winning is to be positive (Trump is good at this), not to be negative (Trump is poor at this), not to talk about policy (Trump is good at this), and not to talk about the opponent (Trump is poor at this). On the other hand, Hillary Clinton is poor at all four — she really, really believes in the rational voter.

We’ll see what happens in the election this week. But apart from the unusual facts of this presidential election, it’s easy to understand why Trump isn’t doing worse and Hillary Clinton isn’t doing better from the way they approach voters.

It’s not classified emails that are the problem

There’s been reporting that the email trove, belonging to Huma Abedin but found on the laptop of her ex-husband, got there as the result of automatic backups from her phone. This seems plausible; if it is true then it raises issues that go beyond whether any of the emails contain classified information or not.

First, it shows how difficult it is for ordinary people to understand, and realise, the consequences of their choices about configuring their life-containing devices. Backing up emails is good, but every user needs to understand what that means, and how potentially invasive it is.

Second, to work as a backup site, this laptop must have been Internet-facing and (apparently) unencrypted. That means that more than half a million email messages were readily accessible to any reasonably adept cybercriminal or nation-state. If there are indeed classified emails among them, then that’s a big problem.

But even if there are not, access to someone’s emails, given the existence of textual analytics tools, means that a rich picture can be built up of that individual: what they are thinking about, who they are communicating with (their ego network in the jargon), what the rhythm of their day is, where they are located physically, what their emotional state is like, and even how healthy they are.

For any of us, that kind of analysis would be quite invasive. But when the individual is a close confidante of the U.S. Secretary of State, and when many of the emails are from that same Secretary, the benefit of a picture of them at this level of detail is valuable, and could be exploited by an adversary.

Lawyers and the media gravitate to the classified information issue. This is a 20th Century view of the problems that revealing large amounts of personal text cause. The real issue is an order of magnitude more subtle, but also an order of magnitude more dangerous.