Posts Tagged 'fraud'

Finally — the end of the Castle Model of cybersecurity?

The Castle model is the way that cybersecurity has been done for the last 20 years. The idea is to build security that keeps bad guys out of your system — you can tell what the metaphor is by the names that are used: INTRUSION detection, fireWALL. Of course, this isn’t the whole story; people have been accustomed to having to do antivirus scans and (less likely) anti-malware scans, but the idea of perimeter defence is deeply ingrained.

We don’t even behave in the real world that way. If you owned a castle with thick walls and the drawbridge was up, you might still raise an eyebrow at a bunch of marauders wandering around inside looting and pillaging. But in the online world, we’re all too likely to let anyone who can get past the perimeter do pretty much anything they want. And, by the way, insiders are already inside the perimeter which is why they are such a large threat.

The credit card hack at Global Payments, made (finally) public last week is a good example. First, the PCI DSS, which defines the standards for credit card processing security only mandates that user data should be “protected” but doesn’t say how. Commentators on this incident have assumed that the data held by Global Payments was all encrypted, but there’s nothing in the requirements that says it has to be, so perhaps it wasn’t. But Global Payments clearly also didn’t have the right kind of sanity checks on exfiltration of data. Even if the hack came through an account belonging to someone who had a legitimate need to look at transactions, surely there should have been controls to limit such access to one day’s worth, or a few thousand, or something like that. Exporting 1.5 million transactions should surely have required some extra levels of authentication and the involvement of an actual person at Global Payments. But the bigger issue is that the PCI DSS doesn’t mandate any “inside the gates” security measures.

So what’s the alternative to the castle model? We are still thinking this through, but it must involve controls on who can do what inside the system (as we usually do in even moderately secure real-world settings), controls on exfiltration of data (downloading, copying to portable devices, outgoing email), and especially on the size of outgoing data, and better logging and internal observation (real-world buildings have a night watchman to limit what can be done in the quiet times).

Even the U.S. military, whose network is air-gapped from the internet, admits that penetration of their networks is so complete that it’s pointless to concentrate on defending their network’s borders and more important to focus on controlling access to the data held within these networks (BBC story).

It’s time for a change of metaphor in cybersecurity — the drawbridge is down whether we like it or not, and so we need to patrol the corridors and watch for people carrying suspiciously large bags of swag.

Advertisements

Anomalies in record-based data

Many organisations have large datasets whose entities are records, perhaps records of transactions. In some settings, such as detecting credit-card fraud, sophisticated sets of rules have been developed to decide which records deserve further attention as potentially fraudulent. What does an organisation do, however, when it has a large dataset like this, hasn’t developed a model of what “interesting” records look like, but would still like to focus attention on “interesting” records — usually because there aren’t enough resources even to look at all of the records individually.

One way to decide which records are interesting, is to label records as uninteresting if there are lot of other records like them. I have developed ways to rank records by interestingness using this idea.

So when the Sydney Morning Herald published a dataset of Australian defence contracts (700,000 of them) I thought I would try my approach. The results are interesting. Here are the most unusual records from this ranking (the columns are contract number, description, contracting agency, start date, end data, amount, and supplier):

1.   1217666,REPAIR PARTS,Department of Defence,16-October-2002,,5872.52,L
This one comes at the top of the list because the supplier name is unusual, only a single letter.

2.  1120859,Supply of,Department of Defence,15-May-2002,,0,C & L AEROSPACE

This one has a very short description and an amount of $0.
3.  854967,EARTH MOVING EQUIPMENT PARTS FOR REPAIR,Department of Defence,21-May-2002,,2134.05,439
Unusual because the supplier name is a number

4.  956798,PRESSURE GAUGE (WRITE BACK  SEE ROSS DAVEY),Department of Defence,11-September-2002,,1,WORMALD FIRE & SAFETY
Unusual because of the extra detail in the description and the cost of $1

5.  1053172,5310/66/105/3959.PURCHASE OF WASHER  FLAT.*CANCELLED* 29/04/03,Department of Defence,12-February-2003,,0,ID INTERNATIONAL
Unusual because of the dollar value, and the unusual description because of the cancellation

6.  868380,cancelled,Department of Defence,14-June-2002,,0,REDLINE
Unusual again because of the description and dollar value

7.  1043448,tetanus immunoglobulin-human,Department of Defence,10-January-2003,,1,AUSTRALIAN RED CROSS
Unusual because of the low dollar value

8  1014322,NATIONAL VISA PURCHASING,Department of Defence,18-October-2002,,26933.99,NAB 4715 2799 0000 0942
Unusual because the supplier is a bank account number (and so numeric); also a largish dollar value

9.  1023922,NATIONAL VISA PURCHASING,Department of Defence,18-September-2002,,25586.63,NAB 4715 2799 0000 0942
Same sort of pattern as (8) — globally unusual but similar to (8), note the common date

10.  968986,COIL  RADIO FREQUENCY,Department of Defence,27-September-2002,,2305.6,BAE
Unusual because of the short supplier name and large dollar value

11.  887357,SWIMMING POOL COVER.,Department of Defence,07-May-2002,,7524,H & A TEC
Unusal supplier name and large (!!) dollar value — hope it’s a big pool

12.  1010554,NAB VISA CARD,Department of Defence,02-August-2002,,16223.19,NAB 4715 2799 0000 0942
Another numeric bank account number as supplier and large dollar amount

13.  1005569,Interest,Department of Defence,12-August-2002,,2222.99,NAB 4715 2799 0000 1494
And again

14.  925011,FLIR RECORDER REPPRODUCER SET REPAIR KIOWA,Department of Defence,16-August-2002,,1100,BAE
Shart supplier name, long description with unusual words

15.  1012869,NAB VISA STATEMENT,Department of Defence,22-August-2002,,12934.87,NAB 4715 2799 0000 0942
Another financial transaction

16.  1073019,NATIONAL VISA,Department of Defence,03-February-2003,,10060.16,NAB 4715 2799 0000 0942
And again

17.  969039,SUSPENDERS  WHITE,Department of Defence,30-September-2002,,41800,ADA
Short supplier name and very large dollar amount (hopefully not just one suspender)

18.  1097060,Purchase of Coveralls  Flyers  Lightweight  Sage Green.,Department of Defence,11-February-2003,,18585.6,ADA
Again short supplier name and large dollar amount

959232,SUPPLY OF COATS AND TROUSERS DPDU,Department of Defence,23-September-2002,,1032350,ADA

Again short supplier name and very (!!) large dollar amount

Clearly the process is turning up example records that seem to be quite unusual within this large set, and might sometimes be worth further investigation.

This technique can be applied to any record-based data. As well as providing a version of the data ranked by interestingness, it also provides a graphical view of the data, and some indication of what the density of unusual records is compared to ordinary records. As the example shows, what it also often turns up are technical problems with the way that the data was collected, since mistakes in fields are records with the wrong fields, or with fields in the wrong place will usually turn up as anomalous.Some of the top records are there not because they are really unusual (probably) but because something went wrong with the capture of the supplier names. So it can be used for quality control as well.

ans =

1    23

ans =

1     6

ans =

1    23

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1     5

ans =

1     6

ans =

1    61

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1     8

ans =

1     6

ans =

1    11

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     6

ans =

1    20

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    16

ans =

1     6

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     6

ans =

1    20

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     4

ans =

1    24

ans =

1     6

ans =

1    25

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     6

ans =

1    17

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    25

ans =

1     6

ans =

1    26

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    18

ans =

1     6

ans =

1    26

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    32

ans =

1     6

ans =

1    25

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    18

ans =

1     6

ans =

1    82

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    18

ans =

1     6

ans =

1    32

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    25

ans =

1     6

ans =

1    43

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    12

ans =

1     6

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    37

ans =

1     6

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    15

ans =

1     6

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    15

ans =

1     6

ans =

1    38

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    44

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    25

ans =

1     7

ans =

1    18

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     7

ans =

1    37

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    15

ans =

1     7

ans =

1    23

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    31

ans =

1     7

ans =

1    33

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     4

ans =

1    32

ans =

1     7

ans =

1    65

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    29

ans =

1     7

ans =

1    79

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    34

ans =

1     7

ans =

1    27

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    21

ans =

1     7

ans =

1    26

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     7

ans =

1    38

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    17

ans =

1     7

ans =

1    27

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    21

ans =

1     7

ans =

1    44

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    25

ans =

1     7

ans =

1    22

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    99

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    29

ans =

1     7

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     4

ans =

1    25

ans =

1     7

ans =

1     5

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    29

ans =

1     7

ans =

1    22

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    18

ans =

1     7

ans =

1    77

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    19

ans =

1     7

ans =

1    30

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    20

ans =

1     7

ans =

1    31

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    30

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    24

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    11

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    11

ans =

1     7

ans =

1    14

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    79

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    34

ans =

1     7

ans =

1     9

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     1

ans =

1    15

ans =

1     7

ans =

1    29

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    23

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    20

ans =

1     7

ans =

1    22

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    77

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    19

ans =

1     7

ans =

1    35

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    31

ans =

1     7

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    29

ans =

1     7

ans =

1    15

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    20

ans =

1     7

ans =

1    44

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    25

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    11

ans =

1     7

ans =

1    99

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    29

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1    11

ans =

1     9

ans =

1    11

Finding significance automatically

In a world where lots of data is collected and available, the critical issue for intelligence, law enforcement, fraud, and cybersecurity analysts is attention.

So the critical issue for tools to support such analysts is focus: how can the most significant and interesting pieces of data/information/knowledge be made the easiest to pay attention to?

This isn’t an easy issue to address for many reasons, some of which I talked about a few posts ago in the context of connecting the dots. But the fundamental problems are: (1) significance or interestingness are highly context dependent, so where to focus depends, in a complex way, on what the analyst already knows and understands; and (2) every new piece of information has the potential to completely alter the entire significance landscape in one hit.

Many existing tools are trying, underneath, to address the issue of focus indirectly, by providing ways for analysts to control their own focus more directly. For example, there are many analysts platforms that allow available information to be sliced and diced in many different ways. These allow two useful things to be done: (1) dross (the guaranteed insignificant stuff) can be removed (or at least hidden from sight); and the rest of the data can be rearranged in many different ways in the hope that human pattern-recognition skills can be brought to bear to find significance.

But it seems like a good idea to try and address the significance issue more directly. This has motivated a couple of the research projects I’m involved with:

  • The ATHENS system tries to find information on the web that is probably new to the user, but which s/he is well-positioned to understand; in other words, the new information is just over the horizon from the user’s current landscape. It builds this new information starting from a query that allows the user to provide the current context;
  • Finding anomalies in large graphs. Lots of data is naturally represented relationally as a graph, with nodes representing some kind of entities, and edges representing some kind of (weighted) similarity between some subset of the nodes (e.g. social networks). Graphs are difficult to work with because they don’t really have a representation that humans can work with — even drawing nice pictures of them tends to (a) occlude chunks once the graph gets big enough, and (b) hide the anomalous structure in the corners because the nice representation is derived from the big structure (e.g. the simple bits of the automorphism group). We’ve developed some tools that find and highlight anomalous regions, anomalous in the sense that, if you were to stand at their nodes and look at the landscape of the rest of the graph, it would look unusual.
  • Finding anomalies in text caused by either a desire to obfuscate the content that’s being talked about, or caused by internal mental state that’s unusual — being deceptive, or highly tense, for example.

Some other people are working in similar directions. For example, there is some work aimed at using social processes to help discover significance. In a sense, sites like Slashdot work this way — each user provides some assessment of quality/importance of some stories, and in return gets information about the quality/importance of other stories. This is also, of course, how refereed publications are supposed to work. The challenge is to contextualize this idea: what makes an object high quality/important for you may not mean anything to me. In other words, most significance lies somewhere on the spectrum from universal agreement to completely taste-based, and it’s hard to tell where, let alone compute it in a practical way.

Knowledge Discovery for Counterterrorism and Law Enforcement

My new book, Knowledge Discovery for Counterterrorism and Law Enforcement, is out. You can buy a copy from:

The publisher’s website

Amazon.

(Despite what these pages say, the book is available or will be within a day or two.)

As the holiday season approaches, perhaps you have a relative who’s in law enforcement, or intelligence, or security? What could be better than a book! Or maybe you’d like to buy one for yourself.

(A portion of the price of this book goes to support deserving university faculty.)

Doing prediction in adversarial settings

The overall goal of prediction in adversarial settings is to stop bad things happening — terrorist attacks, fraud, crime, money laundering, and lots of other things.

People intuitively think that the way to address this goal is to try and build a predictor for the bad thing. A few moments thought shows that building such a predictor is a very difficult, maybe impossible, thing to do. So some people immediately conclude that it’s silly or a waste of money, to try and address such goals using knowledge discovery.

There are a couple of obvious reasons why direct prediction won’t work. The first is that bad guys have a very large number of ways in which they can achieve their goal, and it’s impossible for the good guys to consider every single one in designing the predictive model.

This problem is very obvious in intrusion detection, trying to protect computer systems against attacks. There are two broad approaches. The first is to keep a list of bad things, and block any of them when they occur. This is how antivirus software works — every day (it was every week; soon it will be every hour) new additions to the list of bad things have to be downloaded. Of course, this doesn’t predict so-called zero-day attacks, which use some mechanism that has never been used before and so is not on the list of bad things. The second approach is to keep track of what has happened before, and prevent anything new from happening (without some explicit approval by the user). The trouble is that, although there are some regularities in what a user or a system does every day, there are always new things — new websites visited, email sent to new addresses. As a result, alarms are triggered so often that it drives everyone mad, and such systems often get turned off. Vista’s user authorization is a bit like this.

The other difficulty with using direct prediction is making it accurate enough. Suppose that there are two categories that we want to predict: good, and bad. A false positive is when a good record is predicted to be bad; and a false negative is when a bad record is predicted to be good. Both kinds of wrong predictions are a problem, but in different ways. A false positive causes annoyance and irritation, and generates extra work, since the record (and the person it belongs to) must be processed further. However, a false negative is usually much worse — because it means that a bad guy gets past the prediction mechanism.

Prediction technology is considered to be doing well if it achieves a prediction accuracy of around 90% (the percentage of records predicted correctly). It would be fabulous if it achieved an accuracy of 99%. But when the number of records is 1 million, a misclassification rate of 1% is 10,000 records! The consequences of this many mistakes would range from catastrophic to unusable.

These problems with prediction have been pointed out in the media and in some academic writing, as if they meant that prediction in adversarial settings is useless. This is a bit of an argument against a straw man. What is needed is a more thoughtful way of thinking about how prediction should be done, which I’ll talk about in the next posting.

Looking for Bad Guys III: Using manipulation

Bad guys who are aware the knowledge-discovery tools will be used to look for them may also try to actively manipulate the process to their own advantage.

One way to do this is to get an insider working for them, someone who can alter the data or the results of the analysis to this benefit. This is probably the most common method: over all of history, probably more sieges have been successful because someone opened the gates from the inside than because the walls were broken through. It’s easy to get caught up in the cleverness of technology and forget that sometimes suborning someone is the easiest attack.

However, the focus of this blog is knowledge discovery, so let me concentrate on that. Before we talk about how manipulation can be exploited as a discovery tool, we need to talk about what manipulation looks like; and before we can do that, we need to think about the structure of the knowledge-discovery process.

It’s helpful to divide up the stages of knowledge discovery into:

  1. Collecting the data (CCTV images, transaction logs);
  2. Analysing the data (the part that’s usually thought of as the heart of knowledge discovery);
  3. Deciding on what to do with the results and taking action;

Although an adversary can only attack the process via the data that is collected (assuming they don’t have an insider), it is helpful to think of three different kinds of attacks, directed against each of the three stages. The different attacks require understanding different aspects of the knowledge-discovery system.

Manipulating the data collection stage is probably the easiest, because it’s often possible to see and understand how the data is being collected. For example, the fields of view of CCTV cameras can usually be inferred from their positions (even if they are enclosed in black plastic bubbles) and so ways to move around them without coming into view can be worked out. Alternatively, disguises can be used to conceal who is being seen, even though an image is captured. One of the reasons identity theft is a big business is that it provides a way to have data captured about you, but data that is useless because it doesn’t connect to the real you.

Manipulating the decision and action stage is done using social engineering. This means trying to create the impression in the minds of the people who are making the decisions and taking the actions that the analysis system has made an error.

Manipulating the analysis stage is surprisingly easier than it should be. This is because most knowledge-discovery technology has been tuned to give good results in data with natural variation. This gives an opportunity to insert data that is the worst possible from the point of view of the algorithms, and so enable bad guys to hide their traces.

The technology used for knowledge discovery needs to be completely rethought to take manipulation into account. This is primarily why adversarial knowledge discovery is not just another application of knowledge discovery, but a completely different problem.

The good part about this is that attempts at manipulation also create an abnormal signature in the data; and the process can be tuned to look for this signature as well.

What do the traces of bad guys look like?

The presumption of knowledge discovery in adversarial settings is that the traces left by bad guys are somehow different from those of ordinary people doing ordinary and innocuous things. What would those differences be?

There are three main sources of difference.

The first, and most obvious, is that to be a bad guy means to do things that most people don’t do. Given that reasonable attributes are collected, this almost inevitably means that there will be some differences between data records associated with bad guys and those associated with ordinary people.

This does depend on collecting data in which the differences are actually there. For example, if you’re looking for tax fraud, it’s probably useless to collect data about people’s height, since it’s unlikely that tax fraud is associated with being tall, or being short.

So the first thing we expect is that there will be differences between bad-guy data and normal data because of the requirements of whatever bad things they are doing.

People tend to think that this will be the most important, perhaps the only, difference. This isn’t true, except if the bad guys are exceptionally stupid, or knowledge discovery is being applied for the first time and the bad guys did not anticipate its use. This does happen — first attempts to find fraud are often spectacularly successful because they look at aspects of the data that haven’t been examined before.

However, smart bad guys will anticipate the use of knowledge discovery and so they will try, as much as possible, to make the values of the data collected about them seem as innocuous as possible.

So the second difference between bad-guy data and normal data is that bad-guy data is characterized by concealment. But isn’t concealment the absence of a difference? It turns out that, by and large, concealment itself generates a characteristic signature. Knowledge discovery techniques can be tuned to look for this signature, and so preferentially for bad guys.

How is it that concealment creates its own signature? Because, as humans, we’re lousy at creating, artificially, anything that we also do naturally.

Think about having a conversation. We can all do it easily. But make it a conversation in front of lots of people, say giving a speech or acting in a play, and suddenly the way we do it changes: voice tremor, hesitation, strange speech rhythms, and so on.

The same phenonema happen when people try to construct data to resemble data that otherwise arises naturally. For example, when people create a fictitious number (for example, on a tax return or in accounts) the digit distribution is quite distinctive (called Benford’s Law). This has been used to detect potentially fraudulent tax returns.

The third difference between bad-guy traces and ordinary data is created when bad guys actively try to manipulate the knowledge discovery process.

Perhaps surprisingly, they don’t have to know a lot about exactly which knowledge discovery process is being used, and exactly how it works, to have a shot at this. But, again, the knowledge discovery process can be trained to look for the signature of manipulation.

In summary, there are three main ways in which the traces of bad guys differ from those of ordinary data:

  1. Difference. The bad-guy data is unusual, perhaps even outlying in the dataset.
  2. Concealment. The bad guys have changed data values to make them seem more normal, but this backfires because it can’t be done — normal is hard to fake.
  3. Manipulation. The bad guys have changed data values to try and force particular behavior from the knowledge discovery process, but this backfires because the manipulation creates its own signature.

Of these three factors, the second and third are often much more significant than the first. “The wicked flee when no man pursueth” (Proverbs 28:1, KJV).