Posts Tagged 'australia'

“Info Systems Must ‘Connect Dots’ on Terrorism” by Mannes and Hendler

There’s a new article in Defense News making some of the same points I’ve been making here: that accumulating data is not enough; what’s missing is tools to process the data and point out to human analysts what and where the important stuff is.

Mannes and Hendler make two good points: that it’s not about more data; and it’s not about being able to fuse data from different sources, although these are both good things. The problem is that there is still too much data to find out anything interesting, especially in a timely way.

I do believe that it’s important and useful to be able to automatically deduce what kind of object a given string describes, and so hypothesise what its attributes might be — so that people names come with locations and relatives. I’m not convinced that this has much to do with the Semantic Web (and I’m fairly sure you can generate a knock down fight by asking any two Semantic Web researchers to give a precise definition of what it is).

It’s sad, in a way, that the technology that is used as the ubiquitous metaphor for post-9/11 knowledge discovery (connecting the dots) is still the piece that isn’t being done! (I made the same point about the Australian government’s white paper which has a gaping hole about exactly this issue.)

Anomalies in record-based data

Many organisations have large datasets whose entities are records, perhaps records of transactions. In some settings, such as detecting credit-card fraud, sophisticated sets of rules have been developed to decide which records deserve further attention as potentially fraudulent. What does an organisation do, however, when it has a large dataset like this, hasn’t developed a model of what “interesting” records look like, but would still like to focus attention on “interesting” records — usually because there aren’t enough resources even to look at all of the records individually.

One way to decide which records are interesting, is to label records as uninteresting if there are lot of other records like them. I have developed ways to rank records by interestingness using this idea.

So when the Sydney Morning Herald published a dataset of Australian defence contracts (700,000 of them) I thought I would try my approach. The results are interesting. Here are the most unusual records from this ranking (the columns are contract number, description, contracting agency, start date, end data, amount, and supplier):

1.   1217666,REPAIR PARTS,Department of Defence,16-October-2002,,5872.52,L
This one comes at the top of the list because the supplier name is unusual, only a single letter.

2.  1120859,Supply of,Department of Defence,15-May-2002,,0,C & L AEROSPACE

This one has a very short description and an amount of $0.
3.  854967,EARTH MOVING EQUIPMENT PARTS FOR REPAIR,Department of Defence,21-May-2002,,2134.05,439
Unusual because the supplier name is a number

4.  956798,PRESSURE GAUGE (WRITE BACK  SEE ROSS DAVEY),Department of Defence,11-September-2002,,1,WORMALD FIRE & SAFETY
Unusual because of the extra detail in the description and the cost of $1

5.  1053172,5310/66/105/3959.PURCHASE OF WASHER  FLAT.*CANCELLED* 29/04/03,Department of Defence,12-February-2003,,0,ID INTERNATIONAL
Unusual because of the dollar value, and the unusual description because of the cancellation

6.  868380,cancelled,Department of Defence,14-June-2002,,0,REDLINE
Unusual again because of the description and dollar value

7.  1043448,tetanus immunoglobulin-human,Department of Defence,10-January-2003,,1,AUSTRALIAN RED CROSS
Unusual because of the low dollar value

8  1014322,NATIONAL VISA PURCHASING,Department of Defence,18-October-2002,,26933.99,NAB 4715 2799 0000 0942
Unusual because the supplier is a bank account number (and so numeric); also a largish dollar value

9.  1023922,NATIONAL VISA PURCHASING,Department of Defence,18-September-2002,,25586.63,NAB 4715 2799 0000 0942
Same sort of pattern as (8) — globally unusual but similar to (8), note the common date

10.  968986,COIL  RADIO FREQUENCY,Department of Defence,27-September-2002,,2305.6,BAE
Unusual because of the short supplier name and large dollar value

11.  887357,SWIMMING POOL COVER.,Department of Defence,07-May-2002,,7524,H & A TEC
Unusal supplier name and large (!!) dollar value — hope it’s a big pool

12.  1010554,NAB VISA CARD,Department of Defence,02-August-2002,,16223.19,NAB 4715 2799 0000 0942
Another numeric bank account number as supplier and large dollar amount

13.  1005569,Interest,Department of Defence,12-August-2002,,2222.99,NAB 4715 2799 0000 1494
And again

14.  925011,FLIR RECORDER REPPRODUCER SET REPAIR KIOWA,Department of Defence,16-August-2002,,1100,BAE
Shart supplier name, long description with unusual words

15.  1012869,NAB VISA STATEMENT,Department of Defence,22-August-2002,,12934.87,NAB 4715 2799 0000 0942
Another financial transaction

16.  1073019,NATIONAL VISA,Department of Defence,03-February-2003,,10060.16,NAB 4715 2799 0000 0942
And again

17.  969039,SUSPENDERS  WHITE,Department of Defence,30-September-2002,,41800,ADA
Short supplier name and very large dollar amount (hopefully not just one suspender)

18.  1097060,Purchase of Coveralls  Flyers  Lightweight  Sage Green.,Department of Defence,11-February-2003,,18585.6,ADA
Again short supplier name and large dollar amount

959232,SUPPLY OF COATS AND TROUSERS DPDU,Department of Defence,23-September-2002,,1032350,ADA

Again short supplier name and very (!!) large dollar amount

Clearly the process is turning up example records that seem to be quite unusual within this large set, and might sometimes be worth further investigation.

This technique can be applied to any record-based data. As well as providing a version of the data ranked by interestingness, it also provides a graphical view of the data, and some indication of what the density of unusual records is compared to ordinary records. As the example shows, what it also often turns up are technical problems with the way that the data was collected, since mistakes in fields are records with the wrong fields, or with fields in the wrong place will usually turn up as anomalous.Some of the top records are there not because they are really unusual (probably) but because something went wrong with the capture of the supplier names. So it can be used for quality control as well.

ans =

1    23

ans =

1     6

ans =

1    23

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1     5

ans =

1     6

ans =

1    61

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1     8

ans =

1     6

ans =

1    11

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     6

ans =

1    20

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    16

ans =

1     6

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     6

ans =

1    20

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     4

ans =

1    24

ans =

1     6

ans =

1    25

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     6

ans =

1    17

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    25

ans =

1     6

ans =

1    26

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    18

ans =

1     6

ans =

1    26

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    32

ans =

1     6

ans =

1    25

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    18

ans =

1     6

ans =

1    82

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    18

ans =

1     6

ans =

1    32

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    25

ans =

1     6

ans =

1    43

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    12

ans =

1     6

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    37

ans =

1     6

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    15

ans =

1     6

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    15

ans =

1     6

ans =

1    38

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    44

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    25

ans =

1     7

ans =

1    18

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     7

ans =

1    37

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    15

ans =

1     7

ans =

1    23

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    31

ans =

1     7

ans =

1    33

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     4

ans =

1    32

ans =

1     7

ans =

1    65

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    29

ans =

1     7

ans =

1    79

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    34

ans =

1     7

ans =

1    27

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    21

ans =

1     7

ans =

1    26

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     7

ans =

1    38

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    17

ans =

1     7

ans =

1    27

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    21

ans =

1     7

ans =

1    44

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    25

ans =

1     7

ans =

1    22

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    99

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    29

ans =

1     7

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     4

ans =

1    25

ans =

1     7

ans =

1     5

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    29

ans =

1     7

ans =

1    22

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    18

ans =

1     7

ans =

1    77

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    19

ans =

1     7

ans =

1    30

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    20

ans =

1     7

ans =

1    31

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    30

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    24

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    11

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    11

ans =

1     7

ans =

1    14

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    79

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    34

ans =

1     7

ans =

1     9

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     1

ans =

1    15

ans =

1     7

ans =

1    29

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    23

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    20

ans =

1     7

ans =

1    22

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    77

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    19

ans =

1     7

ans =

1    35

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    31

ans =

1     7

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    29

ans =

1     7

ans =

1    15

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    20

ans =

1     7

ans =

1    44

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    25

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    11

ans =

1     7

ans =

1    99

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    29

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1    11

ans =

1     9

ans =

1    11

Collect-Analyse-Share

The Australian government white paper on counterterrorism, about which I wrote last week, sank below public consciousness without trace, after a 1-day flurry of news stories focused on the $69m to be spent on biometric strengthening of visas from certain countries. So much for the threat of homegrown terrorism!

One of the important facets of the report, and the one closest to what I do, is the emphasis on intelligence-led counterterrorism. This was described using three phases, collect, analyse, and share; all of which were claimed to be deficient, but for only two of which any remedies were suggested (as I blogged about previously). But maybe it’s worth spending some time on each of these phases.

Collect.  I take it that the point of intelligence-led counterterrorism is that the normal sequence is intended to be intelligence–investigation–arrest–nonevent rather than the more typical event–investigation–arrest sequence that tends to happen with e.g. crime, although less and less so. (In fact, for the majority of thwarted attacks, the sequence has been whistleblower–investigation–nonevent, which raises a number of deeper questions.)

Intelligence as a lead operation means ways to get knowledge from the data without having to explicitly look for it — in other words, it has to be fundamentally inductive. But the data has to be there  in the first place, so what kind of data should be collected?

There are two fundamentally different answers to this question, and so two very different kinds of data:

  1. Data collected about all of a particular class of objects, with the expectation that the records of interest will be a small, perhaps a vanishingly small, fraction of the total. For example, data about border crossing, customs, quarantine, financial transactions, taxes and communication is collected by most goverments for all of the relevant events, and some kind of processing is done to decide which of these events deserve further attention. Notice that most commercial data collection is like this too: shops collect data about all of their customers, airlines about all of their passengers, and so on.
  2. This kind of widespread data collection often raises red flags because of concerns about privacy, government power, and sometimes commercial power. The interesting thing is that groups in every country invoke moral arguments for why certain kinds of data collection are wrong and others are right (or at least justified) — but the boundary is different in different countries and at different times. In the U.S. there is deep suspicion about widespread government collection, but very little about commerical collection; in Europe it tends to be the other way around. A decent argument could be made that this battle was lost when income tax (or at least PAYE) was invented, since it meant the governments became involved in every detail of how money was made.

  3. Data collected about known “bad guys” or their associated actions. This kind of data is collected by governments using some kind of warrant-based procedure (that is, there is a need of some kind to demonstrate the potential badness of a “bad guy”) and by commercial organisations using a special-purpose investigative process. The goal in this case is either to confirm the suspected badness (search warrant, insurance investigation) or to discover other “bad guys” by association. This kind of data collection often raises fewer red flags because, in the case of governments, the process is typically quasi-judicial and so has inbuilt checks and balances; and in the commercial case, the costs of investigation act as a brake on doing it too much. (This latter cost is coming down quickly as more and more investigation can be done sitting in front of a screen, so this attitude may be changing.)

Analyse. The second phase is to take the collected data and use it inductively to build models of what it reveals. Very little of this happens today. Most analysis is of the slice-and-dice variety: given lots of data, usually of many different kinds, an analyst uses a sophisticated data-manipulation system to look at it in many different ways, and explore the connections that are implicit in it. The flagship in this area is probably Palantir (unpaid plug) which, though extremely expensive, does many interesting things on very large datasets. (See this video for a demo of how it can be used in an intelligence setting.)

The weakness with tools like this is that the analyst has to drag the knowledge out of the data, rather than having the data produce the knowledge. While we are far from doing this in a general way (“Computer, tell me what I should know”), we are, as a research community, making some progress. It is the invisibility of this possibility and the current partial successes that I see as the biggest weakness of the white paper.

Share. The inability of intelligence and law enforcement organisations to share is a chronic problem but I suspect it is often mischaracterised. Knowledge is power, and there’s a strange desire among some people to hang onto data, even if it has no value to them; but I suspect the issue is much more often pragmatic. Different law enforcement and intelligence organisations have different systems, and it’s often a nightmare to move data even within an organisation, let along between organisations just because of different formats, database schemas, encodings, and sheer size. This is not an easy problem to address. The temptation is often to wish for a single overarching, consistent database into which everything can be thrown. This is a poor idea for two reasons:

  • The price per unit of storage is often low (and decreasing) up to a certain size, but then takes a major jump. For example, a terabyte drive costs around $100, but a 100TB drive costs a lot more than 100 times as much. At any given moment, there’s a sweet spot between capacity, access time, and cost; moving away from it drives up the cost.
  • Any storage mechanism makes some kinds of analysis easy and other kinds difficult; the kinds that are difficult are done less often, so things get missed. In other words, variation in storage solutions associates with variation in analysis carried out, which decreases gaps.

There is a deeper issue about sharing: the knowledge present in data is often very implicit (that is, it’s hard to know what’s there until you look) so giving someone else your data may also be unintentionally giving them more than you realised. People who work with census data have had to think about this issue for a long time because census data at the level of individuals is very private, but at the collective level is (has to be) very public. This is also an important issue for businesses that build models of their customers. There are interesting technical problems here that might make a bigger difference in the end than trying to legislate increased cooperation.

Thoughts on the Australian Government White Paper on Counter-terrorism

The Australian government has just released a White Paper updating their policy on counterterrorism. Most of the content is eminently sensible, but there are a couple of questionable assumptions and/or directions.

1.  The section on resilience assumes that radicalisation can be mitigated by “reducing disadvantage” using government actions to address social and economic issues. This may well be so, but I don’t think there’s much evidence to support it. It’s clear that there are countries where economic and social grievances are significant drivers for radicalisation (e.g. Southern Thailand); but the results of a recent survey in Canada with which I was involved showed clearly that attitudes about economic and social issues were uncorrelated to radicalism. Although many Islamic immigrants to Canada (and indeed many immigrants) struggle with e.g. access to jobs, this does not seem to turn into a sense of grievance that might lead to radicalisation. Australia may be different, but there doesn’t seem to be any particular reason why it should be.

2. The section on intelligence-led counterterrorism talks about three components: the ability to collect; the ability to analyse; and the ability to share. There is existing capacity and proposed actions for the first and the the third — but there is a great black hole in both existing capacity and proposed action for the second: analysis.

It’s easy to skip over this word and assume what it means; but I suspect that, when it’s unpacked, it tends to be taken to mean either “looking stuff up” or “having a human put stuff together to discover its significance”. It doesn’t take much thought to realise that this can’t be enough. The challenge in intelligence is (a) deciding how important each dot is, and (b) finding the interesting constellations of dots from among the many possible constellations. In practice, the number of dots is in the thousands (and up) each day, so this process must be largely automated.

There is a strange blind spot about the role and importance of analysis. I suspect that this is mostly because it’s not obvious how powerful inductive data modelling can be and it’s not on the conceptual map of most people, especially those whose training has been in the humanities and social sciences. But talking about collection and sharing without talking about analysis is like a sandwich without the filling — and you don’t make a better sandwich by improving the quality of the bread, if there’s still no filling.

Analysis is tough for intelligence agencies, who are fighting a battle to upgrade their capabilities at the same time as meeting the real-time challenges of what analysis they can already do. And, although data mining/knowledge discovery is a well-developed subject, adversarial data mining, which I’ve often argued here is quite a different subject, has received little attention. One way that governments can help is to let some of this upgrading happen in universities. As far as I am aware, there is almost no work on counterterrorism analysis happening in Australian universities, and the possibility gets only a tiny mention in the National Security Science and Innovation Strategy. There are several research groups looking at the social aspects of terrorism and counterterrorism, and one or two looking at the forensic aspects of data analysis, but a conspicuous absence of work on data analysis as a preventive and preemptive tool.

A part of the report that has attracted media attention is the intent to impose special visa requirements for applicants from 10 as-yet-unidentified countries (but the US imposed special requirements on 10 countries so it probably isn’t too hard to guess the list). Two parts of this are problematic. First, it will use new biometrics — although this seems to be a grand way of talking about fingerprints and facial photos. Biometrics get over-trusted; they are mostly relatively easy to spoof. Second, the report promises to use “advanced data analysis and risk profiling” to identify risky visa applicants. It’s hard to know what to make of this,  but it sounds like either something quite weak, or something with unworkably high false-positive and false-negative rates.

3.  The problem with treating home-grown terrorism as a law enforcement problem is that catching and sentencing those who have planned or carried out attacks doesn’t do anything for those who are “next in line”. There’s a risk that dealing with a home-grown group simply radicalises their supporters to the point of violence. For example, this seems to be a potential risk after the sentencing of five men last week.

Other countries, for example Thailand and Saudi Arabia (although with questionable success), take a wider view and try to deradicalise those whose involvement with terrorist activity is marginal. In other words, any criminal events in the terrorism area are regarded as the tip of an iceberg; and other approaches (sometimes called “smart power”) are used to address the less-visible hinterland of the criminal event. While a law enforcement approach is good, there seems to be some scope for a wider approach to the problem. And the great majority of home-grown attacks have been discovered and prevented because of the actions of a whistle-blower within the attackers’ community, so motivating such whistle-blowing and making it easy seems like it should be a centrepiece of any proposed strategy.