Posts Tagged 'ranking'

Open Source Intelligence

There’s a report (here) about a National Press Club presentation of OSINT in the U.S. context. Two main points were made: the lack of correspondents making reports from local situations has an impact on the quality of available data (correspondents can go where professional intelligence gatherers cannot); and the amount of data on the Internet poses a challenge for analysis. In summary, there’s more data but less knowledge; and the knowledge is less well labelled in ways that traditionally made its extraction easy.

This is in some ways a U.S.-centric view. It’s ironic that one news organisation is adding foreign correspondents and other heavy-weight news gathering capacity at a significant rate — al Jazeera. The Washington Post may be getting lighter, but what’s happening in the U.S. is not entirely what’s happening in the rest of the world.

It’s also interesting that saying “organizing and prioritizing the material to be analyzed” is a challenge reveals the presupposition that organizing and priotitizing is somehow not part of analysis. Of course, readers here will know that it’s my view that “prioritizing” or ranking is at the heart of all intelligence analysis. The quotation shows a two-step attitude: first, find the good information in the vast wells of the Internet and then analyse it. It’s better not to think of these steps as separate; there’s only one task: getting knowledge from data, and dividing the task arbitrarily isn’t helpful.

Advertisements

Anomalies in record-based data

Many organisations have large datasets whose entities are records, perhaps records of transactions. In some settings, such as detecting credit-card fraud, sophisticated sets of rules have been developed to decide which records deserve further attention as potentially fraudulent. What does an organisation do, however, when it has a large dataset like this, hasn’t developed a model of what “interesting” records look like, but would still like to focus attention on “interesting” records — usually because there aren’t enough resources even to look at all of the records individually.

One way to decide which records are interesting, is to label records as uninteresting if there are lot of other records like them. I have developed ways to rank records by interestingness using this idea.

So when the Sydney Morning Herald published a dataset of Australian defence contracts (700,000 of them) I thought I would try my approach. The results are interesting. Here are the most unusual records from this ranking (the columns are contract number, description, contracting agency, start date, end data, amount, and supplier):

1.   1217666,REPAIR PARTS,Department of Defence,16-October-2002,,5872.52,L
This one comes at the top of the list because the supplier name is unusual, only a single letter.

2.  1120859,Supply of,Department of Defence,15-May-2002,,0,C & L AEROSPACE

This one has a very short description and an amount of $0.
3.  854967,EARTH MOVING EQUIPMENT PARTS FOR REPAIR,Department of Defence,21-May-2002,,2134.05,439
Unusual because the supplier name is a number

4.  956798,PRESSURE GAUGE (WRITE BACK  SEE ROSS DAVEY),Department of Defence,11-September-2002,,1,WORMALD FIRE & SAFETY
Unusual because of the extra detail in the description and the cost of $1

5.  1053172,5310/66/105/3959.PURCHASE OF WASHER  FLAT.*CANCELLED* 29/04/03,Department of Defence,12-February-2003,,0,ID INTERNATIONAL
Unusual because of the dollar value, and the unusual description because of the cancellation

6.  868380,cancelled,Department of Defence,14-June-2002,,0,REDLINE
Unusual again because of the description and dollar value

7.  1043448,tetanus immunoglobulin-human,Department of Defence,10-January-2003,,1,AUSTRALIAN RED CROSS
Unusual because of the low dollar value

8  1014322,NATIONAL VISA PURCHASING,Department of Defence,18-October-2002,,26933.99,NAB 4715 2799 0000 0942
Unusual because the supplier is a bank account number (and so numeric); also a largish dollar value

9.  1023922,NATIONAL VISA PURCHASING,Department of Defence,18-September-2002,,25586.63,NAB 4715 2799 0000 0942
Same sort of pattern as (8) — globally unusual but similar to (8), note the common date

10.  968986,COIL  RADIO FREQUENCY,Department of Defence,27-September-2002,,2305.6,BAE
Unusual because of the short supplier name and large dollar value

11.  887357,SWIMMING POOL COVER.,Department of Defence,07-May-2002,,7524,H & A TEC
Unusal supplier name and large (!!) dollar value — hope it’s a big pool

12.  1010554,NAB VISA CARD,Department of Defence,02-August-2002,,16223.19,NAB 4715 2799 0000 0942
Another numeric bank account number as supplier and large dollar amount

13.  1005569,Interest,Department of Defence,12-August-2002,,2222.99,NAB 4715 2799 0000 1494
And again

14.  925011,FLIR RECORDER REPPRODUCER SET REPAIR KIOWA,Department of Defence,16-August-2002,,1100,BAE
Shart supplier name, long description with unusual words

15.  1012869,NAB VISA STATEMENT,Department of Defence,22-August-2002,,12934.87,NAB 4715 2799 0000 0942
Another financial transaction

16.  1073019,NATIONAL VISA,Department of Defence,03-February-2003,,10060.16,NAB 4715 2799 0000 0942
And again

17.  969039,SUSPENDERS  WHITE,Department of Defence,30-September-2002,,41800,ADA
Short supplier name and very large dollar amount (hopefully not just one suspender)

18.  1097060,Purchase of Coveralls  Flyers  Lightweight  Sage Green.,Department of Defence,11-February-2003,,18585.6,ADA
Again short supplier name and large dollar amount

959232,SUPPLY OF COATS AND TROUSERS DPDU,Department of Defence,23-September-2002,,1032350,ADA

Again short supplier name and very (!!) large dollar amount

Clearly the process is turning up example records that seem to be quite unusual within this large set, and might sometimes be worth further investigation.

This technique can be applied to any record-based data. As well as providing a version of the data ranked by interestingness, it also provides a graphical view of the data, and some indication of what the density of unusual records is compared to ordinary records. As the example shows, what it also often turns up are technical problems with the way that the data was collected, since mistakes in fields are records with the wrong fields, or with fields in the wrong place will usually turn up as anomalous.Some of the top records are there not because they are really unusual (probably) but because something went wrong with the capture of the supplier names. So it can be used for quality control as well.

ans =

1    23

ans =

1     6

ans =

1    23

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1     5

ans =

1     6

ans =

1    61

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1     8

ans =

1     6

ans =

1    11

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     6

ans =

1    20

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    16

ans =

1     6

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     6

ans =

1    20

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     4

ans =

1    24

ans =

1     6

ans =

1    25

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     6

ans =

1    17

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    25

ans =

1     6

ans =

1    26

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    18

ans =

1     6

ans =

1    26

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    32

ans =

1     6

ans =

1    25

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    18

ans =

1     6

ans =

1    82

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    18

ans =

1     6

ans =

1    32

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    25

ans =

1     6

ans =

1    43

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    12

ans =

1     6

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    37

ans =

1     6

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    15

ans =

1     6

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    15

ans =

1     6

ans =

1    38

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    44

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    25

ans =

1     7

ans =

1    18

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     7

ans =

1    37

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    15

ans =

1     7

ans =

1    23

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    31

ans =

1     7

ans =

1    33

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     4

ans =

1    32

ans =

1     7

ans =

1    65

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    29

ans =

1     7

ans =

1    79

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    34

ans =

1     7

ans =

1    27

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    21

ans =

1     7

ans =

1    26

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    24

ans =

1     7

ans =

1    38

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    17

ans =

1     7

ans =

1    27

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    21

ans =

1     7

ans =

1    44

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    25

ans =

1     7

ans =

1    22

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    99

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    29

ans =

1     7

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     4

ans =

1    25

ans =

1     7

ans =

1     5

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    29

ans =

1     7

ans =

1    22

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    18

ans =

1     7

ans =

1    77

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     5

ans =

1    19

ans =

1     7

ans =

1    30

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    20

ans =

1     7

ans =

1    31

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    30

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    24

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    11

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    11

ans =

1     7

ans =

1    14

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    79

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    34

ans =

1     7

ans =

1     9

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     1

ans =

1    15

ans =

1     7

ans =

1    29

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    23

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    20

ans =

1     7

ans =

1    22

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    20

ans =

1     7

ans =

1    77

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     6

ans =

1    19

ans =

1     7

ans =

1    35

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    31

ans =

1     7

ans =

1    21

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     7

ans =

1    29

ans =

1     7

ans =

1    15

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    20

ans =

1     7

ans =

1    44

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    25

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     9

ans =

1    11

ans =

1     7

ans =

1    99

ans =

1    21

ans =

1    11

ans =

1     0

ans =

1     8

ans =

1    29

ans =

1     7

ans =

1     8

ans =

1    21

ans =

1    11

ans =

1    11

ans =

1     9

ans =

1    11

Which predictors can rank?

To be able to build a ranking predictor, there must be some way of labelling the training records with (estimates of) their ranks, so that this property can be generalised to new records. This is often straightforward, even if the obvious target label doesn’t map directly to a ranking.

There are six mainstream prediction technologies:

  1. Decision trees. These are everyone’s favourites, but they are quite weak predictors, and can only be used to predict class labels. So no use for ranking.
  2. Neural networks. These are also well-liked, but undeservedly so. Neural networks can be effective predictors for problems where the boundaries between the classes are difficult and non-linear, but they are horrendously expensive to train. They should not be used without soul searching. They can, however, predict numerical values and so can do ranking.
  3. Support Vector Machines. These are two-class predictors that try to fit the optimal boundary (the maximal margin) between points corresponding to the records from each class. The distances from the boundary are an estimate of how confident the classifier is in the classification of each record, and so provide a kind of surrogate ranking: from large positive numbers down to 1 for one class and then from -1 to large negative numbers for the other class.
  4. Ensembles. Given any kind of simple predictor, a better predictor can be built by: creating samples of the records from the training dataset; building individual predictors from each sample; and the use the collection of predictors as single, global predictor by asking for the prediction of each one, and using voting to make the global prediction. Ensembles have a number of advantages, primarily that the individual predictors cancel out each others variance. But the number of predictors voting for the winning class can also be interpreted as a strength of opinion for that class; and so for a value on which to rank. In other words, a record can be unanimously voted normal, voted normal by all but one of the individual predictors, and so on.
  5. Random Forests. Random Forests are a particular form of ensemble predictor where each component decision tree is built making decisions about internal tests in a particularly robust and contextualized way. This makes them one of the most powerful prediction technologies known. The same technique, using the number of votes for the winning class, or the margin between the most popular and the next most popular class can be used as a ranking.
  6. Rules. Rules are used because they seem intuitive and explanatory, but they are very weak predictors. This is mostly because they capture local structure in data, rather than global structure captured by most other predictors. Rules cannot straightforwardly be used to rank.

So, although ranking predictors are very useful in adversarial situations, they are quite difficult to build and use.