Many organisations have large datasets whose entities are records, perhaps records of transactions. In some settings, such as detecting credit-card fraud, sophisticated sets of rules have been developed to decide which records deserve further attention as potentially fraudulent. What does an organisation do, however, when it has a large dataset like this, hasn’t developed a model of what “interesting” records look like, but would still like to focus attention on “interesting” records — usually because there aren’t enough resources even to look at all of the records individually.
One way to decide which records are interesting, is to label records as uninteresting if there are lot of other records like them. I have developed ways to rank records by interestingness using this idea.
So when the Sydney Morning Herald published a dataset of Australian defence contracts (700,000 of them) I thought I would try my approach. The results are interesting. Here are the most unusual records from this ranking (the columns are contract number, description, contracting agency, start date, end data, amount, and supplier):
1. 1217666,REPAIR PARTS,Department of Defence,16-October-2002,,5872.52,L
This one comes at the top of the list because the supplier name is unusual, only a single letter.
2. 1120859,Supply of,Department of Defence,15-May-2002,,0,C & L AEROSPACE
This one has a very short description and an amount of $0.
3. 854967,EARTH MOVING EQUIPMENT PARTS FOR REPAIR,Department of Defence,21-May-2002,,2134.05,439
Unusual because the supplier name is a number
4. 956798,PRESSURE GAUGE (WRITE BACK SEE ROSS DAVEY),Department of Defence,11-September-2002,,1,WORMALD FIRE & SAFETY
Unusual because of the extra detail in the description and the cost of $1
5. 1053172,5310/66/105/3959.PURCHASE OF WASHER FLAT.*CANCELLED* 29/04/03,Department of Defence,12-February-2003,,0,ID INTERNATIONAL
Unusual because of the dollar value, and the unusual description because of the cancellation
6. 868380,cancelled,Department of Defence,14-June-2002,,0,REDLINE
Unusual again because of the description and dollar value
7. 1043448,tetanus immunoglobulin-human,Department of Defence,10-January-2003,,1,AUSTRALIAN RED CROSS
Unusual because of the low dollar value
8 1014322,NATIONAL VISA PURCHASING,Department of Defence,18-October-2002,,26933.99,NAB 4715 2799 0000 0942
Unusual because the supplier is a bank account number (and so numeric); also a largish dollar value
9. 1023922,NATIONAL VISA PURCHASING,Department of Defence,18-September-2002,,25586.63,NAB 4715 2799 0000 0942
Same sort of pattern as (8) — globally unusual but similar to (8), note the common date
10. 968986,COIL RADIO FREQUENCY,Department of Defence,27-September-2002,,2305.6,BAE
Unusual because of the short supplier name and large dollar value
11. 887357,SWIMMING POOL COVER.,Department of Defence,07-May-2002,,7524,H & A TEC
Unusal supplier name and large (!!) dollar value — hope it’s a big pool
12. 1010554,NAB VISA CARD,Department of Defence,02-August-2002,,16223.19,NAB 4715 2799 0000 0942
Another numeric bank account number as supplier and large dollar amount
13. 1005569,Interest,Department of Defence,12-August-2002,,2222.99,NAB 4715 2799 0000 1494
14. 925011,FLIR RECORDER REPPRODUCER SET REPAIR KIOWA,Department of Defence,16-August-2002,,1100,BAE
Shart supplier name, long description with unusual words
15. 1012869,NAB VISA STATEMENT,Department of Defence,22-August-2002,,12934.87,NAB 4715 2799 0000 0942
Another financial transaction
16. 1073019,NATIONAL VISA,Department of Defence,03-February-2003,,10060.16,NAB 4715 2799 0000 0942
17. 969039,SUSPENDERS WHITE,Department of Defence,30-September-2002,,41800,ADA
Short supplier name and very large dollar amount (hopefully not just one suspender)
18. 1097060,Purchase of Coveralls Flyers Lightweight Sage Green.,Department of Defence,11-February-2003,,18585.6,ADA
Again short supplier name and large dollar amount
959232,SUPPLY OF COATS AND TROUSERS DPDU,Department of Defence,23-September-2002,,1032350,ADA
Again short supplier name and very (!!) large dollar amount
Clearly the process is turning up example records that seem to be quite unusual within this large set, and might sometimes be worth further investigation.
This technique can be applied to any record-based data. As well as providing a version of the data ranked by interestingness, it also provides a graphical view of the data, and some indication of what the density of unusual records is compared to ordinary records. As the example shows, what it also often turns up are technical problems with the way that the data was collected, since mistakes in fields are records with the wrong fields, or with fields in the wrong place will usually turn up as anomalous.Some of the top records are there not because they are really unusual (probably) but because something went wrong with the capture of the supplier names. So it can be used for quality control as well.