Patterns of word usage in the UEA climate emails

I’m always pleased to see examples of real emails because they can act as testbeds for various textual analysis techniques. I’ve begun to analyse the “climategate” emails from the University of East Anglia. The figure below shows a plot of the structure of the words used. (This is quite a quick and dirty analysis — I didn’t try to remove email headers or otherwise clean up the content of the files.)

There are three parts to the structure. The arm to the right is an artifact of the fact that several word files were included in the bodies of emails, rather than as attachments, so my extraction software sees them as part of the text. This can be fixed, but will take me some time.

The interesting property is the longtitudinal structure from top to bottom in the figure. The phrases at the bottom are all content, while the phrases at the top are all identifiers of people and places (admittedly hard to see). Since the analysis algorithms know nothing of the semantics of emails, and are based purely on “bag of words” style analysis, this is an interesting, and unexpected, outcome.


0 Responses to “Patterns of word usage in the UEA climate emails”

  1. Leave a Comment

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s


%d bloggers like this: