The Tweets of #GE2015

I have for some time been working on an approach to analysing the Letters of 1916 corpus. The idea behind this stems from a poster at last year’s Digital Humanities conference that investigated a single, long-running correspondence using ‘lexical distance’ from a rolling baseline to highlight changes and repetition in topics being discussed (the taller the bar, the ‘newer’ the subjects being discussed were; these fed into a rolling average, against which later letters were compared).


The graph above shows a first stab at this with the Letters corpus (actually the baseline in this case is an accumulation of all the letters). To get this far has required some work: the very first attempt ‘forgot’ the need to take account of the length of texts for each date range (thus a significant-looking rise round about Easter 1916 could be attributed solely to there being more letters for this period). I will go into further detail on this line of investigation in a later post.

The biggest problem, which seems to underlie any work with the Letters corpus, is the sparseness of data: seemingly insignificant letters which just happen to repeat a word a lot — especially if it is a rare word — have an enormous effect.

Partly as a small afternoon’s hacking challenge, and partly as in an attempt to see whether the kind of approach described above was even feasible, I decided to do the same with tweets containing the hashtag #GE2015 (for the UK General Election on 7th May 2015). As a greater challenge, and to learn some fresh bits of Python, I built a system that would analyse the tweets in real time (minute by minute) and push the results to a website. You may notice, now that I’ve turned it off, the graph no longer updates. Whilst it was working, though, three Python scripts running simultaneously a) carried out a continual monitoring of the twitter hashtag #GE2015, stripped out stop words and other extraneous characters and accumulated the tweets for each minute in a file, b) compared the lexical distance (Euclidean distance, with each unique word count as a vector) against an accumulation of all the tweets up to that point, and c) converted the result to JSON and pushed them to my web server. The website above performed an AJAX call to fetch the data file every minute and refreshed the graph. (Thanks again visjs.)

A screen capture of the final graph

A screen capture of the final graph

I turned on the whole operation shortly before midnight. Of course, I had to stay up all night to watch the election to check that big spikes in the graph actually represented something (assuming there were any). Fortunately, the election provided. The first spike is the announcement of the Nuneaton result (which effectively confirmed the hitherto-disbelieved exit poll). The second is Douglas Alexander (Labour campaign manager and Shadow Foreign Secretary) losing his seat. The third, and largest, is Scottish Labour leader Jim Murphy losing his seat… After this point, all the spikes seem to be for significant figures losing seats. It is not apparent without digging deeper whether they were less shocking, or that the baseline data was getting normalised to the ‘theme’ of notable figures losing (hopefully — it will validate my model — the latter).

When tweets come marching in: screenshot of Terminal window showing the cleaned and tokenised tweets rolling in

By the end of the night (around 7am, when the Twitter client gave up and refused to reconnect) my laptop had processed over 3 million word tokens. A final annotated graph can be found here. (UPDATE: Sometimes visjs goes a little strange and moves the captions around… hopefully this can be fixed). If you’re really interested, this can be cross-referenced with the Guardian coverage.

I hope to dig a little deeper into these results, including looking at the relationship between amounts of data and the result. In general, though, there seemed to be a pretty steady flow of data and none of the exceptionally-sparse bits that added curiosity to the Letters work.


No comments yet

Leave a Reply

Your email address will not be published. Required fields are marked *