The Tweets of #GE2015

Letters of 1916, Miscellaneous

I have for some time been working on an approach to analysing the Letters of 1916 corpus. The idea behind this stems from a poster at last year’s Digital Humanities conference that investigated a single, long-running correspondence using ‘lexical distance’ from a rolling baseline to highlight changes and repetition in topics being discussed (the taller the bar, the ‘newer’ the subjects being discussed were; these fed into a rolling average, against which later letters were compared).

 

The graph above shows a first stab at this with the Letters corpus (actually the baseline in this case is an accumulation of all the letters). To get this far has required some work: the very first attempt ‘forgot’ the need to take account of the length of texts for each date range (thus a significant-looking rise round about Easter 1916 could be attributed solely to there being more letters for this period). I will go into further detail on this line of investigation in a later post.

The biggest problem, which seems to underlie any work with the Letters corpus, is the sparseness of data: seemingly insignificant letters which just happen to repeat a word a lot — especially if it is a rare word — have an enormous effect.

Partly as a small afternoon’s hacking challenge, and partly as in an attempt to see whether the kind of approach described above was even feasible, I decided to do the same with tweets containing the hashtag #GE2015 (for the UK General Election on 7th May 2015). As a greater challenge, and to learn some fresh bits of Python, I built a system that would analyse the tweets in real time (minute by minute) and push the results to a website. You may notice, now that I’ve turned it off, the graph no longer updates. Whilst it was working, though, three Python scripts running simultaneously a) carried out a continual monitoring of the twitter hashtag #GE2015, stripped out stop words and other extraneous characters and accumulated the tweets for each minute in a file, b) compared the lexical distance (Euclidean distance, with each unique word count as a vector) against an accumulation of all the tweets up to that point, and c) converted the result to JSON and pushed them to my web server. The website above performed an AJAX call to fetch the data file every minute and refreshed the graph. (Thanks again visjs.)

A screen capture of the final graph

A screen capture of the final graph

I turned on the whole operation shortly before midnight. Of course, I had to stay up all night to watch the election to check that big spikes in the graph actually represented something (assuming there were any). Fortunately, the election provided. The first spike is the announcement of the Nuneaton result (which effectively confirmed the hitherto-disbelieved exit poll). The second is Douglas Alexander (Labour campaign manager and Shadow Foreign Secretary) losing his seat. The third, and largest, is Scottish Labour leader Jim Murphy losing his seat… After this point, all the spikes seem to be for significant figures losing seats. It is not apparent without digging deeper whether they were less shocking, or that the baseline data was getting normalised to the ‘theme’ of notable figures losing (hopefully — it will validate my model — the latter).

When tweets come marching in: screenshot of Terminal window showing the cleaned and tokenised tweets rolling in

By the end of the night (around 7am, when the Twitter client gave up and refused to reconnect) my laptop had processed over 3 million word tokens. A final annotated graph can be found here. (UPDATE: Sometimes visjs goes a little strange and moves the captions around… hopefully this can be fixed). If you’re really interested, this can be cross-referenced with the Guardian coverage.

I hope to dig a little deeper into these results, including looking at the relationship between amounts of data and the result. In general, though, there seemed to be a pretty steady flow of data and none of the exceptionally-sparse bits that added curiosity to the Letters work.

Read comments / comment on this post

Letters of 1916: Social Network with VisJS

Letters of 1916

Over the last couple of days, I’ve been investigating various libraries for graph visualisations. One thing that came up from the last Letters of 1916 Twitter Chat, where we looked at some of Roman’s visualisations of the letters, was the difficulty of understanding and explaining outliers (like the love letters) when the data represented by each node cannot be accessed.

As a result, I decided to look into interactive graph tools (particularly web-based ones… they’re maybe a bit slower, but I speak passable JavaScript, and almost every modern browser can show them without much difficulty). The simplest seemed to be vis.js, which is astonishingly simple (just pass it some JSON for the nodes and edges). So I diverted the Python script I used to generate the graph in the last post into an html page with the vis.js library included, and hit refresh… This takes an awfully long time to render.

One potential for this web-based approach is incorporating this into a digital (online) edition of some kind: each of the nodes is an html canvas object, so it’s possible to add any amount of data — names, pop-ups, canonical links — to them. I also like the ability to zoom in easily (scroll) and to drag nodes around to spot the links between larger clusters. (It looks like there are two relatively-tightly linked clusters, that are joined only by a chain of about four people.)

The other thing to note is that this graph tool does not overlay identical edges, so each line now represents a single letter. This creates a sort of tightly-knit bundle for two people who wrote to each other a lot (‘James Finn’ and ‘May Fay’ are connected by so many edges that the graph library still hasn’t managed to stabilise, and they dance around each other and — when zoomed out —seem to flash like pulsars; I feel this is quite romantic, somehow.)

Click to load the full interactive graph 

All the data in this document comes to about 150kb, but then factor in the loading of the javascript library and the rendering of the page in-browser and you might be there a while. (Chrome claims the page is not responding: it is.)

Unidentified persons

Since posting the graph in the previous post, with its big cluster of ‘unknowns’ in the middle, I’ve been trying to think of various ways round this — or, at least, to make it not quite so disruptive to the graph as a whole. Once the corpus is more fully-developed, hopefully a lot more of these people will be identified, but in the meantime I just wanted a way to ‘unbundle’ all the unknowns into discrete unknowns. The question then becomes, “How many unknown people are there in this bundle?”

At one end of the spectrum, we can assume that every single unknown sender or recipient was a distinct person. But this is probably not the case — a quick skim through the Excel file of data shows that particular people just didn’t write the recipient’s full name on the letter, which is quite understandable in the case of family members, for instance. In this case, a great number of additional people will be magicked into existence, which made the graph quite a lot more complicated.

At the other extreme, there is the situation we saw in the last graph, where we assume all the unknowns are one and the same person. This makes Mr. Unknown the most popular person in Dublin by a wide margin, and wildly distorts the graph.

In the end, I settled for somewhere in between, and assumed that each sender or recipient had precisely one unknown correspondent. These have ‘ANONC’ (for unknown creator) and ‘ANONR’ (unknown recipient) appended to their labels. Somewhere in the graph there will be a pair of nodes called just ‘ANONC’ and ‘ANONR’, where both sender and recipient were unknown.

(The other option, which has just occurred to me, would be to remove all letters involving an unknown person. This would have the effect of removing some actual people from the graph entirely.)

The many-names-of-Lady-Clonbrock problem

Another problem, which I identified in the previous post, is the lack of normalisation of individuals’ names. I’m working on ways round this — compiling a dictionary of aliases and having my Python script normalise the names seems the most obvious thing, though, of course, this means knowing who all the people are in the first place.

I’m going to continue work on this: maybe introducing some kind of fuzziness into the searching, or using addresses instead of names, or both, might also be useful.

And finally, it would be nice to know the direction of each letter on the graph: something for another post.

Read comments / comment on this post

Letters of 1916: Preliminary Social Network

Letters of 1916

For a first post on this blog, I thought I’d present a first network graph derived from the senders/recipients of the Letters of 1916 project. (Thanks to Roman Bleier for his Python-based analysis tools, which made picking apart Excel files pleasingly trivial.)

A few preliminary thoughts:

Our corpus congregates around a few individual (which you’d expect, given the rather self-contained nature of the collections it’s derived from).

But then again, we’ve got lots of individuals, grouped round the edge, only connected by one letter (and not to any wider groups).

Lady Clonbrock was on first-name terms with some people, and not with others.

Most letters were sent to or by ‘anonymous’ (the unlabelled node in the middle) — which is almost certainly quite a few people. As such, the entire graph is massively skewed (actually it’s probably even less interconnected than it looks).

Click to see large image (6Mb)

There are so many flaws with this it’s scarcely worth mentioning them all (but I’ll valiantly try): the data is a raw spreadsheet dump from the ongoing work on the project; the lack of normalisation is a problem (see “Lady Clonbrock [née whatever]”); and the missing data really is a problem (any suggestions, aside from historical investigation?). There is also the fact that I don’t understand the details of the graph-plotting algorithm, so it’s difficult to say exactly in what way this graph represents the corpus.

However, my main thought that arises from this is: what will any sort of analysis of a network of letters (at the level of the letters) actually mean? At worst, it’s indicative only of how we’ve gone about collecting letters (and the people we’ve persuaded to bring them); at best, it might say something about the processes by which some letters have survived a hundred years. Anyway, it is a network of the data of the project — ‘a network of Dublin letters’ is too great a claim for the scanty evidence of this network.

Read comments / comment on this post