How big are the words modernists use?

It’s a fairly straightforward question to ask, one which most literary scholars would be able to provide a halfway decent answer to based on their own readings. Ernest Hemingway, Samuel Beckett and Gertrude Stein more likely to use short words, James Joyce, Marcel Proust and Virginia Woolf using longer ones, the rest falling somewhere between the two extremes.

Most Natural Language Processing textbooks or introductions to quantitative literary analysis demonstrate how the most frequently occurring words in a corpus will decline at a rate of about 50%, i.e. the most frequently occurring term will appear twice as often as the second, which is twice as frequent as the third, and so on and so on. I was curious to see whether another process was at work for word lengths, and whether we can see a similar decline at work in modernist novels, or whether more ‘experimental’ authors visibly buck the trend. With some fairly elementary analysis in NLTK, and data frames over into R, I generated a visualisation which looked nothing like this one.*

*The previous graph had twice as many authors and was far too noisy, with not enough distinction between the colours to make it anything other than a headwreck to read.

In narrowing down the amount of authors I was going to plot, I did incline myself more towards authors that I thought would be more variegated, getting rid of the ‘strong centre’ of modernist writing, not quite as prosodically charged as Marcel Proust, but not as brutalist as Stein either. I also put in a couple of contemporary writers for comparison, such as Will Self and Eimear McBride.

As we can see, after the rather disconnected percentages of corpora that use one letter words, with McBride and Hemingway on top at around 25%, and Stein a massive outlier at 11%, things become increasingly harmonious, and the longer the words get, the more the lines of the vectors coalesce.

Self and Hemingway dip rather egregiously with regard to their use of two-letter words (which is almost definitely because of a mutual disregard for a particular word, I’m almost sure of it), but it is Stein who exponentially increases her usage of two and three letter words. As my previous analyses have found, Stein is an absolute outlier in every analysis.

By the time the words are ten letters long, true to form it’s Self who’s writing is the only one above 1%.

Author: Chris Beausang

I am a person who is PhD’ing in NUI Maynooth, using the open-source statistical programming language R to perform operations on large textual corpora in order to situate the novelists Anne Enright and Eimear McBride in a more nuanced and informed way to early to mid-twentieth century modernism. My interest in the topic is based on the fact that I find too many people describe Enright and McBride as being influenced solely by Joyce or Beckett, which is fine, and probably not inaccurate, but it gets tedious, and I think there’s a lot of mileage yet to be gotten out of investigating their indebtedness to Stein, Mansfield or Woolf; we don’t necessarily have to attribute everything interesting in Irish letters to Irish men.

Leave a Reply

Your email address will not be published. Required fields are marked *