A (Proper) Statistical analysis of the prose works of Samuel Beckett


Content warning: If you want to get to the fun parts, the results of an analysis of Beckett’s use of language, skip to sections VII and VIII. Everything before that is navel-gazing methodology stuff.

If you want to know how I carried out my analysis, and utilise my code for your own purposes, here’s a link to my R code on my blog, with step-by-step instructions, because not enough places on the internet include that.

I: Things Wrong with my Dissertation’s Methodology

For my masters, I wrote a 20000 word dissertation, which took as its subject, an empirical analysis of the works of Samuel Beckett. I had a corpus of his entire works with the exception of his first novel Dream of Fair to Middling Women, which is a forgivable lapse, because he ended up cannibalising it for his collection of short stories, More Pricks than Kicks.

Quantitative literary analysis is generally carried out in one of two ways, through either one of the open-source programming languages Python or R. The former you’ve more likely to have heard of, being one of the few languages designed with usability in mind. The latter, R, would be more familiar to specialists, or people who work in the social sciences, as it is more obtuse than Python, doesn’t have many language cousins and has a very unfriendly learning curve. But I am attracted to difficulty, so I am using it for my PhD analysis.

I had about four months to carry out my analysis, so the idea of taking on a programming language in a self-directed learning environment was not feasible, particularly since I wanted to make a good go at the extensive body of secondary literature written on Beckett. I therefore made use of a corpus analysis tool called Voyant. This was a couple of years ago, so this was before its beta release, when it got all tricked out with some qualitative tools and a shiny new interface, which would have been helpful. Ah well. It can be run out of any browser, if you feel like giving it a look.

My analysis was also chronological, in that it looked at changes in Beckett’s use of language over time, with a view to proving the hypothesis that he used a less wide vocabulary as his career continued, in pursuit of his famed aesthetic of nothingness or deprivation. As I wanted to chart developments in his prose over time, I dated the composition of each text, and built a corpus for each year, from 1930–1987, excluding of course, years in which he just wrote drama, poetry, which wouldn’t be helpful to quantify in conjunction with one another. Which didn’t stop me doing so for my masters analysis. It was a disaster.

II: Uniqueness

Uniqueness, the measurement used to quantify the general spread of Beckett’s vocabulary, was obtained by the generally accepted formula below:

unique word tokens / total words

There is a problem with this measurement, in that it takes no account of a text’s relative length. As a text gets longer, the likelihood of each word being used approaches 1. Therefore, a text gets less unique as it gets bigger. I have the correlations to prove it:

screen-shot-2016-11-03-at-12-18-03There have been various solutions proposed to this quandary, which stymies our comparative analyses, somewhat. One among them is the use of vectorised measurements, which plot the text’s declining uniqueness against its word count, so we see a more impressionistic graph, such as this one, which should allow us to compare the word counts for James Joyce’s novels, A Portrait of the Artist as a Young Man and his short story collection, Dubliners.


All well and good for two or maybe even five texts, but one can see how, with large scale corpora, this sort of thing can get very incoherent very quickly. Furthermore, if one was to examine the numbers on the y-axis, one can see that the differences here are tiny. This is another idiosyncrasy of stylostatistical methods; because of the way syntax works, the margins of difference wouldn’t be regarded as significant by most statisticians. These issues relating to the measurement are exacerbated by the fact that ‘particles,’ the atomic structures of literary speech, (it, is, the, a, an, and, said, etc.) make up most of a text. In pursuit of greater statistical significance for their papers, digital literary critics remove these particles from their texts, which is another unforgivable that we do anyway. I did not, because I was concerned that I was complicit in the neoliberalisation of higher education. I also wrote a 4000 word chapter that outlined why what I was doing was awful.

IV: Ambiguity

The formula for ambiguity was arrived at by the following formula:

number of indefinite pronouns/total word count

I derived this measurement from Dr. Ian Lancashire’s study of the works of Agatha Christie, and counted Beckett’s use of a set of indefinite pronouns, ‘everyone,’ ‘everybody,’ ‘everywhere,’ ‘everything,’ ‘someone,’ ‘somebody,’ ‘somewhere,’ ‘something,’ ‘anyone,’ ‘anybody,’ ‘anywhere,’ ‘anything,’ ‘no one,’ ‘nobody,’ ‘nowhere,’ and ‘nothing.’ Those of you who know that there are more indefinite pronouns than just these, you are correct, I had found an incomplete list of indefinite pronouns, and I assumed that that was all. This is just one of the many things wrong with my study. My theory was that there were to be correlations to be detected in Beckett’s decreasing vocabulary, and increasing deployment of indefinite pronouns, relative to the total word count. I called the vocabulary measure ‘uniqueness,’ and the indefinite pronouns measure I called ‘ambiguity.’ This in tenuous I know, indefinite pronouns advance information as they elide the provision of information. It is, like so much else in the quantitative analysis of literature, totally unforgivable, yet we do it anyway.

V: Hapax Richness

I initially wanted to take into account another phenomenon known as the hapax score, which charts occurrences of words that appear only once in a text or corpus. The formula to obtain it would be the following:

number of words that appear once/total word count

I believe that the hapax count would be of significance to a Beckett analysis because of the points at which his normally incompetent narrators have sudden bursts of loquaciousness, like when Molloy says something like ‘digital emunction and the peripatetic piss,’ before lapsing back into his ‘normal’ tone of voice. Once again, because I was often working with a pen and paper, this became impossible, but now that I know how to code, I plan to go over my masters analysis, and do it properly. The hapax score will form a part of this new analysis.

VI: Code & Software

A much more accurate way of analysing vocabulary, for the purposes of comparative analysis when your texts are of different lengths, therefore, would be to randomly sample it. Obviously not very easy when you’re working with a corpus analysis tool online, but far more straightforward when working through a programming language. A formula for representative sampling was found, and integrated into the code. My script is essentially a series of nested loops and if/else statements, that randomly and sequentially sample a text, calculate the uniqueness, indefiniteness and hapax density ten times, store the results in a variable, and then calculate the mean value for each by dividing the result by ten, the number of times that the first loop runs. I inputted each value into the statistical analysis program SPSS, because it makes pretty graphs with less effort than R requires.

VII: Results

I used SPSS’ box plot function first to identify any outliers for uniqueness, hapax density and ambiguity. 1981 was the only year which scored particularly high for relative usage of indefinite pronouns.


It should be said that this measure too, is correlated to the length of the text, which only stands to reason; as a text gets longer the relative incidence of a particular set of words will decrease. Therefore, as the only texts Beckett wrote this year, ‘The Way’ and ‘Ceiling,’ both add up to about 582 words (the fifth lowest year for prose output in his life), one would expect indefiniteness to be somewhat higher in comparison to other years. However, this doesn’t wholly account for its status as an outlier value. Towards the end of his life Beckett wrote increasingly short prose pieces. Comment C’est (How It Is) was his last novel, and was written almost thirty years before he died. This probably has a lot to do with his concentration on writing and directing his plays, but in his letters he attributed it to a failure to progress beyond the third novel in his so-called trilogy of Molloy, Malone meurt (Malone Dies) and L’innomable (The Unnamable). It is in the year 1950, the year in which L’inno was completed, that Beckett began writing the Textes pour rien (Texts for Nothing), scrappy, disjointed pieces, many of which seem to be taking up from where L’inno left off, similarly the Fizzles and the Faux Départs. ‘The Way,’ I think, is an outgrowth of a later phase in Beckett’s prose writing, which dispenses the peripatetic loquaciousness and the understated lyricism of the trilogy and replaces it with a more brute and staccato syntax, one which is often dependent on the repetition of monosyllables:

No knowledge of where gone from. Nor of how. Nor of whom. None of whence come to. Partly to. Nor of how. Nor of whom. None of anything. Save dimly of having come to. Partly to. With dread of being again. Partly again. Somewhere again. Somehow again. Someone again.

Note also the prevalence of particle words, that will have been stripped out for the analysis, and the ways in which words with a ‘some’ prefix are repeated as a sort of refrain. This essential structure persists in the work, or at least the artefact of the work that the code produces, and hence of it, the outlier that it is.


From plotting all the values together at once, we can see that uniqueness is partially dependent on hapax density; the words that appear only once in a particular corpus would be important in driving up the score for uniqueness. While there could said to be a case for the hypothesis that Beckett’s texts get less unique, more ambiguous up until 1944, when he completed his novel Watt, and if we’re feeling particularly risky, up until 1960 when Comment C’est was completed, it would be wholly disingenuous to advance it beyond this point, when his style becomes far too erratic to categorise definitively. Comment C’est is Beckett’s most uncompromising prose work. It has no punctuation, no capitalisation, and narrates the story of two characters, in a kind of love, who communicate with one another by banging kitchen implements off another:

as it comes bits and scraps all sorts not so many and to conclude happy end cut thrust DO YOU LOVE ME no or nails armpit and little song to conclude happy end of part two leaving only part three and last the day comes I come to the day Bom comes YOU BOM me Bom ME BOM you Bom we Bom

VIII: Conclusion

I would love to say that the general tone is what my model is being attentive to, which is why it identified Watt and How It Is as nadirs in Beckett’s career but I think their presence on the chart is more a product of their relative length, as novels, versus the shorter pieces which he moved towards in his later career. Clearly, Beckett’s decision to write shorter texts, make this means of summing up his oeuvre in general, insufficient. Whatever changes Beckett made to his aesthetic over time, we might not need to have such complicated graphs to map, and I could have just used a word processor to find it — length. Bom and Pim aside, for whatever reason after having written L’inno none of Beckett’s creatures presented themselves to him in novelistic form again. The partiality of vision and modal tone which pervades the post-L’inno works demonstrates, I think far more effectively what is was that Beckett was ‘pitching’ for, a new conceptual aspect to his prose, which re-emphasised its bibliographic aspects, the most fundamental of which was their brevity, or the appearance of an incompleteness, by virtue of being honed to sometimes less than five hundred words.

The quantification of differing categories of words seems like a radical, and the most fun, thing to quantify in the analysis of literary texts, as the words are what we came for, but the problem is similar to one that overtakes one who attempts to read a literary text word by word by word, and unpack its significance as one goes: overdetermination. Words are kaleidoscopic, and the longer you look at them, the more threatening their darkbloom becomes, the more they swallow, excrete, the more alive they are, all round. Which is fine. Letting new things into your life is what it should be about, until their attendant drawbacks become clear, and you start to become ambivalent about all the fat and living things you have in your head. You start to wish you read poems instead, rather than novels, which make you go mad, and worse, start to write them. The point is words breed words, and their connections are too easily traced by computer. There’s something else about knowing that their exact correlations to a decimal point. They seem so obvious now.

Marcel Proust’s ‘In Search of Lost Time: The Fugitive’ as speculative fiction

Speculative fiction is a straightforward enough concept to grasp. As the name indicates, it creates a breach in fiction’s conventions of representation and violates the rules that traditionally govern the world in which fiction takes place. In short, a speculative fiction begins with a ‘what if?’

Jorge Luis Borges is one of the most skilled practitioners of speculative fictions, though he rarely needs more than twenty or twenty five pages to exhaust his capacity to work through every aspect of the world that he has conjured up. Being as I am on the last volume of á la recherche I cannot over-emphasise how grateful I am to him for his capacity for brevity.

Of course, there are very few novels that don’t fall into the category delineated above; novels that are propelled by a question in the mind of the author are not a niche genre. There are certain coping mechanisms that one finds oneself devising when making one’s way through a 3500 page novel and one of them is to fixate on the abject strangeness of many of its key moments, many of which seem to border on aspects of science-fiction sub-genre.

Carol Clark, the translator of The Prisoner writes: “practical considerations of money, which would be at the centre of a novel by Balzac or Zola, seem to be of little importance here. Again, one feels that Proust is carrying out a thought experiment: let there be a young man M and a girl A, living in flat F. Let the money available to M be infinite.” The use of the term ‘thought experiment’ conveys how bizarre the novel can be. The Prisoner describes how Marcel’s lover Albertine moves into his apartment and how Marcel expends seemingly endless funds on lavish gifts for her. When she leaves him, he promises her a Rolls Royce and a yacht if she returns. All this focus on the financial inconsistencies glosses over the fact that Albertine’s aunt, Mme Bontemps, seems to be perfectly fine with her daughter living unmarried with a seemingly endlessly wealthy society dilettante with neurasthenia.

It’s not even fanciful to posit the existence of shape shifters in Proust’s novel, Odette de Crécy somehow manages to de-age as the novel continues; this is commented on by the narrator frequently with an appropriate incredulity and the scope of Albertine’s face seems to change dramatically at some point after In the Shadow of Young Girls in Flower, to an extent that I don’t think can be attributed to the normal changes brought about by adolescence. This presumably serves a metaphorical end about the multiplicity of self and the necessary masquerades adopted by people in the normal course of society life, a necessity that is only bolstered when one deviates from the proscribed sexual ‘norm,’ as very few characters in this novel don’t.

Proust also engages in a kind of description that I find myself noticing quite a bit recently, and that is prose that attempts to grapple with reality on a quantum level, to convey phenomena that are not visible to the naked eye:

“the whole sky was filled with that radiant, palish blue that the walker lying in a field sometimes sees over his head, but so uniform, so deep that one feels the blue of which it is made was used without any admixture and with such inexhaustible richness that one could delve deeper and deeper into its substance without finding an atom of anything but that same blue.”

It is this willingness to represent the ineffable in text that Proust’s best moments of confrontational strangeness that gets him his best moments as we see in the above, wherein an anonymous and yet universal representation of man ‘the walker,’ falls into the sky endlessly, which is at once the sky and also seems to prefigure some kind of undiluted cordial, perhaps anticipating the famous madeleine dissolved in tea. The paragraph is positively bristling with paradoxes and abstrusities, least among which is the suggestion that one can simply ‘find’ an atom, that atoms can be ‘pure’ and that they are colour-coded.

Marcel Proust’s ‘In Search of Lost Time: Sodom and Gomorrah’

35750-_uy200_At this stage, the fourth volume of six in Marcel Proust’s In Search of Lost Time, it doesn’t need saying that Proust is a hyper-critical author. He doesn’t allow his characters to get away with anything and dwells for sentence after sentence after sentence on their most minute flaws and concealed insecurities. However, there seems to be shades of difference in Proust’s treatment of particular characters based on their class. Regardless of how denigrating he may be towards the Guermantes or the Princess de Parma, their characterisations retain an idealised quality, their personas never lose their sheen of seemingly fundamental decency. The origin of this positive discrimination is somewhat unclear, as the focalisation of In Search of Lost Time’s perspective is so overdetermined. Blame could lie with the narrator, M, who is, after all, hopelessly besotted with all members of the aristocracy, regardless of the depth of their ignorance. Some blame could well be attached to Proust himself, with one eye on F. Scott Fitzgerald’s admiration of rich people, for being in some self-evident way different from the have-nots.

Characters such as Charles Morel and Françoise lack this ‘upper-class’ status, which would otherwise have allowed for their redemption, at least partially, from M’s perspective. Therefore, there is something altogether crueler about M’s probing evisceration of Françoise’s character, considering she is employed as his family’s servant. Françoise also has the dubious honour of being the only character that M has told to her face exactly what he thinks of her, something that he would not dare do to someone with a secure place on a social scale of any kind (as yet, anyway, I have only read the first four parts of six): “’You’re an excellent person, I said smarmily, you’re kind, you’ve a thousand good qualities, but you’re no further on than the day that you arrived in Paris, either in knowing about women’s clothes or in how to pronounce words properly and not commit howlers.’”

M’s identification of Françoise’s primary failing as linguistic is, I believe, revealing. First, her way of speaking is wholly idiosyncratic, because she is from rural France and was not formally educated. This can be seen in her occasional tendency towards exaggeration, at occasions like being found by a member of the family in the kitchen, particularly when she is with her daughter: ‘She’s just had a spoonful of soup, Françoise said to me, and I forced her suck on a bit of the carcass,’ so as thus to reduce her daughter’s supper to nothing, as though it would have been wrong for it to be plentiful. Even at lunch or dinner, if I made the mistake of going into the kitchen, Françoise would make as if they had finished and even apologise by saying: ‘I just wanted a bite of something,’ or ‘a mouthful.’ Her supposed ineptitude in expressing herself exasperates M, who constantly demonstrates his facility in doing so with an endlessly proliferating sequence of sub-clauses erupting at the least prompting.

This relates to another reason for preferring Françoise above all others that populate Proust’s ‘world entire,’ as parts in the novel that feature her are generally an occasion of humour, as M’s frustration with her manifests itself in a haughty and staccato sentence style, often a welcome relief from his normative mode. The second part of In Search of Lost TimeIn The Shadow of Young Girls In Flower, contains what I believe to be the funniest part of the entire novel, if I can be allowed to decide this with two volumes remaining. This section of the novel describes a holiday that M, his grandmother and Françoise take in the coastal town of Balbec. They stay in a hotel and Françoise makes the acquaintance of a number of staff members, butlers and servants, etc. This has unexpected effects for M and his grandmother:

“she had also gotten to know one of the wine waiters, a kitchen-hand and a housekeeper from one of the floors. The result of this for our daily arrangements was that, whereas at the at the very beginning of her stay Françoise, knowing no one had kept ringing for the most trivial reasons, at times when my grandmother and I would never have dared to ring 0 and if we raised some mild objection to this,. she replies, ‘Well we’re paying them enough!’ as thought she herself was footing the bills – now that she was on friendly terms with one of the personalities from below stairs, a thing which had initially seemed to augur well for our comfort if either of us happened to have cold feet in bed, she would not countenance the idea of ringing, even at times which were in no way untoward; she said it would ‘put them out,’ it would mean the…servants’ dinner-hour would be disturbed and they would not like that…The long and short of it was that we had to make to do without proper hot water because Françoise was a friend of the man whose job it was to heat it.”

If that didn’t split your sides, Proust may not be the best place for you to get your laughs.

M probably gets annoyed as he does because he doesn’t want someone competing with him, in the realm of linguistic play, least of all an uneducated woman of the servant class, self-obsessed little twerp that he is.

Marcel Proust’s ‘In Search of Lost Time: The Guermantes Way’

A large proportion of Marcel Proust’s magnum opus In Search of Lost Time is given over to salon conversations. Salons have a long history as gatherings of educated members of the upper and middle classes keen to discuss art and politics over good food and wine.

Proust makes clear that these gatherings are not mini-utopias of intellectuals forging the uncreated conscience of their race within drawing rooms. Instead, they consist mostly of nouveau riche philistines, uneducated social climbers and artists who compromise themselves through their wishes to succeed within ‘society.’

The conversations between the attendees at these salons are rendered in Proust’s deadpan manner, a mode in which he is particularly adept. The idiot comments of the idiot attendees are expressed with a minimal amount of overt editorial glossing on the part of the narrator, allowing the members of the petit gentry to condemn themselves out of their own words and actions. If one were to open the third instalment in In Search of Lost Time, The Guermantes Way on a random page, one is more likely to find one of these people sounding off on something on which they understand little about than not.

Note: So it actually took me five tries of a random page to find a demonstrative example. The first paragraph on page 236 reads: “But still, don’t lets fool ourselves; the charming views of my nephew are going to land him in queer street. Particularly with Fezensac ill at the moment. That means Duras will be will be running the election, and you know how he likes to bluff,’ said the Duc, who had never managed to learn the precise meaning of certain words and thought that bluffing meant, not shooting a line, but creating complications.”

The effect of this exhaustive rendering of banal conversation is to suffocate the reader through over-exposure to the awful things that these boring people say, making it almost impossible not to despise these poor deludes. However, the appearance of a seemingly endless succession of conversations that the narrator is privy to prompt a question or two.

Getting access and moving through the ranks of society is a nuanced process. One risks becoming a figure of fun for others, being exiled from them altogether for being perceived as a flatterer or for attending other salons, namely, not showing sufficient loyalty to one’s hosts. Therefore each salon abides by a particular code of behaviour that one should not violate, if one wishes to maintain one’s position within them. The Verdurin salon demands absolute loyalty, the Guermantes insist that art and other ‘serious topics’ are too tedious to be discussed and for Odette Swann (née de Crécy)’s salon, being an anti-Semite is, (ironically, considering M. Swann is Jewish) a bonus.

‘Wit’ and ‘eloquence’ are prized traits for any would-be salon attendee and these terms are placed within perverted commas to demonstrate how advisedly they are used in this instance; both manifest themselves more frequently as obnoxiousness. Therefore one wonders how the narrator seems to succeed in gaining access to these exclusive social clubs when he barely speaks; all the space he provides is given over to the conversation of others. Are we as readers supposed to believe that in this hyper-critical environment that the narrator, M, is allowed to sit back in silence, committing every word of the conversations of others to his memory and be invited back week after week? Especially since even the most trivial detail or impression can send him into a two or three page verbal effusions at the least notice?

One suspects that he is guilty of saying exactly the same kind of shallow nonsense enunciated by those around him and covers himself by devoting all his time to describing the foolishness of others.

Marcel Proust’s ‘In Search of Lost Time: In the Shadow of Young Girls in Flower’

van-gogh-self-portrait-e1361405076205In one of the more well-worn anecdotes of literary history, Marcel Proust’s masterpiece Du côté de chez Swann was rejected by Humblot, a reader for a publishing house. In a letter, Humblot wrote the following: “My dear friend, perhaps I am dense but I just don’t understand why a man should take thirty pages to describe how he turns over in his bed before he goes to sleep. It made my head swim.”

Trotting out these anecdotes in general introductions to cheep and cheerful Wordsworth editions serve a very particular end, a phenomenon that Julian Barnes describes in an essay written on Vincent Van Gogh’s life and work in the London Review of Books: “this…spurs us towards self-congratulation: look how we who have come later appreciate your work, how superior our eye and taste and sympathy are to those who snubbed and misprised you back in the day.” In other words, we look back at Humblot as perhaps the most tone-deaf reader in literary history, in contrast with us, those who, if the contingencies of fate were only aligned differently, would have been born in late nineteenth century France and would have appreciated Proust’s writing, as so many of his contemporaries did not.

This is to miss, if not the point, a point.

One of the themes that Proust consistently refers to is the relationship that exists between sensibility and habit. The general track of the novel (says I, being currently (almost) half way through) is how the narrator’s sensibility, his openness and receptivity to the world around him in all its strangeness and assorted differengenera comes to be overwhelmed by his habits. Sexual debauchery, love, drunkenness, no matter how novel and abject these feelings are when we first experience them, we, with surprising rapidity become adjusted to them, to the point that we barely can be said to experience them at all.

Habit is not a malign however, though it calcifies our precious and individual sensibility. It is a wholly necessary force, allowing us to grow accustomed to people and places that our sensibility led us to despise instinctively. As Proust writes: “habit…also undertakes to endear us to people whom we disliked to begin with, alters the shapes of their face, improves their tone of voice, makes hearts grow fonder.”

The average sentence length in English writing is around 15-17 words, style guides generally recommend that sentences longer than twenty words be shortened as it is likely that they are unclear or convoluted. From a very rudimentary quantitative analysis, I found Proust’s sentences to be, on average, 35 words long. It is therefore possible to view Humblot as not just the first, but one of the more perceptive of Proust’s critics, immediately getting to the heart of what it is that is unique about Proust’s style.

The point behind Proust’s excessively long sentences is precisely this – their excess. What we judge as a coherent sentence in a novel runs to a certain length. We are accustomed to it and when we read, we are within the realm of habit. Proust’s prose is intended to be shocking, to awaken us to the possibilities of language and thought, to appeal to our sensibilities again by having our texts violently defamiliarised from ourselves.

I would accord more with Humblot’s reading than with the mainstream understanding of Proust as a canonical author, among the other masterpieces that we stock our bookshelves with and rarely read. James Grieve, a translator of À l’ombre des jeunes filles en fleurs, speaks pithily of Proust’s irreconcilable strangeness, based on the highly irregular nature of his prose style: “Proust’s reflections, his enunciation of philosophical and psychological truths…are often more importance to him than his verisimilitudes. His composition was often not linear; he wrote in bits and pieces; transitions from one scene to another are sometimes awkward, clumsy even.” If that wasn’t devastating enough, Grieve delivers a final cruelty: “His paragraphing often seems idiosyncratic.”

Far from being a word virtuoso, a fluent weaver of imaginative reality, Proust is in many ways inept and it is in this way that we should appreciate him; his idiosyncrasies are what make In Search of Lost Time such a brilliant and bizarre novel.