Quantifying Modernism and the avant-garde

Introduction and Methodology

This post will document a statistical analysis which was carried out on a corpus of 500 novels. 250 of these texts are generally categorised as ‘realist’ and will be used as a benchmark against which we might define modernist literary style, a mode of writing which arose in the early twentieth century, (though it should be noted that this chronology is increasingly subject to revision due to the work of new modernist scholars).

The first novel in the naturalistic corpus, chronologically speaking, is Jane Austen’s novelLady Susan, and was written in the year 1794. The final one is Thomas Hardy’s novel Jude the Obscure, which was published in 1895. This corpus contains the complete prose works, a phrase here encompassing novels, novellas and short story collections, of fifteen writers, Jane Austen, Emily, Anne and Charlotte Bronte, Stephen Crane, Honoré de Balzac, Charles Dickens, Fyodor Dostoevsky, George Eliot, Gustave Flaubert, Elizabeth Gaskell, Thomas Hardy, William Makepeace Thackeray, Leo Tolstoy and Émile Zola.

The corpus of 250 modernist novels begins in the year 1869, with Henry James’ first bloc of short stories, and continues all the way to Samuel Beckett’s 1988 novella ‘Stirrings Still’, so there is some overlap between these two corpora’s starting and end points. This modernist corpus otherwise consists of the complete works of nineteen writers such as Djuna Barnes, Samuel Beckett, Jorge Luis Borges, Elizabeth Bowen, Joseph Conrad, William Faulkner, F. Scott FitzGerald, Ford Madox Ford, Ernest Hemingway, Henry James, James Joyce, Franz Kakfa, D.H. Lawrence, Katherine Mansfield, Flann O’Brien, Marcel Proust, Gertrude Stein, Edith Wharton and Virginia Woolf.

This disproportion between the two corpora, with fifteen realists versus ninteen modernists, may seem disconcerting at first, but what is required in order for the statistical analyses to function is for the number of observations to be equal, rather than the number of novelists. Unfortunately, realist authors wrote more novels than modernist authors, and this compromised our ability to retain the same number of authors on each end of the generic spectrum.

One other aspect to consider is the international dimension. The realist corpus includes ten novelists who wrote in English, but there are also two Russian and three French realists, two of whom, Zola and the aforementioned Balzac, were far more prolific than any other writer in either corpus. Zola and Balzac composed 86 and 34 novels, short story collections or novellas respectively. This has the consequence that well over half of the realist corpus is in translation from another language in comparison to just under 10% of the modernist corpus. I intend to address this when I am at a later stage in my research. There has been some work published on the issues surrounding the quantification of literature in translation and across language, but I do not yet possess a sufficient breadth of knowledge in this field to comment intelligently on the matter. I do think it is important to have French and Russian writers included in the realist corpus on the basis that many of them, be they Tolstoy, Flaubert or Balzac, exerted a significant influence on their modernist successors.

Whether or not these are ‘the best’ or most accurate translations is sort of beside the point, from the reading I have done around the issue of literary translation, their being subject to change over time is in the nature of how text is received and re-constituted in different eras for different communities of readers (this discussion between Will Self and Kafka’s translators is particularly illuminating in this context, please do not be put off by Self, he gives the translators so much space to discuss the process, you really should watch it). The germane point here is that the translations being analysed in this instance could not be considered to be the most contemporary. There might be an argument for retaining these older translations on the basis that they are more likely to be the versions of the text which would have been circulating in the early twentieth century and therefore the translations modernist authors would have been more likely to have read, but making this claim would require a greater burden of proof, such as what languages each author read novels in and what their reading habits were more generally.

So, to turn to the analysis. My research is directed towards the quantitative analysis of grammar, the rationale being that we could, by examining varying quantities of particular categories of words, such as verbs, adjectives or prepositions, develop an understanding of how literary fiction changes from the beginning of the nineteenth century until the end of the twentieth, and, more specifically, how literary modernism departs from, or, perhaps remains contiguous with, this previous generation of novel writing. This was carried out using a POS tagger from the Natural Language Toolkit in Python.

Results

From realism to modernism:

  • average sentence length decreases by 4 words, from an average 22 words to 18 words per sentence.
  • Personal pronouns (I, you, he, she, it, we, they, me, him, her, us, and them) increase by 1% from 5% to 6%. Interrogative pronouns (who and where) also decrease by 0.01% from 0.03% to 0.02%
  • Verbs in the past tense increase by 1% from 6% to 7%.
  • Adverbs increase by 0.5% from 4.5% to 5%.
  • Prepositions, (after, in, to, on, and with) decrease by 0.4% from 10.9% to 10.5%
  • Wh Determiners (words beginning with wh, such as ‘where’ or ‘who’ acting to modify the noun phrase) decrease by 0.2% from 0.6% to 0.4%.
  • Particles (parts of speech with grammatical function with no meaning such as ‘up’ in the phrase ‘I tidied up the room’) increase by 0.1% from 0.4% to 0.5%.
  • Non third-person singular present verbs (verbs in first or second person) decrease by 0.1% from 1.6% to 1.5%.
  • Existentials (words such as ‘there’ which indicates that something exists) increase by 0.04%, from 0.17% to 0.21%.
  • Superlative adjectives (adjectives such as ‘best’, ‘biggest’, ‘worst’) decrease by 0.01% from 0.14% to 0.13%.

It will not have escaped your attention that a lot of these percentages are quite small. The extent to which any given text is made up of this hyper-specific categories is pretty minimal in the first place, so this is why many of these quantities seem so laughably tiny. Rest assured that they are statistically significant, this does not mean that they are important, this requires a greater burden of proof, more analyses, more exploration, but that they are noteworthy considering the quantities involved.

One boxplot which might be of interest, is the one below, which shows the ‘spread’ of the data for average sentence length between realism and modernism.

What we see on the left is the variation of the sentence length data (the term ‘variation’ here meaning the general ‘dispersedness’ of the data) for realism, which goes from 10 to roughly 35 words per sentence with an outlier or two on either end, whereas if we consider modernism, we have everything from zero (Samuel Beckett’ novel How It Is which has no full stops in it) up to forty-five, with far more outliers on the higher end. Higher outliers, are data points with values greater than 1.5 times the interquartile range above the third quartile, lower outliers, of which there are three, are more than 1.5 times below the first quartile. For one’s own general knowledge, the modernist outliers for sentence length are

  • William Faulkner’s Absalom! Absalom! (46.4), and Intruer in the Dust (42.3)
  • Marcel Proust’s Swann’s Way (42.9), In a Budding Grove (40.2) In a Budding Grove (40.2), Time Re-gained (38), The Prisoner (37.2) and The Captive (35.7) The Guermantes Way (34.1) and Sodom and Gomorrah (30.9).
  • Samuel Beckett’s Texts for Nothing and The Unnamable have 40.5 and 32.9 words per sentence respectively
  • Gertrude Stein’s novels The Making of Americans and Everybody’s Autobiography have 33.9 and 33.5 respectively.
  • Henry James’ The Ivory Tower and The Young Lovell score 31.8 and 29 respectively.
  • The three lower outlier values for sentence length are all written by Beckett, such as the aforementioned How It Is and also Worstward Ho (4.9) and Ill Seen Ill Said (7).

It can be tempting I think, when we see these sorts of names surface so prominently, in conjunction with a visual confirmation of the existence of an avant-garde to think that modernism in its most pure form was a kind of relentless maximalism, an uncompromising movement towards longer sentences, more pronouns, and that all other manifestations of it are inadequate or insufficient in some way. This is a kind of a boring and masculinist overview of the genre, which takes, I think, too many of the claims made by its most dogmatic adherents at face value, and it’s not a modernism I’m particularly interesting in defending or instantiating. There can also, of course, be a regressive or rearguard aspect to modernism, which is perceptible in the following boxplot, which displays the distribution of past tense verbs.

As was pointed out above, modernism displays an increase in past tense verbs overall, but here we see a large number of outlier values moving against the overall trend. These novels are:

  • James Joyce’s Ulysses (4.3%) and Finnegans Wake (2.7%)
  • William Faulkner’s As I Lay Dying (4.2%) and Requiem for a Nun (3.6%)
  • Samuel Beckett’s Malone Dies (3.9%), Fizzles (2.5%), Company (2%), Texts for Nothing (1.8%), The Unnamable (1.7%), Worstward Ho (1.6%), Ill Seen Ill Said (1.4%) and a corpus of his miscellaneous and unpublished short fiction (2.2%).
  • Joseph Conrad and Ford Madox Ford’s collaborative novel The Nature of a Crime (2.6%)
  • Virginia Woolf’s The Waves (2.4%)
  • Gertrude Stein’s Tender Buttons (1.7%)

The higher modernism outlier is Virginia Woolf’s 1937 novel The Years (10%) and the lower realism outlier is Balzac’s 1841 novel Letters of Two Brides(2.7%)

In this way we can see that modernism is not just a unidirectional commitment to a narrow sequence of stylistic changes. Instead, it’s a contradictory movement in which a number of different stylistic markers jostle against and subvert one another. In this particular instance, for example, we can perceive the authors most generally understood to be among the most uncompromising; Joyce, Beckett, Stein, Woolf and Faulkner, resisting the overall trend.

From the two boxplots I’ve generated so far, you might have noticed that in, modernism tends to generate a greater number of outliers, and I can confirm that this trend of a greater degree grammatical heterogeneity manifesting itself in modernist novel-writing than naturalistic novel-writing persists across the other categories of grammar, which you can validate by looking at the complete analysis here.

This struck me as important development, so I quantified the extent of each data point’s outlier-ness, and then grouped them according to author. These values were then divided by the number of outlier data points, because some of these novelists only have a small number of novels in the corpus versus others. Austen’s complete works would be totally outnumbered by Balzac’s for instance. The results appear below:

Please do note the values on the y-axis; Jane Austen is barely above zero because the only outlier text she wrote is Mansfield Park, which marks itself out for its disproportional use of adjectives. I thought it better to not exclude her from the plot though, because, I didn’t want it to turn into even more of a boy’s club than it might otherwise be. It would be useful, and exciting I think, to conceive of this plot as an indication of early breaches with conventional form, perhaps some nineteenth century anticipations of modernism. Reading Dostoevsky, Zola and Balzac in this manner would all be coterminous with changes taking place in the study of modernism now, but reading Thackeray and Eliot in these terms might be a more surprising development, and I’d be interested to read these texts in light of what we’re seeing here.

The modernism plot for deviation appears below:

The unlabelled entry between Faulkner and James is Hemingway

From this plot we can see that the most avant-gardist prose writers, considered from the perspective of their grammar, appear to be Beckett, Stein, Woolf, Conrad and Joyce. Of course, this is nowhere near a definitive answer as to what modernist style is, or who its most innovative practitioners were; these measurements are atomistic and are quantifying individual words. But style is not just words in isolation, style is agglomerations of words, spaces between words, the clandestine networks and relations the phrases these words add up to compose in the mind of the reader, and, if these digital methodologies are to have any chance of illustrating this shift (an inadequate term in the first instance, since it is more an accumulation of changes distributed over a broad corpus than a sudden or transformational one that we are here concerned with) it is in these cumulative terms that style must be quantified, in order to avoid drifting into the reductive and schematic scientism that numerical analyses of this kind are frequently accused of perpetuating.

How big are the words modernists use?

It’s a fairly straightforward question to ask, one which most literary scholars would be able to provide a halfway decent answer to based on their own readings. Ernest Hemingway, Samuel Beckett and Gertrude Stein more likely to use short words, James Joyce, Marcel Proust and Virginia Woolf using longer ones, the rest falling somewhere between the two extremes.

Most Natural Language Processing textbooks or introductions to quantitative literary analysis demonstrate how the most frequently occurring words in a corpus will decline at a rate of about 50%, i.e. the most frequently occurring term will appear twice as often as the second, which is twice as frequent as the third, and so on and so on. I was curious to see whether another process was at work for word lengths, and whether we can see a similar decline at work in modernist novels, or whether more ‘experimental’ authors visibly buck the trend. With some fairly elementary analysis in NLTK, and data frames over into R, I generated a visualisation which looked nothing like this one.*

*The previous graph had twice as many authors and was far too noisy, with not enough distinction between the colours to make it anything other than a headwreck to read.

In narrowing down the amount of authors I was going to plot, I did incline myself more towards authors that I thought would be more variegated, getting rid of the ‘strong centre’ of modernist writing, not quite as prosodically charged as Marcel Proust, but not as brutalist as Stein either. I also put in a couple of contemporary writers for comparison, such as Will Self and Eimear McBride.

As we can see, after the rather disconnected percentages of corpora that use one letter words, with McBride and Hemingway on top at around 25%, and Stein a massive outlier at 11%, things become increasingly harmonious, and the longer the words get, the more the lines of the vectors coalesce.

Self and Hemingway dip rather egregiously with regard to their use of two-letter words (which is almost definitely because of a mutual disregard for a particular word, I’m almost sure of it), but it is Stein who exponentially increases her usage of two and three letter words. As my previous analyses have found, Stein is an absolute outlier in every analysis.

By the time the words are ten letters long, true to form it’s Self who’s writing is the only one to manifest it at a rate of above 1%.

Can a recurrent neural network write good prose?

At this stage in my PhD research into literary style I am looking to machine learning and neural networks, and moving away from stylostatistical methodologies, partially out of fatigue. Statistical analyses are intensely process-based and always open, it seems to me, to fairly egregious ‘nudging’ in the name of reaching favourable outcomes. This brings a kind of bathos to some statistical analyses, as they account, for a greater extent than I’d like, for methodology and process, with the result that the novelty these approaches might have brought us are neglected. I have nothing against this emphasis on process necessarily, but I do also have a thing for outcomes, as well as the mysticism and relativity machine learning can bring, alienating us as it does from the process of the script’s decision making.

I first heard of the sci-fi writer from a colleague of mine in my department. It’s Robin Sloan’s plug-in for the script-writing interface Atom which allows you to ‘autocomplete’ texts based on your input. After sixteen hours of installing, uninstalling, moving directories around and looking up stackoverflow, I got it to work.I typed in some Joyce and got stuff about Chinese spaceships as output, which was great, but science fiction isn’t exactly my area, and I wanted to train the network on a corpus of modernist fiction. Fortunately, I had the complete works of Joyce, Virginia Woolf, Gertrude Stein, Sara Baume, Anne Enright, Will Self, F. Scott FitzGerald, Eimear McBride, Ernest Hemingway, Jorge Luis Borges, Joseph Conrad, Ford Madox Ford, Franz Kafka, Katherine Mansfield, Marcel Proust, Elizabeth Bowen, Samuel Beckett, Flann O’Brien, Djuna Barnes, William Faulkner & D.H. Lawrence to hand.

My understanding of this recurrent neural network, such as it is, runs as follows. The script reads the entire corpus of over 100 novels, and calculates the distance that separates every word from every other word. The network then hazards a guess as to what word follows the word or words that you present it with, then validates this against what its actuality. It then does so over and over and over, getting ‘better’ at predicting each time. The size of the corpus is significant in determining the length of time this will take, and mine required something around twelve days. I had to cut it off after twenty four hours because I was afraid my laptop wouldn’t be able to handle it. At this point it had carried out the process 135000 times, just below 10% of the full process. Once I get access to a computer with better hardware I can look into getting better results.

How this will feed into my thesis remains nebulous, I might move in a sociological direction and take survey data on how close they reckon the final result approximates literary prose. But at this point I’m interested in what impact it might conceivably have on my own writing. I am currently trying to sustain progress on my first novel alongside my research, so, in a self-interested enough way, I pose the question, can neural networks be used in the creation of good prose?

There have been many books written on the place of cliometric methodologies in literary history. I’m thinking here of William S. Burroughs’ cut-ups, Mallarmé’s infinite book of sonnets, and the brief flirtation the literary world had with hypertext in the 90’s, but beyond of the avant-garde, I don’t think I could think of an example of an author who has foregrounded their use of numerical methods of composition. A poet friend of mine has dabbled in this sort of thing but finds it expedient to not emphasise the aleatory aspect of what she’s doing, as publishers tend to give a frosty reception when their writers suggest that their work is automated to some extent.

And I can see where they’re coming from. No matter how good they get at it, I’m unlikely to get to a point where I’ll read automatically generated literary art. Speaking for myself, when I’m reading, it is not just about the words. I’m reading Enright or Woolf or Pynchon because I’m as interested in them as I am in what they produce. How synthetic would it be to set Faulkner and McCarthy in conversation with one another if their congruencies were wholly manufactured by outside interpretation or an anonymous algorithmic process as opposed to the discursive tissue of literary sphere, if a work didn’t arise from material and actual conditions? I know I’m making a lot of value-based assessments here that wouldn’t have a place in academic discourse, and on that basis what I’m saying is indefensible, but the probabilistic infinitude of it bothers me too. When I think about all the novelists I have yet to read I immediately get panicky about my own death, and the limitless possibilities of neural networks to churn out tomes and tomes of literary data in seconds just seems to me to exacerbate the problem.

However, speaking outside of my reader-identity, as a writer, I find it invigorating. My biggest problem as a writer isn’t writing nice sentences, given enough time I’m more than capable of that, the difficulty is finding things to wrap them around. Mood, tone, image, aren’t daunting, but a text’s momentum, the plot, I suppose, eludes me completely. It’s not something that bothers me, I consider plot to be a necessary evil, and resent novels that suspend information in a deliberate, keep-you-on-the-hook sort of way, but the ‘what next’ of composition is still a knotty issue.

The generation of text could be a useful way of getting an intelligent prompt that stylistically ‘borrows’ from a broad base of literary data, smashing words and images together in a generative manner to get the associative faculties going. I’m not suggesting that these scripts would be successful were they autonomous, I think we’re a few years off one of these algorithms writing a good novel, but I hope to demonstrate that my circa 350 generated words would be successful in facilitating the process of composition:

be as the whoo, put out and going to Ingleway effect themselves old shadows as she was like a farmers of his lake, for all or grips — that else bigs they perfectly clothes and the table and chest and under her destynets called a fingers of hanged staircase and cropping in her hand from him, “never married them my said?” know’s prode another hold of the utals of the bright silence and now he was much renderuched, his eyes. It was her natural dependent clothes, cattle that they came in loads of the remarks he was there inside him. There were she was solid drugs.

“I’m sons to see, then?’ she have no such description. The legs that somewhere to chair followed, the year disappeared curl at an entire of him frwented her in courage had approached. It was a long rose of visit. The moment, the audience on the people still the gulsion rowed because it was a travalious. But nothing in the rash.

“No, Jane. What does then they all get out him, but? Or perfect?”

“The advices?”

Of came the great as prayer. He said the aspect who, she lay on the white big remarking through the father — of the grandfather did he had seen her engoors, came garden, the irony opposition on his colling of the roof. Next parapes he had coming broken as though they fould

has a sort. Quite angry to captraita in the fact terror, and a sound and then raised the powerful knocking door crawling for a greatly keep, and is so many adventored and men. He went on. He had been her she had happened his hands on a little hand of a letter and a road that he had possibly became childish limp, her keep mind over her face went in himself voice. He came to the table, to a rashes right repairing that he fulfe, but it was soldier, to different and stuff was. The knees as it was a reason and that prone, the soul? And with grikening game. In such an inquisilled-road and commanded for a magbecross that has been deskled, tight gratulations in front standing again, very unrediction and automatiled spench and six in command, a

I don’t think I’d be alone in thinking that there’s some merit in parts of this writing. I wonder if there’s an extent to which Finnegans Wake has ‘tainted’ the corpus somewhat, because stylistically, I think that’s the closest analogue to what could be said to be going on here. Interestingly, it seems to be formulating its own puns, words like ‘unrediction,’ ‘automatiled spench’ (a tantalising meta-textual reference I think) and ‘destynets’, I think, would all be reminiscent of what you could expect to find in any given section of the Wake, but they don’t turn up in the corpus proper, at least according to a ctrl + f search. What this suggests to me is that the algorithm is plotting relationships on the level of the character, as well as phrasal units. However, I don’t recall the sci-fi model turning up paragraphs that were quite so disjointed and surreal — they didn’t make loads of sense, but they were recognisable, as grammatically coherent chunks of text. Although this could be the result of working with a partially trained model.

So, how might they feed our creative process? Here’s my attempt at making nice sentences out of the above.

— I have never been married, she said. — There’s no good to be gotten out of that sort of thing at all.

He’d use his hands to do chin-ups, pull himself up over the second staircase that hung over the landing, and he’d hang then, wriggling across the awning it created over the first set of stairs, grunting out eight to ten numbers each time he passed, his feet just missing the carpeted surface of the real stairs, the proper stairs.

Every time she walked between them she would wonder which of the two that she preferred. Not the one that she preferred, but the one that were more her, which one of these two am I, which one of these two is actually me? It was the feeling of moving between the two that she could remember, not his hands. They were just an afterthought, something cropped in in retrospect.

She can’t remember her sons either.

Her life had been a slow rise, to come to what it was. A house full of men, chairs and staircases, and she wished for it now to coil into itself, like the corners of stale newspapers.

The first thing you’ll notice about this is that it is a lot shorter. I started off by traducing the above, in as much as possible, into ‘plain words’ while remaining faithful to the n-grams I liked, like ‘bright silence’ ‘old shadows’ and ‘great as prayer’. In order to create images that play off one another, and to account for the dialogue, sentences that seemed to be doing similar things began to cluster together, so paragraphs organically started to shrink. Ultimately, once the ‘purpose’ of what I was doing started to come out, a critique of bourgeois values, memory loss, the nice phrasal units started to become spurious, and the eight or so paragraphs collapsed into the three and a half above. This is also ones of my biggest writing issues, I’ll type three full pages and after the editing process they’ll come to no more than 1.5 paragraphs, maybe?

The thematic sense of dislocation and fragmentation could be a product of the source material, but most things I write are about substance-abusing depressives with broken brains cos I’m a twenty-five year old petit-bourgeois male. There’s also a fairly pallid Enright vibe to what I’ve done with the above, I think the staircases line could come straight out of The Portable Virgin.

Maybe a more well-trained corpus could provide better prompts, but overall, if you want better results out of this for any kind of creative praxis, it’s probably better to be a good writer.

A Deleuzian Theory of Literary Style

I’m always surprised when I read one of the thinkers generally, and perhaps lazily, lumped in to the general category of post-structuralist, when I find how great a disservice the term does to their work. To read Derrida, Foucault or Deleuze, is not to find a triad of philosophers who struggle to produce a coherent system via addled half-thoughts in order to deconstruct, stymie or relativise everything. In fact, I’m not sure there’s another philosopher I’ve read who displays greater attention to detail in their work than Derrida, and Deleuze, far from being a deconstructionist, presents us with painstaking and intricate schemata and models of thought. The rhizome, to take the most well-known concept associated with Deleuze and his collaborator, Félix Guattari, doesn’t provide us with a free-for-all, but an intricately worked-out model to enable further thought. Difference and Repetition is likewise painstaking, and so involved is Deleuze’s model of difference, applying it in great depth to my theory of literary style, might be something to do if one wished to be a mad person, particularly since, at an early stage in the work, he attempts to map his concepts to particular authors, such as Borges, Joyce, Beckett and Proust. But I’ll do my best.

My notion of literary style has been influenced by the fact of my dealing with the matter via computation, i.e. multi-variate analysis and machine learning. All the reading I’m doing on the subject, is leading me towards a theory of literary style founded on redundancy. When I say redundancy, I don’t mean that what distinguishes literary language from ‘normal’ language is its superfluity, an excess of that which it communicates. For the Russian formalists, this was key in defining literary language, its surfeit of meaning. I don’t like this distinction much, as it assumes that we can neatly cleave necessary communication from unnecessary communication, as if there were a clear demarcation between the words we use for their usage (utilitarian) and the words we use for their beauty (aesthetic). The lines between the two are generally blurred, and both can reinforce the function of the other. The shortcomings of this category become yet more evident when we take into account authors who might have a plain style, works which depend on a certain reticence to speak. Of course, a certain degree of recursion sets in here, as we could argue that it is in the showcased plainness of these writers that the superfluity of the work manifests itself. Which presents us with the inevitable conclusion that the definition is flawed because its a tautology; it’s excessive because it’s literary, it’s literary because it’s excessive.

My own idea of redundancy comes from a number of articles in the computational journal Literary and Linguistic Computing, the entire corpus of which, from the mid-nineties until today, I am slowly making my way through. It provides an interesting narrative of the ways in which computational criticism has evolved in these years. At first, literary critics would have been sure that the words that traditional literary criticism tends to emphasise, the big ones, the sparkly ones, the nice ones, were most indicative of a writer’s style. What practitioners of algorithmic criticism have come to realise however, is that it is the ‘particles’ of literary matter, that are far more indicative of a writer’s style, the distribution of words such as ‘the’, ‘a’, ‘an’, ‘and’, ‘said,’ which are sometimes left out of corpus stylistics altogether, dismissed as ‘stopwords,’ bandied about too often in textual materials of all kinds to be of any real use. It’s a bit too easy, with the barest dash of an awareness of how coding works, to start slipping into generalisations along the lines of neuroscience, so I won’t go too mad, but I will say that this is an example of the ways in which humans tend to identify patterns, albeit maybe not necessarily the determining, or most significant patterns, in any given situation.

We’re magpies when we read, for better or worse. When David Foster Wallace re-instates the subject of a clause at its end, a technique he becomes increasingly reliant on as Infinite Jest proceeds, we notice it, and it becomes increasingly to the fore in our sense of his style. But, in the grand scheme of the one-thousand some page novel, the extent to which this technique is made use of is statistically speaking, insignificant. Sentences like ‘She tied the tapes,’ in Between the Acts, for instance, pass our awareness by because of their pedestrian qualities, much like many other sentences that contain words such as ‘said,’ because of the extent to which any text’s fabric is predominantly composed of such filler.

In Difference and Repetition, Deleuze is concerned with reversing a trend within Western philosophy, to mis-read the nature of difference, which he traces back to Plato and Kant, and the idealist/transcendentalist tendencies within their thought. They believed in singular, ideal forms, against which the notion of the Image is pitched, which can only be inferior, a simulacrum, as they are derivative copies. Despite his model of the dialectic, Hegel is no better when it comes to comprehending difference; Deleuze sees the notion of synthesis as profoundly damaging to difference, as the third-way synthesis has a tendency to understate it. Deleuze dismisses the process of the dialectic as ‘insipid monocentrality’. Deleuze’s issue seems to be that our notions of identity, only allow difference into the picture as a rupture, or an exception which vindicates an overall sense of homogeneity. Difference should be emphasised to a greater extent, and become a principle of our understanding:

Such would be the nature of a Copernican revolution which opens up the possibility of difference having its own concept, rather than being maintained under the domination of a concept in general already understood as identical.

Recognising this would be the advent of difference-in-itself.

This is all fairly consistent with Deleuze’s sense of Being as being (!) in a constant state of becoming, an experiential-led model of ontology which doesn’t aim for essence, but praxis. It would be fairly unproblematic to map this onto literary style; literary stylistics should likewise depend on difference, rather than similarity which only allows difference into the picture as a rupture; difference should be our primary criterion when examining the ways in which style becomes itself.

Another tendency of the philosophical tradition as Deleuze understands it is a belief in the goodness of thought, and its inclination towards moral, useful ends, as embodied in the works of Descartes. Deleuze reminds us of myopia and stupidity, by arguing that thought is at its most vital when at a moment of encounter or crisis, when ‘something in the world forces us to think.’ These encounters remind us that thought is impotent and require us to violently grapple with the force of these encounters. This is not only an attempt to reverse the traditional moral image of thought, but to move towards an understanding of thought as self-engendering, an act of creation, not just of what is thought, but of thought itself.

It would be to take the least radical aspect of this conclusion to fuse it with the notion of textual deformance, developed by Jerome McGann, which is of particular magnitude within the digital humanities, considering that we often process our text via code, or visualise it, and build arguments from these simulacra. But, on a level of reading which is, technologically speaking, less sophisticated, it reflects the way in which we generate a stylistic ideal as we read, a sense of a writer’s style, whether these be based on the analogue, magpie method (or something more systematic, I don’t want to discount syllable-counts, metric analyses or close readings of any kind) or quantitative methodologies.

By bringing ourselves to these points of crisis, we will open up avenues at which fields of thought, composed themselves of differential elements, differential relations and singularities, will shift, and bring about a qualitative difference in the environment. We might think of this field in terms of a literary text, a sequence of actualised singularities, appearing aleatory outside of their anchoring context as within a novel. Readers might experience these as breakthrough moments or epiphanies when reading a text, realising that Infinite Jest apes the plot of William Shakespeare’s Hamlet, for example, as it begins to cast everything in a new light. In this way, texts are made and unmade according to the conditions which determine them. I for one, find this to be so much more helpful in articulating what a text is than the blurb for post-structuralism, (something like ‘endlessly deferred free-play of meaning’). Instead, we have a radical, consistently disarticulating and re-articulating literary artwork in a perpetual, affirming state of becoming, actualised by the reader at a number of sensitive points which at any stage might be worried into bringing about a qualitative shift in the work’s processes of meaning making.

A (Proper) Statistical analysis of the prose works of Samuel Beckett

mte5ndg0mdu0odk1otuzndiz

Content warning: If you want to get to the fun parts, the results of an analysis of Beckett’s use of language, skip to sections VII and VIII. Everything before that is navel-gazing methodology stuff.

If you want to know how I carried out my analysis, and utilise my code for your own purposes, here’s a link to my R code on my blog, with step-by-step instructions, because not enough places on the internet include that.

I: Things Wrong with my Dissertation’s Methodology

For my masters, I wrote a 20000 word dissertation, which took as its subject, an empirical analysis of the works of Samuel Beckett. I had a corpus of his entire works with the exception of his first novel Dream of Fair to Middling Women, which is a forgivable lapse, because he ended up cannibalising it for his collection of short stories, More Pricks than Kicks.

Quantitative literary analysis is generally carried out in one of two ways, through either one of the open-source programming languages Python or R. The former you’ve more likely to have heard of, being one of the few languages designed with usability in mind. The latter, R, would be more familiar to specialists, or people who work in the social sciences, as it is more obtuse than Python, doesn’t have many language cousins and has a very unfriendly learning curve. But I am attracted to difficulty, so I am using it for my PhD analysis.

I had about four months to carry out my analysis, so the idea of taking on a programming language in a self-directed learning environment was not feasible, particularly since I wanted to make a good go at the extensive body of secondary literature written on Beckett. I therefore made use of a corpus analysis tool called Voyant. This was a couple of years ago, so this was before its beta release, when it got all tricked out with some qualitative tools and a shiny new interface, which would have been helpful. Ah well. It can be run out of any browser, if you feel like giving it a look.

My analysis was also chronological, in that it looked at changes in Beckett’s use of language over time, with a view to proving the hypothesis that he used a less wide vocabulary as his career continued, in pursuit of his famed aesthetic of nothingness or deprivation. As I wanted to chart developments in his prose over time, I dated the composition of each text, and built a corpus for each year, from 1930–1987, excluding of course, years in which he just wrote drama, poetry, which wouldn’t be helpful to quantify in conjunction with one another. Which didn’t stop me doing so for my masters analysis. It was a disaster.

II: Uniqueness

Uniqueness, the measurement used to quantify the general spread of Beckett’s vocabulary, was obtained by the generally accepted formula below:

unique word tokens / total words

There is a problem with this measurement, in that it takes no account of a text’s relative length. As a text gets longer, the likelihood of each word being used approaches 1. Therefore, a text gets less unique as it gets bigger. I have the correlations to prove it:

screen-shot-2016-11-03-at-12-18-03There have been various solutions proposed to this quandary, which stymies our comparative analyses, somewhat. One among them is the use of vectorised measurements, which plot the text’s declining uniqueness against its word count, so we see a more impressionistic graph, such as this one, which should allow us to compare the word counts for James Joyce’s novels, A Portrait of the Artist as a Young Man and his short story collection, Dubliners.

screen-shot-2016-11-03-at-13-28-18

All well and good for two or maybe even five texts, but one can see how, with large scale corpora, this sort of thing can get very incoherent very quickly. Furthermore, if one was to examine the numbers on the y-axis, one can see that the differences here are tiny. This is another idiosyncrasy of stylostatistical methods; because of the way syntax works, the margins of difference wouldn’t be regarded as significant by most statisticians. These issues relating to the measurement are exacerbated by the fact that ‘particles,’ the atomic structures of literary speech, (it, is, the, a, an, and, said, etc.) make up most of a text. In pursuit of greater statistical significance for their papers, digital literary critics remove these particles from their texts, which is another unforgivable that we do anyway. I did not, because I was concerned that I was complicit in the neoliberalisation of higher education. I also wrote a 4000 word chapter that outlined why what I was doing was awful.

IV: Ambiguity

The formula for ambiguity was arrived at by the following formula:

number of indefinite pronouns/total word count

I derived this measurement from Dr. Ian Lancashire’s study of the works of Agatha Christie, and counted Beckett’s use of a set of indefinite pronouns, ‘everyone,’ ‘everybody,’ ‘everywhere,’ ‘everything,’ ‘someone,’ ‘somebody,’ ‘somewhere,’ ‘something,’ ‘anyone,’ ‘anybody,’ ‘anywhere,’ ‘anything,’ ‘no one,’ ‘nobody,’ ‘nowhere,’ and ‘nothing.’ Those of you who know that there are more indefinite pronouns than just these, you are correct, I had found an incomplete list of indefinite pronouns, and I assumed that that was all. This is just one of the many things wrong with my study. My theory was that there were to be correlations to be detected in Beckett’s decreasing vocabulary, and increasing deployment of indefinite pronouns, relative to the total word count. I called the vocabulary measure ‘uniqueness,’ and the indefinite pronouns measure I called ‘ambiguity.’ This in tenuous I know, indefinite pronouns advance information as they elide the provision of information. It is, like so much else in the quantitative analysis of literature, totally unforgivable, yet we do it anyway.

V: Hapax Richness

I initially wanted to take into account another phenomenon known as the hapax score, which charts occurrences of words that appear only once in a text or corpus. The formula to obtain it would be the following:

number of words that appear once/total word count

I believe that the hapax count would be of significance to a Beckett analysis because of the points at which his normally incompetent narrators have sudden bursts of loquaciousness, like when Molloy says something like ‘digital emunction and the peripatetic piss,’ before lapsing back into his ‘normal’ tone of voice. Once again, because I was often working with a pen and paper, this became impossible, but now that I know how to code, I plan to go over my masters analysis, and do it properly. The hapax score will form a part of this new analysis.

VI: Code & Software

A much more accurate way of analysing vocabulary, for the purposes of comparative analysis when your texts are of different lengths, therefore, would be to randomly sample it. Obviously not very easy when you’re working with a corpus analysis tool online, but far more straightforward when working through a programming language. A formula for representative sampling was found, and integrated into the code. My script is essentially a series of nested loops and if/else statements, that randomly and sequentially sample a text, calculate the uniqueness, indefiniteness and hapax density ten times, store the results in a variable, and then calculate the mean value for each by dividing the result by ten, the number of times that the first loop runs. I inputted each value into the statistical analysis program SPSS, because it makes pretty graphs with less effort than R requires.

VII: Results

I used SPSS’ box plot function first to identify any outliers for uniqueness, hapax density and ambiguity. 1981 was the only year which scored particularly high for relative usage of indefinite pronouns.

screen-shot-2016-11-03-at-12-27-38

It should be said that this measure too, is correlated to the length of the text, which only stands to reason; as a text gets longer the relative incidence of a particular set of words will decrease. Therefore, as the only texts Beckett wrote this year, ‘The Way’ and ‘Ceiling,’ both add up to about 582 words (the fifth lowest year for prose output in his life), one would expect indefiniteness to be somewhat higher in comparison to other years. However, this doesn’t wholly account for its status as an outlier value. Towards the end of his life Beckett wrote increasingly short prose pieces. Comment C’est (How It Is) was his last novel, and was written almost thirty years before he died. This probably has a lot to do with his concentration on writing and directing his plays, but in his letters he attributed it to a failure to progress beyond the third novel in his so-called trilogy of Molloy, Malone meurt (Malone Dies) and L’innomable (The Unnamable). It is in the year 1950, the year in which L’inno was completed, that Beckett began writing the Textes pour rien (Texts for Nothing), scrappy, disjointed pieces, many of which seem to be taking up from where L’inno left off, similarly the Fizzles and the Faux Départs. ‘The Way,’ I think, is an outgrowth of a later phase in Beckett’s prose writing, which dispenses the peripatetic loquaciousness and the understated lyricism of the trilogy and replaces it with a more brute and staccato syntax, one which is often dependent on the repetition of monosyllables:

No knowledge of where gone from. Nor of how. Nor of whom. None of whence come to. Partly to. Nor of how. Nor of whom. None of anything. Save dimly of having come to. Partly to. With dread of being again. Partly again. Somewhere again. Somehow again. Someone again.

Note also the prevalence of particle words, that will have been stripped out for the analysis, and the ways in which words with a ‘some’ prefix are repeated as a sort of refrain. This essential structure persists in the work, or at least the artefact of the work that the code produces, and hence of it, the outlier that it is.

screen-shot-2016-11-03-at-12-55-13

From plotting all the values together at once, we can see that uniqueness is partially dependent on hapax density; the words that appear only once in a particular corpus would be important in driving up the score for uniqueness. While there could said to be a case for the hypothesis that Beckett’s texts get less unique, more ambiguous up until 1944, when he completed his novel Watt, and if we’re feeling particularly risky, up until 1960 when Comment C’est was completed, it would be wholly disingenuous to advance it beyond this point, when his style becomes far too erratic to categorise definitively. Comment C’est is Beckett’s most uncompromising prose work. It has no punctuation, no capitalisation, and narrates the story of two characters, in a kind of love, who communicate with one another by banging kitchen implements off another:

as it comes bits and scraps all sorts not so many and to conclude happy end cut thrust DO YOU LOVE ME no or nails armpit and little song to conclude happy end of part two leaving only part three and last the day comes I come to the day Bom comes YOU BOM me Bom ME BOM you Bom we Bom

VIII: Conclusion

I would love to say that the general tone is what my model is being attentive to, which is why it identified Watt and How It Is as nadirs in Beckett’s career but I think their presence on the chart is more a product of their relative length, as novels, versus the shorter pieces which he moved towards in his later career. Clearly, Beckett’s decision to write shorter texts, make this means of summing up his oeuvre in general, insufficient. Whatever changes Beckett made to his aesthetic over time, we might not need to have such complicated graphs to map, and I could have just used a word processor to find it — length. Bom and Pim aside, for whatever reason after having written L’inno none of Beckett’s creatures presented themselves to him in novelistic form again. The partiality of vision and modal tone which pervades the post-L’inno works demonstrates, I think far more effectively what is was that Beckett was ‘pitching’ for, a new conceptual aspect to his prose, which re-emphasised its bibliographic aspects, the most fundamental of which was their brevity, or the appearance of an incompleteness, by virtue of being honed to sometimes less than five hundred words.

The quantification of differing categories of words seems like a radical, and the most fun, thing to quantify in the analysis of literary texts, as the words are what we came for, but the problem is similar to one that overtakes one who attempts to read a literary text word by word by word, and unpack its significance as one goes: overdetermination. Words are kaleidoscopic, and the longer you look at them, the more threatening their darkbloom becomes, the more they swallow, excrete, the more alive they are, all round. Which is fine. Letting new things into your life is what it should be about, until their attendant drawbacks become clear, and you start to become ambivalent about all the fat and living things you have in your head. You start to wish you read poems instead, rather than novels, which make you go mad, and worse, start to write them. The point is words breed words, and their connections are too easily traced by computer. There’s something else about knowing that their exact correlations to a decimal point. They seem so obvious now.