Quantifying Modernism and the avant-garde

Introduction and Methodology

This post will document a statistical analysis which was carried out on a corpus of 500 novels. 250 of these texts are generally categorised as ‘realist’ and will be used as a benchmark against which we might define modernist literary style, a mode of writing which arose in the early twentieth century, (though it should be noted that this chronology is increasingly subject to revision due to the work of new modernist scholars).

The first novel in the naturalistic corpus, chronologically speaking, is Jane Austen’s novelLady Susan, and was written in the year 1794. The final one is Thomas Hardy’s novel Jude the Obscure, which was published in 1895. This corpus contains the complete prose works, a phrase here encompassing novels, novellas and short story collections, of fifteen writers, Jane Austen, Emily, Anne and Charlotte Bronte, Stephen Crane, Honoré de Balzac, Charles Dickens, Fyodor Dostoevsky, George Eliot, Gustave Flaubert, Elizabeth Gaskell, Thomas Hardy, William Makepeace Thackeray, Leo Tolstoy and Émile Zola.

The corpus of 250 modernist novels begins in the year 1869, with Henry James’ first bloc of short stories, and continues all the way to Samuel Beckett’s 1988 novella ‘Stirrings Still’, so there is some overlap between these two corpora’s starting and end points. This modernist corpus otherwise consists of the complete works of nineteen writers such as Djuna Barnes, Samuel Beckett, Jorge Luis Borges, Elizabeth Bowen, Joseph Conrad, William Faulkner, F. Scott FitzGerald, Ford Madox Ford, Ernest Hemingway, Henry James, James Joyce, Franz Kakfa, D.H. Lawrence, Katherine Mansfield, Flann O’Brien, Marcel Proust, Gertrude Stein, Edith Wharton and Virginia Woolf.

This disproportion between the two corpora, with fifteen realists versus ninteen modernists, may seem disconcerting at first, but what is required in order for the statistical analyses to function is for the number of observations to be equal, rather than the number of novelists. Unfortunately, realist authors wrote more novels than modernist authors, and this compromised our ability to retain the same number of authors on each end of the generic spectrum.

One other aspect to consider is the international dimension. The realist corpus includes ten novelists who wrote in English, but there are also two Russian and three French realists, two of whom, Zola and the aforementioned Balzac, were far more prolific than any other writer in either corpus. Zola and Balzac composed 86 and 34 novels, short story collections or novellas respectively. This has the consequence that well over half of the realist corpus is in translation from another language in comparison to just under 10% of the modernist corpus. I intend to address this when I am at a later stage in my research. There has been some work published on the issues surrounding the quantification of literature in translation and across language, but I do not yet possess a sufficient breadth of knowledge in this field to comment intelligently on the matter. I do think it is important to have French and Russian writers included in the realist corpus on the basis that many of them, be they Tolstoy, Flaubert or Balzac, exerted a significant influence on their modernist successors.

Whether or not these are ‘the best’ or most accurate translations is sort of beside the point, from the reading I have done around the issue of literary translation, their being subject to change over time is in the nature of how text is received and re-constituted in different eras for different communities of readers (this discussion between Will Self and Kafka’s translators is particularly illuminating in this context, please do not be put off by Self, he gives the translators so much space to discuss the process, you really should watch it). The germane point here is that the translations being analysed in this instance could not be considered to be the most contemporary. There might be an argument for retaining these older translations on the basis that they are more likely to be the versions of the text which would have been circulating in the early twentieth century and therefore the translations modernist authors would have been more likely to have read, but making this claim would require a greater burden of proof, such as what languages each author read novels in and what their reading habits were more generally.

So, to turn to the analysis. My research is directed towards the quantitative analysis of grammar, the rationale being that we could, by examining varying quantities of particular categories of words, such as verbs, adjectives or prepositions, develop an understanding of how literary fiction changes from the beginning of the nineteenth century until the end of the twentieth, and, more specifically, how literary modernism departs from, or, perhaps remains contiguous with, this previous generation of novel writing. This was carried out using a POS tagger from the Natural Language Toolkit in Python.

Results

From realism to modernism:

  • average sentence length decreases by 4 words, from an average 22 words to 18 words per sentence.
  • Personal pronouns (I, you, he, she, it, we, they, me, him, her, us, and them) increase by 1% from 5% to 6%. Interrogative pronouns (who and where) also decrease by 0.01% from 0.03% to 0.02%
  • Verbs in the past tense increase by 1% from 6% to 7%.
  • Adverbs increase by 0.5% from 4.5% to 5%.
  • Prepositions, (after, in, to, on, and with) decrease by 0.4% from 10.9% to 10.5%
  • Wh Determiners (words beginning with wh, such as ‘where’ or ‘who’ acting to modify the noun phrase) decrease by 0.2% from 0.6% to 0.4%.
  • Particles (parts of speech with grammatical function with no meaning such as ‘up’ in the phrase ‘I tidied up the room’) increase by 0.1% from 0.4% to 0.5%.
  • Non third-person singular present verbs (verbs in first or second person) decrease by 0.1% from 1.6% to 1.5%.
  • Existentials (words such as ‘there’ which indicates that something exists) increase by 0.04%, from 0.17% to 0.21%.
  • Superlative adjectives (adjectives such as ‘best’, ‘biggest’, ‘worst’) decrease by 0.01% from 0.14% to 0.13%.

It will not have escaped your attention that a lot of these percentages are quite small. The extent to which any given text is made up of this hyper-specific categories is pretty minimal in the first place, so this is why many of these quantities seem so laughably tiny. Rest assured that they are statistically significant, this does not mean that they are important, this requires a greater burden of proof, more analyses, more exploration, but that they are noteworthy considering the quantities involved.

One boxplot which might be of interest, is the one below, which shows the ‘spread’ of the data for average sentence length between realism and modernism.

What we see on the left is the variation of the sentence length data (the term ‘variation’ here meaning the general ‘dispersedness’ of the data) for realism, which goes from 10 to roughly 35 words per sentence with an outlier or two on either end, whereas if we consider modernism, we have everything from zero (Samuel Beckett’ novel How It Is which has no full stops in it) up to forty-five, with far more outliers on the higher end. Higher outliers, are data points with values greater than 1.5 times the interquartile range above the third quartile, lower outliers, of which there are three, are more than 1.5 times below the first quartile. For one’s own general knowledge, the modernist outliers for sentence length are

  • William Faulkner’s Absalom! Absalom! (46.4), and Intruer in the Dust (42.3)
  • Marcel Proust’s Swann’s Way (42.9), In a Budding Grove (40.2) In a Budding Grove (40.2), Time Re-gained (38), The Prisoner (37.2) and The Captive (35.7) The Guermantes Way (34.1) and Sodom and Gomorrah (30.9).
  • Samuel Beckett’s Texts for Nothing and The Unnamable have 40.5 and 32.9 words per sentence respectively
  • Gertrude Stein’s novels The Making of Americans and Everybody’s Autobiography have 33.9 and 33.5 respectively.
  • Henry James’ The Ivory Tower and The Young Lovell score 31.8 and 29 respectively.
  • The three lower outlier values for sentence length are all written by Beckett, such as the aforementioned How It Is and also Worstward Ho (4.9) and Ill Seen Ill Said (7).

It can be tempting I think, when we see these sorts of names surface so prominently, in conjunction with a visual confirmation of the existence of an avant-garde to think that modernism in its most pure form was a kind of relentless maximalism, an uncompromising movement towards longer sentences, more pronouns, and that all other manifestations of it are inadequate or insufficient in some way. This is a kind of a boring and masculinist overview of the genre, which takes, I think, too many of the claims made by its most dogmatic adherents at face value, and it’s not a modernism I’m particularly interesting in defending or instantiating. There can also, of course, be a regressive or rearguard aspect to modernism, which is perceptible in the following boxplot, which displays the distribution of past tense verbs.

As was pointed out above, modernism displays an increase in past tense verbs overall, but here we see a large number of outlier values moving against the overall trend. These novels are:

  • James Joyce’s Ulysses (4.3%) and Finnegans Wake (2.7%)
  • William Faulkner’s As I Lay Dying (4.2%) and Requiem for a Nun (3.6%)
  • Samuel Beckett’s Malone Dies (3.9%), Fizzles (2.5%), Company (2%), Texts for Nothing (1.8%), The Unnamable (1.7%), Worstward Ho (1.6%), Ill Seen Ill Said (1.4%) and a corpus of his miscellaneous and unpublished short fiction (2.2%).
  • Joseph Conrad and Ford Madox Ford’s collaborative novel The Nature of a Crime (2.6%)
  • Virginia Woolf’s The Waves (2.4%)
  • Gertrude Stein’s Tender Buttons (1.7%)

The higher modernism outlier is Virginia Woolf’s 1937 novel The Years (10%) and the lower realism outlier is Balzac’s 1841 novel Letters of Two Brides(2.7%)

In this way we can see that modernism is not just a unidirectional commitment to a narrow sequence of stylistic changes. Instead, it’s a contradictory movement in which a number of different stylistic markers jostle against and subvert one another. In this particular instance, for example, we can perceive the authors most generally understood to be among the most uncompromising; Joyce, Beckett, Stein, Woolf and Faulkner, resisting the overall trend.

From the two boxplots I’ve generated so far, you might have noticed that in, modernism tends to generate a greater number of outliers, and I can confirm that this trend of a greater degree grammatical heterogeneity manifesting itself in modernist novel-writing than naturalistic novel-writing persists across the other categories of grammar, which you can validate by looking at the complete analysis here.

This struck me as important development, so I quantified the extent of each data point’s outlier-ness, and then grouped them according to author. These values were then divided by the number of outlier data points, because some of these novelists only have a small number of novels in the corpus versus others. Austen’s complete works would be totally outnumbered by Balzac’s for instance. The results appear below:

Please do note the values on the y-axis; Jane Austen is barely above zero because the only outlier text she wrote is Mansfield Park, which marks itself out for its disproportional use of adjectives. I thought it better to not exclude her from the plot though, because, I didn’t want it to turn into even more of a boy’s club than it might otherwise be. It would be useful, and exciting I think, to conceive of this plot as an indication of early breaches with conventional form, perhaps some nineteenth century anticipations of modernism. Reading Dostoevsky, Zola and Balzac in this manner would all be coterminous with changes taking place in the study of modernism now, but reading Thackeray and Eliot in these terms might be a more surprising development, and I’d be interested to read these texts in light of what we’re seeing here.

The modernism plot for deviation appears below:

The unlabelled entry between Faulkner and James is Hemingway

From this plot we can see that the most avant-gardist prose writers, considered from the perspective of their grammar, appear to be Beckett, Stein, Woolf, Conrad and Joyce. Of course, this is nowhere near a definitive answer as to what modernist style is, or who its most innovative practitioners were; these measurements are atomistic and are quantifying individual words. But style is not just words in isolation, style is agglomerations of words, spaces between words, the clandestine networks and relations the phrases these words add up to compose in the mind of the reader, and, if these digital methodologies are to have any chance of illustrating this shift (an inadequate term in the first instance, since it is more an accumulation of changes distributed over a broad corpus than a sudden or transformational one that we are here concerned with) it is in these cumulative terms that style must be quantified, in order to avoid drifting into the reductive and schematic scientism that numerical analyses of this kind are frequently accused of perpetuating.

Joanna Walsh’s ‘Seed’

The first thing one notices about Joanna Walsh’s online novella Seed is the quality of the design. Seed’s aesthetic is very consistent, and was obviously designed with an eye to the material at hand. For all this we have its illustrator Charlotte Hicks to thank, as well as the digital publishing company responsible for designing the platform on which the text is hosted. Seed is optimised for iOS, and, as the site tells us, is probably better viewed there, but it can also be read on a laptop or a PC.

The reader begins by being presented with seventeen different plants which open up onto different lexia, with suggestive and minimalistic titles such as ‘Baby’, ‘Touch’ or ‘Red’. Each one gives a brief insight into the life of an eighteen year old woman living in a middle-class housing estate in suburban England, coming to terms with herself, her environment, the people around her and the reality of her incipient young adulthood. By presenting the reader with seventeen different starting points (ignoring the opening explanatory remarks for a moment), and the means of proceeding in any way they might choose, the text emulates the same provisional and tentative steps that the narrator concurrently takes in the development of her own identity. In an interview, Walsh explains that the rhizoidal orientation of the text provided her with the opportunity to disorientate the reader, and perhaps engender in them the same uncertainty that the protagonist of the novella may be feeling at any given time, so that the reader has:

no sense of reading left to right, of the weight of the book, of how far they were through, or, sometimes, of the direction within the narrative.

Seed is therefore doing very deliberate and self-conscious things with the particularities of its format, typical of texts which, overtly or otherwise, draw attention to their digitality. Insofar as a firm distinction can be drawn between these two facets of the work, Seed therefore introduces a coherence/tension between its form and its content.

In a design quirk which enables this sense of openness that Seed conveys, the reader has the option of changing the text’s visual interface in order to display differently-coloured vines intertwined between each of the plants. The colours refer to each lexia’s subject matter, and inverts the standardised and industrial nature of colour-coding, a tendency, or obssession, that the narrator exhibits throughout the text:

Fruits in the supermarket. They’re a different species. Those strawberries all white in the middle all the year round, like crunchy peaches. Everything so shiny. Not a speck of earth anywhere. Why would there be? It goes straight from the formica shed to our formica kitchen. Once cut my mother wraps it in cling film and puts it in the fridge.

The narrator’s sustained attention to post-industrial artefacts, the symptoms of contemporary, or then-contemporary suburban living, is the strongest aspect of Seed. The narrator’s oscillation between a tone of matter-of-fact inventory and syntax-rupturing anxiety, enacts the very process of interpretation and the fact that so much narrative time is deployed in coming to terms with such quotidian objects, made to seem strange by their presence in a narrative medium known for attention to other, less strange things, intensifies the effect:

The doves in our garden say something else no they say somewhere else from their tall perspective looking down on lawns mowed with stripes, somewhere nature isn’t the same kind we have round here.

The site’s drawing together of Seed’s structure and content, finds a corollary in the text’s actual word usage. Walsh uses leitmotifs, particularly the names of plants or descriptions of colours in order to string each unit of text together with one another in more subtle ways, without making use of an overt visual interface.

It should be noted that the text is not as radically discontinuous as it might at first seem, or certainly was not regarded as such by Walsh, who said the following in an interview:

I’ve been thinking about the authority I’m still claiming as an ‘author’ in Seed; despite the degree of reader-control offered by the project, it’s still a fairly traditional ‘authorial’ work.

I had to write Seed as a linear text to ensure it will read ok for anyone who wants to follow the temporal narrative. That said, I never write in a ‘linear’ fashion, but in one that resembles the Seed reading experience: I write phrases, notes, paragraphs, then brings them together on shuffle, until they work.

Walsh’s comments may be surprising for those familiar with her writing methodology, which involves the use of cut-ups, or other aleatoric methods which introduce an element of chance into the composition process. It is surprising also, for those who are familiar with the somewhat niche history of digital or hypertextual literature. For many of hypertext’s trailblazing practitioners, such as Shelley Jackson or Michael Joyce, the crux of hypertextual literature was the game-playing that new digital formats allowed the author to engage in as an absent centre of meaning, which expedited the then-extremely trendy dalliances with post-structuralist philosophy and critical theory in a digital context. Within Seed’s units of text after all, there is no opportunity for interaction, except insofar as the text requires you to turn the page. In an interview with Review31, Walsh described how Seed barely resembles a hypertext in the original sense of the term at all, and that it is much better understood as a traditional work focalised around the author’s vision.

This is true, firstly for the structural reasons already outlined, but also because Seed’s formal architecture is best understood as functioning in the same way as literary works in print do, in that they imply, or gesture, far more readily than they state directly. This is axiomatic for all novels worthy of the name, but it presents an interesting means of thinking about how narrative works in the context of Seed in particular. While it might seem to present some amount of freedom or capacity for interaction, Seed is in fact circumscribing you even as it offers the chance of liberation. This has a nice visual metaphor in Seed’s visual interface which deliberately places a number of other flowers beyond the reader’s reach in darkness, suggesting both the thwarted ambition to move beyond the text that we’re presented with, and, as I’ve said already, the myopia of the narrator in her own environment:

it’s a fairly tight work, and I’ve said what I wanted to say in it. I love the idea of locked passages: part of my intent was to create a feeling of implied space beyond what is described (isn’t that the intent of most novels, to create, in however abstract a sense, a ‘world’, even if ‘world’ means a set of conceptual parameters?). I’d like to do a print edition to see if and how the circle of nonlinearity could be squared.

Though we have the ability to read Seed in any order we might like, each section is up to five pages long, and therefore requires us to read chronologically for a far greater length of time than hypertexts of the nineties do. Whether this can be attributed to the now mainstream nature of micro-textual formats, which requires literature to aspire to something else is probably a question for others to answer. Personally speaking, if writers working digitally can produce works as good as Seed, I won’t be unduly detained by the sociological reasonings why.

How big are the words modernists use?

It’s a fairly straightforward question to ask, one which most literary scholars would be able to provide a halfway decent answer to based on their own readings. Ernest Hemingway, Samuel Beckett and Gertrude Stein more likely to use short words, James Joyce, Marcel Proust and Virginia Woolf using longer ones, the rest falling somewhere between the two extremes.

Most Natural Language Processing textbooks or introductions to quantitative literary analysis demonstrate how the most frequently occurring words in a corpus will decline at a rate of about 50%, i.e. the most frequently occurring term will appear twice as often as the second, which is twice as frequent as the third, and so on and so on. I was curious to see whether another process was at work for word lengths, and whether we can see a similar decline at work in modernist novels, or whether more ‘experimental’ authors visibly buck the trend. With some fairly elementary analysis in NLTK, and data frames over into R, I generated a visualisation which looked nothing like this one.*

*The previous graph had twice as many authors and was far too noisy, with not enough distinction between the colours to make it anything other than a headwreck to read.

In narrowing down the amount of authors I was going to plot, I did incline myself more towards authors that I thought would be more variegated, getting rid of the ‘strong centre’ of modernist writing, not quite as prosodically charged as Marcel Proust, but not as brutalist as Stein either. I also put in a couple of contemporary writers for comparison, such as Will Self and Eimear McBride.

As we can see, after the rather disconnected percentages of corpora that use one letter words, with McBride and Hemingway on top at around 25%, and Stein a massive outlier at 11%, things become increasingly harmonious, and the longer the words get, the more the lines of the vectors coalesce.

Self and Hemingway dip rather egregiously with regard to their use of two-letter words (which is almost definitely because of a mutual disregard for a particular word, I’m almost sure of it), but it is Stein who exponentially increases her usage of two and three letter words. As my previous analyses have found, Stein is an absolute outlier in every analysis.

By the time the words are ten letters long, true to form it’s Self who’s writing is the only one to manifest it at a rate of above 1%.