The question that this blog post sets itself is: What differences and similarities can be detected in modernist and contemporary authors on the basis of three stylistic variables; hapax, unique and ambiguity, and how are these stylistic variables related to one another?

I: The Data

The data to be analysed in this project were derived from an analysis of twenty-one corpora of avant-garde literary prose through use of the open-source programming language R. The complete works of the authors James Joyce, Virginia Woolf, Gertrude Stein, Sara Baume, Anne Enright, Will Self, F. Scott FitzGerald, Eimear McBride, Ernest Hemingway, Jorge Luis Borges, Joseph Conrad, Ford Madox Ford, Franz Kafka, Katherine Mansfield, Marcel Proust, Elizabeth Bowen, Samuel Beckett, Flann O’Brien, Djuna Barnes, William Faulkner & D.H. Lawrence were used.

Seventeen of these writers were active between the years 1895 and 1968, a period of time associated with a genre of writing referred to as ‘modernist’ within the field of literary criticism. The remaining four remain alive, and have novels published as early as 1991, and as late as 2016. These novelists are known for their identification as latter-day modernists, and perceive their novels as re-engaging with the modernist aesthetic in a significant way.

I.II Uniqueness

The unique variable is a generally accepted measurement used within digital literary criticism to quantify the ‘richness’ of a particular text’s vocabulary. The formula for uniqueness is obtained by dividing the number of distinct word types in a text by the total number of words. For example, if a novel contained 20000 word types, but 100000 total words, the formula for obtaining this text’s uniqueness would be as follows:

20000/100000 = Uniqueness is equal to 0.2

I.III Ambiguity

Ambiguity is a measure used to calculate the approximate obscurity of a text, or the extent to which it is composed of indefinite pronouns. The indefinite pronouns quantified in this study are as follows, ‘another’, ‘anybody’, ‘anyone’, ‘anything’, ‘each’, ‘either’, ‘enough’, ‘everybody’, ‘everyone’, ‘everything’, ‘little’, ‘much’, ‘neither’, ‘nobody’, ‘no one’, ‘nothing’, ‘one’, ‘other’, ‘somebody’, ‘someone’, ‘something’, ‘both’, ‘few’, ‘everywhere’, ‘somewhere’, ‘nowhere’, ‘anywhere’, ‘many’, ‘others’, ‘all’, ‘any’, ‘more’, ‘most’, ‘none’, ‘some’, ‘such’. The formula for ambiguity is:

number of indefinite pronouns / number of total words

I.IV Hapax

Finally, the hapax variable calculates the density of hapax legomena, words which appear only once in a particular author’s oeuvre. The formula for this variable is:

number of hapax legomena / number of total words

a bar chart giving an overview of the data

II: Data Overview

Even before analysing the data in great depth, the fact that these variables are interrelated with one another stands to a logical analysis. Hapax and unique are best understood as an indication of a text’s heterogeneity, as if a text is hapax-rich, the score for uniqueness will be similarly elevated. Ambiguity, as it is a set of pre-defined words, can be considered a measure of a text’s homogeneity, and if the occurrences of these commonplace words are increasing, hapax and uniqueness will be negatively effected. The aim of this study will be to first determine how these measures vary according to the time frame in which the different texts were written, i.e. across modern and contemporary corpora, which correlations between stylistic variables exist, and which of the three is most subject to the fluctuations of another.

more overviews for each variable

IV.I: The Three Groups Hypothesis

A number of things are clear from these representations of the data. The first finding is that the authors fall into approximately three distinct groups. The first is the base- level of early twentieth-century modernist authors, who are all relatively undifferentiated. These are Ernest Hemingway, Virginia Woolf, William Faulkner, Elizabeth Bowen, Marcel Proust, F. Scott Fitzgerald, D.H. Lawrence, Joseph Conrad and Ford Madox Ford. They are all below the mean for the hapax and unique variables.

boxplot of outliers for the unique hapax variable

The second group reach into more extreme values for unique and hapax. These are Djuna Barnes, Jorge Luis Borges, Franz Kafka, Flann O’Brien, James Joyce, Eimear McBride and Sara Baume. Three of these authors are even outliers for the hapax variable, which can be seen in the box plot.

Joyce’s position as an extreme outlier in this context is probably due to his novel Finnegans Wake (1939), which was written in an amalgam of English, French, Irish, Italian and Norwegian. It’s no surprise then, that Joyce’s value for hapax is so high. The following quotation may be sufficient to give an indication of how eccentric the language of the novel is:

La la la lach! Hillary rillarry gibbous grist to our millery! A pushpull, qq: quiescence, pp: with extravent intervulve coupling. The savest lauf in the world. Paradoxmutose caring, but here in a present booth of Ballaclay, Barthalamou, where their dutchuncler mynhosts and serves them dram well right for a boors’ interior (homereek van hohmryk) that salve that selver is to screen its auntey and has ringround as worldwise eve her sins (pip, pip, pip)

Though Borges’ and Barnes’ prose may not be as far removed from modern English as Finnegans Wake, both of these authors are known for their highly idiosyncratic use of language; Borges for his use of obscure terms derived from archaic sources, and Barnes for reversing normative grammatical and syntactic structures in unique ways.

The third and final group may be thought of as an intermediary between these two extremes, and these are Katherine Mansfield, Samuel Beckett, Will Self and Anne Enright. These authors share characteristics of both groups, in that the values for ambiguity remain stable, but their uniqueness and hapax counts are far more pronounced than the first group, but not to the extent that they reach the values of the second group.

boxplot displaying stein as an extreme outlier for ambiguity

Gertrude Stein is the only author who’s stylistic profile doesn’t quite fit into any of the three groups. She is perhaps best thought of as most closely analogous to the first group of early twentieth century modernists, but her extreme value for ambiguity should be sufficient to distinguish her in this regard.

The value for ambiguity remains fairly stable throughout the dataset, the standard deviation is 0.03, but if Stein’s values are removed from the dataset, the standard deviation narrows from 0.03 to 0.01.

Two disclaimers need to be made about this general account from the descriptive statistics and graphs. The first is that there is a fundamental issue with making such a schematic account of these texts. The grouping approach that this project has taken thus far is insufficiently nuanced as it could probably be argued that McBride could just as easily fit into the third group as the second. Therefore, the stylistic variables do not adequately distinguish modern and contemporary corpora from one another.

IV.II Word Count

word count for the most prolific authors

It should not escape our attention that those authors who score lowest for each variable and that the first group of early twentieth-century author are the most prolific. The correlation between word count and the stylistic variables was therefore constructed.

Pearson correlation for word count and stylistic variables

Both the Pearson correlation and Spearman’s rho suggest that word count is highly negatively correlated with hapax and unique (as word count increases, hapax and unique decreases and vice versa), but not with ambiguity.

Spearman’s rho for word count and stylistic variables

The fact that the Spearman’s rho scores significantly higher than the Pearson suggests that the relationship between the two are non-linear. This can be seen in the scatter plot.

scatter plot showing the relationship between word count and uniqueness

In the case of both variables, the correlation is obviously negative, but the data points fall in a non-linear way, suggesting that the Spearman’s rho is the better measure for calculating the relationship. In both cases it would seem that Joyce is the outlier, and most likely to be the author responsible for distorting the correlation.

scatter plot displaying the relationship between word count and hapax density
Pearson correlations for word count and each stylistic variable

SPSS flags the correlation between hapax and unique as being significant, as this is clearly the most noteworthy relationship between the three stylistic variables. The Spearman’s rho exceeded the Spearman correlation by a marginal amount, and it was therefore decided that the relationship was non-linear, which is confirmed by the scatter plot below:

Spearman’s rho correlation for word count and stylistic variables

The stylistic variables of unique and hapax are therefore highlycorrelated.

VI: Conclusion

As was said already, the notion that stylistic variables are correlated stands to reason. However, it was not until the correlation tests were carried out that the extent to which uniqueness and hapax are determined by one another was made clear.

The biggest issue with this study is the issue that is still present within digital comparative analyses in literature generally; our apparent incapacity to compare texts of differing lengths. Attempts have been made elsewhere to account for the huge difference that a text’s length clearly makes to measures of its vocabulary, such as vectorised analyses that take measurements in 1000 word windows, but none have yet been wholly successful in accounting for this difference. This study is therefore one among many which presents its results with some clarifiers, considering how corpora of similar lengths clustered together with one another to the extent that they did. The only author that violated this trend was Joyce, who, despite a lengthy corpus of 265500 words, has the highest values for hapax and uniqueness, which marks his corpus out as idiosyncratic. Joyce’s style is therefore the only of the twenty-one authors that we can say has a writing style that can be meaningfully distinguished from the others on the basis of the stylistic variables, because he so egregiously reverses the trend.

But we hardly needed an analysis of this kind to say Joyce writes differently from most authors, did we.


I’ve yet to tell anyone what my PhD research question is without boring them. In the interests of brevity, key in not murdering conversational rhythm dead, I’m not above lying about what it involves, so I tell people I’m counting which authors use full stops and how many, and what that might mean. I suppose that I can’t blame them, just the word ‘modernist’ turns people off.

So, what it is that I am actually doing is utilising an open-source programming language (R) to ingest and index a large corpus of modernist prose authors, (using a wide-ranging definition of ‘modernist,’ to bring us beyond the tens and twenties of the nineteen hundreds to the fifties, in order to include people like Doris Lessing, for example) and compare them on the basis of a largely arbitrary range of stylostatistical indices (richness of vocabulary, sentence length, punctuation usage, among others) with a number of living authors who have, at one time or another, identified themselves as writing within the modernist tradition, as re-vivifying a presumably extinct ethic of novel-writing. These contemporary modernists will be Eimear McBride, Will Self & Anne Enright.

My hope in doing so is to move beyond the essentialistic critical reception of Anne Enright and Eimear McBride as existing within a canon of Irish modernism, consisting only of Joyce, Beckett and Flann O’Brien, which reviewers are always keen to broach in analysing their works. Who’s to say Gertrude Stein might not be a better comparison? Or Proust? Or Woolf? Via computation and pseudo-formalistic analysis, I hope to focus my comparisons, and the comparisons of others, a bit more accurately.

All this justifies the Hegelian trajectory sometimes imposed on discussions of the novel as a genre; as if there was the modern novel, then there was the post-modern novel and now there is what we have now, the execrably named post-post-modern novel, or the newly sincere novel, which isn’t much better. How are we draw these lines, and are literary scholars doomed forever to cut the timeline of literature into ever thinner slices?

It is David Foster Wallace I think, that offers us the two best means of segmenting the modern from the post-modern in literary terms, by shaking his head and refusing to answer. But then he does answer, in two ways, though the first answer is Foster Wallace’s way of not answering, while still mounting a very astute point.

Answer the First

‘After modernism.’

Answer the Second

‘…there are certain, when I’m talking about post-modernism, I’m talking about, maybe the black humourists who came along in the nineteen sixties, post-Nabokovians, Pynchon, and Barthelme, and Barth, De Lillo…Coover…’

What engages Wallace about these authors, as he goes onto explain in the interview, is the fact that they wrote novels that were absolutely bristling with self-conscious possibilities; of the text as a text that is mediated, constructed, conflicted, created in the act of its reading, writing and post-mortem discussion(s), the writer as historically constructed, discursive persona and the reader as persona. So we have two things we can probably say about literary postmodernity. It is a temporal phenomenon, kicking off after whenever it is that modernism petered out, and secondly, that a post-modern text is more self-conscious than a modernist one.

My own take would introduce a third encapsulation, and that is that post-modernism is an outgrowth from, and potential response to, modernism, rather than a rejection. This will come as a surprise to exactly zero people, and gets me to the fault line of this issue; that it is impossible to speak in broad terms about any literary grouping worth discussing that wouldn’t be essentially true of any other one. Literature’s pesky way of valuing ambiguity, referentiality and innovation ensures this.

As I was reading Katherine Mansfield’s Collected Short Stories, and Virginia Woolf’s novel The Voyage Out, I was trying to locate some qualitative phenomenon that one would not find in a post-modernist novel. And I was unsuccessful in doing so. I might say that post-modernists are more prone to textual experimentation than the modernists were; I’m always disappointed by modernist writers’ words appearing in a linear, left to right, up to down way. You’re more likely to find an image, a font change, or interruptive clause in the counter cultural writers coming in Gaddis’ wake.

But, self-consciousness is not a quantifiable phenomenon, and to say that it increases or decreases is at least a little futile. (In the context of a literary discussion that is. Given a wide enough scope of inquiry, everything is futile.) To say that post-modern novels are self-conscious to an extent that was impossible before the sixties is untrue; Don Quixote encounters a counterfeit version of himself during one of his sagas, which was Miguel de Cervantes’ clever method of criticising those who were distributing pirated, unofficial and non-canonical versions of the Quixote. Laurence Sterne also provides a blank page in The Life and Opinions of Tristram Shandy; a Gentleman, so that the reader may draw a character according to how they think she might look. As always, far more valuable literary discussions operate in the range of the qualitative rather than the quantitative. As such, back to Mansfield.

One could turn to a story such as ‘Psychology’ for example, which appears in Bliss and Other Stories. It is a story of about six pages, deriving its title from a pseudo-scientific movement that was then disrupting the notion that the self was knowable, and that we acted according to rational impulses. It’s a bold title, and by choosing it, Mansfield promises us much about what it is that motivates us, how we judge, how we interpret. But, rather than calling the story something like ‘What It Is To Be Human,’ she calls it ‘Psychology,’ shifting the focus from some Platonic realm wherein such lines of enquiry are easily defined, to the discipline or institution of psychology itself. Which is of course, carried out by a human agent, just as flawed and prone to unreason as the subject, and, in Mansfield’s time at least, male. And no one writes about how stupid men can be better than Mansfield.

The story represents two unnamed characters, male and female. The narrator makes it clear that they are deeply attracted to one another, perhaps even in love, but something, whether it be their own defensiveness or social convention, prevents them from expressing it. Mansfield represents this by doubling the presences in the text, providing each character with a ‘secret self.’ Significantly, these secret selves, at one or two points speak with the same voice:

‘Why should we speak? Isn’t this enough?’

Their ‘real’ conversation is stilted and awkward. The male character makes up an excuse to leave and in response, the female character inwardly rages:

‘You’ve hurt me; you’ve hurt me! We’ve failed!’ said her secret self while she handed him his coat and stick, smiling gaily.’

In her despair, the female character is overly affectionate and glad to receive a normally unwelcome friend, then writes a letter to the departed object of her affection, in which she is far more at home with expressing herself, almost as if the mediated, imaginative space of a letter is far more comfortable than the ‘real’ social encounter, in which both of them flailed.

The subject they discuss, is the ‘psychological novel,’ which I have seen practicing modernist authors use as a term which refers to the work that they and their contemporaries are doing with the novel form. (Joyce refers to Proust taking it as far as it can go in Á la récherche.)

It might not be a stretch to see Mansfield as doing some meta-commentary in referring to the psychological novel and in having here two characters, explicated in terms of their inner, imaginative psychology far more illustrative than in their outer, social one. So, we have a story that is pointing to its ‘about-itselfness,’ throughout, a narrative concerning the discontinuity of self-hood and the intractable crevasse that separates our inner being from the outer world. The contours of the inner/outer are perhaps more clearly drawn than you’d get in something written today, but were double-blind test to be arranged, adjusted for historical changes, (appearance of trains, telegrams v. planes & the internet) the emphasis upon social convention, the use and meaning of the word ‘gay,’ I’m not sure that a reader could be relied on to tell the difference between a modernist and a post-modern text.

Maybe it might be more useful to say that post-modernism is like modernism, only more so.