“Culturomics is the application of high-throughput data collection and analysis to the study of human culture.” (Michel et al. 181).
However, in order to understand how culturomics has come into being, it might be useful to initiate with some background information.
The Google Books project was officially launched in October, 2004, and the launch of the Google Print Library Project was announced two months later, more commonly known as Google Book Search (GBS) (Eichenlaub 4). Originally, the aim of the GBS project was to digitise published books that were in the public domain, and consisted of a few partnerships, being the New York Public Library and some highly esteemed universities such as Stanford University and Oxford University (Ibid 4-6). However, the aims of GBS soon extended to digitising books that were still under copyright. Despite controversies with publishers and authors, which resulted in some high profile law suits, other libraries and universities joined the project, including some from European countries (Ibid 6). By 2010, Google had scanned 12 million books (Eichenlaub 6). By 2013, there were more than 15 million books digitised in the GBS project – “12% of all books ever published” (Rossi et al. 344).
The first edition of NGram Viewer was “unveiled” in December 2010 (Hand), as a unique tool “capable of precisely and rapidly quantifying cultural trends based on massive quantities of data” (“Google N-gram Viewer” – Culturomics). It contained a selected Google corpus of 5.2-million-books in English, French, Spanish, German, Hebrew, Chinese and Russian published from 1500 to 2008 (Hand; P. Cohen). It was designed to allow users to analyse the frequency of words, metaphors or phrases that contain up to five words, so it is possible to examine word trends between France and England for example by using the words woman vs. femme (“Google N-gram Viewer” – Culturomics). However, it was suggested that data in NGram Viewer is at its best potential in English between the years 1800 to 2000 (Ibid). As a consequence of the development of Ngram Viewer, its co-creators conducted a study, and its publication coincided with the launch of Ngram Viewer, and the coinage of the term culturomics. Culturomics received mixed views, and the following annotated bibliography is an attempt to document these views, by way of a critique.
Michel, Jean-Baptiste et al. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331.6014 (2011): 176–182. Web. 26 Nov. 2014. <http://stevenpinker.com/files/pinker/files/michel_et_al_quantitative_analysis_of_culture_science_2011.pdf>.
First published online in Nature on 16 December, 2010, this article coincided with Google’s launch of NGram Viewer. Michel et al. set out to examine changes in linguistic and culture, by using a method of culturomics, a term they have coined for such purpose. Michel et al. describe how using NGram Viewer’s 5.2-million-book corpus, of which they estimate is 4% of all books ever published. They believe that by using a statistical analysis of this corpus it is possible to study linguistic and cultural changes through a quantitative process, which they have labelled culturomics. The study highlights two poignant factors, which contribute to what they term as “culturomic trends”. They assess that cultural change influences “the concepts we discuss”, while linguistic change “affects the words we use for those concepts”. In this assessment Michel et al. examine the growth of new words in the English lexicon, by comparing words to the editions of dictionaries, while also analysing obsolete words from the lexicon. Thus, they believe “culturomic tools” will assist lexicographers to re-evaluate words that are no longer in common usage, and identify lower frequency words which are not evident in dictionaries. Of interest is their probing of how government censorship affects the suppression of individuals and groups, and offer a notable example to support this through a comparison of German and English texts and how the Nazi regime suppressed references to the work of Marc Chagall, a Jewish artist. They also offer examples of how quickly people get famous over different periods, collective memory and how quickly societies forget. In the article, Michel et al. blend a mix of narrative, mathematic equations, visual graphs, and some good old fashioned excitement for the potential of Ngram and culturomics results as “a new type of evidence in the humanities.”
Hand, Eric. “Culturomics: Word Play.” Nature News 474.7352 (2011): 436–440. Web. 19 Nov. 2014. <http://www.nature.com.jproxy.nuim.ie/news/2011/110617/full/474436a.html>.
This article is an appropriate starting point to explore the background for the inception of NGram Viewer, which is fundamental to the concept of culturomics. Eric Hand provides useful information on the background of Erez Lieberman Aiden, a molecular biologist and applied mathematician, one of the acclaimed academics to collaborate with Jean-Baptiste Michel, another scientist/mathematician, to create the Google NGram Viewer. Interestingly, Hand proceeds to describe how the concept of NGram Viewer was developed as a consequence of a study by Lieberman Aiden, Michel and others to demonstrate how verbs became regularised over time, from the Old English period through to Modern English. In doing so, they used traditional humanities methods to uncover irregular verbs in use from Old English to Middle English, by the slow task of “[s]ifting through old grammar books.” Subsequently, Hand explains that as a consequence of slow process that Lieberman Aiden et al. had to undertake, in order to create a dataset for their study, this prompted a response to create a new tool to overcome some of the issues they endured. Hence the idea for creating NGram Viewer was birthed. This type of insight is often negated by humanities critics, who tend to refer to the culturomics team solely in terms of their backgrounds in science.
Lieberman, Erez; Michel, Jean-Baptiste; Jackson, Joe; Tang, Tina and Nowak, Martin A. “Quantifying the Evolutionary Dynamics of Language.” Nature 449.7163 (2007): 713–716. NCBI PubMed. Web. 16 Nov. 2014.
In this study, Lieberman et al. demonstrate how verbs became regularised over time, from the old English period, through to Middle English and arriving at Modern English. In doing so they examined a large compilation of grammar textbooks, over these periods to pry out irregular verbs, and observe how they evolved over time, and through frequent/infrequent use. In order to obtain frequency data they used the CELEX database, of 17.9 million words obtained from an assortment of textual sources. They then used computational methods, by means of algorithms produced through Python (programming), and equations to perform a quantitative analysis and produce figures and tables of their findings. While the findings seem impressive, it is nonetheless, difficult for the average student, without having a good technical knowledge, to negotiate an understanding of their quantitative methods, and thus, how they arrived at their findings. Nonetheless, it is worthwhile reading this article to observe such processes of research, and envisage why Lieberman and Michel were prompted to develop a software tool such as NGram Viewer, to assist researchers in the area of linguistic usage and lexicon patterns over time.
Cohen, Patricia. “In 500 Billion Words, a New Window on Culture.” The New York Times 16 Dec. 2010. NYTimes.com. Web. 17 Nov. 2014. <http://www.nytimes.com/2010/12/17/books/17words.html>.
Shortly before the launch of Ngram Viewer and the public release of the accompanying publication “Quantitative Analysis of Culture Using Millions of Digitized Books” by Jean-Baptiste Michel et al., Patricia Cohen claims that there was no expected “fanfare” by Google for the forthcoming launch. Ironically, Cohen manages to make up for this with her coverage in The New York Time, and mentions that the release of the article by Michel et al. would be available for free to non-subscribers in Science online magazine. Cohen also suggests that the aim of the study was to “demonstrate how vast digital databases can transform our understanding of language, culture and the flow of ideas.” This type of sensationalist statement may not have gone down too well with some humanities scholars, in the first instance. However, she adds that Michel and Lieberman Aiden already speculated resistance from humanities scholars, thus, they emphasise that culturomics merely provided information that would still need interpretation. This article also provides information on the background of the concept of Ngram Viewer and culturomics, due to the study undertook by Jean-Baptist Michel and Erez Lieberman Aiden from 2004, for the examination of verb irregularity in the English language, whereby it took 18 months to collect the data before anything could be examined for evidence. Consequently, upon reading about Google’s plan to develop a Google digital library, Michel and Lieberman Aiden realised its potential for the type of study they were conducting, and approached the director of research at Google, Peter Norvig who was receptive to the idea.
Cohen, Dan. “Initial Thoughts on the Google Books Ngram Viewer and Datasets.” Dan Cohen 19 Dec. 2010. Web. 19 Nov. 2014. <http://www.dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/>.
Initially, Cohen expresses his excitement for NGram Viewer as an “easy-to-use” tool with potential to introduce scholars to digital research methods, declaring it as a “gateway drug” in digital humanities. However, in context of the coinage of the term culturomics, by Michel at al., Cohen is somewhat critical that Michel et al. could claim precedence in this type of quantitative analysis of data, by adding their label of culturomics without at least acknowledging other related work in this domain. (For example, a project he was involved with called, “Victorian Books: A Distant Reading of Victorian Publications”). Moreover, Cohen points out that there was an absence of a humanities scholar, in the article by Michel et al., nonetheless, adds that since “digital humanities is nice”, thus, the culturomics team ought to be welcomed to the domain. In examining the obvious problems with OCR and in using large data sets, Cohen commends the culturomics team and Google for also admitting to the existence of problems, and suggests optimistically, that this should improve over time, with the added bonus that NGram also provides an opportunity for multilingual comparisons. However, Cohen’s has a genuine objection with the actual term “culturomics”, in that he suggests its name implies a similarity to the term of genomics, and thus, suggestive that culturomics will provide a dataset of completeness to study culture, like the human genome. Nonetheless, Cohen still applauds some of the findings teased out by Michel et al., particularly in the area of censorship and suppression of authors by the government regimes. However, he is disappointed that the tool does not allow for a smoother transaction from a distant reading to a close reading for the purpose of deeper historical research.
Morse-Gagné, Elise E. “Culturomics: Statistical Traps Muddy the Data.” Science 332.6025 (2011): 35–35. www.sciencemag.org. Web. 20 Nov. 2014.
In response to Michel et al.’s “Quantitative analysis of culture using millions of digitized books,” Elise Morse-Gagné, questions the authors’ assumptions that counting words in the English language is statistically meaningful, and indicates that such attempts tend to be linked to efforts in proving the supremacy of the English language. Morse-Gagné highlights that it is merely reflective of the global ratio of English speakers and publications in English, and thus, not an expression of the amount of English words available for a speaker or indicative of the growth of the English lexicon. Indeed, Morse-Gagné highlights that Michel et al. have merely measured the written account and not the actual spoken record of the English lexicon.
Schwartz, Tim. “Culturomics: Periodicals Gauge Culture’s Pulse.” Science 332.6025 (2011): 35–36. www.sciencemag.org. Web. 28 Nov. 2014.
In commending Michel et al. for their analysis of a Google Books corpus, to identify shifts in time and culture, Schwartz adds that their methods are an important contribution in using digital techniques as a “window into history.” Nevertheless, he points out that the study was conducted on 4% of all books ever published, but in fact was a sample corpus of 4% of books published that were “deemed worthy” of being digitised by Google and the GBS partners. Schwartz also adds that periodicals represent a more temporal closeness to events than published books, due to publication delays, thus, he suggests that using the written record as a method of analysing culture should include a more varied representation of textual sources.
Aiden, Erez Lieberman, Joseph P. Pickett, and Jean-Baptiste Michel. “Culturomics—Response.” Science 332.6025 (2011): 36–37. www.sciencemag.org.. Web. 21 Nov. 2014.
In response to Morse-Gagné, Lieberman Aiden et al. share her concerns on account of other efforts to prove a superiority of the English language. Nonetheless, Lieberman Aiden et al. ratify their findings by iterating they clarified in their article what they were measuring being a “ ‘word’ as a meaningful string of alphabetic characters that is free of typographical errors and that appears in the text of published books with a frequency greater than one part per billion.” Thus, they claim they were conscious to avoid the potential of subjectivity associated with other research efforts to measure the volume of the lexicon, and thus, avoid accusations of Anglocentricism. In response to Schwartz, they agree that their study would have benefitted from the inclusion of other types of written data, such as newspapers and magazines, and added that their article did infer some emphasise to this point. However, they chose to focus their study on books as it was an available source of data, due to the infancy of digitisation of other published sources material. From this, one observes, the necessity for digital scholars to be very clear and concise in describing their methodologies, sources and rationales for using such sources in order to justify their findings and conclusions. However, of interest here is the negation of Lieberman Aiden et al. to add any comment to Schwartz’s suggestion that their study was based on a sample corpus representing 4% of books published that were “deemed worthy” of being digitised by Google and Google Book partners. Similarly in other literature, this point is perhaps kept out of sight, and worth looking at more closely in the context of the politics of digitisation in the Google Book project.
Hitchcock, Tim. “Historyonics: Culturomics, Big Data, Code Breakers and the Casaubon Delusion.” Historyonics 19 June 2011. Web. 21 Nov. 2014. <http://historyonics.blogspot.ie/2011/06/culturomics-big-data-code-breakers-and.html>.
Tim Hitchcock presents an array of criticism for Ngram Viewer and culturomics, some of which echoes criticisms made by Dan Cohen. Hitchcock commences by suggesting that the brand name of “culturomics”, and the team behind it, ignored all earlier efforts of digital humanists in history, linguistics, and library science in testing new methodologies to create new modes of research in acquiring evidence. Indeed, he adds that the actual achievement of the culturomics team was overestimated by triumphalist publicity. Rather, he insists that Ngram Viewer should at best be used as an accompaniment alongside other methods in historical analysis. Moreover, for Hitchcock, it is not merely the hyperbole associated with the study, rather, it is the excitement of scientists in discovering a “new science”, which is worrying, in that it projects that research results from using Ngram Viewer might eventually supersede other modes of historical analysis. It seems that Hitchcock is most annoyed by the fact that Michel and Aiden Lieberman, as mathematicians, use patterns and mathematical equations to decipher historical development. Interestingly, Hitchcock compares culturomics to the experimental period of cliometric history, when historians used large data sets to formulate explanations to human behaviour in past societies. However, he claims that the difference between cliometrics and culturomics lies in the concept that cliometretics began with a “model of how societies work”, and engaged with testing the model against the statistics. On the other hand, he suggests that culturomics merely sets out to find patterns, and then anticipates that evidence of encoded meanings will be discovered from pattern results. Indeed, Hitchcock goes so far as to imply that culturomics is a form of “code breaking” that was apparent in “analytical positivism”, similar to the attempt of projects to crack DNA codes from the 1950s, and points to Lieberman Aiden’s background as a molecular biologist, and infers culturomics a method of code breaking for the functioning of human society. Thus, Hitchcock is pessimistic for the future of culturomics as a notable research method in historical analysis, as he claims it is not based on a model which accommodates the over-all purpose of history. He adds that culturomics fails to recognise link that history presents interpretations from the past to the present, in a bid to find greater understanding of how societies function today.
Meeks, Elijah. “The Digital Humanities as Culturomics.” Digital Humanities Specialist 23 June 2011. Web. 20 Nov. 2014. <https://dhs.stanford.edu/the-digital-humanities-as/the-digital-humanities-as-culturomics/>.
This blog post is of interest, as Elijah Meeks notes that Aiden and Michel made keynote speeches, on culturomics at the DH11 conference, which he felt encouraged by, as it reflected the potential for the benefits of scientists and humanists learning from each other, through their separate research methods. For Meeks, this would only serve to strengthen the humanities scholarship and educational scope.
Willems, Klaas. “‘Culturomics’ and the Representation of the Language of the Third Reich in Digitized German Books.” Interdisciplinary Journal for Germanic Linguistics and Semiotic Analysis 18.1 (2013): 87–99. Ghent University Academic Bibliography. Web. 23 Nov. 2014. <https://biblio.ugent.be/publication/4126846>.
In this article Willems is interested in examining the reliability of Ngram Viewer as a tool in “culturomic investigations” based on a corpus of monolingual books, which in this case is the available corpus of German books, estimated at 37 billion words. Specifically, Willems uses a case-study to focus on the frequency of German expressions most commonly associated with the language of the Third Reich. Willems uses the phrases, nouns uncovered by the research of Michael and Doerr in their book published in 2002. Willems claims that the Nazi period is divided into two periods, from 1918 to 1933, when it developed as the National Socialist party, and the second period, from 1933 to 1945; however, in his study he refers to the period of the Third Reich as inclusive of both periods. While he observes that some formulations of phrases occurred during the period of the Third Reich, for example “Blutschutzgesetz ‘(Nuremberg) Blood Protection Act’”, thus he suggests that some Nazi expression previously existed in the German lexicon but became more frequent in use with the onset of National Socialism. Indeed, Willems case-study through “culturomic” investigation confirms much of the findings of Michael and Doerr’s study, but for a few exceptions which he maintains need to be explained. For example, Michael and Doerr’s study inferred that the term charakterlich, was a Nazi expression applied as a reference to German character, however, Willems study of the word through NGram, shows that the word had previously been used. However, upon further inspection, Willems detects that the a similar expression such as charakteristisch is detected by Ngram as the same, due to flaws with OCR in interpreting older printed texts in German. Thus, Willems adds that the quantitative results produced by Ngram, may be as a consequence of inaccuracies in lemmatisation. Although he fails to assess the extent to which this may occur, nonetheless, it is a significant finding vis-à-vis the claim by the culturomics team’s that the tool is a significant research aid in the study of human culture. While Willems concludes that the tool is useful for cultural and semiotic analyses, the problem with lemmatisation errors needs to be reflected, and the tool is not an end in itself, rather, it is useful for retrieving data that may have potential for further study. Although, he does suggest that once the tool incorporates data from other sources such as periodicals, its potential will be further increased. Soon after Willems study, the culturomics team launched the second edition of Ngram Viewer.
Lin, Yuri; Michel, Jean-Baptiste; Lieberman Aiden, Erez; Orwant, Jon; Brockman, Will and Petrov, Slav. “Syntactic Annotations for the Google Books Ngram Corpus.” Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Republic of Korea: Association for Computational Linguistics, 2012. 169–174. Web. 17 Nov. 2014. <http://aclweb.org/anthology/P/P12/P12-3029.pdf>.
This conference paper discusses the second edition of Ngram Google Book Corpus containing over 8 million books, and described as “6% of all books ever published”, in eight languages (169-170). The authors describe an update of their methods from the first edition, plus mention the improvements of OCR, thus, an improvement of Ngram Viewer in relation to older historical texts. Thus, with concern earlier flaws detected, they believe this edition is an improvement, but still not by any means “perfect”. In particular they refer to the recognition of the long s, which tends to be detected by OCR as an f. Nonetheless, they still suggest that by aggregation, the tool is still useful for the study in the evolution of grammar. Of interest, this paper is written in terms of a scientific procedure, somewhat emotionless in comparison to the culturomic excitement displayed in “Quantitative analysis of culture using millions of digitized books”.
Eichenlaub, Naomi. “Checking In With Google Books, Hathitrust, And The Dpla.” Computers in Libraries 33.9 (2013): 4-9. Academic Search Complete. Web. 26 Nov. 2014.
Rossi, Ernest, Jane Mortimer, and Kathryn Rossi. “Therapeutic Hypnosis, Psychotherapy, and the Digital Humanities: The Narratives and Culturomics of Hypnosis, 1800–2008.” American Journal of Clinical Hypnosis 55.4 (2013): 343–359. Taylor and Francis+NEJM. Web. 22 Nov. 2014.
“Google N-Gram Viewer – Culturomics.” Culturomics. N.p., n.d. Web. 16 Nov. 2014. <http://www.culturomics.org/Resources/A-users-guide-to-culturomics>.
Image Source: Example of the Long “S” in “The Blind Beggar of Bethnal Green” in Robert Dodsley‘s Triffles (London, 1745) in West, Andrew. “BabelStone: The Rules for Long S.” Web. 28 Nov. 2014. <http://babelstone.blogspot.ie/2006/06/rules-for-long-s.html>.