Computational Linguistics and the Dating of Early Irish Texts

Workshop and Lecture Series hosted by the members of the Chronologicon Hibernicum Project

15 December 2016.

Source: https://www.maynoothuniversity.ie/chronologiconhibernicum
Source: https://www.maynoothuniversity.ie/chronologiconhibernicum

Onscreen presentation offers advantages that could not have been imagined by the editors of previous generations.  One of the major advantages hypertext editions have over print is that they are fully searchable.  This has benefits beyond mere convenience: Thorlac Turville-Petre has recently argued regarding the study of Middle English linguistics through the digital medium that searchable texts and electronic concordances serve as powerful aids to full and accurate analyses of the language – a point that has not been overlooked by scholars of Medieval Irish.

Recent projects in Medieval Irish Studies have made an increasing number of digital resources available to the researcher.  Chronologicon Hibernicum (ChronHib) – A Probabilistic Chronological Framework for Dating Early Irish Language Developments and Literature, is an ERC funded research project with the aim of creating tools to facilitate the study of language development.  The primary aim of the project is to refine the methodology for dating Early Medieval Irish linguistic development and to build a chronological framework of linguistic changes that can be used to date literary texts within the Early Irish period (ca. 6th – mid 10th century).  The project aims to achieve this through statistical methods for the seriation of linguistic data, and for estimating dates using Bayesian inference.  A further goal of the project is to harness the potential of existing digital resources and to develop new publicly available corpora to help date Old and Middle Irish texts and to gain deeper insights into the development of the phonology, morphology, syntax and lexicon of the Irish language.

In early December, I was invited to attend a morning workshop and an afternoon lecture series entitled “Computational Linguistics and Dating of Early Irish Texts”, hosted by the members of the Chronologicon Hibernicum project, at National University Ireland, Maynooth.  The aim of the workshop was to discuss aspects of harmonising the annotation schemes and headword choices in the existing linguistically-parsed databases of Early Irish (Milan Glosses  http://www.univie.ac.at/indogermanistik/milan_glosses.htm, Griffith 2013; Priscian Glosses http://www.univie.ac.at/indogermanistik/priscian/, Bauer 2015), and the Parsed Old and Middle Irish Corpus (https://www.dias.ie/celt/celt-publications-2/celt-theparsed-old-and-middle-irish-corpus-pomic/, Lash 2014) as well as the various in-progress databases including The Annals of Ulster database, the Minor Glosses database and the Poems of Blathmac database.  Additionally, the workshop aimed to examine how existing online text repositories, such as CELT (https://www.ucc.ie/celt/) and Thesaurus Linguae Hibernicae (https://www.ucd.ie/thl/) could be utilised to advance the study of Early Medieval Irish linguistics.

Following some brief introductions, Prof. David Stifter, the principal investigator of ChronHib, discussed the progress of the project to date and presented the participants with a draft Guidelines for Analysis and Mark-up in ChronHib’s Lexical Databases.  A lively debate ensued and a number of issues were discussed in relation to including the following:

  • Issues of importing data from existing databases such as CELT and TLH.
  • Exploring the possibilities of tagging texts automatically
  • Database server and web enabled data entry – IT
  • Prototyping and dirty data.
  • Usability studies
  • Creating a live database
  • Formatting of the Masterfile

The afternoon lectures explored computational methods for linguistic research, especially as they apply to the study of Old and Middle Irish.  The three speakers all addressed different aspects of computational linguistics – from building corpora of historical languages, to the application of computational approaches to linguistic analysis and unsupervised machine learning.  First up, was Dr. Marius Jøhndal a linguist researching the syntax of Latin at the Department of Philosophy and, Classics, History of Art and Ideas at the University of Oslo.  Dr. Jøhndal discussed his experiences of the PROIEL Treebank project and the degree to which computational approaches can be applied when building corpora of historical languages.  The next speaker, Dr. Aaron Griffith of Utrecht University, examined the distribution of pre-verbal ceta ‘first’ <*kintu- in the glosses in order to demonstrate the kinds of research questions that can be addressed utilising digital corpora of historical languages.  Lastly Prof. Gregory Toner provided an overview of the methods and some of the results of an ongoing project at Queen’s University, Belfast which applies unsupervised machine learning computational linguistic techniques to the dating on medieval Irish texts.

All in all the day was a success.  It was particularly exciting for a student of Digital Humanities with a primary research interest in Medieval Irish Studies to participate in such a vibrant discourse concerning the implementation of digital technology in the study of Medieval Irish linguistics.   It also raised questions for me regarding the understanding of Digital Humanities in other humanities disciplines.  Throughout the morning session participants encouraged members of the project to approach IT professionals and database specialists to resolve issues of interactivity and design.  When I enquired as to why the project had not thought to approach the Digital Humanities Department, a department with which the project shares a floor at Maynooth University, I was informed that they had not considered that databases formed part of the Digital Humanities subject matter.  For me, this points to a break-down in communication between Digital Humanities and the wider humanities community.  I cannot help but wonder how many opportunities have been missed as a consequence of similar misunderstandings?

Speakers and presentation titles:

  • Marius Jøhndal – “Building and using online corpora for (historical) linguistic research”
  • Aaron Griffith – “Pre-verbal ceta ‘first’ in the glosses (and some thoughts of the origin of the notae augentes)”
  • Gregory Toner – “Machine learning and the dating of Medieval Irish texts”

Works Cited:

Thorlac Turville-Petre, ‘Editing Electonic Texts’, in Probable Truth: Editing Texts from Britain in the Twenty-First Century, eds Gillespie, V. and Hudson A., pp, 55-70, at pp. 61-2.