Practicum: Topic Modelling the B.M.H.’s Witness Statements

For this project I’ll be aiming to create topic models using the Bureau of Military History’s Witness Statements as the corpus. These statements were collected by the Irish State between 1947 and 1957 and relate to the period from the formation of the Irish Volunteers, 25th November 1913, to the beginning of the ceasefire (which lead to peace negotiations and ultimately the Anglo-Irish treaty) which came into force on 11th of July 1921. The were collected in order to document and reconcile primary source testimony and accounts of the events of this period before the memories were lost through the passing of time: ‘to assemble and co-ordinate material to form the basis for the compilation of the history of the movement for Independence from the formation of the Irish Volunteers on 25th November 1913, to the 11th July 1921’ (report of the Director, 1957).

As the B.M.H. website states, the collection is not exhaustive and many anti-treaty survivors of the period did not cooperate with the collection process because of its perceived ‘Free State‘ patronage. Nonetheless, the statements do represent an extraordinary source for scholars of the period in light of the secretive nature of much of the operations of the time. Their digitisation and public dissemination is a wonderful project from a public history point of view.

The scale of the corpus is quite large for a traditional scholarly standpoint. There are 1,773 statements or approximately 36,000 pages of interviewed statements and letters pertaining to the period. Happily, a process of O.C.R. or optical character recognition has been used to make the text searchable and, in effect, allows us to separate the textual content from the digital surrogates of the hardcopy originals. This allows us to take this digitised and extracted textual data and, using statistical natural language processing software, mine for topics. This is done by identifying patterns in a textual corpus. It is essentially a form of statistical modelling that can tease out topics by applying algorithms which respond to factors like word recurrence and proximity. Ideally, this can allow hidden theme and meaning to be understood from a large corpus of text that might evade or overwhelm a traditional human scholarly interrogation.

I’ll be using an open source natural language processing software package called Mallet (MAchine Learning for LanguagE Toolkit) to train these topics. It was developed by Andrew McCallum et al. at the University of Massachusetts Amherst.  It is a package that I am only now becoming familiar with for the purpose of this project but have found it very usable and quite accessible.

I have started this project using only a small selection of the whole data set to allow me to refine my own workflow and knowledge of the software without reproducing this already iterative process with a cumbersome data set. Once I feel my workflow and competence are at an effective stage I can hopefully introduce the entire corpus and produce some meaningful topic models.

My first topics produced in Mallet are visible in the screen shots below. The next steps in workflow design will be introducing named entity recognition packages to highlight names of places, organisations and people which may corrupt the topics produced. This will allow me to enhance the stop word lists (a list of words which the machine should ignore and not consider in its training) with Corpus specific terms which may be corrupting the topics produced.

This slideshow requires JavaScript.

Once meaningful and and clean topics (not filled with names of places, for example) are generated I can begin to model the data. The shape and style of such models may be dependant on the data produced but at this stage I intend to use Gephi, an open source software package for representing and network data, to model and display the topics.

As my project and learning curve continue I will be blogging here regularly. I will include my own work flows and processes for those who would like to produce their own topic models like I have, but as of now most of my early efforts have been made with the help of programminghistorian.org.

 

Your two cents...