Practicum: Topic modelling update…

This is a continuation of a series of posts relating to a topic modelling project. The first post in the series can be found here.


 

At this stage I’m happy to report that topics are being generated and models are being produced. Unfortunately I’m still operating on a sample of 50 statements rather than the full corpus but, as I have discovered, the iterative nature of the process and the somewhat ‘trial and error’ way of developing the right workflow means that this may have been a blessing in disguise. The size of the corpus doesn’t really have much bearing on things (within reason) once the workflow is refined, however, when refining the workflow the amount of data to keep track of can become, let’s say, cumbersome.

I’ve got somewhat of a handle on the machine learning software and I’m beginning to discover the possibilities of Gephi. So far so good, one might say, however there have been a few snags and glitches along the way. The first mistake I made was to naively save my .mallet files, which store the information mined from the source files, in the same folder as the sample witness statements. The result was that any iteration (barring the first, as no .mallet files were present yet) was including the previously stored .mallet files and their contents as part of the corpus. This was noticed when topics began to appear with code etc. from these .mallet files. One happy result of this was that when modelling in Gephi these texts were shown as unrelated to any topics bar the ‘rogue’ mallet topic, showing that the process was at least working.

This slideshow requires JavaScript.

Mallet, as discussed in the previous post, allows stop words, which are common, less meaningful words such as ‘the’,’and’,’it’ etc., to be excluded by inserting the opition ‘–remove stop words’ in the command line. A list of these stop words is included in the mallet package in a text file. However, this file is simply illustrative and is not, in fact, called on by the software when this command is included. So, the user can not add words to this list by editing this text file as the stop words mallet works off are actually inserted in the code and any editing of this list would require rewriting and recompiling mallet itself. This came as a surprise to this user and meant that what had been perceived to be a simple enough task to edit and experiment with stop words was now somewhat more complicated. There exists ways in mallet to point at custom made stop word lists and include them but it was decided that a custom python script designed to pull our stop words out of the corpus itself was more useful as this would allow us to remove certain word patterns too. In essence, this means we could remove the phrase ‘Bureau of Military History’ but allow the word ‘military’ to remain. As many of the statements include cover sheets with template forms for the interviewer to fill out, this word pattern removal was very helpful in cleaning our corpus further and another example of one step back but two steps forward, if you will.


As my project and learning curve continue I will be blogging here regularly. I will include my own work flows and processes for those who would like to produce their own topic models like I have, but as of now most of my early efforts have been made with the help of programminghistorian.org.

Your two cents...