Over the past month I’ve been working with Harriet Wheelock keeper of collections at the RCPI heritage centre. The collection I’ve been working with is The Kirkpatrick Index. This is part of a wider set of collections by Thomas Kirkpatrick. The Kirkpatrick Index consists of manuscript notes, newspaper cuttings, photographs, documents etc which were collected by Kirkpatrick between 1869-1954 and contains biographical details of over 10,000 Doctors who have had contact with the RCPI or RCSI. The result is a wide potential for in depth information not only from a medical history perspective but a personal one. The index in it’s current form is one of the most requested of the collections in the RCPI. Many requests are for those interested in family history.
Before explaining in detail the work I feel it’s important to give some attention to Kirkpatrick himself for, without him, the collection would not exist. Kirkpatrick was a practising doctor as well as a historian. Dedicated to both fields he was registrar of the RCPI for forty-four years and also the general secretary of the Royal Academy of Medicine in Ireland. He took a particular interest in venereal diseases and held a clinic for women in the early hours of morning to facilitate anonymity. As he grew older however it was reported on Sundays he was “not on the wards but in the Worth Library” tending to the volumes and books. His “great pleasure was to dust and polish these volumes, which to this day remain in mint condition.”.These are only small snap shots and through out his life he continued to make contribution to medical and historical life.
My initial goal when approaching this project was to scope the index and form a proof of concept. This could be used in the future by those wishing to complete the task of digitising and making the collection accessible. Clearly 10,000 envelopes containing information on each doctor would be too much to actual structure and digitise by myself but I could form an approach and some examples of how this could be completed. The data I was to structure came to me in several formats. One format was the envelopes provided by Kirkpatirck containing the bulk of the information. These envelopes are extremely hard to structure as each one contains different levels of information and detail. For example one may contain a photograph, birth cert, newspaper article, clipping from a medical journal. While the other may only contain a name, date of birth and death and degree earned in the college. This varies widely between the two. Some envelopes containing 6 pages of information. Next were the Index Cards. These were typed in the 1960s and contained all the information on the front of the envelopes such as date of birth and death, degree earned and what year and where the doctor was practising. Finally I had access to an excel sheet which contained information on 3000 records not in the envelope format but on sheets of paper collected by Kirkpatrick. These sheets contained similar information to the envelopes. I was faced with an issue of chaining data. Linking the Excel Sheet data to the 3000 manuscripts and linking the index cards to the envelops. Then finding how to structure them. I was also requested to integrate the CALM archival system as the RCPI heritage centre has a license for its use.
This was my first project dealing with different and disjointed data formats on this scale. I started with the Index cards as I felt this was a good point to start my chain of linking. This will give structure and lay the ground work for the envelopes. I was lucky that the cards themselves were in a type font which meant I could OCR the cards easily. The cards themselves are quite large but again I was aided by a photocopy of all cards being made in the past. Each scan contained 23 cards per page. The quality of the photocopy I received was not of the highest quality but unfortunately it’s all I have to work with as the digital file of the scan is not longer accessible. For OCRing the data I first took the easy approach by simply hoping the OCR methods provided by google drive could recognise the font. This proved very inaccurate and I was lucky to even get 20% of the text correctly transcribed. I next researched other methods including ABBYY and Acrobat Pro. These were not methods I could use for this project due to the commercial nature of the programs. Finally I decided on implement Tesseract OCR which is an open source OCR tool developed by google. Running Tesseract with the default set up did not recognise the font to a acceptable manner. This meant I would have to train Tesseract on the data set. To do this I followed the guide lines set out on the manual for Tesseract. This process is fairly automatic as Tesseract provides the tools for training new fonts and languages. The only time consuming part was manually correcting the box files to give corrected data for training the set. You need a box editor to do this and I used jTessBoxEditor which is a light weight java boxeditor which I found to work very well. I wrote some scripts for preparing the files for training and in the future will write one for training the data. I was thinking I could hand these over to the RCPI heritage centre so if they needed these tools in the future they could use them. Overall the OCRing of the index cards seems to be working well with only some repeated mistakes. I’m thinking of adding a dictionary to correct these mistakes as many words and terms are repeated.
Once the the index cards have been converted to text this will make the collection searchable in a digital format. This will at least remove the need to physically come in to check the index manually. The information on membership and army enrollment on these cards will also benefit users of the collection greatly. The manuscripts themselves will then be my next challenge. I plan to manually digitise a small amount of records and then manually link them to the index card data. This will be part of my proof of concept. Showing the amount of time digitising the files will take. For the 3000 sheets and the index cards I might be able to automate the process as each sheet and entry to the excel contain a roll number. This roll number is handwritten on the sheets which makes it more difficult to OCR, but at the same time, every sheet’s handwritting is by one person which is Kirkpatrick’s. Overall I’m enjoying my experience working with the Kirkpatrick Index and hope I can achieve a lot more in the future.