For my thesis, I carried out an analysis of the Property Losses Ireland Committee PLIC papers. The context and work carried out on these papers will be explained in another blog post but before my research and observations could begin I had to create the dataset. This blog will cover how I created it. I gained access to and conducted academic research which gave alternative historical perspectives with no direct permissions from the archive itself. This really speaks to the current nature of dissemination and the wealth of knowledge available to us.
The National Archives of Ireland currently houses the claims in a searchable web format containing 6,567 rows. The first task was to webscrape this data. I used Python’s package beautifulsoup to achieve this. I first scrapped the lists of streets names so I could inject them into the URL to crawl over the website going through the row offset. This was my first-time web scraping data and it wasn’t perfect but I caught the majority of fields. The data was in a table format so it wasn’t too difficult to isolate and grab. For the missed files I used the google chrome extension “webscraper” which helped to collect the files I missed. I combined both the CSV files generated by the scrapping and removed duplicates. The set() function in python was used to generate unique rows from each file which were then placed in a new CSV file.
The table followed this format originally.
Next was cleaning up the data. Turning the data into a data frame with pandas I went through the scope column of each row which contained information about the contents of the claim. The following is an example
“Claim for £134,410 for damage by fire to buildings, stock-in-trade, plants, customers goods and consequential loss at 85-90 Abbey Street Middle and 94-96 Abbey Street Middle, Dublin. Payment of £96,841 recommended by Committee.”
I wanted to extract monetary values for analysis. The amount claimed for and paid were in the £sd format which meant they could be easily identified. I wrote the following regular expression to extract these values.
I had looked over the files and anything which landed outside of my regular expression I would mark to revise. In this way, I could see all the exceptions. Which included claims sent to the Under-Secretary of Dublin, claims sent to the Minister of Finance, claims where the claimant was no longer extant, claims redirected to the Prisoner’s Effects Office, claims not proceeded, claims awarded a nil payment, claims deemed liable to insurance companies, claims received too late past the deadline, claims rejected due to consequential loss and claims placed under claim declined. Regular expression was written to identify each one of these claims by looking for the keyword.
The old £sd systems was not a base 10 system so I decided the best way to deal with calculations was to convert the values to pence values carry out the calculation on the pence amount and convert back to the £sd system. When I captured the values I took each set from the regular expression and multiplied the appropriate values. Pound values by 240 shelling values by 20 and pence values could stay the same. I would then perform a subtraction between the claim amount and payment. I would also find the percentage difference between the two. This was used for examining bias later. When I was finished I converted the new pence values back to the £sd format by modding the number and using the remainder for the following value.
After this was finished I also wanted a geographical aspect to my data set so I utilised tools for open street map to gain the names of all streets in Dublin and their GPS coordinates. I first tried downloading the nodes by zooming to the appropriate level over Dublin and capturing that snippet. This did not work however as there were too many nodes to download in this fashion. I then downloaded the entire OSM data for Northern Ireland and Ireland. To prevent clashes with names and to save time I needed to apply a polygon to the county of Dublin to separate it from the rest of Ireland. I used Nomination to look up the reference ID for the polygon of Dublin and then looked it up here http://polygons.openstreetmap.fr/index.py to extract the needed polygon information.
I applied the Polygon to the map with the following command
osmconvert ireland-latest.osm.pbf -B=dublin.poly -o=dublin-latest.osm
After that, I executed the following commands with osm filter and osmconverter to extract the street names and the GPS data to a CSV file
osmfilter dublin-latest.osm –keep=”highway=*” –drop-version > dublin-streets.osm
osmconvert64 dublin-streets.osm –all-to-nodes –csv=”@id @lat @lon highway name” > dublin-streets.csv
Streets are classified as highways and by the end, I had a CSV of all streets in the county of Dublin and their GPS coordinates.
I then had to link this to my dataset of Insurance claims. I had to first extract the street names from the location column of the table. In this section, the location of the claim would always follow the last semicolon. I wrote the following regular expression to extract this information.
I then paired this with the names in the dublin-streets.csv which was achieved through fuzzy matching. With some minimal manual cleaning, the dataset was now ready for analysis.