Setting Limits: Knowing When to Stop Encoding

My Digital Scholarly Editing class has been hard at work recently preparing a website featuring the diaries of World War I soldier Albert Woodman. In doing so, we have needed to encode the handwritten diary in a machine-readable format. We chose a TEI encoding schema (TEI stands for, appropriately enough, the Text Encoding Initiative). During some of our very animated debates over how our encoding structure should look, I’ve been reminded of discussions that have happened in meetings for my practicum, in which I have been working on a new design for Susan Schreibman’s Versioning Machine, which I have written about here.

In both encoding the diary and in working on the practicum, we have looked at the same question from two different sides: how much encoding is enough?

Taking a handwritten document and presenting it on-screen is a complex task. At its most basic, an electronic edition of the diary can be accomplished by simply scanning every single page and putting the digital facsimile up online. Indeed, in our digital scholarly edition of the Woodman Diary that we are creating as a part of our Digital Scholarly Editing class, we are including digital images of the diary pages. But that alone would make for a gross underutilisation of the digital medium. Being able to cross-reference entries and add annotations are major features that the digital diary could provide, not to mention simply including a more readable version of the text for those who may have trouble deciphering Albert Woodman’s handwriting. All of these things require the text to be encoded, rather than simply scanned.

Encoding our text is by no means a straightforward matter, however. TEI schemas include a plethora of options for capturing not only the words on the page, but also scribbles, removed words, additions, and even elements such as gaps in the text or horizontal lines drawn across the page. All of these possibilities had to, at one point or another, be addressed in our discussions of how to render the diary. Ultimately, the encoding choices came down to the tension between rendering the text as closely as possible to how it appeared on the page and the practicality of creating a readable digital object in a reasonable amount of time.

In order to make those decisions, we first had to think about how we wanted the digital scholarly edition to look. Would the text reflect the layout of the page, becoming basically a typeset version of Albert Woodman’s written diary, or would it distance itself from the physical object, taking form in a layout that provided smoother readability in a browser window?

Ultimately, we decided to go with both approaches. When a user is simply browsing the contents of the diary, the text appears in isolation, and takes a shape that is as easily readable as possible, with a single day’s text appearing in its entirety, regardless of how many physical diary pages the entry required or whether other days appeared on the same original pages. When a user chooses to view a facsimile of a diary page, however, a transcription appears alongside the original image that preserves the original’s line breaks, and any text on the page not related to the selected day appears in greyed-out form. If the selected day does not end on the physical page, then that text is also omitted from the accompanying transcription.

In order to create both layouts, we needed to have an encoding structure that recognizes both individual dates and individual diary pages. We included both, therefore, giving dates a greater hierarchical importance because they were our primary method for organizing the diary. In order to create a display that presented text side-by-side with the facsimile, however, we also needed to encode line breaks for each individual line on the paper, regardless of whether a break in the line would make syntactical sense in the text. One of the nice things about XML is that, if a certain encoded structure is not amenable to a certain display style, it can simply be ignored, and thus our standard reading view will simply ignore those line breaks.

Ignoring line breaks would, however, leave us with giant masses of text for each individual day, and so we also decided to encode paragraphs, preserving Albert Woodman’s conceptual organization in addition to that imposed upon him by the width of his diary pages. The text in the reading view will then be organized by those paragraphs, breaking up the text to make it more manageable.

We began looking at other ways to respect the text of the diary as it was written on the page, but we soon realized that trying to capture every element of the diary would quickly become overwhelming. We decided to encode deletions by the author, such as words he has scribbled out, by using the <del> tag, to capture underlined words with the <hi> tag, and additions that were added in the middle of a sentence with the <add> tag, all with the appropriate attributes when necessary. But capturing more than this—horizontal lines drawn to break up entries on a page or words that are curved slightly to be crammed into the space at the end of a line, for example—would be overkill, especially since we were not planning to render any of those elements in our display of the text. Our design, in other words, came to inform our encoding, just as much as our encoding plans needed to be reflected in our design.

While decisions about the digital representation of the diary were important, however, a digital scholarly edition includes more than just the words that were on the paper. We also had to decide how to annotate our text.

A century-old wartime diary would naturally include a number of terms and references that would want some degree of explanation, even if it were not a personal account. The fact that the diary includes a number of references that are specific to Albert Woodman’s personal life and experiences only complicated our decisions more. Theoretically, we could have decided to use tags to encode nearly every word for a potential annotation, but doing so would bloat our encoding to such a degree that it would be next to impossible to complete our digital object on any kind of a reasonable schedule. And so, we again had to decide what to encode and what to leave be.

People and places were obvious choices for inclusion—both provide valuable context for the diary, and are rich subjects for expansion in notes after they are captured with the <persName> and <placeName> tags and assigned appropriate attributes.

Other terms were more troublesome, however. The diary is rife with military terms, bits of dialectical and period slang, foreign sprinklings of words, and simple abbreviations for brevity’s sake. Initially, we attempted to capture all of these using the <distinct> tag and then to distinguish between them all with “type” and “time” attributes such as “slang”, “mil”, and “WWI”. Keeping everything straight quickly became a logistical nightmare, however, and we realized that we were also creating distinctions that we didn’t need for our rendering of the edition. As far as the annotations were concerned, a term is a term is a term, whether it be military jargon or simple period slang. Those distinctions could be made in the annotations themselves, rather than in the overall encoding. Furthermore, creating a linked annotation for so many words would turn our diary into an unreadable stew of hyperlinks, and attempting to read the diary while clicking on every individual annotation would likely exhaust users before they reached the end of a single day. So, we simplified, giving any relevant term a <term> tag and a simple “ref” attribute marker, which is more than sufficient for our organizational needs. Foreign words and basic abbreviations are encoded as such internally but are displayed as-is on the website, free of any extraneous annotations. That gave us a firm framework to work in without making the diary’s encoding too complex to write (or to read).

And so, the answer to the question “How much encoding is enough?” lies in deciding what the encoding is for. If our presentation is too oppressive, we are defeating our own purpose—as we also would be doing if we made our encoding so complicated that we were unable to finish it in time for our release. After all, what good is a digital scholarly edition if nobody ever gets to read it?

Further Reading:

“P5: Guidelines for Electronic Text Encoding and Interchange.” TEI Consortium. 16 September 2014. Web. 22 March 2015. <http://www.tei-c.org/release/doc/tei-p5-doc/en/html/>

This entry was posted in Digital Scholarly Editing and tagged , . Bookmark the permalink.

Comments are closed.