The Tweets of #GE2015

Letters of 1916, Miscellaneous

I have for some time been working on an approach to analysing the Letters of 1916 corpus. The idea behind this stems from a poster at last year’s Digital Humanities conference that investigated a single, long-running correspondence using ‘lexical distance’ from a rolling baseline to highlight changes and repetition in topics being discussed (the taller the bar, the ‘newer’ the subjects being discussed were; these fed into a rolling average, against which later letters were compared).

 

The graph above shows a first stab at this with the Letters corpus (actually the baseline in this case is an accumulation of all the letters). To get this far has required some work: the very first attempt ‘forgot’ the need to take account of the length of texts for each date range (thus a significant-looking rise round about Easter 1916 could be attributed solely to there being more letters for this period). I will go into further detail on this line of investigation in a later post.

The biggest problem, which seems to underlie any work with the Letters corpus, is the sparseness of data: seemingly insignificant letters which just happen to repeat a word a lot — especially if it is a rare word — have an enormous effect.

Partly as a small afternoon’s hacking challenge, and partly as in an attempt to see whether the kind of approach described above was even feasible, I decided to do the same with tweets containing the hashtag #GE2015 (for the UK General Election on 7th May 2015). As a greater challenge, and to learn some fresh bits of Python, I built a system that would analyse the tweets in real time (minute by minute) and push the results to a website. You may notice, now that I’ve turned it off, the graph no longer updates. Whilst it was working, though, three Python scripts running simultaneously a) carried out a continual monitoring of the twitter hashtag #GE2015, stripped out stop words and other extraneous characters and accumulated the tweets for each minute in a file, b) compared the lexical distance (Euclidean distance, with each unique word count as a vector) against an accumulation of all the tweets up to that point, and c) converted the result to JSON and pushed them to my web server. The website above performed an AJAX call to fetch the data file every minute and refreshed the graph. (Thanks again visjs.)

A screen capture of the final graph

A screen capture of the final graph

I turned on the whole operation shortly before midnight. Of course, I had to stay up all night to watch the election to check that big spikes in the graph actually represented something (assuming there were any). Fortunately, the election provided. The first spike is the announcement of the Nuneaton result (which effectively confirmed the hitherto-disbelieved exit poll). The second is Douglas Alexander (Labour campaign manager and Shadow Foreign Secretary) losing his seat. The third, and largest, is Scottish Labour leader Jim Murphy losing his seat… After this point, all the spikes seem to be for significant figures losing seats. It is not apparent without digging deeper whether they were less shocking, or that the baseline data was getting normalised to the ‘theme’ of notable figures losing (hopefully — it will validate my model — the latter).

When tweets come marching in: screenshot of Terminal window showing the cleaned and tokenised tweets rolling in

By the end of the night (around 7am, when the Twitter client gave up and refused to reconnect) my laptop had processed over 3 million word tokens. A final annotated graph can be found here. (UPDATE: Sometimes visjs goes a little strange and moves the captions around… hopefully this can be fixed). If you’re really interested, this can be cross-referenced with the Guardian coverage.

I hope to dig a little deeper into these results, including looking at the relationship between amounts of data and the result. In general, though, there seemed to be a pretty steady flow of data and none of the exceptionally-sparse bits that added curiosity to the Letters work.

Read comments / comment on this post

The Albert Woodman Diary: Editing a text for the general public

AFF606a

This is final blog post of this module, the end product of which will be a Digital Scholarly Edition of Albert Woodman’s two diaries from the First World War. As such, rather than focus on a specific topic I thought it appropriate to explain some of the design and encoding decisions made (so far) in the process of constructing the edition. The first decision taken in the edition’s conception was the choice of target audience for the edition. Given the hundred-year anniversary of the First World War, as well as the Easter Rising, we considered the diary of an Irish signaller serving in the British army to be of widespread interest.

It has been said — in numerous ways and by numerous experts in the field of scholarly editing[1] — that the market for print versions of scholarly editions consists principally, if not entirely, of editors themselves. In the same way that the general public do not need to read reams of medical journals, relying instead on their doctor to offer appropriate treatment, most users of scholarly editions are not terribly interested in the process behind its creation. What we desire from a doctor is the correct prescription, and from a textual scholar the “correct” text. It is for this reason that so-called ‘reading editions’ exist: the text of the edition is no less valid — and dependent on the same work of validation — yet the actual validation, the process by which the text is validated, is not presented as it is not useful to the reader.

This generalisation is true as far as it goes, but fails to see the whole picture in numerous ways. Firstly, it fails to consider the fact that the scholarly/reading edition dichotomy in print does not necessarily apply to the digital realm, where scholarly apparatus can be included by default and ignored or turned off, without the material detriment in print (i.e. doubling the size of the volume). Digital editions allow the presentation of TEI-encoded text, which, in the absence of anything else, including explicit representation in an interface, can serve as a manifestation of the textual scholarship underlying the edition. Facsimiles and side-by-side views with transcribed texts give a greater material context. A more ‘normalised’ text facilitates reading. And, of course, these views can — if the edition is designed this way — changed according to the user.

The second is that, as I will suggest in this post, some forms of apparatus are actually necessary for a general audience, though not necessarily the detailed textual notes. Part of this is due to the fact that we are dealing with unique historical documents, rather than ‘works’ of literature that might invoke a greater degree of textual criticism. Unless we created a diplomatic edition (which, as I have suggested above, would be of less interest to a wider audience), a much greater proportion of the scholarly apparatus of the edition would be focused on historical context. In fact, what we came to realise in the production of Woodman’s Diary is that the more general an audience, the more contextualisation is required. Of course, we hope the edition to be of use as a primary source to historians studying the period; it seems logical, though, that a specialist in the period will require less contextualising information than a general reader.

Taking the above considerations into account, the Woodman project decided to favour ‘external’ contextualisation over textual or diplomatic features. One of the ‘views’ of the text chosen was therefore of an edited text, incorporating clickable links to more contextual information (mapping named entities to biographical or geographical information, and annotations of terms whose specific context made understanding difficult to a general audience). It was also decided to present a side-by-side view of the facsimile and transcription of each diary page. This latter choice was, I would argue, not so much for the purposes of allowing greater interrogation at a textual level (though this is undoubtedly a possibility), but, especially for a wider audience, of providing additional contextualisation by giving some notion of the physical characteristics of the diary.[2]

The question for an editor becomes, then, what should the text of the edition be, and what should be annotated? Or, in this case, what concessions to the audience need to be made? In general, we favoured very much leaving the presented text ‘un-normalised’, particularly when dealing with what Greg terms “accidentals” (spelling, word division, some punctuation) that impacts on the presentation rather than the meaning of content. This is was a difficult decision to make: especially if we truly regard these features as accidental rather than based on some form of authorial intent, it would surely have made much greater sense, for a modern, general audience, to modernise and correct spellings and some punctuation. Perhaps the most striking example of this is the absence of an apostrophe in virtually every contraction (e.g. “havent”). It is very hard to declare this an intentional feature of the text, and thus it would seem logical for it to be corrected. Of course, from the opposite perspective, it could be argued that such corrections are excessively normative (perhaps even conveying a sense of cultural or educational condescension on the part of the editor).

I do not think either view is conclusive, and as such our rationale was to consider both the nature of the text we are trying to present and the intended audience. Our approach was therefore to emphasise explanation over correction. In this we are, as I argued earlier, helped by the dynamic nature of digital editions, which allow an explanation to be hidden and not infringe on the reading process, but also to be merely a click away. Abbreviations therefore are encoded using the TEI <choice> element, giving the text ‘as written’ and the expansion as an optional ‘explanation’. The determining factor in editing, annotating and explaining the text was whether a given aspect would be comprehensible. Thus “havent” was deemed to be understandable, whilst military acronyms, we felt, needed (a lot) more explanation.

In general, our argument for ‘not editing’ the text too heavily was the nature of the text itself. We were not dealing with a text that was intended to be edited or published, or indeed, read by more than a few people. Also, we are talking about a war diary, written daily in extreme circumstances. “Accidentals” therefore feel less like presentational nuances but actual artefacts of composition that are, consequently, a vital part of contextualising the text to a general audience, rather than something that should be editorially glossed over.[3]


 

[1] Elena Pierazzo particularly made this point in the DiXiT Camp in Borås.

[2] I would suggest pictures of the diary alongside the text give a sense of the materiality of the text without resorting to specialist descriptions of the “comprises an octavo notebook” sort, which is of greater value to textual scholars and bibliographers.

[3] As a parting shot, to undermine my conclusion, I feel in hindsight that we may have been excessive in transcribing punctuation, particularly where it seems Woodman may have let his pen come to rest in a point at the end of each line. This does not seem to be punctuation, whether correct, incorrect, or subsequently altered, with any intent at all — indeed, it is not punctuation but a dot, and thus may be considered truly accidental.

Read comments / comment on this post

Woodman Diary: pre-compiling for zero-infrastructure

AFF606a

Since the development of database-backed websites, starting with CGI scripts, continuing the development of languages such as Perl and PHP, and, latterly, culminating in web frameworks such as Django (written in Python) and Ruby on Rails, it would not be only moderate hyperbole to suggest that no-one writes HTML any more. More precisely, it would be true to say that HTML is used to write templates for content, which is then compiled in order to be sent to a user’s web browser. There are essentially two reasons for this. Firstly, manually writing HTML for each page to be delivered requires a degree of repetitive overhead that a tiny amount of programming can quite easily make redundant. Each HTML document requires a number of elements that are common to each page: a head element containing metadata, a title element, and scripts and CSS stylesheets; each page will probably incorporate the same navigation elements, and the same footer elements. Keeping these elements that will never change, or will change predictably, separate from the actual content of each page allows much greater maintainability (a change in the header only needs to be made in one place) and avoids what amounts to a great deal of repetitive typing. It is for this reason that PHP was designed initially: less as a programming language, and more as a way of including standard content across a number of pages without having to type it out manually.

The second factor to be considered is the nature of the content in question. The web today is less a standard way of representing and linking digital documents (as Tim Berners-Lee designed it to be) and more a means for delivering dynamic and interactive content. It would, for instance, be inconceivable that a site such as Facebook could exist if every page was a separate document that had to be written manually in HTML, or even made into HTML pages in advance. For this kind of site, HTML is merely the form of delivering content in a way that the browser can render; the data of the site is held separately in a database, and pages are created by firstly querying the database, and secondly inserting the result into a template. Even sites with less dynamic content (WordPress blogs; newspaper websites) employ a database-backed approach, though this is, I would argue, a pragmatic decision rather than a necessity.

Whilst researching the options for building the site infrastructure for the Albert Woodman’s Diary project, these two factors were essential in our considerations (not to mention the pragmatics of actually getting a given approach to work within a time constraint). The first, that of avoiding repetition, was both clearly beneficial and, as the data was to be encoded in TEI-XML, a given anway: XSLT, as a standard approach for converting an XML document into another XML document (in this case XHTML, which is essentially an XML schema), functions as a templating language in the same way as PHP or ERB (the templating language used with Rails) allows the insertion of arbitrary content into HTML. Using XSLT templates, the ‘standard’ content of each page would only need to be written once.

The more important question to be considered was the extent to which a database was required to store and query the content. Implicitly underlying this decision was the relative complexity of getting a database (especially an XML database) set up and running. As I previously noted, the necessity of querying underlying data is dependent on precisely how dynamic the content of a site is, but depends equally on the complexity of the data query required to generate a given page. WordPress, for example, is backed by a database, with a user request for a page resulting in a database being queried for a given post’s content, the content being inserted into a standard HTML template, and the HTML returned to the user’s browser and rendered. In this model, there is no particular reason why the data of each blog post could not be pre-rendered in HTML and sent to the user on request without requiring a database (indeed, this is precisely what various WordPress caching plugins do). One reason for this is the relative simplicity of the query itself: request a single row from a single table. A more important factor is the predictability of this query. If a page is to be pre-rendered, it is necessary to anticipate that a user issue a page request that corresponds to a given query. In the case of a blog, this is relatively obvious: a user will wish to see all the posts. Even more complex queries, such as a page listing the posts in each category, can be anticipated and pre-rendered.[1] A final factor is the number of query permutations: a user wishing to see a page of posts belonging to either of two categories will require a page to be pre-generated for each combination of two categories… which could lead to potentially thousands of pre-rendered pages, making an on-the-fly database query much more effective.

This line of thought was also applied to Woodman’s Diary in deciding whether to have a site backed by a database or simply to convert the XML to HTML: Would the content change? How arbitrary a query would be required for a user to navigate the content? How many pages would need to be rendered in such a case? And, most pragmatically, how would the technical complexity of setting up a native XML database outweigh the benefits? Clearly, the content of the diary is static, so the pages could be rendered beforehand. Looking at the number of diary entries (a couple of hundred), using XSLT to generate each page and the links between in advance was clearly plausible. Likewise, a page corresponding to each named entity (again, a fairly large, but also unchanging number) could be pre-rendered. In this respect, the complexity derived from both setting up an XML database (such as eXist) and learning XQuery[2] was not worth it in terms of end-result compared to simply pre-compiling every HTML page and sticking them on a web server.

Of course, what is lost by this approach, as suggested above, is the ability of the user to arbitrarily query the data, either using the XML structure (“give me every day in June that’s a Sunday and mentions Woodman’s wife”) or through full-text searching. In part, this functionality can be achieved by, for instance, a filterable list of named entities and terms; otherwise, it was felt that arbitrary queries were unnecessary. A user wishing to perform complex queries would, anyway, be able to download the underlying TEI documents and manipulate them as they wished.


 

[1] Later blogging platforms, notably Jekyll — a static blog generator written in Ruby — take this approach: the blogger writes posts using Markdown syntax, which the generator compiles into the HTML for a blog site, with a home page, post-pages, category-pages, and all the links between. The blogger then uploads all the HTML files to a basic web server, and does not have to bother with configuring a database.

[2] Using a relational database, such as MySQL, was not considered an option. Even simple analysis shows it to require a kind of quasi-rendering of the XML data, in order to put a given piece of content into a relational database table — which makes it no more beneficial than pre-rendering the HTML page.

Read comments / comment on this post

Letters of 1916: An edition, an archive, a thematic research collection, a network?

AFF606a

As an as-yet incomplete project, certainly an unpublished one, the potential value of the Letters of 1916 is necessarily speculative. We do not know in what precise form the project will be disseminated, nor what uses future scholars will find for it. This is, of course, always the case in scholarly editing. As Peter Robinson argues, the quality of an edition is ultimately dependent on its usefulness (Robinson 2013); and the usefulness of an edition is, essentially, a combination of the usefulness of the edition’s material and the usefulness of its presentation. While the first aspect of this usefulness can be more readily assessed (and, especially if the material is previously unpublished, it is almost useful by default), determining precisely how future researchers will want to engage with the material is more difficult. That said, some speculation of use-cases — ideally, a systematic and critical engagement with potential users — is a basic necessity in the construction of a digital scholarly edition. At this point in the project, then, and within the scope of a blog post, it seems fair to speculate.

By the broadly sketched criteria above, it is clear to see why the Letters of 1916 has such potential value as a project, though some of these ‘points of value’ will often seem to overlap and must be unpicked. Clearly, the source material is useful: inevitably, some letters are more important than others, but such a wealth of letters — covering the period Easter Rising, and set against the background of the First World War — will undoubtedly be valuable to researchers. In this regard, the more material accumulated, the more useful the resource. We can add to this the fact that much of the material was previously unavailable for one reason or another: either pragmatically (being stored in an archives and thus inaccessible without, at best, a deal of inconvenience) or actually (remaining unknown in the collections — or attics — of private individuals). Thus the materials themselves are useful, and digital availability is useful as it allows access to the materials.

I have considered the latter argument — allowance of access — as a part of the ‘material usefulness’ of the edition for the reason that it does not necessarily prescribe any particular form for the edition, except that it exist and be viable form of dissemination for the material. This separation of the value of the material and the value of the edition is particularly relevant with the Letters of 1916 due to the heterogeneous nature of the source materials. More conventional editions tend to be editions of something, normally a work of some kind. While this term is far from unambiguous, it suggests both a cohesiveness and completeness, within some named boundary. Collections of letters published as editions — the correspondence of Charles Darwin springs to mind — normally have a central focus, in this case, Darwin. The chronological ordering of the letters again reinforces this centrality.

As such, the word ‘edition’ applies only in so far as Letters is the product of ‘editing’. Each letter digitised, encoded and disseminated is an ‘edition’ of that letter (itself cohesive); an ‘edition’ of all the letters could be more properly considered as an electronic archive (McGann) or a ‘thematic research collection’ (Palmer). The distinction would depend, ultimately, on the extent to which the collection could be deemed ‘thematically coherent’ (Palmer); this is very much a matter of debate, but ultimately I would suggest, merely a matter of terminology. In either case, however, the only way to begin to explore the letters is through ‘searching’, either via a free-text search or by browsing and filtering lists of letters.

The obvious solution to this is ‘compartmentalisation’ into an appropriate number of ‘sub-editions’. The separation of letters that do not, by this rationale, ‘belong together’ seems advantageous. Both the MacGreevy Archive[1] and the Shelley-Godwin Archive[2] follow an approach of separating the content into collections (with the latter, this seems to be the case, even though only the Frankenstein papers have thus far been published; and in both cases, the division of content is perhaps more obvious). Such a compartmentalisation could be achieved by manually assigning ‘categories’ to letters.

Dividing the letters into categories has many advantages. The redundant complexity necessary to create a data structure applicable to all the letters would be avoided. Alternatively, it avoids the ruthlessness normalisation of letters fit within a more generally applicable structure. In short, the representation of letters can be tailored much more appropriately to the letters themselves. The strong imposition of some ontological categories — whilst normative and externally derived, they may also be externally documented — may make it easier for researchers to retrieve specific information. It may thus be possible to consider each sub-edition properly as a thematic research collection.

There are problems with this, however. Categorisation, as opposed to tagging, requires — from a standpoint of usability — that the categories as applied correspond to the user’s notion of how a category should be applied. Tagging letters (in other words, assigning multiple categories) partially solves this problem, but at the expense of making the categories themselves less informative.

This is one suggestion. Another possibility to be considered is that a more cohesive edition might be constructed through tracing links between individuals writing or receiving letters, individuals named in letters, places named in letters; or derived characteristics, such as topics generated through latent semantic analysis. The former provide strongly identifiable aspects that letters may have in common; the latter, by assessing similarity, can identify letters that are ‘closer’, and thus linked-to by ranking.

This provides a level of ‘forced’ cohesion to the edition as a whole, while at a level of usability providing researchers with alternative modes of investigation. Radford, in ‘Positivism, Foucault, and the Fantasia of the Library’ refers to the conceptualisation by Umberto Eco of the library as a labyrinth, a ‘net’ where “every point can be connected with every other point, and, where the connections are not yet designed, they are […] conceivable and designable”. This conception, which appears similar to the Deleuzian idea of the ‘rhizome’ and Latour’s ‘Actor-Network Theory’, is explored as a model for data representation by Lyn Robinson and Mike Maguire, and by Sukovic. An edition constructed along these lines would permit a researcher to ‘walk’ through virtual corridors, following a myriad of links between texts of various types; the notion of ‘searching’ is replaced by one of exploration.

To be borne in mind is the fact that such a ‘labyrinth edition’ is only valid as a form of investigation from the inside. Strictly as a mode of exploration, it links letters together by common features, but, at the level of user interaction, never rises above the level of the letters — as individuals entities — themselves. The user is only ever guided from one letter to the next.

Viewing such a labyrinth externally, we essentially have a complex graph of links (a rhizome). Analysing such links should allow the creation of a general picture of the edition (or archive) as a whole. This perspective is, however, constrained due to the ‘fixed incompleteness’ — it is bounded by the limits of the content of the edition. In principle, this rhizomatic network should be able to reach out to include aspects beyond itself (in what Deleuze calls ‘lines of flight’), which become assimilated into this network (in other words, more letters, all letters, even lost ones). The implication of this is that such a network is never closed; at the same time, latent semantic analysis is dependent on closure — deriving some positivist conclusion from what is present.

In this analysis, an external view of this network is a model derived from the content itself, and is therefore a model only of the content itself; it is not a model of, say, correspondence by post in 1916. As such, methods along these lines are dependent on the means by which ‘closure’ is achieved. These means are very much factors in the real world: what kinds of letters have survived; what kinds of letters make it into archives; constraints of digitisation put in place by archives or other copyright holders; and myriad other factors. A model of the letters that can make valid claims beyond the materials of its construction must also model, or ‘reassemble’ (in Bruno Latour’s words) the factors of its construction.

 


Deleuze, Gilles. A Thousand Plateaus: Capitalism and Schizophrenia. Minneapolis: University of Minnesota Press, 1987. Print.

Latour, Bruno. Reassembling the Social an Introduction to Actor-Network-Theory. Oxford; New York: Oxford University Press, 2005. Open WorldCat. Web. 14 Aug. 2014.

McGann, Jerome. “Electronic Archives and Critical Editing.” Literature Compass 7.2 (2010): 37–42. Wiley Online Library. Web. 28 Oct. 2014.

Palmer, Carole L. “Thematic Research Collections.” A Companion to Digital Humanities. N.p. Web. 11 July 2014.

Robinson, Lyn, and Mike Maguire. “The Rhizome and the Tree: Changing Metaphors for Information Organisation.” Journal of Documentation 66.4 (2010): 604–613. CrossRef. Web. 14 May 2014.

Robinson, Peter. “Towards a Theory of Digital Editions.” The Journal of the European Society for Textual Scholarship (2013): 105. Print.

Sukovic, Suzana. “Information Discovery in Ambiguous Zones of Research.” Library Trends 57.1 (2008): 82–87. Web. 15 May 2014.

 

[1] http://www.macgreevy.org/

[2] http://shelleygodwinarchive.org/

Read comments / comment on this post

Even when text is an ordered hierarchy of content objects, it isn’t

AFF606a

It is debateable whether the theory that text is, as proposed by DeRose et al., an ‘Ordered Hierarchy of Content Objects’ (1990), preceded the practical implications of representing text using a hierarchical markup language. Even the abstract to ‘What is Text, Really?’ appears contradictory: within two sentences one finds the utilitarian argument (“this model [using SGML] contains advantages for the writer, publisher, and researcher”), preceded by the theoretical claim (“text is best represented as an OHCO, because that is what text really is” [my emphasis]).

In ‘Refining our notion of what text really is’, Allan Renear demonstrates that a text has “no unique logical hierarchy” that “represents the text itself” (1993). At best, he argues, a text may be treated as consisting of one of several hierarchies representing a given “analytical perspective” (Renear 1993). Examples of this are easy to imagine: a poem may be broken down into a hierarchy of stanzas, though this hierarchy may be incompatible with a hierarchy of grammatical clauses, which may span two stanzas. Thus we have an ‘overlapping hierarchy’, or, in Renear’s term, two perspectives. And, as he goes on to argue, a perspective which overlaps may be broken down into two sub-perspectives — which, it seems to me, are themselves perspectives (and thus the process can continue down to the atomic level, at which point the whole notion of a hierarchy is redundant).

Renear’s article was written over twenty years ago, and the intervening period has seen the arrival of a new language (XML), the refinement and elaboration of schema for encoding texts (the TEI), and the increasingly sophisticated uses of syntactic tricks to resolve problems of encoding ‘multiple perspectives’. My argument in this post, however, is that text — below the structural level; at the level of words and characters — is not discrete; and thus not hierarchical, even when, in practice, problems of ‘overlapping hierarchies’ are disregarded. To phrase this argument differently: at the level of words or characters, a text may be ordered or hierarchical.

The basis of this argument lies in the way that we use XML (or, formerly, SGML) to mark up texts. As a language for encoding discrete data, XML is hierarchical in the same way as other languages such as JSON and YAML. The following simple examples of XML and JSON — encoding a small amount of discrete data — contains the same data:

<person>
  <name>
    <forename>John</forename>
    <surname>Smith</smith>
  </name>
  <address>
    <streetAddress>1 Main Street</streetAddress>
    <town>Maynooth</town>
    <country>Ireland</country>
  </address>
</person>

The same data represented in the JSON format:

{
  "person": {
    "name": {
      "forename": "John",
      "surname": "Smith"
    },
    "address": {
      "streetAddress": "1 Main Street",
      "town": "Maynooth",
      "country": "Ireland"
    }
  }
}

[Example 1]

In the document-structural elements of the TEI, this is also the case (though some TEI-header elements may contain text rather than discrete data):

<TEI>
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>The Ballad of Red Riding Hood</title>
      </titleStmt>
    </fileDesc>
    </teiHeader>
  <text>
    <body>
      <p>
        Once upon a time there was a wicked wolf [...]
      </p>
      <p>
        Little Red Riding Hood set off through the forest [...]
      </p>
    </body>
  </text>
</TEI>

And in JSON:

{
  "TEI": {
    "teiHeader": {
      "fileDesc": {
        "titleStmt": {
          "title": "The Ballad of Red Riding Hood"
        }
      }
    },
    "text": {
      "body": {
        "p": [
          {
              "n": "1",
              "#text": "Once upon a time there was a wicked wolf [...]"
            },
            {
              "n": "2",
                "#text": "Little Red Riding Hood set off through the forest [...]"
          }
        ]
      }
    }
  }
}

In the above example, it should be noted that the <body> element contains a number of <p> elements, but these contain only a block of text: in other words, regardless of length (potentially massive), it remains a discrete data element.

At this level (within the paragraphs of the above examples), the text ceases to function as a hierarchy: it is, as Buzetti and McGann assert, “marked by the linear ordering of the string of characters which constitutes it” (Buzetti & McGann, 2007). Strings of characters in a given order constitute words that, in a given order, constitute sentences (or lines) that constitute paragraphs (or stanza); and this order is implicit within the text.

The value of XML for encoding textual data, then, lies in its ability to be used inline: that is, inserted into the ordered flow of text. In this usage, as distinct from the representation of discrete data, XML tags serve as descriptive markers for a string of characters, rather than as a means of structuring. According to Raymond, “[inline markup] is not a data model, it is a type of data representation” (Raymond, 1995). In this regard, as Buzetti and McGann state, its function is identical to that of any other punctuation mark that adds additional meaning to a word or sentence (Buzetti & McGann, 2007).

To this must be added that ordered text has what might be called a ‘ground state’: text that does not need to be marked up for some additional semantic value does not need to be marked up at all. The following trivial, and verbose, examples demonstrate the value of this:

<p>
  <neutralString>Once upon a time</neutralString>
  <name>Little Red Riding Hood</name>
  <neutralString>walked home</neutralString>
</p>

[Example 3]

Or even:

<p>
  <word>Once</word>
  <word>upon</word>
  <word>a</word>
  <word>time</word>
  <name>Little Red Riding Hood</name>
  <word>walked</word>
  <word>home</word>
</p>

[Example 4]

It may be argued that the above examples are, in fact, examples of linear text as an ordered hierarchy of content objects. However, though the paragraph has been made into a valid hierarchy of discrete data, the structure of the text is governed by the order of the elements — the inline markup being “dependent on a subset of character positions within textual data” (Raymond, 1995) — and not by the hierarchy. And, even then, the order is still implicit. Converted to JSON, the same data is nothing more than a part-of-speech-tagged bag of words:

{
  "p": {
    "word": [
      "Once",
      "upon",
      "a",
      "time",
      "walked",
      "home"
    ],
    "name": "Little Red Riding Hood"
  }
}

[Example 5]

By the same token, the paragraphs of text in the first example, at their particular hierarchical level, have only an implicit order. It is, I would suggest, this match between the ‘order’ which constitutes the text — ‘naturally’, as it would seem — and the non-disruptive (to this order) nature of inline markup that makes XML so useful for textual encoding. Moreover, maintaining the implicit order of content (at whatever level) in the text simplifies processing data: provided the output follows the ‘natural’ order of the text, less work is required (by way of contrast, imagine the JSON example given above, but with each word given an explicit ‘position’ attribute: it would be possible to reassemble the text into its correct order, but only by looking up each word in turn by its position attribute).

The value of XML, then, is its capacity both to represent the discrete (and logically subordinate: paragraphs ‘belonging’ to a chapter) and to describe the continuous, without destroying the nature of the latter. The language is an imperfect fudge between hierarchical and ordered structures. Yet, in the absolute, XML is hierarchical. Take the following sentence:

And then the man shouted, “Johnny Hallyday est un idiot,” before falling over drunk.

Which may be marked up thus:

<sentence>And then the man shouted, "<french><name>Johnny Hallyday</name>est un idiot,</french>" before falling over drunk.</sentence>

[Example 6]

In this example, the ‘embedding’ of the <name> tag within the <french> tag is merely a function of the hierarchical nature of the language, which requires properly-nested elements, rather than the expression of some hierarchical relation between the name of a person and the language used in some dialogue (though, coincidentally, Johnny Hallyday is French). Of course, the reason the tags must be nested in this way is to preserve the ordering of text within the marked spans. This is precisely the reason why problems of ‘overlapping hierarchies’ occur.

There is no intrinsic reason why a given string of characters cannot ‘belong’ to two categories (the French and the name). Indeed, as Michael Witmore has argued in a blog post, text is a ‘massively addressable object’. That is to say, any aspect of the text can be addressed in multiple ways. What would be created is a ‘flat’ ontology, where two semantic descriptors have, effectively the same status. This is precisely how both relational databases and graph databases operate. However, achieving massive addressability with inline tags that mark spans is impossible, due to overlapping hierarchies. Therefore linear text must be broken down into discrete chunks that can be individually referenced, in short, to be ‘addressable’. The obvious consequence of this is that it is no longer linear text. Whether such a trade-off — the processing required to break down a text, explicitly recording its order, and then to ‘reconstruct’ it — is worth the effort is certainly debateable. At some point, with the increasing use of stand-off markup to work round the problems of encoding several overlapping hierarchies and maintaining data coherence, this may prove the simpler approach.


Buzzetti, Dino, and Jerome McGann. “Electronic Textual Editing: Critical Editing in a Digital Horizon.” Electronic Textual Editing. Web.

DeRose, Steven J. et al. “What Is Text, Really?” ACM SIGDOC Asterisk Journal of Computer Documentation 21.3 (1997): 1–24. Print.

Raymond, Darrell R., and Derick Wood. Markup Reconsidered. Technical report, Department of Computer Science, The University of Western Ontario, 1995. Google Scholar. Web. 6 Jan. 2015.

Renear, Allen H., Elli Mylonas, and David Durand. “Refining Our Notion of What Text Really Is: The Problem of Overlapping Hierarchies.” (1993).

Witmore, Michael. “Text: A Massively Addressable Object.” 2010. Web. 22 Sept. 2014.

Read comments / comment on this post

Letters of 1916: Social Network with VisJS

Letters of 1916

Over the last couple of days, I’ve been investigating various libraries for graph visualisations. One thing that came up from the last Letters of 1916 Twitter Chat, where we looked at some of Roman’s visualisations of the letters, was the difficulty of understanding and explaining outliers (like the love letters) when the data represented by each node cannot be accessed.

As a result, I decided to look into interactive graph tools (particularly web-based ones… they’re maybe a bit slower, but I speak passable JavaScript, and almost every modern browser can show them without much difficulty). The simplest seemed to be vis.js, which is astonishingly simple (just pass it some JSON for the nodes and edges). So I diverted the Python script I used to generate the graph in the last post into an html page with the vis.js library included, and hit refresh… This takes an awfully long time to render.

One potential for this web-based approach is incorporating this into a digital (online) edition of some kind: each of the nodes is an html canvas object, so it’s possible to add any amount of data — names, pop-ups, canonical links — to them. I also like the ability to zoom in easily (scroll) and to drag nodes around to spot the links between larger clusters. (It looks like there are two relatively-tightly linked clusters, that are joined only by a chain of about four people.)

The other thing to note is that this graph tool does not overlay identical edges, so each line now represents a single letter. This creates a sort of tightly-knit bundle for two people who wrote to each other a lot (‘James Finn’ and ‘May Fay’ are connected by so many edges that the graph library still hasn’t managed to stabilise, and they dance around each other and — when zoomed out —seem to flash like pulsars; I feel this is quite romantic, somehow.)

Click to load the full interactive graph 

All the data in this document comes to about 150kb, but then factor in the loading of the javascript library and the rendering of the page in-browser and you might be there a while. (Chrome claims the page is not responding: it is.)

Unidentified persons

Since posting the graph in the previous post, with its big cluster of ‘unknowns’ in the middle, I’ve been trying to think of various ways round this — or, at least, to make it not quite so disruptive to the graph as a whole. Once the corpus is more fully-developed, hopefully a lot more of these people will be identified, but in the meantime I just wanted a way to ‘unbundle’ all the unknowns into discrete unknowns. The question then becomes, “How many unknown people are there in this bundle?”

At one end of the spectrum, we can assume that every single unknown sender or recipient was a distinct person. But this is probably not the case — a quick skim through the Excel file of data shows that particular people just didn’t write the recipient’s full name on the letter, which is quite understandable in the case of family members, for instance. In this case, a great number of additional people will be magicked into existence, which made the graph quite a lot more complicated.

At the other extreme, there is the situation we saw in the last graph, where we assume all the unknowns are one and the same person. This makes Mr. Unknown the most popular person in Dublin by a wide margin, and wildly distorts the graph.

In the end, I settled for somewhere in between, and assumed that each sender or recipient had precisely one unknown correspondent. These have ‘ANONC’ (for unknown creator) and ‘ANONR’ (unknown recipient) appended to their labels. Somewhere in the graph there will be a pair of nodes called just ‘ANONC’ and ‘ANONR’, where both sender and recipient were unknown.

(The other option, which has just occurred to me, would be to remove all letters involving an unknown person. This would have the effect of removing some actual people from the graph entirely.)

The many-names-of-Lady-Clonbrock problem

Another problem, which I identified in the previous post, is the lack of normalisation of individuals’ names. I’m working on ways round this — compiling a dictionary of aliases and having my Python script normalise the names seems the most obvious thing, though, of course, this means knowing who all the people are in the first place.

I’m going to continue work on this: maybe introducing some kind of fuzziness into the searching, or using addresses instead of names, or both, might also be useful.

And finally, it would be nice to know the direction of each letter on the graph: something for another post.

Read comments / comment on this post

Letters of 1916: Preliminary Social Network

Letters of 1916

For a first post on this blog, I thought I’d present a first network graph derived from the senders/recipients of the Letters of 1916 project. (Thanks to Roman Bleier for his Python-based analysis tools, which made picking apart Excel files pleasingly trivial.)

A few preliminary thoughts:

Our corpus congregates around a few individual (which you’d expect, given the rather self-contained nature of the collections it’s derived from).

But then again, we’ve got lots of individuals, grouped round the edge, only connected by one letter (and not to any wider groups).

Lady Clonbrock was on first-name terms with some people, and not with others.

Most letters were sent to or by ‘anonymous’ (the unlabelled node in the middle) — which is almost certainly quite a few people. As such, the entire graph is massively skewed (actually it’s probably even less interconnected than it looks).

Click to see large image (6Mb)

There are so many flaws with this it’s scarcely worth mentioning them all (but I’ll valiantly try): the data is a raw spreadsheet dump from the ongoing work on the project; the lack of normalisation is a problem (see “Lady Clonbrock [née whatever]”); and the missing data really is a problem (any suggestions, aside from historical investigation?). There is also the fact that I don’t understand the details of the graph-plotting algorithm, so it’s difficult to say exactly in what way this graph represents the corpus.

However, my main thought that arises from this is: what will any sort of analysis of a network of letters (at the level of the letters) actually mean? At worst, it’s indicative only of how we’ve gone about collecting letters (and the people we’ve persuaded to bring them); at best, it might say something about the processes by which some letters have survived a hundred years. Anyway, it is a network of the data of the project — ‘a network of Dublin letters’ is too great a claim for the scanty evidence of this network.

Read comments / comment on this post

Hello

Miscellaneous

Hello digitis[/z]ers.

Welcome to my new blog. This rather rubbish message will be updated in due course. In the meantime, here is a quote about DH:

― Is it about robots?
― No, it is not about robots.

— “What is Digital Humanities?”, Conversation in pub, Dublin, 2014.

Read comments / comment on this post