Even when text is an ordered hierarchy of content objects, it isn’t

It is debateable whether the theory that text is, as proposed by DeRose et al., an ‘Ordered Hierarchy of Content Objects’ (1990), preceded the practical implications of representing text using a hierarchical markup language. Even the abstract to ‘What is Text, Really?’ appears contradictory: within two sentences one finds the utilitarian argument (“this model [using SGML] contains advantages for the writer, publisher, and researcher”), preceded by the theoretical claim (“text is best represented as an OHCO, because that is what text really is” [my emphasis]).

In ‘Refining our notion of what text really is’, Allan Renear demonstrates that a text has “no unique logical hierarchy” that “represents the text itself” (1993). At best, he argues, a text may be treated as consisting of one of several hierarchies representing a given “analytical perspective” (Renear 1993). Examples of this are easy to imagine: a poem may be broken down into a hierarchy of stanzas, though this hierarchy may be incompatible with a hierarchy of grammatical clauses, which may span two stanzas. Thus we have an ‘overlapping hierarchy’, or, in Renear’s term, two perspectives. And, as he goes on to argue, a perspective which overlaps may be broken down into two sub-perspectives — which, it seems to me, are themselves perspectives (and thus the process can continue down to the atomic level, at which point the whole notion of a hierarchy is redundant).

Renear’s article was written over twenty years ago, and the intervening period has seen the arrival of a new language (XML), the refinement and elaboration of schema for encoding texts (the TEI), and the increasingly sophisticated uses of syntactic tricks to resolve problems of encoding ‘multiple perspectives’. My argument in this post, however, is that text — below the structural level; at the level of words and characters — is not discrete; and thus not hierarchical, even when, in practice, problems of ‘overlapping hierarchies’ are disregarded. To phrase this argument differently: at the level of words or characters, a text may be ordered or hierarchical.

The basis of this argument lies in the way that we use XML (or, formerly, SGML) to mark up texts. As a language for encoding discrete data, XML is hierarchical in the same way as other languages such as JSON and YAML. The following simple examples of XML and JSON — encoding a small amount of discrete data — contains the same data:

<person>
  <name>
    <forename>John</forename>
    <surname>Smith</smith>
  </name>
  <address>
    <streetAddress>1 Main Street</streetAddress>
    <town>Maynooth</town>
    <country>Ireland</country>
  </address>
</person>

The same data represented in the JSON format:

{
  "person": {
    "name": {
      "forename": "John",
      "surname": "Smith"
    },
    "address": {
      "streetAddress": "1 Main Street",
      "town": "Maynooth",
      "country": "Ireland"
    }
  }
}

[Example 1]

In the document-structural elements of the TEI, this is also the case (though some TEI-header elements may contain text rather than discrete data):

<TEI>
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>The Ballad of Red Riding Hood</title>
      </titleStmt>
    </fileDesc>
    </teiHeader>
  <text>
    <body>
      <p>
        Once upon a time there was a wicked wolf [...]
      </p>
      <p>
        Little Red Riding Hood set off through the forest [...]
      </p>
    </body>
  </text>
</TEI>

And in JSON:

{
  "TEI": {
    "teiHeader": {
      "fileDesc": {
        "titleStmt": {
          "title": "The Ballad of Red Riding Hood"
        }
      }
    },
    "text": {
      "body": {
        "p": [
          {
              "n": "1",
              "#text": "Once upon a time there was a wicked wolf [...]"
            },
            {
              "n": "2",
                "#text": "Little Red Riding Hood set off through the forest [...]"
          }
        ]
      }
    }
  }
}

In the above example, it should be noted that the <body> element contains a number of <p> elements, but these contain only a block of text: in other words, regardless of length (potentially massive), it remains a discrete data element.

At this level (within the paragraphs of the above examples), the text ceases to function as a hierarchy: it is, as Buzetti and McGann assert, “marked by the linear ordering of the string of characters which constitutes it” (Buzetti & McGann, 2007). Strings of characters in a given order constitute words that, in a given order, constitute sentences (or lines) that constitute paragraphs (or stanza); and this order is implicit within the text.

The value of XML for encoding textual data, then, lies in its ability to be used inline: that is, inserted into the ordered flow of text. In this usage, as distinct from the representation of discrete data, XML tags serve as descriptive markers for a string of characters, rather than as a means of structuring. According to Raymond, “[inline markup] is not a data model, it is a type of data representation” (Raymond, 1995). In this regard, as Buzetti and McGann state, its function is identical to that of any other punctuation mark that adds additional meaning to a word or sentence (Buzetti & McGann, 2007).

To this must be added that ordered text has what might be called a ‘ground state’: text that does not need to be marked up for some additional semantic value does not need to be marked up at all. The following trivial, and verbose, examples demonstrate the value of this:

<p>
  <neutralString>Once upon a time</neutralString>
  <name>Little Red Riding Hood</name>
  <neutralString>walked home</neutralString>
</p>

[Example 3]

Or even:

<p>
  <word>Once</word>
  <word>upon</word>
  <word>a</word>
  <word>time</word>
  <name>Little Red Riding Hood</name>
  <word>walked</word>
  <word>home</word>
</p>

[Example 4]

It may be argued that the above examples are, in fact, examples of linear text as an ordered hierarchy of content objects. However, though the paragraph has been made into a valid hierarchy of discrete data, the structure of the text is governed by the order of the elements — the inline markup being “dependent on a subset of character positions within textual data” (Raymond, 1995) — and not by the hierarchy. And, even then, the order is still implicit. Converted to JSON, the same data is nothing more than a part-of-speech-tagged bag of words:

{
  "p": {
    "word": [
      "Once",
      "upon",
      "a",
      "time",
      "walked",
      "home"
    ],
    "name": "Little Red Riding Hood"
  }
}

[Example 5]

By the same token, the paragraphs of text in the first example, at their particular hierarchical level, have only an implicit order. It is, I would suggest, this match between the ‘order’ which constitutes the text — ‘naturally’, as it would seem — and the non-disruptive (to this order) nature of inline markup that makes XML so useful for textual encoding. Moreover, maintaining the implicit order of content (at whatever level) in the text simplifies processing data: provided the output follows the ‘natural’ order of the text, less work is required (by way of contrast, imagine the JSON example given above, but with each word given an explicit ‘position’ attribute: it would be possible to reassemble the text into its correct order, but only by looking up each word in turn by its position attribute).

The value of XML, then, is its capacity both to represent the discrete (and logically subordinate: paragraphs ‘belonging’ to a chapter) and to describe the continuous, without destroying the nature of the latter. The language is an imperfect fudge between hierarchical and ordered structures. Yet, in the absolute, XML is hierarchical. Take the following sentence:

And then the man shouted, “Johnny Hallyday est un idiot,” before falling over drunk.

Which may be marked up thus:

<sentence>And then the man shouted, "<french><name>Johnny Hallyday</name>est un idiot,</french>" before falling over drunk.</sentence>

[Example 6]

In this example, the ‘embedding’ of the <name> tag within the <french> tag is merely a function of the hierarchical nature of the language, which requires properly-nested elements, rather than the expression of some hierarchical relation between the name of a person and the language used in some dialogue (though, coincidentally, Johnny Hallyday is French). Of course, the reason the tags must be nested in this way is to preserve the ordering of text within the marked spans. This is precisely the reason why problems of ‘overlapping hierarchies’ occur.

There is no intrinsic reason why a given string of characters cannot ‘belong’ to two categories (the French and the name). Indeed, as Michael Witmore has argued in a blog post, text is a ‘massively addressable object’. That is to say, any aspect of the text can be addressed in multiple ways. What would be created is a ‘flat’ ontology, where two semantic descriptors have, effectively the same status. This is precisely how both relational databases and graph databases operate. However, achieving massive addressability with inline tags that mark spans is impossible, due to overlapping hierarchies. Therefore linear text must be broken down into discrete chunks that can be individually referenced, in short, to be ‘addressable’. The obvious consequence of this is that it is no longer linear text. Whether such a trade-off — the processing required to break down a text, explicitly recording its order, and then to ‘reconstruct’ it — is worth the effort is certainly debateable. At some point, with the increasing use of stand-off markup to work round the problems of encoding several overlapping hierarchies and maintaining data coherence, this may prove the simpler approach.


Buzzetti, Dino, and Jerome McGann. “Electronic Textual Editing: Critical Editing in a Digital Horizon.” Electronic Textual Editing. Web.

DeRose, Steven J. et al. “What Is Text, Really?” ACM SIGDOC Asterisk Journal of Computer Documentation 21.3 (1997): 1–24. Print.

Raymond, Darrell R., and Derick Wood. Markup Reconsidered. Technical report, Department of Computer Science, The University of Western Ontario, 1995. Google Scholar. Web. 6 Jan. 2015.

Renear, Allen H., Elli Mylonas, and David Durand. “Refining Our Notion of What Text Really Is: The Problem of Overlapping Hierarchies.” (1993).

Witmore, Michael. “Text: A Massively Addressable Object.” 2010. Web. 22 Sept. 2014.

Comments
rhadden

2017-01-15 13:55:09

Thanks for your reply! (Normally only get spam on this blog.) I'm not sure what my mindset was when I wrote this (two years ago!), but here's my response now.

Firstly, I think the bit of Renear you've cited (especially point 5) implies that text isn't an OHCO.

The question I would ask is 'what is a content object?' I think the point on which I disagree with you is your assertion that text is language (and nothing more): by doing so, you pre-define a set of valid, compatible 'content objects' -- by taking words as a content object, each with a grammatical function, you are able to build upwards into a syntax tree. But then we should ask, is text reducible to grammatical sentences? You might also run into problems with higher sets of 'formal chunk'. Imagine a novel with a sentence that started in one chapter and finished in the next chapter (I can't think of an example, but I'm sure some funky author or other has done it!) Then there are other examples of OHCO that are usually cited (grammatical syntax vs poetic lines).

In answer to your last question then, I'd go with Ted Underwood's description of text as a "massively addressable object". Or a graph, or a rhizome. (To be a little esoteric, maybe text is a Schrödinger's cat kind of quantum superposition: all possible hierarchical and non-hierarchical structures exist at once, but by engaging with the text we can't help but bring some perspective that constructs a particular hierarchy).

Not that this definition is necessarily useful as a way to think of, or especially encode, text. Hence OHCO, XML etc., especially versus a graph model: using XML, you have to hack in "alternative" hierarchies, but you get one default hierarchy for free (hence you can select a single content object in one go, as opposed to reconstructing it, as you would have to if you encoded each word separately with a number representing its order in the text).

I hope that's a good answer :)

Anthony Durity

2017-01-15 12:56:07

Renear et al. state in their conclusion,

“So we have the following positive conclusions:

1. Perspectives -- theories, methodologies, and analytical practices -- are at least as important as genre in the identification of text objects [15]
2. Perspectives frequently determine hierarchies of objects
3. Non-hierarchical perspectives can often be decomposed into hierarchical sub-perspectives

But we note:

4. Perspectives do not always determine hierarchies
5. Non-hierarchical perspectives cannot always be decomposed into hierarchical sub-perspectives”

Discrete data is a term which makes no sense to me. Language is made up of discrete components, that's what makes it language.

We're talking about _texts_ here, not random chunks of language. Texts are almost OHCO by _definition_ I would assert. The sentence is decomposed grammatically into a syntax tree, and from there on up through formal chunks like the paragraph, section, chapter, part, and so on the text can be decomposed into its constituent parts.

I put it to you this way, if texts are not OHCO, then what are they?

:)

Leave a Reply

Your email address will not be published. Required fields are marked *