Woodman Diary: pre-compiling for zero-infrastructure

Since the development of database-backed websites, starting with CGI scripts, continuing the development of languages such as Perl and PHP, and, latterly, culminating in web frameworks such as Django (written in Python) and Ruby on Rails, it would not be only moderate hyperbole to suggest that no-one writes HTML any more. More precisely, it would be true to say that HTML is used to write templates for content, which is then compiled in order to be sent to a user’s web browser. There are essentially two reasons for this. Firstly, manually writing HTML for each page to be delivered requires a degree of repetitive overhead that a tiny amount of programming can quite easily make redundant. Each HTML document requires a number of elements that are common to each page: a head element containing metadata, a title element, and scripts and CSS stylesheets; each page will probably incorporate the same navigation elements, and the same footer elements. Keeping these elements that will never change, or will change predictably, separate from the actual content of each page allows much greater maintainability (a change in the header only needs to be made in one place) and avoids what amounts to a great deal of repetitive typing. It is for this reason that PHP was designed initially: less as a programming language, and more as a way of including standard content across a number of pages without having to type it out manually.

The second factor to be considered is the nature of the content in question. The web today is less a standard way of representing and linking digital documents (as Tim Berners-Lee designed it to be) and more a means for delivering dynamic and interactive content. It would, for instance, be inconceivable that a site such as Facebook could exist if every page was a separate document that had to be written manually in HTML, or even made into HTML pages in advance. For this kind of site, HTML is merely the form of delivering content in a way that the browser can render; the data of the site is held separately in a database, and pages are created by firstly querying the database, and secondly inserting the result into a template. Even sites with less dynamic content (WordPress blogs; newspaper websites) employ a database-backed approach, though this is, I would argue, a pragmatic decision rather than a necessity.

Whilst researching the options for building the site infrastructure for the Albert Woodman’s Diary project, these two factors were essential in our considerations (not to mention the pragmatics of actually getting a given approach to work within a time constraint). The first, that of avoiding repetition, was both clearly beneficial and, as the data was to be encoded in TEI-XML, a given anway: XSLT, as a standard approach for converting an XML document into another XML document (in this case XHTML, which is essentially an XML schema), functions as a templating language in the same way as PHP or ERB (the templating language used with Rails) allows the insertion of arbitrary content into HTML. Using XSLT templates, the ‘standard’ content of each page would only need to be written once.

The more important question to be considered was the extent to which a database was required to store and query the content. Implicitly underlying this decision was the relative complexity of getting a database (especially an XML database) set up and running. As I previously noted, the necessity of querying underlying data is dependent on precisely how dynamic the content of a site is, but depends equally on the complexity of the data query required to generate a given page. WordPress, for example, is backed by a database, with a user request for a page resulting in a database being queried for a given post’s content, the content being inserted into a standard HTML template, and the HTML returned to the user’s browser and rendered. In this model, there is no particular reason why the data of each blog post could not be pre-rendered in HTML and sent to the user on request without requiring a database (indeed, this is precisely what various WordPress caching plugins do). One reason for this is the relative simplicity of the query itself: request a single row from a single table. A more important factor is the predictability of this query. If a page is to be pre-rendered, it is necessary to anticipate that a user issue a page request that corresponds to a given query. In the case of a blog, this is relatively obvious: a user will wish to see all the posts. Even more complex queries, such as a page listing the posts in each category, can be anticipated and pre-rendered.[1] A final factor is the number of query permutations: a user wishing to see a page of posts belonging to either of two categories will require a page to be pre-generated for each combination of two categories… which could lead to potentially thousands of pre-rendered pages, making an on-the-fly database query much more effective.

This line of thought was also applied to Woodman’s Diary in deciding whether to have a site backed by a database or simply to convert the XML to HTML: Would the content change? How arbitrary a query would be required for a user to navigate the content? How many pages would need to be rendered in such a case? And, most pragmatically, how would the technical complexity of setting up a native XML database outweigh the benefits? Clearly, the content of the diary is static, so the pages could be rendered beforehand. Looking at the number of diary entries (a couple of hundred), using XSLT to generate each page and the links between in advance was clearly plausible. Likewise, a page corresponding to each named entity (again, a fairly large, but also unchanging number) could be pre-rendered. In this respect, the complexity derived from both setting up an XML database (such as eXist) and learning XQuery[2] was not worth it in terms of end-result compared to simply pre-compiling every HTML page and sticking them on a web server.

Of course, what is lost by this approach, as suggested above, is the ability of the user to arbitrarily query the data, either using the XML structure (“give me every day in June that’s a Sunday and mentions Woodman’s wife”) or through full-text searching. In part, this functionality can be achieved by, for instance, a filterable list of named entities and terms; otherwise, it was felt that arbitrary queries were unnecessary. A user wishing to perform complex queries would, anyway, be able to download the underlying TEI documents and manipulate them as they wished.


[1] Later blogging platforms, notably Jekyll — a static blog generator written in Ruby — take this approach: the blogger writes posts using Markdown syntax, which the generator compiles into the HTML for a blog site, with a home page, post-pages, category-pages, and all the links between. The blogger then uploads all the HTML files to a basic web server, and does not have to bother with configuring a database.

[2] Using a relational database, such as MySQL, was not considered an option. Even simple analysis shows it to require a kind of quasi-rendering of the XML data, in order to put a given piece of content into a relational database table — which makes it no more beneficial than pre-rendering the HTML page.


No comments yet

Leave a Reply

Your email address will not be published. Required fields are marked *