Spreading the Data Around – An Annotated Bibliography

After I finished writing an earlier post about Jordan Mechner’s experience with his Prince of Persia source code, I started thinking about methods of keeping data safe and accessible in the humanities. While the issue faced by Mechner was one of his data existing on floppy disks in an outdated format, floppy disks themselves are a thing of the past. Even forms of portable data storage that were common only a few years ago—USB flash drives and SD cards—are quickly being replaced by something new: Cloud Storage. But is just throwing our data up on servers enough to keep it preserved? Now, to be fair, losing large amounts of data on multiple occasions has made me somewhat paranoid when it comes to keeping my files safe. But corporations, archivists, and even hobbyists often raise similar questions, and many scholars and professionals in a variety of fields have written about those issues and possible solutions.

After doing research into what others have had to say about preservation, my current feeling is that the ideal solution would be a leveraging of the interested public in a kind of decentralized group storage, as a combination of crowdsourcing and cloud storage—”Crowdstorage,” if you will—where the population involved in working on a project also participates in its preservation and curation.  This, however, raises a number of issues about copyright law and intellectual property, and is probably in many cases an impossibility.

So, at this juncture, rather than chasing rainbows with the idea of throwing the data to the crowd, I thought I’d let the sources I read speak for themselves—with some annotations of my own, of course.

Digital Preservation Coalition (DPC). Preservation Management of Digital Materials: The Handbook. November 2008. Web. Accessed 25 November 2014. <http://www.dpconline.org/component/docman/doc_download/299-digital-preservation-handbook>
Originally a print book, this handbook now exists as a PDF file online. The 160-page document details the Digital Preservation Coalition’s best practices for handling digital materials. Of particular interest is Chapter 4.3, “Storage and Preservation” (p. 103), which is a central topic of the book in spite of appearing near the end.  One important point noted in the chapter is the concept of having both “preservation” and “access” copies in the digital form, much as may be done in an analogue archive: a copy exists to preserve the original data, while another exists for people to access and use (further discussed in Chapter 4.5, “Access,” on p. 122).  The document makes the assertion that “not all resources can or need to be preserved forever” and urges making the decision of what to preserve (and for how long) as early as possible (p.103).
Notably, this large document makes little mention of storage techniques beyond physical media, and the use of online, decentralized servers is barely mentioned.  Third-party services are discussed in Chapter 3.3 (p. 51), but the discussion is focused more on legality and logistics and less on what services third parties may actually provide.
In summary, while the handbook does not address off-site networked storage directly, it does an excellent job of framing existing preservation practices in the Digital Humanities, as well as common issues and considerations for preservationists to be aware of.

Gantz, John F. et al. The Expanding Digital Universe: A Forecast of Worldwide Information Growth Through 2010. Framingham, MA: IDC Information and Data, 2007. Web. Accessed 27 November 2014. <https://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf>
Though eight years old, the forecast by Gantz et al. does a fantastic job of illustrating the particular challenges faced in Digital Curation. The report calculates that, in 2006, the amount of data produced was 161 billion gigabytes, about three million times the amount of information in every book ever written (p. 1). The paper predicted that that value would increase by a compound rate of 57% every year to 2010, meaning that by the beginning of the current decade, the forecast expected the digital world to be 988 billion gigabytes in size (p. 3). Personally, given the proliferation of streaming entertainment in the last ten years and the increased size of digital downloads, not to mention the explosive growth of social media, I believe that the estimation put forth by Gantz et al. is accurate. The analyses on data production in this document are a fascinating read for anyone interested in digital information, but two topics that are particularly compelling for this post’s purposes have to do with storage. First, Gantz et al. predict that, in 2010, companies will only be producing about 30% of the digital universe (with the other 70% created by users), but will be responsible for safeguarding over 85% of it (p. 9). With the advent of cloud computing, this certainly appears to have come to pass. Second, there is the issue of long-term storage, with digital media degrading far more quickly than its analogue counterpart, so much so that keeping digital data valid requires transferring it to new media every few years (p. 11).  All in all, this analysis illustrates that the issue of archival and storage is not limited to the Digital Humanities or even to archivists; it is faced by anyone dealing with any kind of data—contemporary or otherwise—that needs to be safeguarded and accessed digitally.

Inter-University Consortium for Political and Social Research. Guide to Social Science Data Preservation and Archiving, 5th Edition. Ann Arbor, MI: Institute for Social Research University of Michigan, 2012. Web. Accessed 27 November 2014. <http://www.icpsr.umich.edu/files/deposit/dataprep.pdf>
This document outlines the methods and guidelines for data preservation and archiving by the Inter-University Consortium for Political and Social Research (ICPSR). The paper outlines a number of areas that need to be considered before data is to be deposited in a public archive, and notably among these areas are issues of intellectual property and copyright (p. 11). The guidelines remind users that the owners of data may not be the same as the individuals who collect it (as can be the case when researchers work on behalf of institutions), and also may not be the same as the entities responsible for archiving that data (p.11). Both of these issues need to be considered when data is archived and distributed. While institutions are likely to expect their data to be shared to some degree, they often place particular limitations on the nature in which that sharing may occur, as well as the audience with whom the sharing may take place.  Furthermore, there is more than one method of sharing, and five common approaches are outlined on page 12. While an archivist’s primary concern may be preservation of data, the practice goes hand-in-hand with making the data accessible to appropriate audiences, and so issues of sharing and accessibility need to be considered from the very beginning of the preservation process. After all, if no one is going to be given access to the data, why is it being preserved in the first place? Other sections of this document, particularly “Importance of Good Data Management” (p. 19) and “Master Datasets and Work Files” (p. 33) contain valuable information for those interested in protecting data, but I would recommend the document in its entirety as required reading for anyone beginning the a Digital Curation project for the first time.

Kraus, Kari and Rachel Donahue. “’Do You Want to Save Your Progress?’: The Role of Professional and Player Communities in Preserving Virtual Worlds.” Digital Humanities Quarterly 6.2 (2012). DHQ. Web. Accessed 25 November 2014. <http://digitalhumanities.org:8080/dhq/vol/6/2/000129/000129.html>
Imagine my surprise when I stumbled across an article discussing exactly the concept that got me thinking about using the crowd for preservation in the first place: the actions of the video game playing community with regard to old and outdated software titles. Kraus and Donahue discuss in detail the issues that make video games particularly susceptible to decay and loss, even relative to other born-digital artefacts.  Here are a few examples that are particularly challenging for video games: software and hardware dependence, particularly in the case of titles for consoles or arcade cabinets; lack of documentation from the development or annotations for tools used; questions of authenticity as to whether a program has been free from later adjustment or tampering; and intellectual property rights, often closely guarded by large companies even when they have no plans to license or reuse an outdated title.
Kraus and Donahue go on to examine the preservation measures taken by different groups involved in video games—the companies making the games, and the players themselves—in terms of preserving the games they have made or played.  The article goes into a lot of detail with regard to various types of preservation and gives a number of various examples, but it essentially boils down to this: the player community is what is preserving these artefacts, not professional archivists, nor even the creators of the games themselves.

McClean, Tom. “Not with a Bang but a Whimper: The Politics of Accountability and Open Data in the UK.” 15 August 2011. Web. Accessed 27 November 2014. <http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1899790>
This is an important paper as a counterpoint to my idealized vision of using the crowd for curation. McClean outlines the British government’s 2010 decision to “begin the online publication of historical data from the Combined Online Information System, known as COINS” (p. 1). The goal was transparency and accountability in terms of budget and expense in the public sector. This essay argues that the release of data has had little impact so far, and gives some reason for coming to this conclusion. The opening of the essay is heavy on political background, since McClean asserts that politics were the main driving force behind the release of the data (p. 2). Similarly, McClean cites several political and bureaucratic reasons for the release of the data’s failing to meet expectations (p. 5). However, one argument in McClean’s analysis of the data release itself is very enlightening for the purposes of this blog: that the raw data itself is far too complex to be useful, and that no tools are provided to assist the public in interpreting it (p. 8). For example, McClean writes that On the most basic level, the dataset is simply too large to be processed with commonly-available software packages such as Microsoft Excel; doing so requires slightly more specialized software, and this alone probably puts it out of reach of a good many ordinary citizens” (p. 8). Furthermore, McClean notes that the data needed to be reformatted and otherwise modified in order to even be viable for interpretation (p. 9). Although McClean neglects to cite any sources supporting his claim that “after a brief burst of publicity immediately after its release, COINS has not featured prominently in media reports” (p. 9), one can still assume that the difficulties outlined in his writing are real barriers to the free use of the released data. McClean’s essay highlights the dangers of simply releasing data to the public and assuming that they will have the resources, knowledge, and inclination to use it. Without providing cleanly formatted data and the tools necessary for interpretation, the decision to release the contents of COINS seems far more political than practical.

National Records of Scotland. The National Records of Scotland and born digital records a strategy for today and tomorrow. September 2014. Web. Accessed 27 November 2014. <http://www.nrscotland.gov.uk/files/record-keeping/nrs-digital-preservation-strategy.pdf>
This extremely recent white paper, barely two months old, details a 5-year plan put forth by the National Records of Scotland for the purpose of preserving the digital artefacts in its archive, with the goal of being “efficient, flexible, high quality, and trustworthy” (p. 3).  The report states that “the Scottish Government’s corporate electronic document and records management system holds 10.5 terabytes of records” (p.3), and that is only part of the material for which the National Records of Scotland is potentially responsible.  The challenge is compounded further by the fact that the National Records apparently do not have the freedom to decide what to preserve and what to discard, but are rather obliged to keep all the records that other government entities pass to them for preservation (p.4), this, coupled with the fact that each institution has its own idiosyncrasies in terms of the formats in which it produces data, creates an especially difficult situation. The National Records recognize that it cannot “do everything now and then walk away from the issue” (p.4), and thus has adopted a forward-thinking approach to preservation. This aggressive approach is particularly important in light of the fact that the National Records of Scotland currently holds only about 256GB of data (p. 8), far less than the Records may soon be asked to curate.  For the purposes of this blog post, this article illustrates the immediate contemporary issues faced by a large-scale repository. While the white paper does not propose a firm solution to National Records of Scotland’s situation, I am interested to see whether the institution partners with other groups in the governmental structure to alleviate issues of storage and access.

O’Carroll, Aileen et al. Caring for Digital Content: Mapping International Approaches. Maynooth: NUI Maynooth (2013). Web. Accessed 27 November 2014. <http://dri.ie/caring-for-digital-content-2013.pdf>
The Digital Repository of Ireland is one of the country’s leaders in digital preservation, and this publication is geared specifically toward examining different approaches toward keeping digital content. Of particular note for the purposes of this blog post is the section “Multi-Site Digital Repositories,” which details a number of repositories that “host data within a federated structure that allows sharing of metadata and data across institutions” (p. 15). The five repositories included in this section (along with links to each) describe institutions that have decentralized their data by creating a structure wherein the data is stored in multiple interconnected locations. This not only allows multiple institutions to pool their resources and work more effectively by sharing information, but it also safeguards against data loss by not keeping all of their data stored in one place.  One very interesting thing to note is that, while the five institutions—ARTstor, the Digital Repository of Ireland, the IQSS Dataverse Network, Openaire, and the Texas Digital Library—all operate under the same principle of decentralizing their data, each goes about that process very differently. ARTstor, for example, provides institutions with tools to store and publish data privately or publicly (p.13), while IQSS “delegates the access controls to its users” while keeping the systems themselves centralized (p.14). Caring for Digital Content includes a large amount of information on other types of collections and services in addition to those mentioned in “Multi-Site Digital Repositories,” constituting a large overview of data curation strategies in a number of different countries. While the primary goal of the publication was to help the Digital Repository of Ireland “future-proof” its own digital collection (p. 27), the institution has done the broader field of Digital Humanities a great service in making its research findings available to the public at large.

Online Computer Library Center, Inc. and The Center for Research Libraries. Trustworthy Repositories Audit & Certification: Criteria and Checklist. Dublin, Ohio: Online Computer Library Center, Inc., 2007. Web. Accessed 27 November 2014. <http://www.crl.edu/sites/default/files/attachments/pages/trac_0.pdf>
This document, created through the cooperation of the Research Libraries Group and National Archives and Records Administration of the United States, outlines the criteria for certification by those bodies of a “Trustworthy Repository”. The certification process and requirements are valuable for the purposes of this blog post as they outline the expectations for an official digital repository, which would be a very different entity from the idealized “Crowdstored” collection suggested in the introduction to this blog post.  Among these requirements is that of specifying legal permissions for the repository’s contents (p. 13), again highlighting the importance of considering permission and copyright law when maintaining a repository, and highlighting the special situations regarding intellectual copyright that would have to exist if a collection were to be stored and curated by the crowd. The requirement that would probably pose the greatest obstacle to a crowd-curated collection would be this: “Repository commits to defining, collecting, tracking, and providing, on demand, its information integrity measurements” (p. 15).  With a collection where data is stored freely by users, tracking changes to the data becomes difficult, and keeping every instance the data free from corruption is nearly impossible. Finding ways to counter this problem would be an important next step in developing a collection of this type.

Poole, Alex H. “Now is the Future Now? The Urgency of Digital Curation in the Digital Humanities.” Digital Humanities Quarterly 7.2 (2013). DHQ. Web. <http://digitalhumanities.org:8080/dhq/vol/7/2/000163/000163.html>
This article opens by referencing the 2006 report by the American Council of Learned Societies, in which the group stated that accessibility by the public was a critical next step in data curation. The essay states that, as a result of this report, many entities “since have embraced the importance of promoting the digital humanities through democratized access to and an expanded audience for cultural heritage materials and through posing new research questions — indeed, they have done so at an accelerating rate” (n.p.). Section II of the article goes into detail about the different incentives and disincentives to sharing curated data, noting that shared data allows for a much wider degree of computation, analysis, and utilization, but on the other hand also opens up the data to misuse and corruption (n.p.). Furthermore, the essay states that sharing data has another potential downside, that of giving an unscrupulous researcher the opportunity to steal—and get credit for—another’s work (n.p.). The paper goes on to discuss developments in digital curation in the seven years following the report by the American Council of Learned Societies, and in Section IV details a few case studies in the United Kingdom. Most notable for the purposes of this blog post is that “researchers shared their methods and tools more freely than their experimental data,” being much more careful with the data that was often tied to their livelihood (n.p.). Furthermore, “[p]ersonal relationships loomed large in researchers’ willingness to share their data externally; conversely, they felt apprehensive about cyber-sharing” (n.p.), suggesting that researchers were often very nervous about losing control over their data. These findings suggest that researchers often treat digital data in the same ways as analogue data in terms of sharing and control, and that the concept of “Crowdstorage” will run into similar obstacles where academic data is concerned.

Wharton, Robin. “Digital Humanities, Copyright Law, and the Literary.” Digital Humanities Quarterly 7.1 (2013). DHQ. Web. <http://digitalhumanities.org:8080/dhq/vol/7/1/000147/000147.html>
As many of the other sources on this list suggest, the availability of digital data has many ramifications when it comes to intellectual property and copyright law.  This article looks at copyright law from primarily a United States perspective, particularly with regard to Section 101, Title 17 of U.S. Copyright Code, which concerns itself with literary objects. The section “U.S. Copyright Law and the Literary” emphasizes that, in the United States, “[o]nly copying of expression gives rise to copyright infringement; use of the ideas does not” (n.p.), which draws a kind of legal box around what is “literary” and what is merely an idea.  All of that, however, is somewhat tangential to the issues faced by archivists in Digital Humanities with regard to copyright law, since archiving is more concerned with preserving digital objects in their actual expressed form. Therefore, in the United States, the bigger issue is one of “fair use” (as outlined in the section Digital Humanities and Copyright Law), wherein the use of literature for literary scholarship is allowed because of the supposition that “literary scholarship will not resemble, except in the most superficial way, the literary objects with which it engages” (n.p.). The tensions this definition creates in the digital world are legion, and the article goes on to address some of them, but the biggest issue for the purposes of this blog is that once curated objects are made available to users in a way that allows users to duplicate the data, the distribution can hardly be justified under “fair use”.  The idealized “Crowdstorage” concept, then, would need to seriously consider the ramifications that copyright law would have on any such collection’s legal existence.


This entry was posted in Theory and Practice and tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>