[supplied title]

response to an inquiry on digital preservation

About a year ago, I received a few questions from a journalist researching a story on digital preservation and wrote back a long reply. I hope my answers were helpful, but as far as I can tell the story was never published. I've wondered if by pushing back against the "digital dark age" framing I ended up making the topic seem less compelling.

In any case, since it's been so long I figure it's ok to post my response. I don't feel comfortable posting the exact questions I got, since I didn't write them, but I think you can gather the gist from my answers. I'd been meaning to revise this into a post all last year, but never got around to it. I'm putting it up here with only a few minor edits, mostly to add/update links (though I've left them in email style). I think I'd write basically the same thing today.

Dear [redacted],

Sorry I'm later with my reply than I intended to be. Also, I wrote a lot more than I expected to, for which I apologize, but I wanted to draw some distinctions that I don’t often see made in press coverage of digital preservation. I hope it still proves useful to you.

First, I want to push back a bit on the way you've framed the question. I think it's true that if you take a collection of material on digital storage media and a collection of paper records and leave them unattended for decades, the paper records are more likely to survive until you try to read them. But why make ability to hold up under conditions of inattention and neglect the first point of reference? Digital media present particular challenges to preservation, certainly, but it's generally true of all media formats that they're less likely to survive if no one is actively taking care of them.

I also think that paper has become so integrated into our world that we don't always recognize the steps that we take to save paper as "preservation activities" because they can seem so routine. But look at the first three examples you give of lost manuscripts recently discovered. Harper Lee's manuscript for Go Set a Watchman was held in a safe-deposit box; the Irish researchers who found the Einstein manuscript were doing research in the Albert Einsten Archives; the Quranic manuscript leaves found in Birmingham had already been selected for preservation in a university research library. Each of these discoveries had more to do with recognizing the significance of the manuscripts than with establishing their existence. As physical items, these documents were all already being managed for preservation.

I don't want to exaggerate my point too much here. Many types of paper, along with related media like parchment and vellum, have proven remarkably durable over the centuries. I seriously doubt there will be a digital analogue to the manuscripts found in caves in Dunhuang in the early 20th century. I just think it's important to recognize that most of the historical manuscripts that are still around today survived because people actively worked to ensure their survival.

From this perspective, I think we're going to see some very divergent outcomes with digital material. Many artifacts and records that are not actively managed are going to be lost, or already have been lost. But material that we, as individuals or as institutions, put time and effort into maintaining is likely to survive for quite a while. The key point is that we need to make a commitment to digital preservation and to devote the resources necessary to carry out that commitment. The threats to digital preservation are as much economic, social, and political as they are technical, and many of those threats - war, economic collapse and so on - apply to all cultural heritage materials across the board, not just digital materials.

That said, I take your question to be "what can we do to preserve digital material that we've committed ourselves to preserving?" and that's the question I really should be answering. I'll try not to get too caught up in the details.

There are a lot of different types of digital material, on a lot of different media, but the main principle I think we've learned is: start paying attention to what you want to keep earlier rather than later. This is especially true for materials stored on digital media that is going or has gone out of production, especially if that media can only be read using hardware that is also going obsolete. Some of the more difficult cases of technological and format obsolescence are materials stored on media like old types of magnetic tape where the data was never migrated forward as technology changed.

There's often a period when a new technology comes online where you can still move relatively easily between the old and the new because so many people are in the same situation, and there’s an economic incentive for the makers of the new technology to help you leave behind the old one. If you don't take advantage of that time to move your stuff forward - to a new type of data tape, from tape to disk, from floppies to hard drives, from one version of a format to a newer one, etc. - it can be much more difficult to do this later.

This is a real challenge for the museum, as our collection includes many different types of early computer storage media, ranging from punched cards and paper tape to metal and magnetic tape (in addition to more modern formats). This is not an area I've worked on directly, but as I understand it some of this material can only be read either with original hardware or through sometimes difficult engineering and reverse engineering processes. A fair amount of this material - I don't know how much - was acquired by the museum after it had been sitting in storage for years, long after it had last been accessed.

Once you get to more modern formats - and by "modern" I mean from around the time of floppies and early personal computers to the present - the challenges of reading the media today aren't as great because the equipment to read them was more widely produced and remains available. But the clock is always ticking, and most people who still own their floppy disks probably no longer have floppy drives to read them.

Many libraries and archives are now investing in that equipment - usually focusing on 5.25" floppies and later, but their choices will depend on the needs of their collections - as acquisitions are coming in now that include computer media and hardware along with paper. These range from the personal papers of individuals like writers, artists, academics, lawyers, etc. to the records of organizations and businesses. You'd be surprised by how much of that stuff has been kept and can still be read with the proper equipment. There are always going to be some unreadable disks, but I don’t think we’re looking at a catastrophic situation.

There is a problem here for cultural heritage institutions, though: there's often a gap of a few decades or more between when material is created and when it's transferred to an archive, library, or museum. We're fortunate that you can still find equipment to read floppies and Zip disks, but that won't always be the case. We're also fortunate that people have kept their media items past the point when they were able to read them on their own computers. Some sort of time gap is always going to be there, and some people will never want to donate to a separate institution, so there's also a need to educate people on how to manage their own digital material.

Digital preservation can't be something that only "digital preservation professionals" do, it needs to be something within reach of individuals who want to keep their own stuff during their lifetimes and then pass it on to friends and family. You're probably not going to run some ISO-certified preservation repository at home, but you should be able to keep your photos and home videos for yourself for as long as you want to keep them.

Along these lines, I recommend taking a look at the Library of Congress' resources on personal digital archiving: http://digitalpreservation.gov/personalarchiving/
There's also a conference on Personal Digital Archiving, which tends to be more practitioner oriented:

Moving on to the software question…

Assuming that you can read the data off of whatever physical media you're working with, and assuming you're not dealing with a situation where the original hardware and media is an integral part of what you're preserving - as might be the case with new media art - you're then faced with the issue of how to ensure that that data remains readable and accessible in the long term. How you approach this is going to depend on a number of factors, such as file format, operating system, and intellectual property law.

Maybe I'm too optimistic, but I think we'll find that many widely-adopted and widely-shared file formats are going to remain readable for a surprisingly long time, in large part because they've been so widely-adopted and shared. The need to make files readable across platforms and the need to make applications backwards compatible with older file format versions can end up combining to benefit digital preservation.

I won't get into it in this email, but I'm generally persuaded by David Rosenthal's arguments that format obsolescence isn't a huge threat when it comes to widely adopted formats (PDF, commonly used word processing formats, JPEG, HTML, etc.) because people have created, and have incentives to continue to create, applications that can render those formats. A good place to start is his post of a few years ago on obsolescence scenarios: http://blog.dshr.org/2007/04/format-obsolescence-scenarios.html
There's also a good list of Rosenthal's posts on format obsolescence here: https://wsampson.wordpress.com/2011/07/18/format-obsolescence-maybe-not-such-a-bugaboo/

Beyond just taking formats as they are, there's also ongoing work on standardizing formats to make them more durable for long-term preservation. Take a look at the EU-funded Preforma project, which is focusing on formats for documents (PDF), images (TIFF), and audio-visual files (multiple formats): http://www.preforma-project.eu/project.html The hope here is that these formats will be integrated into applications that create these types of files, so that people wanting to produce a preservation-ready file could choose the preservation version of the format at the time of creating the file. Think of it as analogous to choosing acid-free paper over cheaper paper when printing a book meant to last.

Of course, there's a lot of material out there that doesn't use widely-available formats and many applications are not backwards compatible. There are also cases where you can open an old format in a new application, but not without an unacceptable loss of fidelity to the original. In these cases, you can often get a fairly faithful rendering of the original file using emulation, and emulation has gotten much easier to implement over the years. But in other cases emulation might not be an option because of technical problems or legal restrictions.

Some of the most at-risk content are files that can only be opened in one application, especially if the developer is not committed to backwards compatibility or to making old software versions available for preservation purposes. There are specialized applications where it’s hard to see preservation even being possible without buy-in from the company that produces the application. There are also particular challenges to preserving social media and content created with cloud-based applications, which I won’t get into at this point.

So to sum up (since I really should be wrapping this up): I don't think the situation facing digital preservation is as dire as it's sometimes presented (cf. recent and past articles about a "looming Digital Dark Age"). Digital material does require more intervention than paper to keep it alive, but inattention has never been an effective preservation strategy. There's a core of material that's likely to be kept accessible as long as we maintain the ability to use computers. But there's also a significant amount of material at a higher risk of loss, either because it's from the really early era of computing (1970s and before), or because it's dependent on very particular software or hardware environments that are difficult to reproduce.

Finally, I want to emphasize that there's a community of practice around digital preservation who has been working on these issues for years, and I hope any article on the topic will acknowledge that work. I don't think preserving things "forever" can ever be considered a solved problem, but I am optimistic that we aren't suddenly going to lose access to much of our digital heritage, at least not while people are actively working to preserve it.

I’ll leave you with a few more links to resources below:

On emulation, I'll again recommend some of David Rosenthal's work, as he's just published a paper on "Emulation & Virtualization as Preservation Strategies":

On developing a global framework for software preservation, including participation from industry, see the UNESCO PERSIST project:

On conservation of digital art, see this recent Cornell white paper:
https://ecommons.cornell.edu/handle/1813/41368 (paper)
http://blogs.loc.gov/digitalpreservation/2015/12/authenticity-amidst-change-the-preservation-and-access-framework-for-digital-art-objects/ (interview with authors)

A few of the organizations working on digital preservation:

National Digital Stewardship Alliance (US):

Open Preservation Foundation (international, started as an EU-funded initiative):

Digital Preservation Coalition (UK):

Best, Andrew