[supplied title]

response to an inquiry on digital preservation

About a year ago, I received a few questions from a journalist researching a story on digital preservation and wrote back a long reply. I hope my answers were helpful, but as far as I can tell the story was never published. I've wondered if by pushing back against the "digital dark age" framing I ended up making the topic seem less compelling.

In any case, since it's been so long I figure it's ok to post my response. I don't feel comfortable posting the exact questions I got, since I didn't write them, but I think you can gather the gist from my answers. I'd been meaning to revise this into a post all last year, but never got around to it. I'm putting it up here with only a few minor edits, mostly to add/update links (though I've left them in email style). I think I'd write basically the same thing today.

Dear [redacted],

Sorry I'm later with my reply than I intended to be. Also, I wrote a lot more than I expected to, for which I apologize, but I wanted to draw some distinctions that I don’t often see made in press coverage of digital preservation. I hope it still proves useful to you.

First, I want to push back a bit on the way you've framed the question. I think it's true that if you take a collection of material on digital storage media and a collection of paper records and leave them unattended for decades, the paper records are more likely to survive until you try to read them. But why make ability to hold up under conditions of inattention and neglect the first point of reference? Digital media present particular challenges to preservation, certainly, but it's generally true of all media formats that they're less likely to survive if no one is actively taking care of them.

I also think that paper has become so integrated into our world that we don't always recognize the steps that we take to save paper as "preservation activities" because they can seem so routine. But look at the first three examples you give of lost manuscripts recently discovered. Harper Lee's manuscript for Go Set a Watchman was held in a safe-deposit box; the Irish researchers who found the Einstein manuscript were doing research in the Albert Einsten Archives; the Quranic manuscript leaves found in Birmingham had already been selected for preservation in a university research library. Each of these discoveries had more to do with recognizing the significance of the manuscripts than with establishing their existence. As physical items, these documents were all already being managed for preservation.

I don't want to exaggerate my point too much here. Many types of paper, along with related media like parchment and vellum, have proven remarkably durable over the centuries. I seriously doubt there will be a digital analogue to the manuscripts found in caves in Dunhuang in the early 20th century. I just think it's important to recognize that most of the historical manuscripts that are still around today survived because people actively worked to ensure their survival.

From this perspective, I think we're going to see some very divergent outcomes with digital material. Many artifacts and records that are not actively managed are going to be lost, or already have been lost. But material that we, as individuals or as institutions, put time and effort into maintaining is likely to survive for quite a while. The key point is that we need to make a commitment to digital preservation and to devote the resources necessary to carry out that commitment. The threats to digital preservation are as much economic, social, and political as they are technical, and many of those threats - war, economic collapse and so on - apply to all cultural heritage materials across the board, not just digital materials.

That said, I take your question to be "what can we do to preserve digital material that we've committed ourselves to preserving?" and that's the question I really should be answering. I'll try not to get too caught up in the details.

There are a lot of different types of digital material, on a lot of different media, but the main principle I think we've learned is: start paying attention to what you want to keep earlier rather than later. This is especially true for materials stored on digital media that is going or has gone out of production, especially if that media can only be read using hardware that is also going obsolete. Some of the more difficult cases of technological and format obsolescence are materials stored on media like old types of magnetic tape where the data was never migrated forward as technology changed.

There's often a period when a new technology comes online where you can still move relatively easily between the old and the new because so many people are in the same situation, and there’s an economic incentive for the makers of the new technology to help you leave behind the old one. If you don't take advantage of that time to move your stuff forward - to a new type of data tape, from tape to disk, from floppies to hard drives, from one version of a format to a newer one, etc. - it can be much more difficult to do this later.

This is a real challenge for the museum, as our collection includes many different types of early computer storage media, ranging from punched cards and paper tape to metal and magnetic tape (in addition to more modern formats). This is not an area I've worked on directly, but as I understand it some of this material can only be read either with original hardware or through sometimes difficult engineering and reverse engineering processes. A fair amount of this material - I don't know how much - was acquired by the museum after it had been sitting in storage for years, long after it had last been accessed.

Once you get to more modern formats - and by "modern" I mean from around the time of floppies and early personal computers to the present - the challenges of reading the media today aren't as great because the equipment to read them was more widely produced and remains available. But the clock is always ticking, and most people who still own their floppy disks probably no longer have floppy drives to read them.

Many libraries and archives are now investing in that equipment - usually focusing on 5.25" floppies and later, but their choices will depend on the needs of their collections - as acquisitions are coming in now that include computer media and hardware along with paper. These range from the personal papers of individuals like writers, artists, academics, lawyers, etc. to the records of organizations and businesses. You'd be surprised by how much of that stuff has been kept and can still be read with the proper equipment. There are always going to be some unreadable disks, but I don’t think we’re looking at a catastrophic situation.

There is a problem here for cultural heritage institutions, though: there's often a gap of a few decades or more between when material is created and when it's transferred to an archive, library, or museum. We're fortunate that you can still find equipment to read floppies and Zip disks, but that won't always be the case. We're also fortunate that people have kept their media items past the point when they were able to read them on their own computers. Some sort of time gap is always going to be there, and some people will never want to donate to a separate institution, so there's also a need to educate people on how to manage their own digital material.

Digital preservation can't be something that only "digital preservation professionals" do, it needs to be something within reach of individuals who want to keep their own stuff during their lifetimes and then pass it on to friends and family. You're probably not going to run some ISO-certified preservation repository at home, but you should be able to keep your photos and home videos for yourself for as long as you want to keep them.

Along these lines, I recommend taking a look at the Library of Congress' resources on personal digital archiving: http://digitalpreservation.gov/personalarchiving/
There's also a conference on Personal Digital Archiving, which tends to be more practitioner oriented:
http://www.lib.umich.edu/pda2016
https://library.stanford.edu/projects/personal-digital-archiving-2017

Moving on to the software question…

Assuming that you can read the data off of whatever physical media you're working with, and assuming you're not dealing with a situation where the original hardware and media is an integral part of what you're preserving - as might be the case with new media art - you're then faced with the issue of how to ensure that that data remains readable and accessible in the long term. How you approach this is going to depend on a number of factors, such as file format, operating system, and intellectual property law.

Maybe I'm too optimistic, but I think we'll find that many widely-adopted and widely-shared file formats are going to remain readable for a surprisingly long time, in large part because they've been so widely-adopted and shared. The need to make files readable across platforms and the need to make applications backwards compatible with older file format versions can end up combining to benefit digital preservation.

I won't get into it in this email, but I'm generally persuaded by David Rosenthal's arguments that format obsolescence isn't a huge threat when it comes to widely adopted formats (PDF, commonly used word processing formats, JPEG, HTML, etc.) because people have created, and have incentives to continue to create, applications that can render those formats. A good place to start is his post of a few years ago on obsolescence scenarios: http://blog.dshr.org/2007/04/format-obsolescence-scenarios.html
There's also a good list of Rosenthal's posts on format obsolescence here: https://wsampson.wordpress.com/2011/07/18/format-obsolescence-maybe-not-such-a-bugaboo/

Beyond just taking formats as they are, there's also ongoing work on standardizing formats to make them more durable for long-term preservation. Take a look at the EU-funded Preforma project, which is focusing on formats for documents (PDF), images (TIFF), and audio-visual files (multiple formats): http://www.preforma-project.eu/project.html The hope here is that these formats will be integrated into applications that create these types of files, so that people wanting to produce a preservation-ready file could choose the preservation version of the format at the time of creating the file. Think of it as analogous to choosing acid-free paper over cheaper paper when printing a book meant to last.

Of course, there's a lot of material out there that doesn't use widely-available formats and many applications are not backwards compatible. There are also cases where you can open an old format in a new application, but not without an unacceptable loss of fidelity to the original. In these cases, you can often get a fairly faithful rendering of the original file using emulation, and emulation has gotten much easier to implement over the years. But in other cases emulation might not be an option because of technical problems or legal restrictions.

Some of the most at-risk content are files that can only be opened in one application, especially if the developer is not committed to backwards compatibility or to making old software versions available for preservation purposes. There are specialized applications where it’s hard to see preservation even being possible without buy-in from the company that produces the application. There are also particular challenges to preserving social media and content created with cloud-based applications, which I won’t get into at this point.

So to sum up (since I really should be wrapping this up): I don't think the situation facing digital preservation is as dire as it's sometimes presented (cf. recent and past articles about a "looming Digital Dark Age"). Digital material does require more intervention than paper to keep it alive, but inattention has never been an effective preservation strategy. There's a core of material that's likely to be kept accessible as long as we maintain the ability to use computers. But there's also a significant amount of material at a higher risk of loss, either because it's from the really early era of computing (1970s and before), or because it's dependent on very particular software or hardware environments that are difficult to reproduce.

Finally, I want to emphasize that there's a community of practice around digital preservation who has been working on these issues for years, and I hope any article on the topic will acknowledge that work. I don't think preserving things "forever" can ever be considered a solved problem, but I am optimistic that we aren't suddenly going to lose access to much of our digital heritage, at least not while people are actively working to preserve it.

I’ll leave you with a few more links to resources below:

On emulation, I'll again recommend some of David Rosenthal's work, as he's just published a paper on "Emulation & Virtualization as Preservation Strategies":
http://blog.dshr.org/2015/11/emulation-virtualization-as.html

On developing a global framework for software preservation, including participation from industry, see the UNESCO PERSIST project:
http://unesco.nl/digital-sustainability
http://unesco.nl/en/node/2665

On conservation of digital art, see this recent Cornell white paper:
https://ecommons.cornell.edu/handle/1813/41368 (paper)
http://blogs.loc.gov/digitalpreservation/2015/12/authenticity-amidst-change-the-preservation-and-access-framework-for-digital-art-objects/ (interview with authors)

A few of the organizations working on digital preservation:

National Digital Stewardship Alliance (US):
http://ndsa.org/

Open Preservation Foundation (international, started as an EU-funded initiative):
http://openpreservation.org/

Digital Preservation Coalition (UK):
http://dpconline.org/

Best, Andrew

inherent virtue

Let's say you don't think digital will survive so you print everything (that you consider to be at the time of printing) important.

You:

  • choose decent quality papers and inks, maybe even the best you can afford
  • use a high quality printer
  • use folders and boxes that won't quickly degrade
  • regularly keep up your printing and storage supplies
  • keep your boxes in an appropriate climate
  • protect against damage from fire, water, and animals
  • move all of your printouts every time you move, or make provisions to have someone manage them on your behalf
  • continue to pay your paper, ink, printer, folder, box, storage unit, housing, and moving bills
  • find someone to take custody of your printouts after you die and manage them at least as well as you did

While you're doing all of that, take a moment to stop and ask yourself, are these printouts being preserved because something inherent in the physical properties of paper makes them survive, or because you've decided it's worth investing time and money to make sure they survive? Then ask yourself what it would take to preserve the things you didn't, or couldn't, print but that you still want to be kept.

unresolved

When I posted last January, and then in February, I thought I might be on pace to beat my blogging output of 2013, when I only managed to write three posts. And then I didn't post again in 2014.

This is in no way a New Year's Resolution, but I'm getting tired of not writing things, so I'm going to make a real effort to write again this year. I have a new job, have learned a bunch of things that are technical but hopefully not too boring, and I'm hoping to pick up some of my history interests again. If I post once a month, that would be great.

This post counts as January's once but I hope it's not the only one.

what Silicon Valley could have looked like

A few days ago, The Atlantic Cities ran a piece featuring designer Alfred Twu's visualizations of "What Silicon Valley Might Look Like If All of Its Employees Actually Lived There". These are imaginary designs, of course, but they show how dense the region could be if future development were aimed at bringing in more residents and reducing the number of people who commute from San Francisco and elsewhere.

What people might not know is that there was a brief moment when the southern Bay Area could have been developed more densely in the first place. During the 1950s, the Bay Area Rapid Transit Commission investigated the possibilities for building rapid transit in the Bay Area (as one might guess from the title of the commission). Their work ultimately led to the construction of the Bay Area Rapid Transit (BART) system in place today.

Back when I was a grad student in history, I did some research into the early BART planning process and ended up writing a paper on how the West Bay counties - Santa Clara, San Mateo, and Marin - ended up dropping out before any BART development got started. I don't have time to get into all the details here, but briefly: the BART Commission asked the firm Parsons, Brinckerhoff, Hall, and MacDonald (PBHM) to produce a report examining the possibilities of rapid transit for the nine Bay Area counties.

This report came out in 1956 and recommended a multi-stage development schedule.1 The first stage would have covered the urban core (SF-Oakland-Alameda-Berkeley) and stretched into Contra Costa, Alameda, Marin, and San Mateo counties. The Peninsula endpoint would have been Palo Alto. The second stage of development would have brought BART to Santa Clara County and San Jose. Obviously, only some of these plans were implemented and even now BART barely touches San Mateo county.

One of the guiding ideas behind the 1956 PBHM report and the whole early BART planning process was that rapid transit would be used principally to relieve highway congestion rather than shape new development. That might seem like an odd way to look at things, but as far as I could tell from my research, the majority of people involved in planning BART didn't think that (suburban) people would ride rapid transit without the external motivation of congested highways.

So the justification for leaving out the South Bay in the first stage of BART construction was that the area still lacked the population needed to generate the kind of car traffic that BART would then relieve. At the time, Santa Clara County was still fairly agricultural: population growth in San Jose and what would become known as Silicon Valley was just starting to take off. The PBHM report actually considered highway construction to be preferable to rapid transit in Santa Clara County in the near-term.

Karl Belser, the Santa Clara County Planning Director, saw things differently. I'm just going to quote from my paper here:

Presenting a view of rapid transit at odds with PBHM's assumptions, Belser stated that the county had already exhausted its most logical alignments for highway building and was “scraping the bottom of the barrel for added freeway lanes.”2 Instead, the county’s future growth and prosperity depended on the immediate construction of both rapid transit and highways. Belser anticipated a county population of 750,000 by 1965 and one million by 1975; PBHM’s figures projected a high of around 1.1 million only in 1990. Belser called for the “three way linkage of the San Francisco, San Jose, and Oakland area by rapid transit as the means of welding these three major population concentrations together into one great metropolitan complex.” Advocating a realistic approach to planning, he pointed out that even the first stage of construction could take over ten years to complete. By that time Santa Clara clearly would need rapid transit. Furthermore, if constructed at present the line would pass through “relatively open country” without having to deal with existing development and high land costs. Now Belser came to the core of his disagreement with PBHM: freeway and rapid transit

are dynamically competitive and it is difficult enough to overcome tradition and habit without having such bents built into the physical pattern. In northern Santa Clara County and southern Alameda County the possibility of changing the direction of development and orienting it specifically to the transit system is still open. It would be possible to provide a type of urban living facility which would be primarily based on the transit system for mobility. This would…be a sort of assured patronage for the long range use of the facility…Such a new direction needs to be understood and planned for at the earliest possible time in a rapidly developing area such as ours.

Drawing comparisons with Europe – especially Paris – he thought that his county still had the chance to use transit to help build multiple-unit housing for people with lower incomes who could not afford cars – the very people many BART advocates simply ignored. Projecting future growth heavily oriented towards manufacturing, Belser worried that, “if industry locates itself hit of miss without regard to rapid transit, it becomes impossible, as it is today in Los Angeles, to locate effective desire lines on which to locate the line…Thus if the lines of the system were defined now it would be possible through proper parallel planning to connect areas of residence with areas of employment.” [end quotation from my paper]

Pretty much none of what Belser envisioned actually came to pass and in 1970, with Santa Clara County's car-oriented pattern well in place, Belser published an article called "The Making of Slurban America" lamenting what happened to the county.3


  1. Parsons, Brinckerhoff, Hall, and MacDonald, Regional Rapid Transit: A Report to the Bay Area Rapid Transit Commission 1953-1955, 1956. 

  2. Karl Belser, “Rapid Transit Extension to San Jose An Address Made By Karl J. Belser,” 11 Sep 1956, BART Commission Progress Reports, A18-1.1, California State Archives. All of the Belser quotes in this post are from this same document. 

  3. Karl J. Belser, "The Making of Slurban America," Cry California, 5 (Fall 1970): 1-21.  

re-launch

I guess I could sit here and mess with the color scheme for this blog indefinitely, or I could just start writing blog posts again. I only wrote three posts in 2013, and by the end of the year I was down to just checking my site every now and then to make sure it was still online.1 I'm making a few changes that I hope will get me blogging regularly again:

  1. New URL

Back in 2012, after reading a bunch of advice on choosing domain names, I decided to use my own name for my other site. Much of the advice I read boiled down to consistency and identifiability. People change blog titles all the time but personal names, while not necessarily stable, tend to have more persistence. So if you've decided to use your "real" name online and are creating a website to go with that identity, it makes sense to go with an eponymous domain. I still think that's a good idea and I still have plans to use that domain as a personal website. It might just link out to writings and projects, but that's ok.

For the blog itself, however, I found myself surprisingly uncomfortable seeing my own name out there every time I wanted to post a link. It got to the point that I didn't want to post links to my own writing, or write things that would end up with my name in the URL.2 Hence the new domain, which makes the URL match the blog title. Will I always have a blog with this title? Maybe not, but I've moved on from blog to blog before.

  1. Octopress instead of Wordpress

The bigger change, which might not be that visible beyond the difference in layout/theme, is the move from Wordpress to Octopress. I'll probably put up a separate post about the differences, so I won't get into too much detail here, but essentially Wordpress is a dynamic content management system while Octopress - which I had not heard of before some friends mentioned it on twitter - is a type of static site generator. What this boils down to is that Wordpress is built on various technologies that make it possible to interact with a website on and through the web, while Octopress does a bunch of stuff on my computer that generates static files (html and css, obviously, but you can include lots of things) which are then placed on the web.

There are advantages and disadvantages to this move, most of which I'll gloss over here, but the most noticeable difference from a reading point of view is the lack of comments. Since the site is static and not backed by a database or anything like that, there's no place to save user input without using a third-party add-on. Octopress is built to work out-of-the box with the Disqus comment system, which embeds the comments interface as a third-party application at the bottom of each post, but I don't have a Disqus account and won't set one up until I've looked into their policies more closely. I have added a contact page and I welcome feedback, although I can't promise I'll be able to respond to every email.

From my point of view as someone who wants to learn some new skills, there are advantages in moving to Octopress that outweigh the loss of things like built-in comments. Octopress relies on a bunch of things that I've either never used before or used only occasionally: git, ruby, markdown, css processing with Sass, SSH and Rsync.3 I can also maintain my site entirely from the terminal if I want to, which gives me an excuse to use in-terminal text editors and shell commands more often. That all might sound like a lot of work, but this is stuff I want to learn anyway and having an ongoing project makes me more likely to stick with it. And writing simple posts like this is easy.

  1. Shorter posts, if not in length, then in time spent writing

When I get started, I can write a lot fairly quickly, but all too often I end up writing long posts infrequently. I don't think the occasional longer post is a bad thing, but waiting until I have something essay-like to write has been bad for my blogging. Those posts take a long time and this has generated a feedback loop where I really want to write something but then I think I should get more background before writing and then I think about how long it will take to write and then I keep putting it off. I usually do a fair bit of the gathering background part, so I end up learning a fair amount, but then everything just stays with me and I never write it down. There are still a few longer things I want to write, but I'm going to try not to let them get in the way of other stuff. I didn't have to move to a new website to change how I approach blogging, but I figure this is a good time to start.

So that's it. In the past I'd try to think of some neat wrap-up paragraph for this post, but I think it's enough that I'm actually writing again.


  1. Meanwhile, I kept getting notices from my web host that my installation of Wordpress needed to be upgraded. I did the upgrades, at least. I also renewed my domain. So I never gave up on the site completely. 

  2. This is probably something I should have gotten over, but whatever. It's easy enough, if not free, to get a new domain for the blog and blog-related activities. 

  3. Technically, you don't have to use Rsync to publish a site with Octopress. But it's the method I'm currently using to communicate with my hosting provider.