[supplied title]

collecting and archiving digital video

A couple of years ago I gave a presentation titled "Collecting and archiving born-digital video" at the Society of California Archivists (SCA) 2017 annual conference in Pasadena. I'd always meant to post it online, but I wanted to write up text to accompany the slides. I've finally now done that.

For context, my talk was part of a session titled "Born digital: care, feeding, and intake processes." Other speakers represented the LOCKSS program at Stanford, the Internet Archive, and the Jet Propulsion Laboratory. A full description of the panel is available under session 10 in the SCA 2017 conference program.

A note about the text, which is true of pretty much every talk I've given since library school: I have a difficult time reading from notes when I do public speaking, so unless there's a long quotation I want to read exactly, I go without them. As a result, I don't have an official text for the talk, just the slides.

Since I'm essentially writing it up for the first time in 2019, it's not exactly what I said in 2017, not least because it's probably longer than fifteen minutes as written. But I've decided to go wtih the present tense anyway.

Slide 1: title slide

Hi, I'm Andrew Berger and I'm going to be talking about collecting and archiving in-house digital video, by which I mean video produced in digital formats. For background, I'm going to begin by telling a story, a story about a museum.

Slide 2: Background: oral history as collecting

The Computer History Museum has had an active oral history program since the early 2000s. There are some oral histories from even earlier, mostly on audio tape, but the pace really picked up in the 2000s with the use of DV tape. Almost every oral history from that point on has been on video.

Around 2010-2011, there was a shift from using digital tape to using digital files, with all-file workflows taking over entirely around 2012-2013. Around the same time there was also a shift towards treating oral histories as productions, with an eye towards reuse in exhibits and documentaries. All oral histories are now shot in HD, many with special lighting, separate microphones, and a green screen.

Oral histories are seen as a collecting activity, with every production ultimately creating files that the museum is dedicated to preserving in the permanent collection.

Slide 3: Story continues: lectures and events

Oral histories aren't the only kind of video the museum produces. Lectures and other events are routinely recorded, dating all the way back to the 1980s. These recordings span analog and digital formats, with DV tape taking over from Betacam and VHS in the early 2000s. Just like oral histories, these moved to all-file workflows starting around 2010-2011.

The museum has kept these videos as well, but is that the same as collecting them?

Slide 4: Story continues: exhibit videos

The museum also produces its own exhibit videos. These include not just the videos shown in the exhibits (on-site and online) but also the raw footage shot specifically to be used in exhibit videos. As with just about any video production process, it takes hours of raw footage to produce a few minutes of a final production.

Exhibit video production increased signficantly around 2009 in preparation for two major exhibits. As with oral history and event video, the museum retains its exhibit video.

Slide 5: Meanwhile: more video productions

Beyond oral histories, events, and exhibits, the museum produces even more kinds of video. There's footage of docent trainings, educational events, marketing campaigns--essentially, any public facing museum function could have a video component.

Slide 6: But wait, there's more!: Revolutionaries TV show

But wait, there's more! For five years the museum also produced a television show for the local PBS affiliate, KQED. This was an hour-long interview show that drew heavily from the museum's live lecture series.

The production of the tv show was essentially all-digital from the start. The lectures were recent enough to have been shot in HD, then additional editing and post-production work was done to transform them into files meeting broadcast standards.

The museum has retained copies of its television episodes as well.

Slide 7: To sum up: a lot of digital video

To summarize, the museum has produced and is producing a lot of digital video, with a significant increase starting around 2009. This increase has coincided with a shift towards file-based rather than tape-based workflows.

Slide 8: How much is "a lot"? Chart showing video production by year in terabytes

Now, you might be wondering, how much do I mean by "a lot"? This chart shows the approximate growth in video production from 2011 to 2015, measured in terabytes. The chart starts in 2011 because that's the first year for which I could generate reliable data, in part because for earlier years I would have to go back to tape. You can see there was nearly a doubling in production in four years, from just over 10 TB in 2011 to 20 TB in 2015. [Note from 2019: it's around 30 TB now.]

On a panel with representatives from LOCKSS, the Internet Archive, and JPL, twenty terabytes in a year might not sound like a lot. But keep in mind that for every byte of video, someone had to schedule the event, set-up the mikes and cameras, monitor the feeds, transfer the data, and edit the final products. This is artisanal, hand-crafted data.

Slide 9: Where does it all go? DV tape (picture of boxes of DV tape)

Where has all this data been going? First, as I've mentioned, it mostly went onto DV tape, which is stored in archival boxes.

Slide 10: Where does it all go? Hard drives (picture of pile of hard drives)

For a while, a lot of the data was also being stored on external hard drives. I don't even want to talk about the hard drives.

Slide 11: Where does it all go? Servers (no photo of server shown)

Finally, it started being stored on servers. These have grown quite a bit in size. Since about 2013, the media production team has had their own system combining server space with LTO tape to use for editing and backup. And in 2015, the museum launched a digital preservation repository. The vast majority of the data in the repository so far is video.

Slide 12: The challenge of digital video: developing a new approach

How has the museum adapted to the challenge of managing and preserving all this video? That's the subject of the rest of my talk.

Slide 13: File formats: a problem, but...

First, file formats. I'm not actually going to spend a lot of time talking about formats, even though it sometimes feels like that's the main thing people want to talk about when I get asked about digital video preservation. I will say that there are multiple different production formats and codecs in the collection, some of which are proprietary.

In the absence of a generally agreed upon normalization target within the broader digital video community, we've settled for preserving the "original" digital formats as long as they can be opened with open-source tools, specifically VLC and/or ffmpeg. We might normalize or migrate to other formats in the future but at this point the costs and risks of normalization would outweigh the benefits. This decision does have the side effect of making the preservation of VLC and ffmpeg more important to our overall preservation efforts.

Note that this policy for digital sources is different than our policy for digitization from analog sources, in which case the file format is entirely up to choice.

Slide 14: Technical infrastructure: video files are intense

Video files are intense: not only do they take up a lot of disk space, they also require more powerful processors for rendering and playback. The museum has had to upgrade its technical infrastructure to handle them. As I mentioned above, server space for editing and storage has been greatly expanded. So has internal network bandwidth and computing power for staff that work heavily with video.

Slide 15: Conceptual approach: what are we looking at?

But the biggest changes have been in terms of what I'd call the conceptual approach. What does it mean to say you collect or archive video? Oral histories, as I've mentioned, fit solidly into the collecting paradigm, even if they are produced in-house. They've always been on a workflow to be accessioned, cataloged, transcribed, shared.

What about the other video production types? What about their raw footage? What if there's only raw footage?

On this slide I've included two photos of boxes of digital video tape. One contains oral histories and has nicely printed labels. The other contains lectures, and though you can't see it on the slide, the boxes are labeled in pencil. I think this captures the tentative nature of how video was being preserved outside of oral history, as if it still wasn't ready for the permanence of labels.

Slide 16: Conceptual approach: what is video, really?

One way to look at the problem is to step back and ask, what is video, really? Yes, videos are shot to create digital objects, but video is also made to document activities. Video production can be seen as an institutional function that produces institutional records as video.

As it happens, the museum doesn't really have an institutional archive separate from the general collection. Nonetheless, it's been helpful to approach video as records and to ask, if there were an institutional archive, and if it were arranged around functions and departments, then where would these video files fit into that arrangement? Can they be understood in terms of records scheduling and retention (but without actually using those terms in casual conversation)?

Slide 17: Appraisal and selection: can't preserve literally everything

Given that it is not, in fact, possible to preserve everything, we've been working out appraisal and selection policies with the media production team and other interested parties, depending on the type of video project and how it fits within the context of the whole institution.

For example, a single lecture could produce upwards of 500 GB: three camera feeds, one live "switch" feed drawing from the three cameras, final edits for the online audience, and possibly even another version for broadcast television. For long-term preservation we save only the switch, the public audience version (i.e. the version posted to youtube), and the final broadcast file if there is one. The goal is to balance future re-use value with the goal of documenting the event, rather than saving every single file.

For other kinds of production, such as exhibit videos, we keep much more of the raw footage, sometimes even every camera feed. The media team also saves more data on their production system, but these are considered active records within the context of the whole system.

Slide 18: Staffing: regular lines of communication

Another significant change we've made has been in terms of staffing. Previously, there was a position of Oral History Coordinator, which focused mostly on the oral history program, but had some duties related to video production. For various reasons, this position didn't manage or monitor video files after they were produced, and no archivist became involved until the end of the workflow. But the previous holders of this position kept getting promoted (a good thing!) to work more specifically on content production, which had the side effect of making us take a fresh look at the whole video production and archiving workflow.

In the new arrangement, there's an AV Archivist who works closely with one of the Media Producers as the main points of contact between teams. The AV Archivist keeps abreast of all productions that will potentially produce material to be preserved and, in combination with myself, tracks files from production to preservation and access.

There's also a regular check-in between the archivists and the media team where we keep up with current projects and review what's working and what could be improved.

Slide 19: The moment of transfer

The final significant change we made was to how we manage the actual moment of transfer of files between the production team and the archives. Previously, files were being dragged and dropped over the network, which is not the ideal way to move files that could be hundreds of gigabytes in size. When a copy was interrupted, there was often no way to tell that something had gone wrong until an archivist found a corrupt file.

Now the production system has been configured to generate checksums for each file, and to send files to the archives staging area with corresponding checksums included in an accompanying XML file. This way the archivists can validate the checksums as the files arrive and identify problems earlier in the process.

This is all still a work in progress, so I don't really have a grand conclusion. There's still a lot to do in terms of improving file management and settling on appraisal and selection decisions.

But I will say that as the museum has continued to increase video production in nearly all areas, it's difficult to see how the increase could have been managed without making these larger organizational changes. We may have started with preservation as the end of the workflow, but bringing preservation concerns upstream has helped us see the bigger picture. This has really paid off when requests have come in for previously produced footage and we've been able to say, without too much trouble, yes, we have that and we can get it again.

history book idea: the transformation of the fur trade

Commercial fur trapping is a form of hunting, and hunting is legal, so I probably shouldn't have been surprised to read in the Los Angeles Times that fur trapping still exists in California. If you read the article, you'll see that commerical fur trapping may not be around for much longer, as a bill has been introduced into the legislature to prohibit it. The amount of people still trapping in California is really quite small.

Fur trading was, of course, a big deal in the history of the North American west and reading the article reminded me of a topic I thought about studying when I was in history, but ultimately decided not to pursue: what happened when the fur trade stopped being a big deal.

Most people who know some of the history of the North American west know the general outlines of the fur trade: how a combination of commercial trapping, trade, and colonial expansion pushed west across the continent until it reached the west coast and the Pacific.1 And how the combined effects of overhunting (by sea and on land), settler colonialism (which both devastated the indigenous communities that played a key role in the trade and cut off hunting lands), and changing fashion trends (particularly in Europe) led to the decline of the trade by the end of the nineteenth century. After that point, the fur trade essentially drops out of most western history narratives.

But here's the interesting thing: people still wear fur. Not as many people, and it's come to be very controversial, but fur has remained an economically viable business into the twenty-first century. So what happened? As far as I know: some trapping continued on a much smaller scale, mostly in northern Canada, while other fur traders made a transition to what I've seen called fur farming, essentially raising animals for their pelts. And somewhere along the way people stop talking about the fur trade and start talking about the fur industry.

How that transition happened seems like an interesting history to me. It crosses a whole range of subjects, from what happened with the people who continued to trap, to how the practice of "farming" was developed, to changes in consumption patterns and the moral status of wearing fur.

Now I'm not going to claim to have done a comprehensive literature search, as I'd only come up with this idea as a possible alternative to my then-current project, which was about railroads. But at the time I was thinking about this, admittedly over a decade ago, I didn't find any major study of this transition.

On the off-chance anyone reading this knows of any research that's already been done on these lines, I'd be interested to see it. And if it really hasn't been studied, then maybe someone wants to take it on for a book or dissertation? I'm pretty sure I'm still not going to write much more about it than this post.


  1. Probably less well-known is how the Russian-dominated fur trade reached the Pacific and Alaska through a similar process heading east. 

maintenance note

I can't believe it's been five years since I decided to move my blogging to this URL. At the time, I'd been running a wordpress blog at an eponymous URL for a little over a year, and during that time I'd come to feel so self-conscious about linking to posts with my name literally in the URL that I felt I needed to change it.

My old blog was feeling insecure in another way: it used http and not https. I was ok with this for the public URLs but it really bothered me on the administrative side. I wanted to be able to edit posts anywhere, but I was reluctant to log in from shared networks without the extra security. I know I could have composed posts outside of wordpress and then pasted in the text later, but the habits I'd developed as a blogger during the years when I did a fair amount of blogging led me to be more comfortable writing directly into the administrative interface of whatever platform I'd been using.

So after thinking through the various workarounds I could have developed to keep using wordpress - migrating the whole blog to a new URL, developing a new compose-and-publish workflow, only making edits and doing other administrative tasks at home - I got a new URL, wrote a long post about why, and then promptly nearly quit blogging.

Some of that was about work. I'd already not been posting very often, and then I got fairly busy with a few projects, spent a fair amount of time learning things outside of work hours, decided to move back to California (I was in a term position anyway), surprisingly found a position in California much more quickly than I expected to, and then moved. At which point, I'd really gotten out of the blogging habit.

But some of it was about site maintenance. Static site generators create "simple" html sites, but the process of generating those pages can be somewhat difficult to piece together. Especially if you're not already familiar with the language(s) the generator is written in. Wordpress, on the other hand, does a pretty good job at being the kind of software where you can just log in and post. Wordpress requires you to do some other maintenance tasks, particularly around upgrading, but if you're not heavily customizing it and you're only a single user, it's not too complex.

At the time I launched this blog, I had decided to go with Octopress, which primarily uses Ruby. I thought I'd go on to learn Ruby at that point, but that never happened. And since I didn't start posting regularly, I felt like every time I wanted to write something, I had to re-learn the whole Octopress system again. And then troubleshoot issues like malformed URLs. One of the dangers of static sites is you're often re-generating every page each time you publish, so you can break everything at once pretty easily.

Eventually, I got tired of feeling put off from blogging by the software itself and started looking for alternatives. I'd still like to learn Ruby, but realistically it made more sense for me to stick with tools that are more familiar to me, and I eventually settled on Pelican, which uses Python. Pelican also happens to have an Octopress-style theme, which made the move much easier since I didn't have to go searching for new layouts.1

Has it worked out? Kind of. I made the change in the middle of last year, but I still didn't write a single post last year. Now that I'm writing again, I can say that I find Pelican easier to use, but it turns out that may have more to do with certain aspects of its design than with the underlying programming language.2

That didn't stop me from briefly breaking my site a few weeks ago when I somehow managed to upload it without the proper CSS files, and I've been catching some odd formatting errors related to how markdown works, but I think I'm getting the publishing process worked out. So if you're actually reading this blog and see things go haywire, please bear with me. I'm pretty sure I can fix whatever goes wrong. At the very least I can always back up to the previous version of the site.

As for wordpress, I actually did consider going back to it last fall. My hosting provider now supports some easy ways to implement https, something that previously seemed out of reach to me as a regular site owner. So I went back and enabled it on my old blog,3 which is still up because I haven't migrated any of the posts yet.

At the same time, I also decided to retire my old wordpress.com blog. I've been around blogs since Blogger was the big thing (along with TypePad), and I remember when wordpress started offering free blogs at wordpress.com. It felt like such an improvement! But over the years their monetization efforts have led to some ugly and jarring ad placement, and even though I never looked at that old old blog more than a couple of times per year, I just couldn't take it anymore.

Since it's much easier to migrate from wordpress to wordpress, I decided to import all those old posts to my self-managed wordpress site. That worked pretty cleanly, except I'd used a hierarchy for categories on wordpress.com, and now they're all in a flat structure. It also would have been nice to have an easy way to make it clear that everything prior to July 2012 was imported from a different URL. Maybe I'm too much of an archivist in all aspects of life.

In any case, with so much content now in wordpress, I started to wonder if I should just redirect the whole blog to my current URL and replace Pelican/Octopress entirely. But I concluded that I still like the static site concept enough to stick with it for now. Plus, I spent a lot of time looking at wordpress themes and didn't come away with any I really wanted to use.

So this, I hope, will be the retirement plan for my old eponymous URL: over the next few months, I'm going to migrate selected posts from my old blog(s) here. They'll show up under their original dates, but I'm going to insert a note in each one indicating where they were originally posted.

Then I'll make another wordpress export and a backup and take everything else down, leaving only a note redirecting people here.4 Everything not migrated will live on only in my personal digital archives, or as remnants in the Internet Archive. I may work in digital preservation, but that doesn't mean I think we're duty bound to keep online everything we've ever posted as regular people going about our lives.


  1. Although this has turned out to mean that I still have to install Ruby in order to make certain edits to the theme. But that's just to run one command, which makes it more like an ordinary software dependency rather than a core bit of knowledge required to run the site. 

  2. It's been so long since I've used Octopress I can't give a detailed comparison. But off the top of my head, the way Pelican separates content from themes seems easier to maintain. Plus the process of starting a new post also seems easier in Pelican. 

  3. I've also enabled it on this blog. I've been persuaded by the argument that even if you don't need https for administering your site, it still provides a security and privacy benefit to those who are reading it. 

  4. I'm keeping the eponymous URL itself. Who knows, maybe I'll put a proper personal profile page there, with something like professional/biographical information. I'm more comfortable using my name for that than for ordinary blog posts. 

I still define the "digital humanities" the same way

Just a few months into my first job as an archivist, a Masters student in a Digital Arts and Humanities program contacted the reference desk asking whether any of the archivists would be willing to do an interview by email about the role of the archive in the "digital age"? Having just recently been a graduate student, and having done that sort of assignment myself, it felt a bit odd to be on the other side of the question. I was the least experienced person on reference, but I ended up being the one to do the interview.

I wrote a whole lot more than I thought I would; maybe I was still in student-writing mode. Although the interview was published on the student's blog, I remember thinking that I needed to save my own copy because you never know how long someone is going to maintain their website. I'm glad I did because I went looking for it recently and discovered that it's no longer online,1 though you can still find it in the Wayback Machine.

Re-reading it nearly six years later, I'm surprised at how close my answers were to what I'd probably say today. Even though my job at the time didn't really involve digital preservation, I still ended up working that topic into many of my answers. If I were to revise anything, I'd probably say more now about open access, paywalls, and sustainability models. And though the questions are mainly about digitization, I'd also emphasize the need to talk about all the sources that are now being created in digital formats from the very start.2

In any case, I've reproduced the interview in its entirety below with the questions in bold. With the exception of the hyperlinks, I've resisted the urge to edit anything, though I've made a mental note to look up proper comma usage.

Who determines the worth of something to be digitised – “Quis custodiet ipsos custodes?”, who watches the watchmen. What is the role of the archivist in choosing what to digitise – and how does the cultural/social context of that archivist influence what is kept? (e.g. what was stored and kept from Ancient Greece influences our idea of what the classics are. What about all the works that weren’t retained? Someone decided what was kept, or circumstance e.g. war dictated it, and that has shaped our Western canon)

If you take “digitisation” to mean any sort of digital reproduction – and that’s the loose interpretation of the term that I’m going to be using – then my answer to the first part of the question is that lots of people do, and most of them aren’t archivists. A family scanning old photograph albums or digitising family videos; someone digitising a personal music collection; a community organization putting their materials online; a government agency scanning paper materials for an online exhibit – I think all of this counts. This is all happening in addition to the digitisation efforts that go on in archives (and in related institutions like libraries and museums). So from this perspective, archivists are only one of many groups making decisions about what gets digitised.

But just because I’m taking a broad interpretation of what it means to digitise that doesn’t mean that I think all digitisation is more or less the same. How you digitise something – the technology you use, the metadata you create, the file formats you choose – has an impact on how well you’ll be able to maintain that digital object over time. So even though there’s a lot of people creating digital stuff all the time, I still think that archives (and, again, related institutions) will have a fair amount of influence on what gets kept over the long run, provided of course that they actually commit the time and resources required for digital preservation. Not everyone is going to be able to make that kind of commitment.

So from that longer term perspective, I think that all of the social factors that have long influenced archivists’ work – their backgrounds, training, professional practices, institutional environments, and so on – will continue to operate. I guess I’m kind of side-stepping the second part of the question here, but I don’t think it’s really that much different with digitisation and digital materials than it has been with other kinds of material. It will continue to be the case that those with more power and more resources, both as individuals and as organizations, will be more likely to have their stuff kept over time. Nevertheless, I do think the potential is there for us to preserve a much broader range of materials from our current era than we have for earlier times, simply because so many people and groups now have the ability to create and keep their own stuff. But we will have to make an effort to do that; I don’t think it will simply happen.

What is the best approach, the “quick and dirty” method, of gathering the data (e.g. the flick/turn of pages by a member of the public who scans the books, in good resolution – but not archive standard), or the high resolution approach where you could feasibly use something like Microsoft Silverlight to zoom in on every tiny bit of the page. Where does the worth of something become enough to justify the latter, and will that be mostly older works (since we typically value these higher)?

I think this really depends on what you want to do with it. If your goal is to produce a high-quality digital surrogate that can take the place of the original for most purposes – or if the original is in a format or state where it’s likely to decay within a fairly short period of time – then ideally you would take the highest level approach that you could afford. This means not just using archival standards for file formats and such, but also having a digital preservation system that can handle the digital objects that you’re creating.

If, on the other hand, you’re trying to create copies that are just “good enough” for some purpose – maybe you’re a researcher who just needs to be able to read the text of documents, or you’re doing an art project and you only need a certain resolution, or you just want to be able to listen to an mp3 of some song or speech on your not so great headphones, or you want examples of sources to use in a class you’re teaching but they just have to look ok on a slide – then maybe you don’t need to invest in making high quality copies. Plenty of people nowadays are taking phones and tablets and inexpensive cameras into archives and getting pictures that are good enough for their purposes but that are not really up to archival preservation standards.

Of course, in practice it’s not always easy to tell whether your goals will change. Say you start out digitising with a quick and dirty approach and then later you decide you need to adopt a higher standard. What do you do with what you’ve already digitised? Do you go back and redo it? This is a difficult question and I don’t really have any answer beyond “it depends.” It’s easy enough to say that you should always take the high-quality approach, but that’s not always possible at the outset.

Should you digitise systematically (by chronology, for example) or digitise on demand, which is more important?

I think ideally you would take a systematic approach and organize your digitisation projects around coherent wholes: whole collections, or logical groupings within collections, or groups of related collections. I do think digitisation on demand programs are still a good idea simply because they can increase access, but you have to be careful about how you represent what does not get digitised. As a researcher, I’ve ordered paper photocopies or taken digital photographs of thousands of documents, but it’s been pretty rare for me to copy an entire series or even an entire folder. That is to say, I’ve been highly selective and I wouldn’t want someone looking over the materials I’ve collected from any one archival collection to make the mistake of thinking that what I’ve digitised for my personal use is fully representative of what’s in the collection. Aggregating all the requests made by all the people who’ve used a collection would probably be more representative, but it would almost certainly still leave significant gaps, especially in larger or less-frequently used collections.

Still, even partial digitisation can be usefully suggestive to later researchers. Maybe a request that got digitised turned up something that wasn’t in the finding aid for that collection, and then another researcher comes along and uses that as an entry point for their own research, which then results in more of the collection being digitised, and so on. I think that kind of outcome would be great, but it still might never lead to the whole collection getting digitised without some kind of additional, systematic effort.

Are we digitising to save space – or to preserve – can works be destroyed to save space – or is this unethical?

I firmly believe that if you’ve already made a commitment to preserve the original materials, you need to stand by that as best you can. Digitise for access and preservation, but if you can keep the original – if it’s not irretrievably damaged or decayed – then you shouldn’t destroy it just to make space. This calculation might look different in a situation like in a library where you’re consolidating a collection and you find you have multiple copies of some widely-available book. But even then I don’t think you should destroy the “last” copy just because you’ve digitised the content.

Are paywalls a positive thing for Digital Humanities – should these materials be open source (is this feasible, how would it finance itself)?

I don’t really think I know enough about how projects are financed to be able to answer this question in detail. Certainly I’m in favor of open access/open source models where feasible. Paywalls generally restrict both who can engage in digital humanities work and how far that work reaches outside of the academic community. So to the extent that people working in the digital humanities aspire to make their work more widely available than traditional (academic) humanities work, paywalls can work against that. But I don’t doubt that there are also arguments to be made about how paywalls have made it possible to do some work that wouldn’t have been done at all under a different model.

What is your opinion on the role of large corporations in Digital Humanities? Is the Google approach to digital archiving (e.g. digitisation of books) a positive force – or should large corporations be kept out of the digital arts world?

As with paywalls, I don’t have personal experience working with corporations, but I don’t see why they can’t have some role, provided that corporate interests aren’t what’s driving scholarly work. Just taking Google as an example you can see a wide range of outcomes. I think Google’s book scanning has been a net positive so far, especially with respect to public domain books. Although what’s really made it valuable to me has been all the work that libraries have done through HathiTrust to provide another way of accessing digitised books from Google and other sources. I really prefer their catalog and interface to Google’s, and I’m glad that the agreements with Google didn’t prevent that from being developed.

On the other extreme, Google had a newspaper digitisation project for a while but abandoned it before it was done. That’s exactly the kind of outcome we should be trying to avoid.

What are your thoughts about a concept in “Rainbows End” where a machine digitises complete libraries and then pulps them afterwards?

If you’re running rare materials through the machine and you’re supposed to have made a commitment to preserve them, it sounds awful. But I can think of situations where a machine like that might come in handy. Say you don’t have a preservation mandate, you’re sure that what you want to digitise isn’t rare or unique, and you’re fine with having ebooks, then maybe a machine like that doesn’t sound like such a nightmare. I will admit to digitising a few copies of my own books where the pages were falling out and then disposing of the paper book afterward. But I wouldn’t have done that if I hadn’t known that dozens or even hundreds of libraries owned copies of those books.

What are your thoughts on databases for the humanities and your ideas re expanding text and art culture through digitisation?

Overall, I think the development and expansion of research databases has been a great benefit to researchers. Certainly this has been my own experience in doing historical research. When I was an undergraduate, I wrote my final major research paper based on pamphlets I found in the English Short Title Catalog and which I had to find and read via microfilm. I ended up printing out hundreds of pages because there’s really only so long you can sit in front of a microfilm machine. One year later, all of these pamphlets were online. I had similar experiences as a graduate student.

At the same time, databases aren’t without their costs. Many of the ones that I’ve found most useful are ones that I’ve only been able to access because I was fortunate enough to be affiliated with universities that paid for subscriptions. There are inequalities across institutions with respect to access to these databases and I think that’s a real problem.

Many databases also come with restrictions that can have a real effect on the kind of research that can be done using them. Can you download whole documents or are you limited to a certain number of pages? Are you allowed to do things like text mining? A few places are now providing bulk access for research purposes, but I think that’s still the exception.

Finally, I think database providers need to be transparent about both how their database has been created and how it actually works. How were/are the materials chosen? If the database is based on an existing print or microfilm collection, how was that collection created and was anything left out in the microfilming or digitisation process? If it’s black and white, was there anything originally in color? Was oversize material included?

Also important: How does the search function work? A lot of things could be happening in the background: there’s almost certainly a stop list of words not included in the search (like the word “the”), there’s probably also some kind of system in place for identifying roots and stems (how does the database handle plurals?), and there may even be some support for identifying synonyms. Whenever I see people citing search result counts, I wonder: do you know how those numbers were generated?

I’ve gone on at length here, but I recommend reading Benjamin Schmidt’s “What historians don’t know about database design… for an extended take on this issue.

What is your definition of Digital Humanities (DH)?

I try not to have one, to be honest. I don’t mean to be glib. I have been following the ongoing debates about how to define the field, but what I’m primarily interested in what this all means for research, preservation, and access. For example, as more people do or want to do types of text mining across collections, archives need to be thinking about how they can facilitate that. The same goes for access to born digital materials. And farther along in the research process, there’s the question of how you preserve the work that’s being produced: What forms will it take? Databases? Complex websites? Will there be accompanying data sets? Those are more the kinds of questions that I’m interested in right now. So while I think that the definitional debate is an important one, as it will shape the kind of work that gets done, I also feel like I’m at enough remove from it that I don’t need to stake out my own position right now. I could very well be wrong about that, though.

Interview originally conducted May 2013


  1. Since her site is offline and she's no longer using the social media account she referenced in her email from 2013, I've decided not to use her name in the post. I figure she might have deliberately taken everything down and might not want this post to skew any search results. Click through to the Wayback Machine link if you want more context. 

  2. Why yes, I did deliberately avoid the term "born digital" here. 

memories of a grad school listserv

From my personal email archives:

Date: Thu, 13 Nov 2003 22:39:56 -0800  
To: <hist-grad@[redacted].edu>
From: Andrew Berger <[redacted].edu>  
Subject: hist-grad reform  
In-Reply-To: <[redacted]>  
References: <[redacted]>  
Mime-Version: 1.0  
Content-Type: text/plain; charset="us-ascii"; format=flowed  
Sender: owner-hist-grad@[redacted].EDU  
Precedence: bulk

Year after year one topic always generates the largest proportion of 
traffic on this list. In light of this fact I would like to propose that we 
split hist-grad into two lists:

List 1: hist-grad: whatever it is hist-grad is supposed to be

List 2: meta-hist-grad: for the discussion of what is and is not 
appropriate to be posted on hist-grad

Please send all replies to this e-mail to list 2.

Thank you

I can't remember what on-list dispute led me to send this message and I don't really want to dig through my archives to figure it out. I do remember at least one person replied, on-list, in a way that suggested they took it seriously. Splitting the list wouldn't solve the problem, they said.

response to an inquiry on digital preservation

About a year ago, I received a few questions from a journalist researching a story on digital preservation and wrote back a long reply. I hope my answers were helpful, but as far as I can tell the story was never published. I've wondered if by pushing back against the "digital dark age" framing I ended up making the topic seem less compelling.

In any case, since it's been so long I figure it's ok to post my response. I don't feel comfortable posting the exact questions I got, since I didn't write them, but I think you can gather the gist from my answers. I'd been meaning to revise this into a post all last year, but never got around to it. I'm putting it up here with only a few minor edits, mostly to add/update links (though I've left them in email style). I think I'd write basically the same thing today.

Dear [redacted],

Sorry I'm later with my reply than I intended to be. Also, I wrote a lot more than I expected to, for which I apologize, but I wanted to draw some distinctions that I don’t often see made in press coverage of digital preservation. I hope it still proves useful to you.

First, I want to push back a bit on the way you've framed the question. I think it's true that if you take a collection of material on digital storage media and a collection of paper records and leave them unattended for decades, the paper records are more likely to survive until you try to read them. But why make ability to hold up under conditions of inattention and neglect the first point of reference? Digital media present particular challenges to preservation, certainly, but it's generally true of all media formats that they're less likely to survive if no one is actively taking care of them.

I also think that paper has become so integrated into our world that we don't always recognize the steps that we take to save paper as "preservation activities" because they can seem so routine. But look at the first three examples you give of lost manuscripts recently discovered. Harper Lee's manuscript for Go Set a Watchman was held in a safe-deposit box; the Irish researchers who found the Einstein manuscript were doing research in the Albert Einsten Archives; the Quranic manuscript leaves found in Birmingham had already been selected for preservation in a university research library. Each of these discoveries had more to do with recognizing the significance of the manuscripts than with establishing their existence. As physical items, these documents were all already being managed for preservation.

I don't want to exaggerate my point too much here. Many types of paper, along with related media like parchment and vellum, have proven remarkably durable over the centuries. I seriously doubt there will be a digital analogue to the manuscripts found in caves in Dunhuang in the early 20th century. I just think it's important to recognize that most of the historical manuscripts that are still around today survived because people actively worked to ensure their survival.

From this perspective, I think we're going to see some very divergent outcomes with digital material. Many artifacts and records that are not actively managed are going to be lost, or already have been lost. But material that we, as individuals or as institutions, put time and effort into maintaining is likely to survive for quite a while. The key point is that we need to make a commitment to digital preservation and to devote the resources necessary to carry out that commitment. The threats to digital preservation are as much economic, social, and political as they are technical, and many of those threats - war, economic collapse and so on - apply to all cultural heritage materials across the board, not just digital materials.

That said, I take your question to be "what can we do to preserve digital material that we've committed ourselves to preserving?" and that's the question I really should be answering. I'll try not to get too caught up in the details.

There are a lot of different types of digital material, on a lot of different media, but the main principle I think we've learned is: start paying attention to what you want to keep earlier rather than later. This is especially true for materials stored on digital media that is going or has gone out of production, especially if that media can only be read using hardware that is also going obsolete. Some of the more difficult cases of technological and format obsolescence are materials stored on media like old types of magnetic tape where the data was never migrated forward as technology changed.

There's often a period when a new technology comes online where you can still move relatively easily between the old and the new because so many people are in the same situation, and there’s an economic incentive for the makers of the new technology to help you leave behind the old one. If you don't take advantage of that time to move your stuff forward - to a new type of data tape, from tape to disk, from floppies to hard drives, from one version of a format to a newer one, etc. - it can be much more difficult to do this later.

This is a real challenge for the museum, as our collection includes many different types of early computer storage media, ranging from punched cards and paper tape to metal and magnetic tape (in addition to more modern formats). This is not an area I've worked on directly, but as I understand it some of this material can only be read either with original hardware or through sometimes difficult engineering and reverse engineering processes. A fair amount of this material - I don't know how much - was acquired by the museum after it had been sitting in storage for years, long after it had last been accessed.

Once you get to more modern formats - and by "modern" I mean from around the time of floppies and early personal computers to the present - the challenges of reading the media today aren't as great because the equipment to read them was more widely produced and remains available. But the clock is always ticking, and most people who still own their floppy disks probably no longer have floppy drives to read them.

Many libraries and archives are now investing in that equipment - usually focusing on 5.25" floppies and later, but their choices will depend on the needs of their collections - as acquisitions are coming in now that include computer media and hardware along with paper. These range from the personal papers of individuals like writers, artists, academics, lawyers, etc. to the records of organizations and businesses. You'd be surprised by how much of that stuff has been kept and can still be read with the proper equipment. There are always going to be some unreadable disks, but I don’t think we’re looking at a catastrophic situation.

There is a problem here for cultural heritage institutions, though: there's often a gap of a few decades or more between when material is created and when it's transferred to an archive, library, or museum. We're fortunate that you can still find equipment to read floppies and Zip disks, but that won't always be the case. We're also fortunate that people have kept their media items past the point when they were able to read them on their own computers. Some sort of time gap is always going to be there, and some people will never want to donate to a separate institution, so there's also a need to educate people on how to manage their own digital material.

Digital preservation can't be something that only "digital preservation professionals" do, it needs to be something within reach of individuals who want to keep their own stuff during their lifetimes and then pass it on to friends and family. You're probably not going to run some ISO-certified preservation repository at home, but you should be able to keep your photos and home videos for yourself for as long as you want to keep them.

Along these lines, I recommend taking a look at the Library of Congress' resources on personal digital archiving: http://digitalpreservation.gov/personalarchiving/

There's also a conference on Personal Digital Archiving, which tends to be more practitioner oriented:
2016: http://www.lib.umich.edu/pda2016 [2018-11-28: changed link to archived version]
2017: https://library.stanford.edu/projects/personal-digital-archiving-2017

Moving on to the software question…

Assuming that you can read the data off of whatever physical media you're working with, and assuming you're not dealing with a situation where the original hardware and media is an integral part of what you're preserving - as might be the case with new media art - you're then faced with the issue of how to ensure that that data remains readable and accessible in the long term. How you approach this is going to depend on a number of factors, such as file format, operating system, and intellectual property law.

Maybe I'm too optimistic, but I think we'll find that many widely-adopted and widely-shared file formats are going to remain readable for a surprisingly long time, in large part because they've been so widely-adopted and shared. The need to make files readable across platforms and the need to make applications backwards compatible with older file format versions can end up combining to benefit digital preservation.

I won't get into it in this email, but I'm generally persuaded by David Rosenthal's arguments that format obsolescence isn't a huge threat when it comes to widely adopted formats (PDF, commonly used word processing formats, JPEG, HTML, etc.) because people have created, and have incentives to continue to create, applications that can render those formats.

A good place to start is his post of a few years ago on obsolescence scenarios: http://blog.dshr.org/2007/04/format-obsolescence-scenarios.html

There's also a good list of Rosenthal's posts on format obsolescence here: https://wsampson.wordpress.com/2011/07/18/format-obsolescence-maybe-not-such-a-bugaboo/

Beyond just taking formats as they are, there's also ongoing work on standardizing formats to make them more durable for long-term preservation. Take a look at the EU-funded Preforma project, which is focusing on formats for documents (PDF), images (TIFF), and audio-visual files (multiple formats): http://www.preforma-project.eu/project.html The hope here is that these formats will be integrated into applications that create these types of files, so that people wanting to produce a preservation-ready file could choose the preservation version of the format at the time of creating the file. Think of it as analogous to choosing acid-free paper over cheaper paper when printing a book meant to last.

Of course, there's a lot of material out there that doesn't use widely-available formats and many applications are not backwards compatible. There are also cases where you can open an old format in a new application, but not without an unacceptable loss of fidelity to the original. In these cases, you can often get a fairly faithful rendering of the original file using emulation, and emulation has gotten much easier to implement over the years. But in other cases emulation might not be an option because of technical problems or legal restrictions.

Some of the most at-risk content are files that can only be opened in one application, especially if the developer is not committed to backwards compatibility or to making old software versions available for preservation purposes. There are specialized applications where it’s hard to see preservation even being possible without buy-in from the company that produces the application. There are also particular challenges to preserving social media and content created with cloud-based applications, which I won’t get into at this point.

So to sum up (since I really should be wrapping this up): I don't think the situation facing digital preservation is as dire as it's sometimes presented (cf. recent and past articles about a "looming Digital Dark Age"). Digital material does require more intervention than paper to keep it alive, but inattention has never been an effective preservation strategy. There's a core of material that's likely to be kept accessible as long as we maintain the ability to use computers. But there's also a significant amount of material at a higher risk of loss, either because it's from the really early era of computing (1970s and before), or because it's dependent on very particular software or hardware environments that are difficult to reproduce.

Finally, I want to emphasize that there's a community of practice around digital preservation who has been working on these issues for years, and I hope any article on the topic will acknowledge that work. I don't think preserving things "forever" can ever be considered a solved problem, but I am optimistic that we aren't suddenly going to lose access to much of our digital heritage, at least not while people are actively working to preserve it.

I’ll leave you with a few more links to resources below:

On emulation, I'll again recommend some of David Rosenthal's work, as he's just published a paper on "Emulation & Virtualization as Preservation Strategies":
http://blog.dshr.org/2015/11/emulation-virtualization-as.html

On developing a global framework for software preservation, including participation from industry, see the UNESCO PERSIST project:
http://unesco.nl/digital-sustainability
http://unesco.nl/en/node/2665

On conservation of digital art, see this recent Cornell white paper:
https://ecommons.cornell.edu/handle/1813/41368 (paper)
http://blogs.loc.gov/digitalpreservation/2015/12/authenticity-amidst-change-the-preservation-and-access-framework-for-digital-art-objects/ (interview with authors)

A few of the organizations working on digital preservation:

National Digital Stewardship Alliance (US):
http://ndsa.org/

Open Preservation Foundation (international, started as an EU-funded initiative):
http://openpreservation.org/

Digital Preservation Coalition (UK):
http://dpconline.org/

Best, Andrew

inherent virtue

Let's say you don't think digital will survive so you print everything (that you consider to be at the time of printing) important.

You:

  • choose decent quality papers and inks, maybe even the best you can afford
  • use a high quality printer
  • use folders and boxes that won't quickly degrade
  • regularly keep up your printing and storage supplies
  • keep your boxes in an appropriate climate
  • protect against damage from fire, water, and animals
  • move all of your printouts every time you move, or make provisions to have someone manage them on your behalf
  • continue to pay your paper, ink, printer, folder, box, storage unit, housing, and moving bills
  • find someone to take custody of your printouts after you die and manage them at least as well as you did

While you're doing all of that, take a moment to stop and ask yourself, are these printouts being preserved because something inherent in the physical properties of paper makes them survive, or because you've decided it's worth investing time and money to make sure they survive? Then ask yourself what it would take to preserve the things you didn't, or couldn't, print but that you still want to be kept.

unresolved

When I posted last January, and then in February, I thought I might be on pace to beat my blogging output of 2013, when I only managed to write three posts. And then I didn't post again in 2014.

This is in no way a New Year's Resolution, but I'm getting tired of not writing things, so I'm going to make a real effort to write again this year. I have a new job, have learned a bunch of things that are technical but hopefully not too boring, and I'm hoping to pick up some of my history interests again. If I post once a month, that would be great.

This post counts as January's once but I hope it's not the only one.