[supplied title]

spending less time looking things up with json and jq

I spent a fair bit of time recently trying to figure out how to spend less of my time looking for data in JSON files that contain millions of objects. Since the solution I've come up with is working well enough for now, but was also non-obvious to me as I put it together, I figured I'd write it up.

Here's where I started:

  • a file full of metadata: 2,199,547 lines of metadata I exported from a system in JSON format (one object per line)
  • a file full of identifiers: 186,258 unique identifiers, also as JSON
  • a goal: every line from the metadata file corresponding to each of the identifiers

If you're not familiar with JSON and the tool I'm going to talk about in this post, jq, I recommend working through "Reshaping JSON with JQ", a tutorial at The Programming Historian that uses data from twitter and an art museum as examples. It covers a lot of ground, especially the basics, better than I could.

This post might not make a lot of sense without that background, though I've tried to lay things out as clearly as I can. I'm going to start out by going through how I approached the problem and why I rejected a bunch of ideas I had at first. If that doesn't interest you, you can skip ahead to my solution.

the structure of the data

The lines in the metadata file look like this:1

{"id":"druid:fg419sg6615","title_ssi":"Palo Alto Railroad","accessioned_latest_dttsi":"2017-04-01T02:44:07Z","status_ssi":"v3 Accessioned"}
{"id":"druid:another_id","title_ssi":"another title","accessioned_latest_dttsi":"2017-04-01T02:44:07Z","status_ssi":"v3 Accessioned"}
{"id":"druid:unmatched_id","title_ssi":"not in the identifier file","accessioned_latest_dttsi":"2020-02-01T02:44:07Z","status_ssi":"v3 Accessioned"}

The lines in the identifier file look like this:

{"id":"druid:fg419sg6615"}
{"id":"druid:another_id"}

Other key features of the data:

  1. Each line in each file is unique
  2. Each "id" value is unique within each file
  3. Each "id" value in the identifier file matches an "id" value in the metadata file

That's basically it for the structure of the data. The files themselves are not particularly complex. The challenge is in the scale: there are ways to process a file with 2 million lines that are fast, and ways that are really slow.

the goal

Again, the result I wanted was:

  • every line in the metadata file that
  • has the same value for "id" as a line in the identifier file

And I wanted that result to also be JSON.

the solutions I rejected

The simplest way to do this, conceptually, would be to write a loop. Leaving aside the specifics of how to do this in any particular language, it could look something like:

for each line in the identifier file  
    look for the same "id" value in the metadata file  
    when you find it, output that metadata to a third file for storing results  
repeat until you've done this 185,000 times

The problem with this strategy is that it processes the metadata file once every time it runs through the loop. Even if that took only a second, it would take 185,000 seconds, or about 51 hours, to get through all of the ids.

Even worse, it took a lot more than one second to look up one value in the metadata file with a command like:

jq 'select(.id == "druid:fg419sg6615")'

It actually took more like 15 seconds for that one lookup. At that rate, 185,000 lookups would take about a month. That was clearly not something I was going to wait for.

So I looked at a few alternatives, based on what I've learned over the years about processing text files with command line tools. The main two I considered were:

1. using grep instead of jq

grep is incredibly fast at finding text in files. Since every identifier is unique, why not just grep for it? I tried this out and it's true: grep finds a single value in the metadata file much, much more quickly than the jq expression I used above.

The problem with using grep here is that it makes no use of the JSON structure, so it doesn't make a distinction between an identifier in the "id" field and the same string of text elsewhere in the file. It's possible to correct for this in how you write the command, but it's kind of annoying to have to do that when there's already a field right there that says "id".

2. using parallel processing

Instead of one looooong looooop, I considered using parallel processing to have the computer run as many lookups as possible at the same time. This way it would still make 185,000 lookups, but at least it would be running multiple lookups at the same time.

I'm not going to go into the details, but I did try both of these strategies together: I used GNU parallel to distribute as many as four grep commands at once. This went quickly enough that I could have accepted it as a solution and called it a day. But it still took over six hours, if not a literal day.

I knew there had to be a better way. And there was.

a solution: use a whole lot of memory but read the files only once

If you've read this far and you have experience parsing large files, you're probably wondering: why not just read through the files just once? Why take only one value at a time? Those are good questions!

When I started this process, I knew the loop would be slow and I knew it should be possible to search the metadata file for all of the "id" values I wanted at once. What I didn't know was how to actually do that. Which is why I explored the alternatives first. If they were good enough, I could stop there. But they weren't.

So I went back to the jq manual to see what else it offered. And then I went to the search engines and -overflow sites because the jq manual tends to be too sparse with its examples for me to figure out how to use any of the functionality I haven't used before.

Here's what I came up with; it's a long chain of operations:

cat metadata.json identifiers.json | jq -s 'group_by(.id) | .[] as $groups | if ($groups | length) == 2 then $groups[0] + $groups[1] else empty end' > results.json

A line like that requires some explanation. In general terms, here's what happens:

  1. Concatenate the two files into one big blob of JSOM
  2. Read the whole JSON blob into memory
  3. Group everything together by the "id" field
  4. For each grouping, count the size of the group
  5. A size of 2 means a match between an identifier JSON and a metadata JSON
  6. In that case, merge the two JSON objects together and output that result
  7. If the size isn't 2, don't output anything

As complex as that looks, it turns out that it runs really quickly: on the same computer where it took 15 seconds to get one value, it took about 45 seconds to get all 185,000 values using this method. That's a whole lot less than 6 hours.

There was one catch: this command used used around 8 GB of memory at its peak while it ran. Out of curiousity, I ran it on a computer with 4 GB of memory and it froze the system, then crashed. So it's not a perfect solution.

Here are the details for each command:

cat metadata.json identifiers.json

This concatenates the two sets of JSON lines into the one big blob of JSON.

jq -s

This loads the whole JSON blob into memory and makes it all available to jq for processing. The next set of commands between the 'single quotes' are all for jq.

'group_by(.id')

This groups all the lines with matching "id" values together as arrays. The result is one big array made up of arrays:

[
  [
    {
      "id": "druid:another_id","title_ssi": "another title",
      "accessioned_latest_dttsi": "2017-04-01T02:44:07Z",
      "status_ssi": "v3 Accessioned"
    },
    {
      "id": "druid:another_id"
    }
  ],
  [
    {
      "id": "druid:fg419sg6615",
      "title_ssi": "Palo Alto Railroad",
      "accessioned_latest_dttsi": "2017-04-01T02:44:07Z",
      "status_ssi": "v3 Accessioned"
    },
    {
      "id": "druid:fg419sg6615"
    }
  ],
  [
    {
      "id": "druid:unmatched_id",
      "title_ssi": "not in the identifier file",
      "accessioned_latest_dttsi": "2020-02-01T02:44:07Z",
      "status_ssi": "v3 Accessioned"
    }
  ]
]

The grouping on "id" means that each sub-array contains

  • at least one JSON object from the metadata file,
  • and if there's a match in the identifier file, there will be a second JSON object that came from the identifier file

In other words, each sub-array will have a length of 2 if there's a match and 1 if there isn't. Remember that in this data there is either a match or there isn't - with different data you'd need to look for other patterns.

The next step is to get jq to do this count on the fly. The trick is getting the groupings out of the one big outer array and into a variable:

 .[] as $groups

The .[] operation looks inside the big array, making each smaller array available to jq. You could think of it as removing the outer brackets, though that's probably not technically correct.

Next, the as $groups is jq's way of storing results in a variable. So what this does is take the sub-arrays and store them in a variable that can be accessed as $groups. Now it's time to identify the matches.

if ($groups | length) == 2 then $groups[0] + $groups[1] else empty end'

This line is a conditional that separates groups with matches from groups without matches. jq has a function to count length (appropriately named length), so the first part of the statement checks each group to see if the length is 2, signifying a match. If that's true, then it's time to output the metadata.

But there's one more catch here: the value of each grouping is an array of two objects, the one from the identifier file and the one from the metadata file. For example:

[
  {
    "id": "druid:fg419sg6615"
  },
  {
    "id": "druid:fg419sg6615",
    "title_ssi": "Palo Alto Railroad",
    "accessioned_latest_dttsi": "2017-04-01T02:44:07Z",
    "status_ssi": "v3 Accessioned"
  }
]

So if we were to output the value at this point, we'd be outputting arrays, not the JSON that's the real end goal.

My strategy for getting around this was to use jq's ability to merge two JSON objects into one, which somewhat confusingly gets represented by addition with the + sign. This is what's happening with $groups[0] + $groups[1]. That operation takes fields from both objects and merges them into one object. The duplicate values for 'id' are simply brought together rather than repeated (which I don't think JSON allows, anyway).

So in the end, the result of this command is that the metadata for matches is sent to the output.

What if there isn't a match (i.e. the array length is not 2)? The else empty means that nothing is output at all. Unlike some other languages and tools, jq seems to require you to have an else statement, even if it's just else empty.

> results.json

Finally, the last part of this long string of commands is to output the whole set of results into a new file using bash redirection.

Conclusion

I still can't believe that the longer, more complex command runs so quickly. That just shows the value of using a tool like jq that's optimized for processing JSON rather than treating it all like an unstructured block of text.

The only drawback is the memory usage. It happens that I have 8 or 16 GB of RAM available, depending on the computer I use, but it would be nice to find a way to run this that isn't so memory-intensive. As a test, I re-ran this on a small, low-end laptop with 4 GB of RAM and the system ran out of memory, froze up, and finally crashed the command to avoid crashing entirely.2

So what I have isn't perfect but it fulfills the high-level criteria of "good enough." Now I'm going to go looking for queries to run.


  1. The actual data has many more fields, but I've truncated this in the interest of readability. What matters most here is that each line is a JSON object with a value for "id". 

  2. I did find that if I split the big metadata file into two, ran the command on each part, then merged the two sets of results, I could still get to my goal, albeit a bit more slowly. So that's one possible way around the memory limit. 

collecting and archiving digital video

A couple of years ago I gave a presentation titled "Collecting and archiving born-digital video" at the Society of California Archivists (SCA) 2017 annual conference in Pasadena. I'd always meant to post it online, but I wanted to write up text to accompany the slides. I've finally now done that.

For context, my talk was part of a session titled "Born digital: care, feeding, and intake processes." Other speakers represented the LOCKSS program at Stanford, the Internet Archive, and the Jet Propulsion Laboratory. A full description of the panel is available under session 10 in the SCA 2017 conference program.

A note about the text, which is true of pretty much every talk I've given since library school: I have a difficult time reading from notes when I do public speaking, so unless there's a long quotation I want to read exactly, I go without them. As a result, I don't have an official text for the talk, just the slides.

Since I'm essentially writing it up for the first time in 2019, it's not exactly what I said in 2017, not least because it's probably longer than fifteen minutes as written. But I've decided to go wtih the present tense anyway.

Slide 1: title slide

Hi, I'm Andrew Berger and I'm going to be talking about collecting and archiving in-house digital video, by which I mean video produced in digital formats. For background, I'm going to begin by telling a story, a story about a museum.

Slide 2: Background: oral history as collecting

The Computer History Museum has had an active oral history program since the early 2000s. There are some oral histories from even earlier, mostly on audio tape, but the pace really picked up in the 2000s with the use of DV tape. Almost every oral history from that point on has been on video.

Around 2010-2011, there was a shift from using digital tape to using digital files, with all-file workflows taking over entirely around 2012-2013. Around the same time there was also a shift towards treating oral histories as productions, with an eye towards reuse in exhibits and documentaries. All oral histories are now shot in HD, many with special lighting, separate microphones, and a green screen.

Oral histories are seen as a collecting activity, with every production ultimately creating files that the museum is dedicated to preserving in the permanent collection.

Slide 3: Story continues: lectures and events

Oral histories aren't the only kind of video the museum produces. Lectures and other events are routinely recorded, dating all the way back to the 1980s. These recordings span analog and digital formats, with DV tape taking over from Betacam and VHS in the early 2000s. Just like oral histories, these moved to all-file workflows starting around 2010-2011.

The museum has kept these videos as well, but is that the same as collecting them?

Slide 4: Story continues: exhibit videos

The museum also produces its own exhibit videos. These include not just the videos shown in the exhibits (on-site and online) but also the raw footage shot specifically to be used in exhibit videos. As with just about any video production process, it takes hours of raw footage to produce a few minutes of a final production.

Exhibit video production increased signficantly around 2009 in preparation for two major exhibits. As with oral history and event video, the museum retains its exhibit video.

Slide 5: Meanwhile: more video productions

Beyond oral histories, events, and exhibits, the museum produces even more kinds of video. There's footage of docent trainings, educational events, marketing campaigns--essentially, any public facing museum function could have a video component.

Slide 6: But wait, there's more!: Revolutionaries TV show

But wait, there's more! For five years the museum also produced a television show for the local PBS affiliate, KQED. This was an hour-long interview show that drew heavily from the museum's live lecture series.

The production of the tv show was essentially all-digital from the start. The lectures were recent enough to have been shot in HD, then additional editing and post-production work was done to transform them into files meeting broadcast standards.

The museum has retained copies of its television episodes as well.

Slide 7: To sum up: a lot of digital video

To summarize, the museum has produced and is producing a lot of digital video, with a significant increase starting around 2009. This increase has coincided with a shift towards file-based rather than tape-based workflows.

Slide 8: How much is "a lot"? Chart showing video production by year in terabytes

Now, you might be wondering, how much do I mean by "a lot"? This chart shows the approximate growth in video production from 2011 to 2015, measured in terabytes. The chart starts in 2011 because that's the first year for which I could generate reliable data, in part because for earlier years I would have to go back to tape. You can see there was nearly a doubling in production in four years, from just over 10 TB in 2011 to 20 TB in 2015. [Note from 2019: it's around 30 TB now.]

On a panel with representatives from LOCKSS, the Internet Archive, and JPL, twenty terabytes in a year might not sound like a lot. But keep in mind that for every byte of video, someone had to schedule the event, set-up the mikes and cameras, monitor the feeds, transfer the data, and edit the final products. This is artisanal, hand-crafted data.

Slide 9: Where does it all go? DV tape (picture of boxes of DV tape)

Where has all this data been going? First, as I've mentioned, it mostly went onto DV tape, which is stored in archival boxes.

Slide 10: Where does it all go? Hard drives (picture of pile of hard drives)

For a while, a lot of the data was also being stored on external hard drives. I don't even want to talk about the hard drives.

Slide 11: Where does it all go? Servers (no photo of server shown)

Finally, it started being stored on servers. These have grown quite a bit in size. Since about 2013, the media production team has had their own system combining server space with LTO tape to use for editing and backup. And in 2015, the museum launched a digital preservation repository. The vast majority of the data in the repository so far is video.

Slide 12: The challenge of digital video: developing a new approach

How has the museum adapted to the challenge of managing and preserving all this video? That's the subject of the rest of my talk.

Slide 13: File formats: a problem, but...

First, file formats. I'm not actually going to spend a lot of time talking about formats, even though it sometimes feels like that's the main thing people want to talk about when I get asked about digital video preservation. I will say that there are multiple different production formats and codecs in the collection, some of which are proprietary.

In the absence of a generally agreed upon normalization target within the broader digital video community, we've settled for preserving the "original" digital formats as long as they can be opened with open-source tools, specifically VLC and/or ffmpeg. We might normalize or migrate to other formats in the future but at this point the costs and risks of normalization would outweigh the benefits. This decision does have the side effect of making the preservation of VLC and ffmpeg more important to our overall preservation efforts.

Note that this policy for digital sources is different than our policy for digitization from analog sources, in which case the file format is entirely up to choice.

Slide 14: Technical infrastructure: video files are intense

Video files are intense: not only do they take up a lot of disk space, they also require more powerful processors for rendering and playback. The museum has had to upgrade its technical infrastructure to handle them. As I mentioned above, server space for editing and storage has been greatly expanded. So has internal network bandwidth and computing power for staff that work heavily with video.

Slide 15: Conceptual approach: what are we looking at?

But the biggest changes have been in terms of what I'd call the conceptual approach. What does it mean to say you collect or archive video? Oral histories, as I've mentioned, fit solidly into the collecting paradigm, even if they are produced in-house. They've always been on a workflow to be accessioned, cataloged, transcribed, shared.

What about the other video production types? What about their raw footage? What if there's only raw footage?

On this slide I've included two photos of boxes of digital video tape. One contains oral histories and has nicely printed labels. The other contains lectures, and though you can't see it on the slide, the boxes are labeled in pencil. I think this captures the tentative nature of how video was being preserved outside of oral history, as if it still wasn't ready for the permanence of labels.

Slide 16: Conceptual approach: what is video, really?

One way to look at the problem is to step back and ask, what is video, really? Yes, videos are shot to create digital objects, but video is also made to document activities. Video production can be seen as an institutional function that produces institutional records as video.

As it happens, the museum doesn't really have an institutional archive separate from the general collection. Nonetheless, it's been helpful to approach video as records and to ask, if there were an institutional archive, and if it were arranged around functions and departments, then where would these video files fit into that arrangement? Can they be understood in terms of records scheduling and retention (but without actually using those terms in casual conversation)?

Slide 17: Appraisal and selection: can't preserve literally everything

Given that it is not, in fact, possible to preserve everything, we've been working out appraisal and selection policies with the media production team and other interested parties, depending on the type of video project and how it fits within the context of the whole institution.

For example, a single lecture could produce upwards of 500 GB: three camera feeds, one live "switch" feed drawing from the three cameras, final edits for the online audience, and possibly even another version for broadcast television. For long-term preservation we save only the switch, the public audience version (i.e. the version posted to youtube), and the final broadcast file if there is one. The goal is to balance future re-use value with the goal of documenting the event, rather than saving every single file.

For other kinds of production, such as exhibit videos, we keep much more of the raw footage, sometimes even every camera feed. The media team also saves more data on their production system, but these are considered active records within the context of the whole system.

Slide 18: Staffing: regular lines of communication

Another significant change we've made has been in terms of staffing. Previously, there was a position of Oral History Coordinator, which focused mostly on the oral history program, but had some duties related to video production. For various reasons, this position didn't manage or monitor video files after they were produced, and no archivist became involved until the end of the workflow. But the previous holders of this position kept getting promoted (a good thing!) to work more specifically on content production, which had the side effect of making us take a fresh look at the whole video production and archiving workflow.

In the new arrangement, there's an AV Archivist who works closely with one of the Media Producers as the main points of contact between teams. The AV Archivist keeps abreast of all productions that will potentially produce material to be preserved and, in combination with myself, tracks files from production to preservation and access.

There's also a regular check-in between the archivists and the media team where we keep up with current projects and review what's working and what could be improved.

Slide 19: The moment of transfer

The final significant change we made was to how we manage the actual moment of transfer of files between the production team and the archives. Previously, files were being dragged and dropped over the network, which is not the ideal way to move files that could be hundreds of gigabytes in size. When a copy was interrupted, there was often no way to tell that something had gone wrong until an archivist found a corrupt file.

Now the production system has been configured to generate checksums for each file, and to send files to the archives staging area with corresponding checksums included in an accompanying XML file. This way the archivists can validate the checksums as the files arrive and identify problems earlier in the process.

This is all still a work in progress, so I don't really have a grand conclusion. There's still a lot to do in terms of improving file management and settling on appraisal and selection decisions.

But I will say that as the museum has continued to increase video production in nearly all areas, it's difficult to see how the increase could have been managed without making these larger organizational changes. We may have started with preservation as the end of the workflow, but bringing preservation concerns upstream has helped us see the bigger picture. This has really paid off when requests have come in for previously produced footage and we've been able to say, without too much trouble, yes, we have that and we can get it again.