I spent a fair bit of time recently trying to figure out how to spend less of my time looking for data in JSON files that contain millions of objects. Since the solution I've come up with is working well enough for now, but was also non-obvious to me as I put it together, I figured I'd write it up.
Here's where I started:
- a file full of metadata: 2,199,547 lines of metadata I exported from a system in JSON format (one object per line)
- a file full of identifiers: 186,258 unique identifiers, also as JSON
- a goal: every line from the metadata file corresponding to each of the identifiers
If you're not familiar with JSON and the tool I'm going to talk about in this post, jq, I recommend working through "Reshaping JSON with JQ", a tutorial at The Programming Historian that uses data from twitter and an art museum as examples. It covers a lot of ground, especially the basics, better than I could.
This post might not make a lot of sense without that background, though I've tried to lay things out as clearly as I can. I'm going to start out by going through how I approached the problem and why I rejected a bunch of ideas I had at first. If that doesn't interest you, you can skip ahead to my solution.
the structure of the data
The lines in the metadata file look like this:1
{"id":"druid:fg419sg6615","title_ssi":"Palo Alto Railroad","accessioned_latest_dttsi":"2017-04-01T02:44:07Z","status_ssi":"v3 Accessioned"}
{"id":"druid:another_id","title_ssi":"another title","accessioned_latest_dttsi":"2017-04-01T02:44:07Z","status_ssi":"v3 Accessioned"}
{"id":"druid:unmatched_id","title_ssi":"not in the identifier file","accessioned_latest_dttsi":"2020-02-01T02:44:07Z","status_ssi":"v3 Accessioned"}
The lines in the identifier file look like this:
{"id":"druid:fg419sg6615"}
{"id":"druid:another_id"}
Other key features of the data:
- Each line in each file is unique
- Each "id" value is unique within each file
- Each "id" value in the identifier file matches an "id" value in the metadata file
That's basically it for the structure of the data. The files themselves are not particularly complex. The challenge is in the scale: there are ways to process a file with 2 million lines that are fast, and ways that are really slow.
the goal
Again, the result I wanted was:
- every line in the metadata file that
- has the same value for "id" as a line in the identifier file
And I wanted that result to also be JSON.
the solutions I rejected
The simplest way to do this, conceptually, would be to write a loop. Leaving aside the specifics of how to do this in any particular language, it could look something like:
for each line in the identifier file look for the same "id" value in the metadata file when you find it, output that metadata to a third file for storing results repeat until you've done this 185,000 times
The problem with this strategy is that it processes the metadata file once every time it runs through the loop. Even if that took only a second, it would take 185,000 seconds, or about 51 hours, to get through all of the ids.
Even worse, it took a lot more than one second to look up one value in the metadata file with a command like:
jq 'select(.id == "druid:fg419sg6615")'
It actually took more like 15 seconds for that one lookup. At that rate, 185,000 lookups would take about a month. That was clearly not something I was going to wait for.
So I looked at a few alternatives, based on what I've learned over the years about processing text files with command line tools. The main two I considered were:
1. using grep instead of jq
grep is incredibly fast at finding text in files. Since every identifier is unique, why not just grep for it? I tried this out and it's true: grep finds a single value in the metadata file much, much more quickly than the jq expression I used above.
The problem with using grep here is that it makes no use of the JSON structure, so it doesn't make a distinction between an identifier in the "id" field and the same string of text elsewhere in the file. It's possible to correct for this in how you write the command, but it's kind of annoying to have to do that when there's already a field right there that says "id".
2. using parallel processing
Instead of one looooong looooop, I considered using parallel processing to have the computer run as many lookups as possible at the same time. This way it would still make 185,000 lookups, but at least it would be running multiple lookups at the same time.
I'm not going to go into the details, but I did try both of these strategies together: I used GNU parallel to distribute as many as four grep commands at once. This went quickly enough that I could have accepted it as a solution and called it a day. But it still took over six hours, if not a literal day.
I knew there had to be a better way. And there was.
a solution: use a whole lot of memory but read the files only once
If you've read this far and you have experience parsing large files, you're probably wondering: why not just read through the files just once? Why take only one value at a time? Those are good questions!
When I started this process, I knew the loop would be slow and I knew it should be possible to search the metadata file for all of the "id" values I wanted at once. What I didn't know was how to actually do that. Which is why I explored the alternatives first. If they were good enough, I could stop there. But they weren't.
So I went back to the jq manual to see what else it offered. And then I went to the search engines and -overflow sites because the jq manual tends to be too sparse with its examples for me to figure out how to use any of the functionality I haven't used before.
Here's what I came up with; it's a long chain of operations:
cat metadata.json identifiers.json | jq -s 'group_by(.id) | .[] as $groups | if ($groups | length) == 2 then $groups[0] + $groups[1] else empty end' > results.json
A line like that requires some explanation. In general terms, here's what happens:
- Concatenate the two files into one big blob of JSOM
- Read the whole JSON blob into memory
- Group everything together by the "id" field
- For each grouping, count the size of the group
- A size of 2 means a match between an identifier JSON and a metadata JSON
- In that case, merge the two JSON objects together and output that result
- If the size isn't 2, don't output anything
As complex as that looks, it turns out that it runs really quickly: on the same computer where it took 15 seconds to get one value, it took about 45 seconds to get all 185,000 values using this method. That's a whole lot less than 6 hours.
There was one catch: this command used used around 8 GB of memory at its peak while it ran. Out of curiousity, I ran it on a computer with 4 GB of memory and it froze the system, then crashed. So it's not a perfect solution.
Here are the details for each command:
cat metadata.json identifiers.json
This concatenates the two sets of JSON lines into the one big blob of JSON.
jq -s
This loads the whole JSON blob into memory and makes it all available to jq for processing. The next set of commands between the 'single quotes' are all for jq.
'group_by(.id')
This groups all the lines with matching "id" values together as arrays. The result is one big array made up of arrays:
[
[
{
"id": "druid:another_id","title_ssi": "another title",
"accessioned_latest_dttsi": "2017-04-01T02:44:07Z",
"status_ssi": "v3 Accessioned"
},
{
"id": "druid:another_id"
}
],
[
{
"id": "druid:fg419sg6615",
"title_ssi": "Palo Alto Railroad",
"accessioned_latest_dttsi": "2017-04-01T02:44:07Z",
"status_ssi": "v3 Accessioned"
},
{
"id": "druid:fg419sg6615"
}
],
[
{
"id": "druid:unmatched_id",
"title_ssi": "not in the identifier file",
"accessioned_latest_dttsi": "2020-02-01T02:44:07Z",
"status_ssi": "v3 Accessioned"
}
]
]
The grouping on "id" means that each sub-array contains
- at least one JSON object from the metadata file,
- and if there's a match in the identifier file, there will be a second JSON object that came from the identifier file
In other words, each sub-array will have a length of 2 if there's a match and 1 if there isn't. Remember that in this data there is either a match or there isn't - with different data you'd need to look for other patterns.
The next step is to get jq to do this count on the fly. The trick is getting the groupings out of the one big outer array and into a variable:
.[] as $groups
The .[]
operation looks inside the big array, making each smaller array available to jq. You could think of it as removing the outer brackets, though that's probably not technically correct.
Next, the as $groups
is jq's way of storing results in a variable. So what this does is take the sub-arrays and store them in a variable that can be accessed as $groups
. Now it's time to identify the matches.
if ($groups | length) == 2 then $groups[0] + $groups[1] else empty end'
This line is a conditional that separates groups with matches from groups without matches. jq has a function to count length (appropriately named length
), so the first part of the statement checks each group to see if the length is 2, signifying a match. If that's true, then it's time to output the metadata.
But there's one more catch here: the value of each grouping is an array of two objects, the one from the identifier file and the one from the metadata file. For example:
[
{
"id": "druid:fg419sg6615"
},
{
"id": "druid:fg419sg6615",
"title_ssi": "Palo Alto Railroad",
"accessioned_latest_dttsi": "2017-04-01T02:44:07Z",
"status_ssi": "v3 Accessioned"
}
]
So if we were to output the value at this point, we'd be outputting arrays, not the JSON that's the real end goal.
My strategy for getting around this was to use jq's ability to merge two JSON objects into one, which somewhat confusingly gets represented by addition with the +
sign. This is what's happening with $groups[0] + $groups[1]
. That operation takes fields from both objects and merges them into one object. The duplicate values for 'id' are simply brought together rather than repeated (which I don't think JSON allows, anyway).
So in the end, the result of this command is that the metadata for matches is sent to the output.
What if there isn't a match (i.e. the array length is not 2)? The else empty
means that nothing is output at all. Unlike some other languages and tools, jq seems to require you to have an else statement, even if it's just else empty
.
> results.json
Finally, the last part of this long string of commands is to output the whole set of results into a new file using bash redirection.
Conclusion
I still can't believe that the longer, more complex command runs so quickly. That just shows the value of using a tool like jq that's optimized for processing JSON rather than treating it all like an unstructured block of text.
The only drawback is the memory usage. It happens that I have 8 or 16 GB of RAM available, depending on the computer I use, but it would be nice to find a way to run this that isn't so memory-intensive. As a test, I re-ran this on a small, low-end laptop with 4 GB of RAM and the system ran out of memory, froze up, and finally crashed the command to avoid crashing entirely.2
So what I have isn't perfect but it fulfills the high-level criteria of "good enough." Now I'm going to go looking for queries to run.
-
The actual data has many more fields, but I've truncated this in the interest of readability. What matters most here is that each line is a JSON object with a value for "id". ↩
-
I did find that if I split the big metadata file into two, ran the command on each part, then merged the two sets of results, I could still get to my goal, albeit a bit more slowly. So that's one possible way around the memory limit. ↩