[supplied title]

copying files isn't always a straightforward process (or, some things I've learned working with digital archives)

Copying files is a task that seems like it should be simple and often it is. Pick the right tool for your needs, set up a workflow, repeat. You often don't even need to know what you're copying, you can just duplicate the bits, verify that the copies match, and you're done.

Except ...

sometimes filenames are too long

Are you copying to a Windows system? Then you might have to watch out for long paths and long filenames. Lots of Windows systems are configured with filename or file path character limits. Usually this issue will creep up on you. Maybe a filename on a USB drive looks long but doesn't seem to be causing any problems. But then you try to copy it from the drive to a new location on a Windows machine, and the path to that location is itself a few levels deep. Something like D:\archives\collections\collection_number\accession_number\identifier. And suddenly you find that the long filename is too long to be copied because the combination of the destination path (on drive D:) plus the filename put it over the limit.

I've been lucky to have never learned a solution to this problem: I've used Windows in my archives work but never as the destination for long-named files. So while I've seen the issue, I've always had the option to send files to Mac or Linux systems that don't have the same limits. I believe that recent versions of Windows 10 now offer the option of removing the previous limits. But you may not have access to that yet in your workplace, depending on how often your systems get updates.

sometimes filenames use characters that other systems won't accept

This is another problem I've seen most often when copying to a Windows system. Windows has relatively strict rules for allowable characters in filenames. Unix-based systems, especially Linux ones, are a lot more accepting. So you might find that you can't copy a file from a non-Windows system to a Windows system without either having to rename it or letting the name get mangled during the copying process.

I've seen a few different characters, including question marks, inserted into filenames when one system couldn't deal with the characters it was being asked to interpret. This doesn't always prevent copying the bits so it's a good idea to have a check in place to make sure that filenames, not just files, have gotten copied. You don't want someone to come back years later to find a folder full of names like secret??_�_.txt. Sometimes you might find you have to rename the files yourself. It's a good idea to make a log that includes the original filenames if you have to do that.

sometimes filesystems are too insensitive to case

I've run into this one when copying (or trying to copy) files to Windows as well. But I've seen it in multiple operating systems. This problem is more an issue of filesystem support than it is a problem with filenames. Some filesystems will preserve case and treat the names New_File and new_file as different files. But many systems will display New_File exactly how you typed it (with or without capitals) while in the background the system actually treats upper and lower case characters as if they're the same, making it impossible to create both a New_File and a new_file. Not sure what your system does? Try to create new files with those names and see what happens. You'll either end up with two files or get a message back from the machine telling you you're asking it to do something it can't.

So what happens if someone gives you a drive containing both New_File and new_file (from a case sensitive system) and you try to copy those two files to a system that sees both of those filenames as the same?

Good question! I don't have a great answer. It seems to depend on which systems are involved and which tools you're using. I've seen:

  • One of the files doesn't appear in the destination system. Maybe it was just not copied (because the destination system sees only one file with that name) or maybe both files were copied but the second one to get copied overwrote the first, leaving only one file in the destination. You aren't shown an error and have to work out that there's a missing file yourself.
  • The copying tool reports an error. It tells you something like "can't copy new_file because the file already exists" (referring toNew_File, which was copied successfully moments earlier). You stare at the error dialog box wondering what happened to your day.
  • The copying tool automatically renames one of the files, adding something like (conflicted copy) into the file name, so you end up with New_file and new_file (conflicted copy.
  • The copying tool is itself case insensitive in how it reads and displays filenames so it doesn't even indicate to you that there are two files with nearly the same name for you to copy in the first place. Instead, it looks like there's only one file. You might not even know you're missing something until you run a check (with a different tool) to make sure you copied the expected number of files and discover that one is missing.

sometimes there's more than one way to encode a character

I recommend searching for "Unicode normalization" and weeping.

I ran into this when copying files that had umlauts (and other characters from outside the ASCII range) in their names from a Mac to a Linux system. I was using rsync and I noticed that each time I reran the command, it would delete the files with the not-entirely-ASCII filenames from the destination location and then recopy them. The problem, I learned, was that Macs and Linux systems make different encoding choices and these encodings aren't always translated across system boundaries. To the human eye, the umlaut on the Mac and the umlaut on Linux might look the same, but a system level they were treated as different characters, giving the files different names.

I was running rsync with the --delete option, which should result in the destination directory matching the source directory exactly. But because the systems used different encodings, rsync kept deleting files on the destination that it saw as extraneous (because they did not appear to match any names on the source) and then re-copying those same files to the destination (because it did not recognize that those files had already been present before the command deleted them).1

What made this especially puzzling to me was that the names looked correct on both sides of the transfer and the rsync command always reported success. Up until that moment, I'd thought there was only one valid way to encode each character using Unicode. I had no idea there could be multiple valid ways to arrive at what looked like the same character to a human eye.

You might also run into a more general problem of non-recognized characters, where one system doesn't recognize some or all characters in a filename that was produced on a different system. On the broader issues surrounding filename encoding, I highly recommend reading Elvia Arroyo-Ramirez, Invisible Defaults and Perceived Limitations: Processing the Juan Gelman Files (2015) and Ashley Blewer, Artist_Exhibition-copy (FINAL)(2).mov: Preserving diacritics in filenames as significant properties in media conservation (2019).

sometimes files have attributes that you don't notice at first that turn out to matter

Permissions are the instance of this issue that I've seen the most. Different filesystems can have different permissions schemes. You might want to copy files using the -a (for archive) option of rsync because that's the option that's supposed to copy "everything" without you needing to think about the details. But then it turns out that the permissions to be copied don't actually exist on the destination system because the source filesystem (maybe a USB drive, formatted as NTFS or exFAT) and the destination (maybe a Linux server) don't use the same permissions. So either your copying attempt fails or you get a bunch of error messages back saying that everything was copied except for the permissions.

I've generally gotten around this by using the rsync options of -r (for recursive) to copy directories and --times to copy the file's timestamps, the main attribute I've always tried to preserve. I've rarely been in a position where I needed to preserve a file or directory's original permissions. I'd probably end up making a disk image if I needed to do that. Or I'd make a log file and record the original permissions there.

Beyond permissions, different filesystems may have other ("extended") attributes worth considering when making copies. I haven't spent a lot of time with these, but it can be worth getting familiar with the supported attributes of common filesystems. I've heard stories about important information being stored in the tags that you can associate with files in OSX systems. Those tags might be a feature that's unique to OSX, and I think they're in the attributes. But I could be wrong.

Macs also create things called resource forks, which often appear on non-Mac systems as files that start with ._, but I'm not going to go into detail about those in this post. Partly because I never did have to research them myself. Resource forks are a bigger issue for older Mac filesystems as they often contained essential information for reading a file, and failing to copy a resource fork could result in the corresponding file being unreadable. For newer Mac systems, it often doesn't matter if you copy the resource fork. It might have some information (like last downloaded date), but it's generally not information critical to being able to open the file later. If you're on a newer Mac filesystem and you look at resource forks in a hex editor, you might see a message like This resource fork intentionally left blank. I'm not sure what that really means but have taken it as a sign that I can stop thinking about that file.

sometimes you're getting files from a "cloud" service and the download method you choose affects the filenames

All the consumer-oriented cloud services I've used (products with "box" or "drive" in the names) have provided an interface that looks like a traditional folder-file filesystem. But are the names you see in those interfaces the actual names of the files? Are they even storing your files as "files"? Who knows? Google Drive will let you name two "files" in the same "folder" with exactly the same "name" so it's clearly not enforcing the rules that you would expect out of an ordinary file system.

What I have found when I was trying to come up with a standard workflow for downloading files from cloud providers was that the filenames you ended up with could vary depending on the method you used to download.

It's been a couple years since I looked closely but the big difference I remember had to do with spaces and a few other "special" characters in filenames. Depending on if you downloaded the file directly (i.e. by itself, via a browser) or downloaded the whole folder that contained it (usually as a .zip), you'd get:

  • the filename, but with spaces and "special" characters replaced by underscores
  • the filename exactly as it appeared in the cloud interface2

And if you downloaded an entire "folder" where two "files" have the same "name" you'd either see an error or see that one of the files got automatically renamed for download, as a "traditional" filesystem will require names to be unique within a folder.

Bonus annoyance: when using a browser's "save page as" option to save a webpage as a file, you might find that the browser will try to name the resulting file with the webpage "title" (i.e. the value in the webpage's HTML <title> tag, which may or may not be the human-readable title of the document). If that title uses a character that isn't valid on Windows (like a |) but is valid on the system you used to save the page, you might end up with that character in the filename. Then, later on, you might try to copy that file to a Windows system and then run into one of the filename incompatibility issues described above. (Not that this has ever happened to me!)

sometimes files turn out not to be files

I was tasked with copying half a million files from a hard drive once. They were all source code and it turned out that they had been kept in a system that used symbolic links to relate different files and directories to each other. This made "copying" the "files" much more complicated than I had expected. I ended up making a disk image because I couldn't get the "ordinary" copying process to work well enough to be sure I hadn't lost data. It is often possible to copy symbolic links, but there was something weird about these files that prevented me from getting an error-free copy. I think the problem may not have been the symbolic links themselves, but some incompatibility whose origins were in the 1980s (these were files from an old system) but I never worked that out.

The disk image I made just pushed the problem of copying the files down the road for when someone had the chance to extract them. But we needed to return the hard drive to the donor and I couldn't keep sinking time into figuring out what was going on. I remember explaining the situation to my fellow archivists when I left that job. I hope I was suitably apologetic for leaving them with that mess, but at least we had all the bytes, right?

  1. Apparently, newer versions of rsync can handle this problem but the Mac was running an older version that couldn't translate the encodings. Did I ever want to know this much about rsync and its versions? I did not. I just wanted to copy some files. 

  2. Except possibly on Windows, where an unsupported character might get converted to an underscore no matter what. Again, I can't quite remember the details, just that it's worth looking at multiple download methods and watching what happens.