Every so often you hear about some technological breakthrough that has the potential to solve all of our digital storage problems forever by effectively reducing costs to zero or near-zero. Things like
DNA storage,
atomic-scale storage, and even
next-generation tape storage have all been trotted out as possible solutions to the problem of where to stash our ever-increasing mountains of digital data.
Of course, saying that an experimental technology might one day substantially reduce storage costs is not quite the same thing as saying we've got the problem licked. But even if one or more of these technologies eventually bears the kinds of fruit their promoters say they will, that will not by itself solve all of the problems and costs associated with digital preservation. Not by a long shot.
And to explain what I mean by that, let me tell you about a project I've been working on to store 10TB of digital video that were recently added to our collection. (Or more accurately, were digitized from physical media already in our collection to enable us to preserve the content beyond the point at which the original films and videotapes deteriorate so badly they can no longer be used).
What makes this project particularly relevant here is that I am, in a way, working with free storage, in the form of UVic's enterprise tape archiving system. Now of course it's not really free (back of the envelope suggests costs are something like $400/TB/year), but the University doesn't charge back for it, so it's free to the Library. But the Library is not getting a free ride here, at least not if you factor in the cost of staff time. Preserving this particular 10TB turns out to have some rather significant upfront costs and some ongoing ones that are wholly unrelated to the cost of storage media.
Here is a brief description of our workflow on this project:
1. The digitized files come to us from a vendor on portable hard drives, containing anywhere from one to three TB. Obviously we're not going to process them on the original drives, so we needed to build a custom workstation with lots of disc storage so we could do the following things:
2. The files come in the form of relatively well structured directories including the same video digitized in multiple formats along with XML metadata and checksums. To confirm the files have not been corrupted in transit, we first need to check the files against the vendor-supplied checksums. Since there are hundreds of them, the only way to do it efficiently was to write a python script that iterates through all of the directories and recalculates checksums for all of the digitized content. Doing that required becoming familiar with the vendor's file naming conventions and directory structure.
3. Once that has been done, we don't want to simply store the files in the same way the vendor has delivered them to us. Over time we're likely to accumulate many more files in a wide range of proprietary formats. Ideally we need to standardize them a bit, so people 50 years from now will have a hope of figuring out what all this stuff is. To do that, we run another job that encapsulates the directories in a preservation format known as
BagIt bags, using another custom script we developed for the purpose. Given the amount of content, running this job can take hours.
4. When the bags have been created, we then fire up a software client that transfers from the workstation in the Library to the tape archive system over in our Enterprise Data Centre (TSM). This job generally runs overnight. Sometimes it doesn't complete and has to be rerun the next day.
5. Once all the content has been transferred, we can't just assume the transfer worked. We have to restore the content from the tape archive to ensure that it is in fact retrievable, and no errors were introduced in transmission. So we pull it back again (another overnight job) and then check the integrity of the bags using yet another custom script that was developed for the purpose. Again, lots of checksumming is involved, so hours will go by before it completes.
Rinse and repeat, since we can't work on the whole 10TB all at once. In fact it's becoming apparent that TSM probably works better if you don't throw more than an single TB in its direction at a time.
Only when all of these steps have been followed are we able to say with some confidence that the materials have been effectively stored. Of course over time the tape will have to be regularly refreshed, but that's part of the "free" service so we're not counting those costs (at least, not here). And the Library will need to create and maintain records so we'll know where to find these files if and when they're needed. And going forward we should audit the content regularly to ensure bit rot hasn't set in.
To be perfectly honest this isn't a perfect preservation storage solution. Ideally we'd want multiple copies of these files strewn across a wide geographic region, whereas what we wind up with are two copies, one onsite and one stored offsite, but still on the Island (so same earthquake zone). But this is what we can afford, at least for now.
So you'll have to forgive me if I tend to discount the folks who say DNA storage (or whatever) will solve the long-term storage problem. At best, it will help to solve one aspect of the problem, by bringing the costs of the storage media down. But as I hope my little example demonstrates, getting it onto that media in some kind of structured way will still require lots of elbow grease, and that doesn't come cheap.