If you ask Anthony Weiner, digital records—especially those on the Internet—can seem impossibly hard to get rid of. When a picture or document is reduced to a series of 1s and 0s, it becomes transmissible, reproducible, downloadable, and storable. You can’t burn digital books, and ideas like cloud computing make it possible to back up data in multiple places, ensuring even an accidental fire won’t incinerate your thesis or wedding photos.
The digitization of data gives it protection from physical catastrophes, but, as it stands now, it’s far from eternal. The problem isn’t so much that the data itself might be lost, but that there will be no way to read it.
Try opening a WordPerfect document in Windows Vista, 7, or 8, for example, and you’ll quickly find that Microsoft has stopped supporting the software. Likewise, Apple hasn’t supported ClarisWorks since 2004, ditching its old office suite after 13 years, and the PlayStation 4, which came out in late 2013*, can’t read the original Crash Bandicoot CD-ROM from 1996 (which is really depressing in all honesty because it was a great game). And heaven forbid you need to recover data from a floppy disk. “Preserving bits is not terribly hard. The question is, ‘What do the bits mean?’” says Vint Cerf, a father of the Internet and Google’s “chief Internet evangelist.”
It’s only been around 50 years since the original floppy disk was invented, and many modern laptops have already forgone its successor, the CD drive. Suddenly the longevity of a paper document begins to look promising. “If we’re thinking 1,000 years, 3,000 years ahead in the future, we have to ask ourselves, ‘How do we preserve all the bits we need in order to correctly interpret the digital objects we create?’” says Cerf. “If we don’t find a solution to that problem, our 21st century will be an information black hole.”
As delicious (and sad) as the irony might be that the denizens of the “information age” would leave behind no usable information, there is at least one person determined to make sure that doesn’t happen. At Carnegie Mellon, Mahadev Satyanarayanan, or as nearly everyone calls him, Satya, has begun to develop a platform designed to catalog and record both the digital objects we create and, crucially, descriptions of the software and hardware that make them interpretable. Known as The Olive Archive, Satya’s platform is designed to address one of the trickiest types of data to preserve: executable files.
Archiving static data like a picture or a text document is one thing, but much of today’s important digital information is dynamic. Video games, interactive databases, and pieces of application software are more difficult to preserve because they rely not just on a computer being able to read the bits and bytes, but also require that future hardware be capable of inputting commands and interpreting changes in the program. Crash Bandicoot spins when a player presses the “square” button on a PlayStation controller, but even if all the code from the game were preserved in an archive, a computer doesn’t have a “square” button. The same set of issue will plague computers of the future as GPUs, CPUs, motherboards, and other hardware continues to evolve.
One solution would be to preserve a version of every piece of hardware; while this might be overkill, it would let us recreate the ecosystem of any piece of data we might wish to recover. The Olive Archive aims to accomplish the same sort of preservation, but its approach is far more elegant and won’t require warehouses full of ancient hardware, which would likely break down over decades anyway. Satya wants to create “virtual machines”: maps or descriptions of the hardware that will allow old programs to be recreated using software. Emulating past hardware with current software isn’t a new idea (in fact anyone with an Android smartphone can download a PlayStation emulator from the Internet and run Crash Bandicoot today), but the idea of building a repository capable of opening and executing any digital object is a massive undertaking.
“How do we preserve all the bits we need in order to correctly interpret the digital objects we create?” says Cerf. “If we don’t find a solution to that problem, our 21st century will be an information black hole.”
So what would be required to ensure that an article like this one survives into the next millennium? Satya first suggests saving it as a PDF to make it easier to store. A PDF requires Adobe Reader to open, so citizens of the future will need access to the same version of the program it was saved in (10.1.12 in my case). Adobe Reader only functions within the context of an operating system though, so a version of Windows (or MacOS or Linux) compatible with our PDF reader will need to be included in our virtual machine. Finally, the operating system runs on hardware of some sort. As I type these words, they’re appearing on the screen of a Lenovo ThinkPad Y470. The virtual machine could emulate the hardware found in my specific computer, but any PC capable of running a version of Windows that is compatible with Adobe Reader would do the job. Every bit of data in a digital file lives in an ecosystem composed of software, operating system, and hardware. Satya’s goal is to create an archive of these components that can be assembled as needed, uploaded to a server, and accessed by users over the Internet. “File formats are not created in isolation—they go hand in hand with the software that uses those formats,” he says. Olivelaunches within your Internet browser similar to the way a YouTube video might, the difference being that the software run through Olive emulates entire computer environments, so you can click, type, and make changes just like you were actually using the old hardware and software.
Since beginning work in earnest in 2013, the team has already archived things like Windows 3.11, decades-old games such as Oregon Trail and DOOM, and even TurboTax 1997, in case you happen to be running 18 years late on your return for that year. According to Satya, Olive’s potential to archive digital objects is nearly limitless; the system should be able to keep up with huge shifts in computing, like the development of quantum computers or even the rejection of binary.To run some antiquated computer program, a futuristic machine will simply emulate the hardware that originally emulated the old program when the program was added to the archive.
Currently Intel’s x86-compatible hardware rules the world. Nearly every personal computer on the planet uses some variant of x86 architecture—a set of instructions that governs nearly all fundamental hardware behavior, ranging from reading source code to allocating memory. Out of necessity, Intel has made their architecture fully backwards compatible. Satya believes that because of x86’s sheer ubiquity, it will continue to be important into the near future. In a thousand years however, anything seems possible. But Olive should be able to keep on archiving by adding new layers of emulation. “We believe that whatever replaces Intel x86, there is so much legacy software written for it, that writing an emulator for x86 is something that will happen. If nobody else, the maintainers of Olive can do it,” he says. “As long as you have emulations, you can layer them. In 5,000 AD, if we’re trying to run something that was written in 2015, I might [use] five layers of emulations.”
All that computing power…so far in the future…just to run through the jungle as an orange marsupial wearing jean shorts.
* Correction: The article originally said the PlayStation 4 came out in late 2015, which hasn’t happened yet.
David Shultz is a freelance journalist covering biology and science of all sorts. He tweets at @dshultz14.