Archiving the Web: The Future of Internet History

Imagine trying to find an article you remembered in a magazine from years ago without a solid starting point. Or trying to find the best quality version of a rare film without having access to a proper database. Even on the internet, two decades into its evolution, the attempts to catalog, index and archive the web have been isolated, underfunded, abandoned, or narrow in scope. Even the largest resource, owned by Internet Archive, stores just 0.2% of the pages indexed by Google. That’s despite having being used to hold politicians accountable, win legal battles, and verify sources for important information.

The internet has niche archives scattered all over. There’s DocuWiki, Marxists.org, Google’s scans of newspapers dating back to the 1700s, and of course the rebellious archives of peer-hosted copyrighted material that lurk deeper under the surface. The most complete, however, is the Internet Archive.

The Internet Archive hosts millions of texts, images, videos, audio files, and software programs. It’s free, well organized, and fits the original utopian vision of the internet as a resource that’s bigger, more complete, and more accessible than anything that could exist in a physical location. It hosts the complete Library of Congress along with thousands of other collections donated by institutions worldwide. That’s in addition to saving a day-by-day copy of some of the internet’s most important resources.

library-alexandria-destruction

The Burning of the Library of Alexandria, Hermann Goll (1876)

Unlike the fire that destroyed the only copies of countless ancient texts in Alexandria thousands of years ago, it’s unlikely these resources are going anywhere. The internet, however, is rotting away at an alarming pace. For an organism that survives on its inter-connectivity through hyperlinks, the death of one page can impact thousands of others.

Take The Million Dollar Homepage for example. The 2005 project sold 1,000,000 pixels of space at the price of $1 per pixel. The page attracted companies with big budgets, some spending tens of thousands on an experimental advertising method. For some, that was money down the drain: 20-30% of the links on The Million Dollar Homepage are now dead, and many more redirect to totally different places.

The Million Dollar Homepage, overlayed with the status of its links

The Million Dollar Homepage from 2005, overlayed with the status of its links now

The internet evolves. Huge chunks fade into dark non-cyberspace every few years, and the average lifespan of a web page is just 100 days. A 2012 study found that 30% of the accessible internet in 2009 died just three years later. Strikingly, much of the dead content is news stories. With access to only part of the truth, future historians will look on the events of our lifetimes with a distorted lens.

Iโ€™ve written before about how content is tightly segregated and delivered to readers by biased algorithms. Thatโ€™s happening right now โ€” the filter bubble is damaging today, and has had cataclysmic effects on everything from politics to violent crime. But, failure to properly archive the content of the past is as good as willfully submitting to a skewed version of the truth in the future. Itโ€™s understandable that our records from hundreds of years back are patchy and disputed, but I would never have thought that would continue to be a problem in the internet age.

Conceptually, the internet is decentralized. That means that unless every service provider was knocked out at once, it would still exist and be accessible. However, individual hosts (like those providing information from Wikipedia and Archive) are centralized. The data lives on a number of servers that could theoretically go down. What’s more, there’s often little financial interest from corporations in keeping our wealth of knowledge alive, leaving the fate of essential well-intentioned non-profits in the hands of the public.

With the issue of the availability of information comes the question of quantity. Archive automatically crawls the top trafficked pages of the web. Archive’s spider โ€” developed by enthusiasts for a non-profit โ€” has logged over 300 billion web pages since its founding in 1996. That seems like a staggering amount until you check how many sites are currently indexed in Google: 130 trillion.

99.8% of the current and indexed internet is not archived, never mind about the long-dead pages that might never be seen again by a single being โ€” human, spider, or robot.

Google have the best web crawler ever built. So much so that any estimations as to that size of the internet become mere vague guesses outside the data Google presents. With the stark difference between the pages captured by Archive and those indexed by Google, it’s clear Google isn’t willing to lend its technology for the good of the future, instead opting to spend millions on innovating ways to advertise while mumbling ‘don’t be evil’ with questionable conviction. If internet giants like Google don’t support and democratize the living history of the web the openness of information, it’s up to the whims of the public and the efforts of non-profits.

Storage space is also an issue. The numbers are misty because server space of Archive’s magnitude is provided by enterprises that will do deals differently each time, but it can be estimated that storing the amount of data Archive stores would cost over $18m. Multiply this to make it up to the estimated size of the web and you’re looking at $9b โ€” a cost that very few can front. Even if Archive had the means to crawl as deep as Google, it’d break the bank trying to store that information, especially considering its $10m annual budget.

Archain

As I write this, a new blockchain-based archiving initiative is quietly gaining momentum. Archain describes itself as a “permanent de-centralised and uncensorable archive for the internet”. For users, it’s a way to save a web page to a permanent indestructible database. Unlike traditional servers which can go down and are the responsibility of a set organization to maintain, Archain is built on the blockchain which means that everybody using it is also helping it to grow. Once data is written to a blockchain, it’s basically impossible to rewrite or delete it, which is why cryptocurrencies like Bitcoin can exist without moderation or intervention from banks.

As blockchain technology improves, we’re seeing more and more developers devising impenetrable and permanent ways to store and transfer information. The only thing lacking now is widespread support and adoption, but I suspect that will change as more grow uncertain about the future of internet history.

Space landscape-obsessed dreck penman. Appears on TechCrunch, The Next Web, and on Secret Cave in a far less restrained capacity.