TMCnet - World's Largest Communications and Technology Community




Tracey S. Roth

Dot Com Commerce

Managing Editor, [email protected] CENTER CRM Solutions

[July 12, 2000]

Saving The Web For Posterity

My father has a box in the attic somewhere. It contains old copies of the New York Times. There's one from November 22, 1963 (JFK's assassination), as well as the issue published July 20, 1969 (the first moon landing). He also kept the issue from my day of birth, as well as my brother's. (Neither of our birthdays were particularly newsworthy otherwise, but it's fun to read the humdrum details of life on those days: weather, local department store sales and music or book reviews.)

I used to read newspapers religiously. I'm something of a news fanatic, and being an hour out of touch, except when sleeping, makes me edgy. I no longer bother with newspapers as much, as I now get all my headlines online. CNN.com has a devoted fan in me, and I usually visit both Time.com and US News and World Report at least once a week. Newsweek is the only news publication which still holds a place of honor in print in my mailbox.

It occurs to me, as a result, that news no longer "hits the newspapers." Instead, it trickles to the newspapers. It hits the Web first. By the time a headline reaches the front page in 72 point type, most of us have already read about it. But how do you save a Web page for posterity? Sure, you might be able to sort through the text of an article that appeared on a newsworthy day years later, but it's not quite the same, is it?

At least one organization has recognized the need for preserving aspects of the Internet that might otherwise be lost to the future. Alexa Internet, a wholly owned subsidiary of Amazon.com, made history, so to speak, first in late 1998 by donating a copy of the public World Wide Web to the Library of Congress. The first donation, comprised of two terabytes of information, was presented in the form of an interactive sculpture entitled, "World Wide Web 1997: 2 Terabytes in 63 Inches." (A terabyte can be represented by 2 to the 40th power, or a thousand billion bytes.) The piece was made of four computer monitors and driven by 44 digital tapes. The intent was to provide a complete picture of the Web from 1997 onward. The Library of Congress was the natural recipient for this donation: It is the largest library in the world and is purported to contain nearly every item ever published in any format or language. In recent years, the Library of Congress began its National Digital Library Program, an online archive of many of its rare historical documents, of which the Web snapshot will become a part.

At the time, Alexa Internet president and CEO Brewster Kahle was quoted as saying, "The fabric of the Web is a temporary one at best unless we commit to its long-term care and feeding. With our donation of the Web Archive to the Library of Congress, we're trying to build an infrastructure that transforms the Web into a resource to benefit future generations of scholars and historians."

Alexa's creation continues to archive on a regular basis, using a robot to "crawl" through the Internet every six to eight weeks, analyzing, updating and storing information. The data the company collects is not only archived, but offered on its free Alexa service, a search engine that can locate Web pages that are no longer available (otherwise known as "404" to savvy Netizens). At its first inception in 1996, the archive contained text only. As of this year, it has begun to collect images. Another recipient of Alexa's donation resides at www.archive.org. The site, called the Internet Archive, is a non-profit founded in 1996 (by Brewster Kahle of Alexa, among others) to offer free sources to historians, researchers and the general public. At this time, the site states that, "you will need Unix programming skills to gain access to and use an entire collection." That  precludes most of us from gaining any benefit from the archive until it's in a more user-friendly format, or we manage to slog through Unix For Dummies, whichever comes first. Currently, the Internet Archive boasts somewhere in the neighborhood of one billion pages of Web content -- exceeding 14 terabytes -- and its rate of growth is approximately two terabytes a month as of March of this year.

Nay-sayers might argue that the archiving exercise holds no merit. After all, a great deal of what lurks on the Internet is admittedly complete drivel. Does anyone in the future really need to view, for instance, a message board regularly visited by Backstreet Boys fans? Will we be despondent if a personal Web site created to boast photos of the first birthday of Billy, the son of Mr. and Mrs. Jones of Hicksville, New York, becomes 404? It doesn't seem likely. However, valuable historical information such as the early beginnings of e-commerce (reputed to have occurred in the "adult entertainment" community) and the earliest online presidential campaign information is worth saving, particularly to historians.

As with almost anything Internet-related these days, there are issues to be settled and lawyers to be called. Alexa's practices make some companies and individuals edgy, and many organizations have put filters on their sites so their pages can't be accessed and archived. CNN reported in May of this year that several lawsuits had actually been filed against Alexa for privacy infringement. Particularly in the realm of personal Web sites, I can see where the distrust might lie. An embittered ex-spouse might create a Web page designed to display less-than-savory details or photos of his or her former partner, only to regret the decision later and shut down the site. Unfortunately for him or her, the Alexa robot has already crawled across the site, forever preserving in history a momentary lapse of reason, taste and judgment for all to see. Remember that photo your friend with the digital camera took last year? The one that involved you, six gin-and-tonics, a bowl of onion dip and the backyard kiddie pool? Rest assured, decades from now, your grandchildren will find it.

As the applications and future usefulness of the Web archiving continue to unfold, I'll follow them with interest, though I think I'll join my father in collecting posterity-worthy newspapers. I may stay away from digital cameras, as well.

Tracey S. Roth welcomes your thoughts at troth@tmcnet.com.

Like what you've read? Go to past Dot Com Commerce columns.
Click here for an e-mail reminder every time this column is published.

Technology Marketing Corporation

2 Trap Falls Road Suite 106, Shelton, CT 06484 USA
Ph: +1-203-852-6800, 800-243-6002

General comments: [email protected].
Comments about this site: [email protected].


© 2021 Technology Marketing Corporation. All rights reserved | Privacy Policy