[July
12,
2000]
Saving The Web For Posterity
My father has a box in the attic somewhere. It contains old copies of
the New York Times. There's one from November 22, 1963 (JFK's
assassination), as well as the issue published July 20, 1969 (the first
moon landing). He also kept the issue from my day of birth, as well as my
brother's. (Neither of our birthdays were particularly newsworthy
otherwise, but it's fun to read the humdrum details of life on those days:
weather, local department store sales and music or book reviews.)
I used to read newspapers religiously. I'm something of a news fanatic,
and being an hour out of touch, except when sleeping, makes me edgy. I no
longer bother with newspapers as much, as I now get all my headlines
online. CNN.com has a devoted fan in me,
and I usually visit both Time.com and US
News and World Report at least once a week. Newsweek is the only news
publication which still holds a place of honor in print in my mailbox.
It occurs to me, as a result, that news no longer "hits the
newspapers." Instead, it trickles to the newspapers. It hits the Web
first. By the time a headline reaches the front page in 72 point type,
most of us have already read about it. But how do you save a Web page for
posterity? Sure, you might be able to sort through the text of an article
that appeared on a newsworthy day years later, but it's not quite the
same, is it?
At least one organization has recognized the need for preserving
aspects of the Internet that might otherwise be lost to the future. Alexa
Internet, a wholly owned subsidiary of Amazon.com,
made history, so to speak, first in late 1998 by donating a copy of the
public World Wide Web to the Library of
Congress. The first donation, comprised of two terabytes of
information, was presented in the form of an interactive sculpture
entitled, "World Wide Web 1997: 2 Terabytes in 63 Inches." (A
terabyte can be represented by 2 to the 40th power, or a thousand billion
bytes.) The piece was made of four computer monitors and driven by 44
digital tapes. The intent was to provide a complete picture of the Web
from 1997 onward. The Library of Congress was the natural recipient for
this donation: It is the largest library in the world and is purported to
contain nearly every item ever published in any format or language. In
recent years, the Library of Congress began its National Digital Library
Program, an online archive of many of its rare historical documents, of
which the Web snapshot will become a part.
At the time, Alexa Internet president and CEO Brewster Kahle was quoted
as saying, "The fabric of the Web is a temporary one at best unless
we commit to its long-term care and feeding. With our donation of the Web
Archive to the Library of Congress, we're trying to build an
infrastructure that transforms the Web into a resource to benefit future
generations of scholars and historians."
Alexa's creation continues to archive on a regular basis, using a robot
to "crawl" through the Internet every six to eight weeks,
analyzing, updating and storing information. The data the company collects
is not only archived, but offered on its free Alexa service, a search
engine that can locate Web pages that are no longer available (otherwise
known as "404" to savvy Netizens). At its first inception in
1996, the archive contained text only. As of this year, it has begun to
collect images. Another recipient of Alexa's donation resides at www.archive.org.
The site, called the Internet Archive, is a non-profit founded in 1996 (by
Brewster Kahle of Alexa, among others) to offer free sources to
historians, researchers and the general public. At this time, the site
states that, "you will need Unix programming skills to gain access to
and use an entire collection." That precludes most of us from
gaining any benefit from the archive until it's in a more user-friendly
format, or we manage to slog through Unix For Dummies, whichever
comes first. Currently, the Internet Archive boasts somewhere in the
neighborhood of one billion pages of Web content -- exceeding 14 terabytes
-- and its rate of growth is approximately two terabytes a month as of
March of this year.
Nay-sayers might argue that the archiving exercise holds no merit.
After all, a great deal of what lurks on the Internet is admittedly
complete drivel. Does anyone in the future really need to view, for
instance, a message board regularly visited by Backstreet Boys fans? Will
we be despondent if a personal Web site created to boast photos of the
first birthday of Billy, the son of Mr. and Mrs. Jones of Hicksville, New
York, becomes 404? It doesn't seem likely. However, valuable historical
information such as the early beginnings of e-commerce (reputed to have
occurred in the "adult entertainment" community) and the
earliest online presidential campaign information is worth saving,
particularly to historians.
As with almost anything Internet-related these days, there are issues
to be settled and lawyers to be called. Alexa's practices make some
companies and individuals edgy, and many organizations have put filters on
their sites so their pages can't be accessed and archived. CNN reported in
May of this year that several lawsuits had actually been filed against
Alexa for privacy infringement. Particularly in the realm of personal Web
sites, I can see where the distrust might lie. An embittered ex-spouse
might create a Web page designed to display less-than-savory details or
photos of his or her former partner, only to regret the decision later and
shut down the site. Unfortunately for him or her, the Alexa robot has
already crawled across the site, forever preserving in history a momentary
lapse of reason, taste and judgment for all to see. Remember that photo
your friend with the digital camera took last year? The one that involved
you, six gin-and-tonics, a bowl of onion dip and the backyard kiddie pool?
Rest assured, decades from now, your grandchildren will find it.
As the applications and future usefulness of the Web archiving continue
to unfold, I'll follow them with interest, though I think I'll join my
father in collecting posterity-worthy newspapers. I may stay away from
digital cameras, as well.
Tracey S. Roth welcomes your thoughts at troth@tmcnet.com.
|