««Aug 2008»»
SMTWTFS
      12
3456789
10111213141516
17181920212223
24252627282930
31

Search TipicalCharlie

 


Most Popular Tags

   

Blog Status

  • 3 yrs 19 wks 2 days old
  • Updated: 1 Aug 2008
  • 70 entries
  • 129 comments
Tipical Charlie
Welcome to Tipical Charlie, a repository of all kinds of tips related to computing, from web developer and technologist, Charlie Arehart.
I'll mostly share my own tips that I've found others enjoyed hearing about. I'll welcome tips from others, too.
(Wondering where I came up with the name?)

Resurrecting "dead" web site content with the web archive

posted Monday, 9 May 2005

Have you ever been to a site looking for content that simply no longer exists? Perhaps you had a link (or someone gave it to you), and it now gets a "file not found" error. Of course, if a desired page has simply been moved, you may be able to find the content with enough digging (or via the google site search I just wrote about), but what if it really is gone? Are you stuck? Maybe not. If you've never seen the "web archive", you're in for a treat.

The Internet Archive is an ambitious project which for years has been archiving at regular intervals the current state of web pages on millions of site. You can simply visit the site, put in a domain name (complete URL to a specific page) and if it's been archived, you'll see the old page in all its glory. Of course, on the surface it seems just plain fun (the site even refers to itself frivolously as the "wayback machine"). For instance, search for google.com and you'll see that their oldest hit is from 1998:

http://web.archive.org/web/19981202230410/http://www.google.com/

It's pretty amazing to see how simple it was then, as it is now. Yahoo also started out much simpler:

http://web.archive.org/web/19961017235908/http://www2.yahoo.com/

Of course, for each of these (and any archived site) there may be dozens of points in time when it has archived what the site looked like.

But back to the real point in this entry: if you try to visit a URL and it's no longer there, whether it's the whole site or a single page, try the archive. It isn't just archiving the front page but spidering as much of the site as it could. Indeed, once you call up a page you can also often follow the links on it to find that other pages have been archived.

For instance, if you try to visit http://www.allaire.com, it now takes you to http://www.macromedia.com (which will someday soon take you to http://www.adobe.com, but that's another story.) But visit the archive, and you can see that there are lots of past versions of the allaire site archived (from 1997-2004):

http://web.archive.org/web/*/http://www.allaire.com

But my real point was that you may want to search for some specific page on a site. For instance, often I read a web site or email with a link to a Microsoft article that is no longer the same URL as it was. Often I can find that specific article's URL in the archive. It's just awesome. Try it out. (There's even a way to set up a shortcut in your browser to jump to the archive for a page automatically. More on that another time.)

links: digg this    del.icio.us    technorati    reddit




1. a reader left...
Friday, 13 May 2005 6:16 pm

if the content is recently missing you can generally find it using the google search "cache:www.yoursite.com/yourpage.html" - i find that web archive tends to lag on indexing many sites while google usually has a more recent copy

Sean Tierney [legaltech@gmail.com]


2. Charlie Arehart left...
Sunday, 5 June 2005 11:05 pm

Excellent point, Sean. I had meant to hint at that, too, when I wrote this, and was certainly planning to follow up with a future entry about that awesome tool (and both how to use right-click on a page to see any cached page, and also making it easier to access from the google toolbar.)