Author Topic: Wayback Machine lets you browse over 85 billion web pages archived from 1996 on!  (Read 1699 times)

0 Members and 1 Guest are viewing this topic.

Software Santa

  • Administrator
  • *****
  • Posts: 4281
The Internet Archive Wayback Machine lets you browse through over 85 billion web pages archived from 1996 to a few months ago.



Quote
About the Wayback Machine

Browse through 85 billion web pages archived from 1996 to a few months ago. To start surfing the Wayback, type in the web address of a site or page where you would like to start, and press enter. Then select from the archived dates available. The resulting pages point to other archived pages at as close a date as possible. Keyword searching is not currently supported.

What is the Internet Archive Wayback Machine?

The Internet Archive Wayback Machine is a service that allows people to visit archived versions of Web sites. Visitors to the Wayback Machine can type in a URL, select a date range, and then begin surfing on an archived version of the Web. Imagine surfing circa 1999 and looking at all the Y2K hype, or revisiting an older version of your favorite Web site. The Internet Archive Wayback Machine can make all of this possible.

Can I link to old pages on the Wayback Machine?

Yes! The Wayback Machine is built so that it can be used and referenced. If you find an archived page that you would like to reference on your Web page or in an article, you can copy the URL. You can even use fuzzy URL matching and date specification... but that's a bit more advanced.

Why isn't the site I'm looking for in the archive?

Some sites may not be included because the automated crawlers were unaware of their existence at the time of the crawl. It's also possible that some sites were not archived because they were password protected, blocked by robots.txt, or otherwise inaccessible to our automated systems. Siteowners might have also requested that their sites be excluded from the Wayback Machine. When this has occurred, you will see a "blocked site error" message. When a site is excluded because of robots.txt you will see a "robots.txt query exclusion error" message.

What does it mean when a site's archive data has been "updated"?

When our automated systems crawl the web every few months or so, we find that only about 50% of all pages on the web have changed from our previous visit. This means that much of the content in our archive is duplicate material. If you don't see ""*"" next to an archived document, then the content on the archived page is identical to the previously archived copy.

Who was involved in the creation of the Internet Archive Wayback Machine?

"The original idea for the Internet Archive Wayback Machine began in 1996, when the Internet Archive first began archiving the web. Now, five years later, with over 100 terabytes and a dozen web crawls completed, the Internet Archive has made the Internet Archive Wayback Machine available to the public. The Internet Archive has relied on donations of web crawls, technology, and expertise from Alexa Internet and others. The Internet Archive Wayback Machine is owned and operated by the Internet Archive."

How was the Wayback Machine made?

Alexa Internet, in cooperation with the Internet Archive, has designed a three dimensional index that allows browsing of web documents over multiple time periods, and turned this unique feature into the Wayback Machine.

How large is the Wayback Machine?

The Internet Archive Wayback Machine contains almost 2 petabytes of data and is currently growing at a rate of 20 terabytes per month. This eclipses the amount of text contained in the world's largest libraries, including the Library of Congress.

What type of machinery is used in this Internet Archive?

Much of the Internet Archive is stored on hundreds of slightly modified x86 servers. The computers run on the Linux operating system. Each computer has 512Mb of memory and can hold just over 1 Terabyte of data on ATA disks. However we are developing a new way of storing our data on a smaller machine. Each machine will store 1 terabyte. For more information go to www.petabox.org.

How do you archive dynamic pages?

There are many different kinds of dynamic pages, some of which are easily stored in an archive and some of which fall apart completely. When a dynamic page renders standard html, the archive works beautifully. When a dynamic page contains forms, JavaScript, or other elements that require interaction with the originating host, the archive will not contain the original site's functionality.

Why are some sites harder to archive than others?

If you look at our collection of archived sites, you will find some broken pages, missing graphics, and some sites that aren't archived at all. Here are some things that make it difficult to archive a web site:

    * Robots.txt -- We respect robot exclusion headers.
    * Javascript -- Javascript elements are often hard to archive, but especially if they generate links without having the full name in the page. Plus, if javascript needs to contact the originating server in order to work, it will fail when archived.
    * Server side image maps -- Like any functionality on the web, if it needs to contact the originating server in order to work, it will fail when archived.
    * Unknown sites -- The archive contains crawls of the Web completed by Alexa Internet. If Alexa doesn't know about your site, it won't be archived.
    * Orphan pages -- If there are no links to your pages, the robot won't find it (the robots don't enter queries in search boxes.)

As a general rule of thumb, simple html is the easiest to archive.

Some sites are not available because of robots.txt or other exclusions. What does that mean?

The Standard for Robot Exclusion (SRE) is a means by which web site owners can instruct automated systems not to crawl their sites. Web site owners can specify files or directories that are disallowed from a crawl, and they can even create specific rules for different automated crawlers. All of this information is contained in a file called robots.txt. While robots.txt has been adopted as the universal standard for robot exclusion, compliance with robots.txt is strictly voluntary. In fact most web sites do not have a robots.txt file, and many web crawlers are not programmed to obey the instructions anyway. However, Alexa Internet, the company that crawls the web for the Internet Archive, does respect robots.txt instructions, and even does so retroactively. If a web site owner decides he / she prefers not to have a web crawler visiting his / her files and sets up robots.txt on the site, the Alexa crawlers will stop visiting those files and will make unavailable all files previously gathered from that site. This means that sometimes, while using the Internet Archive Wayback Machine, you may find a site that is unavailable due to robots.txt (you will see a "robots.txt query exclusion error" message). Sometimes a web site owner will contact us directly and ask us to stop crawling or archiving a site, and we endevor to comply with these requests. When you come accross a "blocked site error" message, that means that a siteowner has made such a request and it has been honored.

How can I help the Internet Archive and the Wayback Machine?

The Internet Archive actively seeks donations of digital materials for preservation. If you have digital materials that may be of interest to future generations, please let us know by sending an email to info at archive dot org. The Internet Archive is also seeking additional funding to continue this important mission. You can click the donate tab above or click here. Thank you for considering us in your charitable giving.

Can I search the Archive?

Using the Internet Archive Wayback Machine, it is possible to search for the names of sites contained in the Archive (URLs) and to specify date ranges for your search. We hope to implement a full text search engine at some point in the future.

Why am I getting broken or gray images on a site?

Broken images (when there is a small red "x" where the image should be) occur when the images are not available on our servers. Usually this means that we did not archive them. Gray images are the result of robots.txt exclusions. The site in question may have blocked robot access to their images directory.

The Original Software Santa ... Somebody Else ... owned this Domain in 1998!



Set the Wayback Machine for 8/10/2007 ... When the NEXT Software Santa was just starting out.



Scoot the Wayback Machine ahead 3 years to 7/26/2010 ... Software Santa had added many more Presents..



Here is the Wayback Machine set to 8/24/2012 ... Software Santa changed the site's look {and URL}..

http://www.archive.org/web/web.php
« Last Edit: October 31, 2014, 02:58:47 AM by Software Santa »

 

email