0 Members and 1 Guest are viewing this topic.
About this serviceIn order to check the validity of the technical reports that W3C publishes, the Systems Team has developed a link checker.A first version was developed in August 1998 by Renaud Bruyeron. Since it was lacking some functionalities, Hugo Haas rewrote it more or less from scratch in November 1999. It has been improved by Ville Skyttä and many other volunteers since.The source code is available publicly under the W3C IPR software notice from CPAN (released versions) and a Mercurial repository (development and archived release versions).What it doesThe link checker reads an HTML or XHTML document or a CSS style sheet and extracts a list of anchors and links.It checks that no anchor is defined twice.It then checks that all the links are dereferenceable, including the fragments. It warns about HTTP redirects, including directory redirects.It can check recursively a part of a Web site.There is a command line version and a CGI version. They both support HTTP basic authentication. This is achieved in the CGI version by passing through the authorization information from the user browser to the site tested.Use it onlineThere is an online version of the link checker.In the online version (and in general, when run as a CGI script), the number of documents that can be checked recursively is limited.Both the command line version and the online one sleep at least one second between requests to each server to avoid abuses and target server congestion.Access keysThe following access keys are implemented throughout the site in an attempt to help users using screen readers.Home: access key "1" leads back to the service's home page.Downloads: access key "2" leads to downloads.Documentation: access key "3" leads to the documentation index for the service.Feedback: access key "4" leads to the feedback instructions.Robots exclusionThe link checker honors robots exclusion rules. To place rules specific to the W3C Link Checker in /robots.txt files, sites can use the W3C-checklink user agent string. For example, to allow the link checker to access all documents on a server and to disallow all other robots, one could use the following:User-Agent: *Disallow: /User-Agent: W3C-checklinkDisallow:Robots exlusion support in the link checker is based on the LWP::RobotUA Perl module. It currently supports the "original 1994 version" of the standard. The robots META tag, ie. <meta name="robots" content="...">, is not supported. Other than that, the link checker's implementation goes all the way in trying to honor robots exclusion rules; if a /robots.txt disallows it, not even the first document submitted as the root for a link checker run is fetched.Note that /robots.txt rules affect only user agents that honor it; it is not a generic method for access control.Known issuesIf a link checker run in "summary only" mode takes a long time, some user agents may stop loading the results page due to a timeout. We have placed workarounds hoping to avoid this in the code, but have not yet found one that would work reliably for all browsers. If you experience these timeouts, try avoiding "summary only" mode, or try using the link checker with another browser.