Tuesday's lunchtime links

Written by Adrian Holovaty on August 20, 2002

Evolt.org: Describing Document Text for Accessibility -- "A key focus of accessible web site design is providing equivalent alternatives to auditory and visual content."

Joe Gregorio points out nytimes.com's robots.txt file (the file that delineates which parts of the Web site a robot is allowed to index) isn't very friendly. Namely, it bans robots from just about every file on the server. (More info about robots.txt.) Now that he mentions it, I've never come across an NYT story via a Google search. What a hostile policy!

Comments

Posted by Jim Olivera on August 20, 2002, at 8:29 p.m.:

Could this be the work of some asinine lawyer at NY Times?

Posted by Mike on August 20, 2002, at 9:32 p.m.:

Does it really matter though? Would Google even make an attempt to get past the registration screen on nytimes.com and other newspaper sites?

Posted by Jay Small on August 20, 2002, at 10:01 p.m.:

Google can't crawl past a registration window unless it is given a back door, though its robots are among the best at getting past other search-spider stoppers such as URLs with arguments.

I assume (but don't know) that the folks at washingtonpost.com, who just implemented sectional-access registration but also use Google as their site search engine, gave the Google folks a way in. Sounds as though NYTimes.com didn't care to be spidered. Dunno about WSJ.com -- though it's a paid subscription site, it does put out significant blocks of content in "free" areas.

Posted by Adrian Holovaty on August 20, 2002, at 10:33 p.m.:

The Times recently started using Google search, too. But I think the Times and the Post have custom, site-specific Google implementations, i.e. whenever or however Google indexes them, it doesn't add the pages to its mega database.

Google's SiteSearch page gives more info.

Posted by Jay Small on August 20, 2002, at 10:57 p.m.:

Yep, Google will sell you a "pizza box" rack server for this purpose. But I'm betting the washpost gang wants its articles in the main Google index, so you'd still need some back-door method to get that done (wonder if the pizza box can merge its data with the main Google DB on demand; that'd be an interesting thing to find out). Looks like the Times cares less about whether its articles show up in Google.