August 20, 2002, 8:50 PM ET
How news sites keep robots away
After today's lunchtime links entry and the reader comments it brought about, I got to thinking about news sites' robots.txt files. Are other sites as robot-hostile as nytimes.com? I took a peek at a few news sites' files -- after all, if they're accessible to robots, they're accessible to humans -- and here are a few observations, along with links to the files themselves.
Robot-hostile sites
- Nytimes.com, as mentioned in today's previous blog entry, bans just about every file.
- Foxnews.com disallows every robot from everything! Coincidentally, David Gallagher wrote about this earlier today. (David points out that some Fox News content is indeed indexed in search engines, despite the disallowance.)
- Bostonherald.com bans pretty much everything.
Trends
- Many sites, such as post-gazette.com, hide their "contact us" pages. This is a good way of cutting down on (some) newsroom spam. (Not all robots follow the rules, however, which leads some to set intricate traps.)
- Other directories typically hidden are those that contain includes, advertising, images, JavaScript, story archives and server code (CGIs). No surprise there.
- Then there's the case of hidden directories for internal use, like "testing" and "development" and "temp". The number of sites that identified these directories surprised me; wouldn't sites be better off keeping them a secret by not linking to them, so the robots wouldn't have a means of finding them? Indeed, some robots -- the nasty ones -- specifically seek out directories that are banned by robots.txt. Pointing out those sensitive directories is like hiding from someone and yelling out, "Don't look over here!"
Random observations
- Boston.com's robots.txt file includes a note explaining robots "could cause some undesirable effects if they stumbled onto our voting scripts!", which leads me to recommend they modify their voting scripts to allow for only one vote per IP address per day.
- Not surprisingly, big chains tend to share robots.txt settings. Examples: Knight Ridder, Tribune Company, Cox.
- Dallasnews.com and the other Belo Interactive sites lack a robots.txt file. Projo.com does, however, have a very cool 404 page. (By the way, if you visit the Belo Interactive site with Mozilla, prepare to be insulted with a "Browser Not Supported" message.)
- You find some interesting stuff when snooping around robots.txt files. I found the URL for Villagevoice.com's server logs from Sept. 1996 to (interestingly) the day before Sept. 11, 2001. I also stumbled upon a page on azcentral.com that spit out some of the site's database table names. Somebody might want to fix that!
August 20, 2002, 12:51 PM ET
Tuesday's lunchtime links
Evolt.org: Describing Document Text for Accessibility -- "A key focus of accessible web site design is providing equivalent alternatives to auditory and visual content."
Joe Gregorio points out nytimes.com's robots.txt file (the file that delineates which parts of the Web site a robot is allowed to index) isn't very friendly. Namely, it bans robots from just about every file on the server. (More info about robots.txt.) Now that he mentions it, I've never come across an NYT story via a Google search. What a hostile policy!
