How news sites keep robots away

Written by Adrian Holovaty on August 21, 2002

After today's lunchtime links entry and the reader comments it brought about, I got to thinking about news sites' robots.txt files. Are other sites as robot-hostile as nytimes.com? I took a peek at a few news sites' files -- after all, if they're accessible to robots, they're accessible to humans -- and here are a few observations, along with links to the files themselves.

Robot-hostile sites

Nytimes.com, as mentioned in today's previous blog entry, bans just about every file.
Foxnews.com disallows every robot from everything! Coincidentally, David Gallagher wrote about this earlier today. (David points out that some Fox News content is indeed indexed in search engines, despite the disallowance.)
Bostonherald.com bans pretty much everything.

Trends

Many sites, such as post-gazette.com, hide their "contact us" pages. This is a good way of cutting down on (some) newsroom spam. (Not all robots follow the rules, however, which leads some to set intricate traps.)
Other directories typically hidden are those that contain includes, advertising, images, JavaScript, story archives and server code (CGIs). No surprise there.
Then there's the case of hidden directories for internal use, like "testing" and "development" and "temp". The number of sites that identified these directories surprised me; wouldn't sites be better off keeping them a secret by not linking to them, so the robots wouldn't have a means of finding them? Indeed, some robots -- the nasty ones -- specifically seek out directories that are banned by robots.txt. Pointing out those sensitive directories is like hiding from someone and yelling out, "Don't look over here!"

Random observations

Boston.com's robots.txt file includes a note explaining robots "could cause some undesirable effects if they stumbled onto our voting scripts!", which leads me to recommend they modify their voting scripts to allow for only one vote per IP address per day.
Not surprisingly, big chains tend to share robots.txt settings. Examples: Knight Ridder, Tribune Company, Cox.
Dallasnews.com and the other Belo Interactive sites lack a robots.txt file. Projo.com does, however, have a very cool 404 page. (By the way, if you visit the Belo Interactive site with Mozilla, prepare to be insulted with a "Browser Not Supported" message.)
You find some interesting stuff when snooping around robots.txt files. I found the URL for Villagevoice.com's server logs from Sept. 1996 to (interestingly) the day before Sept. 11, 2001. I also stumbled upon a page on azcentral.com that spit out some of the site's database table names. Somebody might want to fix that!

Comments

Posted by Jay Small on August 21, 2002, at 2:23 p.m.:

Hi, Adrian. Projo.com happens to be one of the sites I oversee for Belo Interactive, so I was surprised to see you got an error message on it with Mozilla. I'm not able to repeat that error, at least not on the first several pages I browse this morning (either with Mozilla or Netscape 7 preview).

But I'd certainly like to investigate it -- wondering if you can provide any details that would help track it down. As for the robots.txt files on BI sites, you're right. We're still trying to figure out how best to combine a robots.txt file with the user registration protocols we run. As I commented yesterday, it may be handy to negotiate a back door for key searchers to index articles without running into registration.

Posted by Adrian Holovaty on August 21, 2002, at 2:40 p.m.:

No, no...Projo was fine...I meant the main Belo Interactive site. :) Sorry for the confusion; I should've been more clear. I've reworded the blog entry.

Posted by Jay Small on August 21, 2002, at 4:30 p.m.:

Whew! I feel better. Of course, we really ought to get the corporate site straight, too, though it's managed completely differently than our local sites. One small step at a time! <g>

Posted by Anil on August 22, 2002, at 6:55 a.m.:

We changed servers at the Voice during the days after the attacks in September, pushing up a scheduled move, which is why our logging system changed then. One of our servers will show the old logs at the URL you linked to, and one of ours will show current stats, so some of your visitors might not see what you've described.

Posted by Adrian on August 22, 2002, at 2:39 p.m.:

Ah, I see. Sure enough, I clicked on the link again and saw the current stats. Thanks for the insight.

Posted by KEvin on November 1, 2002, at 5:45 a.m.:

I would recommend www.imhosted.com - I have used them for a year and its great..