Content-to-code ratio statistics

Written by Adrian Holovaty on November 12, 2002

Thanks to everybody for the positive feedback on GetContentSize, which I presented Friday.

I logged the tool's results to a database, so I'm able to present a few interesting observations/statistics. (I've been away from my computer for a few days, so I apologize for not having this earlier.)

Total Web pages examined: 4,296

(Pages ranged from news sites to blogs to, yes, porn sites.)

Average percent text content: 21.57%

Highest percent text content: This page (86.57%)

Lowest percent text content: This page (.02%)

Average percent text content for URLs ending in ".com" or ".com/": 17.71%

(I figured this might be decent way to narrow down the results to commercial home pages.)

Average page size: 27,910 bytes

If there's another statistic you'd like to see, post a comment here, and I'll query the database to get it, as long as the statistic is obtainable by MySQL. The logged fields are: URL, page size (in bytes), percent content and the date/time.


Posted by Barry Parr on November 12, 2002, at 10:36 p.m.:

I think it would be useful to know the ratios by decile. What was the ratio of the site at the bottom 10%, bottom 20%, etc...

Posted by Devon on November 13, 2002, at 7:50 a.m.:

Would you be able to create a small search engine so people can find out what pages had similar ratios? Like, if I typed in "", it would give me it's ratio and pages that had a ratio within 2% or something? That could be interesting.

Posted by Ben on November 13, 2002, at 11:18 a.m.:

This may just be a tweak in the javascript in the front, but I think it would be nice to have variations of the tool, such as a GCS-Lite bookmarklet, which just spits out the ratio number in an alert box... or a multiple page input, so that I could run it on several pages at once (or a spider engine so that it could be run by a developer/administrator on their own site)... or a GCS-dex that spiders all relevant news sources each day and lists them all.

Comments have been turned off for this page.