Thanks to everybody for the positive feedback on GetContentSize, which I presented Friday.
I logged the tool's results to a database, so I'm able to present a few interesting observations/statistics. (I've been away from my computer for a few days, so I apologize for not having this earlier.)
Total Web pages examined: 4,296
(Pages ranged from news sites to blogs to, yes, porn sites.)
Average percent text content: 21.57%
Highest percent text content: This page (86.57%)
Lowest percent text content: This page (.02%)
Average percent text content for URLs ending in ".com" or ".com/": 17.71%
(I figured this might be decent way to narrow down the results to commercial home pages.)
Average page size: 27,910 bytes
If there's another statistic you'd like to see, post a comment here, and I'll query the database to get it, as long as the statistic is obtainable by MySQL. The logged fields are: URL, page size (in bytes), percent content and the date/time.
Posted by Barry Parr on November 13, 2002, at 4:36 a.m.:
I think it would be useful to know the ratios by decile. What was the ratio of the site at the bottom 10%, bottom 20%, etc...
Posted by Devon on November 13, 2002, at 1:50 p.m.:
Would you be able to create a small search engine so people can find out what pages had similar ratios? Like, if I typed in "http://cnnsi.com/", it would give me it's ratio and pages that had a ratio within 2% or something? That could be interesting.
Posted by Ben on November 13, 2002, at 5:18 p.m.:
Comments have been turned off for this page.