November 7, 2002, 10:26 PM ET
The content-to-code ratio
Earlier today, Steve Outing of E-Media Tidbits wrote about "page bloat behind the scenes" -- the fact that many major news sites have incredibly bulky HTML under the hood. Then Barry Parr followed up, posting content-to-code ratios for six major news sites.
This topic intrigued me, so I threw together an application that calculates the ratio of text content to total page size for a given Web page. It'll strip all the HTML, JavaScript and CSS, and determine how much of the document is actual text.
I gave it the unsexy name GetContentSize, and I've put it online so everybody can play around with it, just for kicks. (All you fans of object-oriented PHP can also download the source code.)
GetContentSize will tell you, for example, that CNN.com's home page -- just the page itself, not any attached images, JavaScript classes or style sheets -- weighs 47,000 bytes but only devotes 8.70 percent of that to text content. (Really makes you wonder what the other 91.3 percent of the document accomplishes.)
Other interesting ratios:
- dallasnews.com -- 5.85%
- abcnews.com -- 5.94%
- news.bbc.co.uk/1/hi/uk/default.stm -- 7.54%
- chicagotribune.com -- 8.86%
- washingtonpost.com -- 9.70%
- nytimes.com -- 10.85%
- latimes.com -- 10.95%
- boston.com -- 14.70%
Not surprisingly, blogs outperformed news sites considerably -- probably because blogs tend to use CSS to separate content from code, and they're more text-driven than news-site home pages anyway. Some examples:
- dashes.com/anil/ -- 25.94%
- doc.weblogs.com -- 26.18%
- kottke.org -- 40.00%
- ashbykuhlman.net -- 44.63%
- holovaty.com -- 46.67%
- simon.incutio.com -- 46.93%
- hypergene.net/blog/ -- 59.03%
Of course, having a low ratio isn't horrible. HTML structure is important. And this tool doesn't account for photos (which are an important part of many sites' content) nor JavaScript-generated content. Still, I think something's wrong when less than 10 percent of a Web page's raw code is devoted to text content. Load time and rendering time remain important concerns.
UPDATE, Nov. 14, 12:14 AM: I've done a bit of follow-up analysis and have changed my methodology slightly.

Post a comment:
Comments on this entry are closed.
Don't see any comments? That's because my Web hosting provider has made a server upgrade that broke the commenting feature on this site. I'm working to restore that; please check back later.