adrian holovaty

Low-tech edition (Skip to navigation)

November 7, 2002, 10:26 PM ET

The content-to-code ratio

Earlier today, Steve Outing of E-Media Tidbits wrote about "page bloat behind the scenes" -- the fact that many major news sites have incredibly bulky HTML under the hood. Then Barry Parr followed up, posting content-to-code ratios for six major news sites.

This topic intrigued me, so I threw together an application that calculates the ratio of text content to total page size for a given Web page. It'll strip all the HTML, JavaScript and CSS, and determine how much of the document is actual text.

I gave it the unsexy name GetContentSize, and I've put it online so everybody can play around with it, just for kicks. (All you fans of object-oriented PHP can also download the source code.)

GetContentSize will tell you, for example, that CNN.com's home page -- just the page itself, not any attached images, JavaScript classes or style sheets -- weighs 47,000 bytes but only devotes 8.70 percent of that to text content. (Really makes you wonder what the other 91.3 percent of the document accomplishes.)

Other interesting ratios:

Not surprisingly, blogs outperformed news sites considerably -- probably because blogs tend to use CSS to separate content from code, and they're more text-driven than news-site home pages anyway. Some examples:

Of course, having a low ratio isn't horrible. HTML structure is important. And this tool doesn't account for photos (which are an important part of many sites' content) nor JavaScript-generated content. Still, I think something's wrong when less than 10 percent of a Web page's raw code is devoted to text content. Load time and rendering time remain important concerns.

UPDATE, Nov. 14, 12:14 AM: I've done a bit of follow-up analysis and have changed my methodology slightly.

Comments (54) / Permalink



Thanks for reading.

A Django site.