GetContentSize changed slightly

Written by Adrian Holovaty on November 14, 2002

Some comments here, some comments on Webgraphics and various e-mails have convinced me to alter my GetContentSize tool a bit.

The tool measures the percent of a single HTML document -- not including any attached images or supplementary files -- devoted to what I call "text content." Text content is everything content-related that the user is able to view. (Or, in the case of blind users, all content a screen reader can access.) HTML code, in and of itself, is not text content. The sentence you're reading right now, however, is.

When I launched the tool last week, it calculated the text content by stripping away all HTML tags in a given document and calculating the ratio of the number of remaining characters to the number of characters in the first place. A problem, some pointed out, is that this didn't account for content embedded in HTML tags -- particularly alt attributes (text meant to describe/replace images) and title attributes (supplementary descriptive information often coded into links, acronyms, etc., and traditionally accessible via browser tooltips).

That's changed.

The tool now includes all contents of every title, alt and summary attribute. That's definitely text content, and it deserves to be counted as such.

(The seldom-used summary attribute, by the way, is supposed to be used to describe data tables. Not many people use it, but I figured I'd include it.)

I have a feeling this will increase the content-to-code ratios for many blogs, as many bloggers tend to use the title attribute liberally, but it probably won't affect other sites' ratios much.

Comments

Posted by Anil on November 14, 2002, at 7:37 a.m.:

Nice work, a great improvement. I'm not just saying that because my ratio went from 25.94% to 38.73 % either. Heh.

Posted by Craig on November 14, 2002, at 6:42 p.m.:

Seconding Anil. These improvements now offer a much better indication of how much of a page's weight is devoted to mark-up.

Posted by Nate on November 14, 2002, at 7:05 p.m.:

Webgraphics saw hardly any ratio improvement, but I'm still very greatful, this tool has brought me to think of my markup in a whole other light.

Posted by Carl on November 14, 2002, at 8:40 p.m.:

If there's no improvement in the ratio, that could mean a site that's not covering all the bases accessibility-wise. A site with pictures but no alt tags, for example.

Posted by Vincent Flanders on November 15, 2002, at 12:52 a.m.:

While the tool is nice there are, I believe, some flaws.

I took one of my documents and ran it through your program. Removing certain items improved the content score, but I'm not sure some of the improvements are valid:

41.68 -- original content percentage

61.12 -- stripped comments and all Dublin Core META tags

61.87 -- changed from XHTML 1.0 trans to HTML 4.01 trans

63.09 -- removed accessibility material (accesskeys, tabindex, etc.)

67.42 -- removed as many blank spaces until I got bored

Conclusions:

1. You're punishing people who use tougher validation.

2. META tags should be broken out into a separate category. (I only used Dublin Core)

3. It seems OK to me to count comments.

4. It seems to me you're punishing accessibility features that are added

5. I don't know what to think about blank spaces.

Posted by Mark on November 15, 2002, at 6:35 p.m.:

I agree with some of Vincent's points that a single score can penalize useful markup (PHB says "look, we can improve our score by removing all that accessibility cruft! And delete that DOCTYPE, it's useless!"), but I have no idea why he thinks comments should be counted as text content. They are not shown to any end user under any circumstances (if you're "viewing source", you're not acting as an end user); they are simply taking up valuable bandwidth because developers are too lazy to write scripts to remove them. Do you release programs with full debugging and symbol tables turned on? No, it wastes both disk space and memory. Ditto HTML comments. (Ditto unnecessary whitespace.)

Posted by Vincent Flanders on November 16, 2002, at 3:44 a.m.:

Actually, I mispoke -- well, I wasn't specific enough. What I *meant* to say is that comments should count -- against content. Hmm. OK, let me try again, "Comments should have a negative impact on the score."

I'm not sure Javascript should be stripped out. I think it should be counted as a "negative." In my mind, most of it is a waste.

It might be helpful to tell folks how much whitespace is being wasted -- in my Dreamweaver-generated code, it was almost 4.5%.

Posted by Adrian on November 16, 2002, at 4:14 a.m.:

Vincent: But comments *do* have a negative impact on the score. In your above example, stripping comments increased your percentage by 20.

Also, JavaScript is already stripped out. (I apologize if I wasn't clear about that.)

As for the point about punishing added accessibility features -- you're quite right. Something like this site's link rel="prev" and link rel="next" (explained here) are almost completely code, yet they're invaluable accessibility and navigation benefits. Just because they add to the code weight doesn't mean they aren't worth it.

It goes to show that you shouldn't necessarily strive to make your text-content-to-code ratio as high as possible. Rather, it's all about balance. The more results I see, the more I favor a number like 50% or 60% as the "ideal" ratio.

The original goal was to highlight sites that had incredibly low percentages; those, I think we all agree, have got to make some changes.

Posted by alison on November 18, 2002, at 3:30 a.m.:

i found a getContentSize bookmarklet for yo ass.

Posted by Jorn Barger on November 20, 2002, at 2:57 p.m.:

My link-dense timelines lose major points-- I'd suggest counting links as content when they're in high-content text-areas, but not when they're in sidebar/footer areas.

Posted by Jorn Barger on November 20, 2002, at 3:04 p.m.:

I'd like to see Google indicate the amount of content in each found page-- but not the ratio, the total text.

Posted by Nielsen-url from Jorn Barger on November 20, 2002, at 3:18 p.m.:

I'd love to see a SourceForge project to explore parsing webpage structures, starting with the ones described in the Jakob-Nielsen article I'm linking here.

Posted by some no talent hack on November 22, 2002, at 2:40 a.m.:

Why are linked CSS pages not included as non-content Text? As most formatting is now done is CSS. or is this the point? hmmm.

Posted by Mr. Farlops on November 22, 2002, at 5:42 a.m.:

I think the link question could be solved simply. Count the text bounded by A as content and count all the external site links--look for domains in HREF--as content. Simply ignore internal links, don't penalize or praise.

I agree that all the accessibility markup should not be penalized. LINK, REL, SUMMARY, LONGDESC, ACCESSKEY and such are good.

I don't know what to say about whitespace. I guess that's another reason you should design your pages on a Mac or one of the children of Unix. They only use one character for end-of-line while NT uses two! Then again, when you're tweaking perl in the middle of the night, that whitespace can be awful helpful to separate code from markup. I guess whitespace should be ignored, some folks and some scripts need it.

I don't know what to suggest for tables. I guess you could penalize tables without TH or SUMMARY because these are layout tables or improperly designed data tables. Perhaps you should have a percentage for "improper use of tables" but that would probably tick people off. Layout tables, in this post-CSS world, are still a sore point between pragmatists and the revolutionary vanguard (Of which I am one but I don't preach too much. They'll come around in the end.).

Posted by anonymous on December 12, 2002, at 5:26 p.m.:

So why is this important?

Posted by Adrian on December 12, 2002, at 6:15 p.m.:

This was an important change because it altered the results of the GetContentSize tool.

Posted by Andrew Urquhart on December 12, 2002, at 10:41 p.m.:

Tool doesn't have a user-agent string

If you modify this tool to specify a user agent that contains a URL direct to this tool, then webmasters will be able to notice your URL in their logs, find your tool, discover how "content-friendly" their pages are and possibly take remedial action. Adding a user-agent is also just generally a nice thing to do, that way your tool doesn't look like an email address harvestor and so doesn't get blocked from some sites. It also appears in some statistic packages under it's own category, rather than under "miscellaneous" or "unknown" for example.

Some Misgivings

A truer measure of content-friendlyness would be the ratio of text to [image|video|audio|object|embed|etc], the ratio of text to text+markup is not useful IMHO, it just penalises those that use mark-up. Screen readers require good markup to allow users to navigate around your pages, e.g. especially via use of heading tags: - skip to next <h2>. A page with well structured markup is much more useful to a screen reader than a block of all of the raw text. Here's an idea: You could sum the content-size data from HEAD requests on linked non-text content (excluding stylesheets and scripts) to determine a more useful "text to other content" indicator.

Just my tuppences worth.

Posted by kpaul on December 21, 2002, at 1:19 a.m.:

Just found another similar tool:

Webpage size checker

Not sure what they count/don't count, though.

Posted by gorka on February 26, 2003, at 7:25 p.m.:

What about a Java version of the algorithm? Thanks

gorkag@yahoo.es

Posted by Kynn on April 11, 2003, at 7:09 p.m.:

My sites appear to be "poorly graded" on this. One of the reasons seems to be that I include metadata, which is a Good Thing To Do.

I tend to use BLOCKQUOTE instead of P when marking up blocks of quotes. This likewise hurts my "score" because of the size of the tags themselves -- P is shorter than BLOCKQUOTE. Each blockquote generates an extra 25 characters which are not counted as "text".

The meta tag named "description" should be considered "text" by the rules stated here, since it IS shown to users often.

I am worried about increasing use of this particular metric for analyzing Web pages, because I feel it discourages many good practices (accessibility, comments in code, metadata, proper use of markup) which can result in those practices being discarded in order to gain some meaningless "high score" on this particular metric.

I feel this is a worthless metric, all things considered, when dealing with Web site analysis.

--Kynn

Posted by steph on May 21, 2003, at 6:17 p.m.:

Nice tool. However, is this program penalising cases where URLs are very long in anchor tags?

Posted by Adrian on May 25, 2003, at 9:41 p.m.:

Steph: It sure is. And that's why the results of this tool should be taken with a grain of salt. This tool should be used to get a *general idea* of how much code you're using, not as an "end all be all" sort of thing.

Posted by Colin on August 5, 2003, at 5:30 a.m.:

i think this is a neat tool, and in my own opinion, i don't think it "punishes" those that use more code (accessabilly features and such need more code than not!) but it does make me look through and see if there is anyway to make it more streamlined and sleaker coding... good job!

Posted by Jacques Distler on September 19, 2003, at 6 p.m.:

If you're using compression (mod_gzip or mod_deflate), then the relevant comparison is between the amount of "text content" and the number of compressed bytes sent down the wire. White spaces aren't penalized, because white spaces compress down to practically nothing. Ditto for semantically-correct tags, accesskey's etc.

And if you are not using compression, then piddling around eliminating white spaces or HTML comments (which make your code more readable, and hence more maintainable) is a waste of time. Turning on compression will cut the size of your file by 60-80%. Nothing else you could do comes even close.

Posted by Joshua Olson on April 30, 2004, at 7 p.m.:

Seems that some of the posters here are missing important factor. The content/markup ratio can be used to determine how well Search Engines will index your content. The gzip compression used during transmission isn't important, the linked in CSS files isn't important. It's all about how much garbage (anything not content) is obscuring the message of the website when you look at the source code of the page--similar to the way many search engines would have to look at the page.

Posted by alek on August 9, 2004, at 3:22 p.m.:

I liked it - simple (as you say) ... although one can spend a lotta time doing that last few tenths of a percent optimization and I'm not sure how truly worthwhile that would be.

Posted by JalanSutera on June 27, 2005, at 10:35 a.m.:

Thanks for a nice and simple tool. After using your service I conclude that I have used too many images in my site. I will reduce it. Thanks...