Microformats could describe online news intelligently

Written by Adrian Holovaty on March 18, 2005

A lot of the buzz at South by Southwest was about the concept of microformats, which are lightweight, informal standards for adding metadata to Web pages by using existing XHTML elements. Tantek Çelik and Eric Meyer both spoke enthusiastically about the idea, and I'm grateful I had a chance to speak with both personally.

A good example is XFN, a way of identifying human relationships within a Web page's code by putting a rel attribute on <a> tags. (I've written about XFN previously.) For instance, I'm friends with Simon Willison, so I put rel="friend" in the link to his Web site, and services such as rubhub do cool things with the aggregated data.

It works, because it's easy for humans to understand, easy to implement, uses existing infrastructure (XHTML) and solves a small, specific problem.

A few other microformats have been invented so far. Some examples:

hCalendar -- A way to designate calendar information within a Web page
hCard -- A way to designate contact information, as in a vCard
VoteLinks -- A way to designate whether you agree or disagree with something you're linking to

Why spend the time adding that metadata to your Web pages? Because it makes it easy for automated tools to aggregate information, and it creates a bunch of interesting possibilities.

I love the idea of microformats, and, as I'm involved in the online-news industry, I'm naturally interested in their possible applications in a journalism context. Here are a few ideas I've been bouncing around; I'd love to see what people think.

Background-story relationships

News sites -- well, decent ones, anyway -- often link stories to previous coverage on the same issue. But there's no reliable way to automate the detection of previous coverage. I was thinking a rel="background-story" attribute for link tags could work.

A rel="background-story" could be used on internal or external links -- in the latter case, it'd be used if a newspaper is following up on the work of another news outlet.

I brought this up with Tantek at South by Southwest, and he suggested that <link rel="prev"> might be a better way of marking it up. The problem I see with that is that it implies a linear "previous" relationship, whereas a news story generally has more than one background story. It's true that some series of news stories are linear, but it's probably not a good idea to pigeonhole all news stories in this way. Journalism describes life, and life isn't linear.

(That said, <link rel="prev"> could, and probably should, be used for multi-part news stories, such as those in a series.)

Possible applications: Automated visualization tools that create news "trees;" much more intelligent news automation by aggregators such as Google News and Topix.net.

Story reaction relationships

It'd be beneficial to designate relationships between a news story and opinion pieces commenting on that news story. The news story itself could include <link rel="opinion-reaction"> in its link to the opinion page, and the opinion page could include <link rev="opinion-reaction"> in its link back to the facts. (The rev means "this describes the current page's relationship to the other page.")

Possible applications: If news aggregators used this, they'd be less likely to publish "opinion" content as news, which would solve a significant problem many journalists have with automated news sites such as Google News.

Reporter relationships

It's rather surprising to think about, but there's no good way to designate the reporter(s) of a news story in a machine-readable format. Yes, there's <meta name="Author">, but that could as easily be set to "Chicago Tribune" as to "Mike Royko." (In a newspaper, who's the "author"? The reporter? The publication? The editors/publishers?)

I propose a <rel="reporter">, which would be inserted on the link to the reporter's bio page -- on news sites that give their reporters bio pages, of course.

Possible applications: Web-wide aggregation of content reported by a particular author. More intelligent parsing of reporter information by news aggregators.

Factual relationships

This one gets a little abstract. It'd be fascinating to mark up every standalone "fact" in a news article and link it to a page of "proof" or "support." The link would have <rel="proof">, and automated agents could traverse the Web to build a gigantic "proof structure," tracing this fact back to an earlier fact, which would in turn be traced back to a previous fact. It's obvious this metadata would be incredibly expensive to maintain. But maybe news organizations should start thinking about creating infrastructure to gather this type of data.

Comments

Posted by Mark Hamilton on March 19, 2005, at 12:38 a.m.:

I was reading along, finding it all very interesting, thinking it sounded like a lot of work for (perhaps) small gain, until I hit "rel=proof" which really blew my mind. There's a revolution in that formula, if you can find a way to make it work on both sides: ease of use by the writer/publisher and ease-of-use of the "proof structure" by the reader.

Of course, then we have to get into the whole nasty argument about what's a fact.:-)

Posted by Már Örlygsson on March 20, 2005, at 1:03 a.m.:

This just dawned on me: Microformats are *the* way to boot-strap the SemWeb from the grassroots without any RDF overhead, chickens, eggs, etc.

Posted by Eric Meyer on March 26, 2005, at 10:43 p.m.:

I'm with Mark: if you can make the "proof" thing work, you're talking an enormous revolution in the making. Arguments about what's a fact and what isn't could actually be programmatically evaluated, with enough "proof" information. Something that gets very few "proof" links would be treated as less credible than something with a lot of "proof" links, perhaps. (Yes, I know that exposes the whole system to being gamed, but it's a start.)

Már: Yes, yes, a thousand times yes!!! You're exactly on the mark. That's why we often refer to them as being a foundation for a "lowercase semantic web"-- that is, something far less formal (and far more accessible) than the Semantic Web efforts.

Posted by brian on March 27, 2005, at 8:37 p.m.:

If you are looking for a well used and well definded system for references and citations, you can look to the BibTeX system for LaTeX. It can easily be converted into an XHTML representation. This doesn't show a relationship between articles, but it could be used and/or extended to show things like author, or proof where it points to a journal reference, etc.

A simple example from this BibTeX site:

http://www.ecst.csuchico.edu/~jacobsd/bib/formats/bibtex.html

@article{Gettys90,

author = {Jim Gettys and Phil Karlton and Scott McGregor},

title = {The {X} Window System, Version 11},

journal = {Software Practice and Experience},

volume = {20},

number = {S2},

year = {1990},

abstract = {A technical overview of the X11 functionality. This is an update

of the X10 TOG paper by Scheifler \& Gettys.}

}

<div class="author">Jim Gettys and Phil Karlton and Scott McGregor</div>

<div class="title">The X Window System, Version 11</div>

<div class="journal">Software Practice and Experience</div>

<div class="abstract">A technical overview of the X11 functionality. This is an update of the X10 TOG paper by Scheifler & Gettys.</div>

</div>

Posted by Tim on March 29, 2005, at 2:58 a.m.:

As I am extremly new to this, what I see in the work I am doing in structered tagging of content in XML and even attempts at desciphering free form tagging of user submitted photos and other content. Reminds me of writing a research paper in high school before the internet was around. The library card catalog system , dewey decimal system to find information and to put it on bibliography cards. That same info author, subject, title, summary, description, date, other works, etc. All related to books and periodicals that had information on one search term. Each printed item couldbe found in a sea of tomes by the decimal number on the spine. Libraries took the time to catalog all the search data like used in micrformatting, I agree time should be taken to catalog web page content, they are just small electronic periodicals and books anyway. I may be off some on what you guys are saying, but I the data my papers input each day is lost once it is printed on a newspaper and it should not be, it should be stored for later retrieval as it will be relavent to searches like in your PROOF tag.

Posted by Creford on August 27, 2005, at 4:50 p.m.:

Thanks for your article and links! This enabled me to understand the fresh concept "Microformats" in microformats.org.