I've just released templatemaker, which is something I've been hacking on and off (mostly off) the past couple of months. It's a Python library for extracting data from similarly formatted text strings.
What the heck does that mean?
Well, say you want to get the raw data from a bunch of Web pages that use the same template -- like restaurant reviews on Yelp.com, for instance. You can give templatemaker an arbitrary number of HTML files, and it will create the "template" that was used to create those files. ("Template," in this case, means a string with a number of "holes" in it, where the holes represent the parts of the page that change.) Once you've got the template, you can then give it any HTML file that uses that same template, and it will give you the raw data: "The value for hole 1 is 'July 6, 2007', the value for hole 2 is 'blue'," etc.
I searched but couldn't find anything else that did this. I heard from a few folks that Google and Yahoo have internal tools that do this sort of thing, probably in a much more robust fashion, but I'm unaware of any open-source code that does this. (Disclaimer: I have not checked CPAN. CPAN probably has five implementations of this functionality.)
The library uses a longest-common substring algorithm, which is implemented in C, via Python's C interface, for performance. (My original implementation was in Python, and it was noticeably slow in that area.) This means you need to compile it in order to use it, but it has no dependencies, so it should be quick and easy.
I'm interested in seeing what uses people have for this, and I'm also planning on adding a ton more features. And I'd love it if a competent C programmer could take a look at the C bits to make sure everything is kosher -- it's been a while since I've written C code. :-)
Let me know if you have any use for this, and may the suggestions and patches flow freely!