May 2, 2008, 1:36 AM ET
Request: Headless HTML rendering engine?
Warning: Seriously geeky request ahead!
I'm looking for a way to render arbitrary Web pages -- including CSS and JavaScript -- and access the resulting DOM tree programatically, i.e., in an automated/headless fashion. I want to be able to ask the following questions of the resulting DOM tree:
- For a given element, what font family, size, and color is the text?
- How tall and wide (in pixels) is a given
<div>,<table>, etc.? - What are the x/y coordinates of a given element (from the upper-left corner of the page, or lower-left, or wherever)?
- For a given element, what is its text content?
The rendering must be state-of-the-art, handling advanced CSS that Firefox, Safari and IE handle. It should work on Linux. Bonus points if there's a Python API for this magical DOM tree.
This is all stuff that standard in-page JavaScript could accomplish, but the catch with me is that I need to be able to do it in a completely automated way, on arbitrary pages, on a headless server.
I know Gecko and Webkit provide this, but I'm not sure where to start with them. The docs and articles I've read seem to be focused more on embedding the full browser window in a GUI application than embedding the rendering engine itself and manipulating the resulting pages.
Help! If you have any clues, I'd be grateful if you left a comment or got in touch with me.
April 4, 2008, 3:24 AM ET
Check out my Radiohead remix
Radiohead is holding a "contest" called Radiohead Remix, in which they're inviting fans to remix the song called "Nude" from their latest album. They've released the raw tracks -- separate, isolated audio clips of vocals, guitar, percussion, etc. -- and are encouraging people to remix the tracks to create something different, then upload it to radioheadremix.com. I put "contest" in quotes because there's no prize other than a guarantee that the band members will listen to your remix. But that's still kind of a cool prize.
I listened to a bunch of the submitted remixes on Wednesday and was kind of disappointed that none of the ones I listened to did anything interesting musically. Most of them retained the same techno/electronica feel of the original song, kept the song's melody intact and added a couple of drum beats. So tonight, I gave a shot at making my own remix.
For context, I'd suggest listening to the original song first. You can find it on the "In Rainbows" album or listen to this remix to get an idea of the song's melody/mood.
My remix is called "Nude (jazzy acoustic)," and you can listen to it on the site or using this embedded player:
It uses only Thom Yorke's vocal track from the recording, which means I was able to change the song's chords from the classic Radiohead melancholy to something a bit happier/jazzier. It has four guitar tracks -- two rhythm, one bass and one fingerstyle melody part for the extended fadeout. I cut up the vocal track in six places to fit the rhythm better, but they're in the same order as on the original recording, including the extended wordless vocals at the end.
It's kind of soulful, in a weird falsetto-Eddie-Vedder-ish way, especially compared to the original recording (which is also soulful, but in a much different way!).
If you like it, please vote and tell all your friends to vote! I'd love for Radiohead to hear this. :-)
February 18, 2008, 7:28 PM ET
EveryBlock hiring a Python screen-scraping expert
Attention Python screen-scraping experts! We're looking to hire another full-time developer at EveryBlock. Our site, which just launched a few weeks ago, compiles a wealth of granular geographic data and publishes it on a block-by-block basis. We offer a distinct Web page (plus an RSS feed and e-mail alerts) for every city block in Chicago, New York and San Francisco. We're expanding to more cities and more data sources. And we have a ton of fun features and projects up our sleeves.
This position involves contributions to all of our site's technology and data, with a concentration on screen-scraping public data from government Web sites. Some specifics we're looking for are:
- Mastery of screen-scraping
- Experience programming in Python
- Experience with geographic data
Experience with Django is a nice-to-have.
For more on EveryBlock, check out our launch announcement and this recent interview.
This is an opportunity to work on an exciting and important project with a talented and experienced Web development team. We're currently only four people, so you'll have a lot of freedom and opportunities to make a difference.
This is a full-time, salaried position, on-location in our modest downtown Chicago office. We're a startup, funded by a grant, trying to make the world a better place. Please contact me if you're interested or have any questions. Tell me about the gnarliest site you've ever scraped.
February 15, 2008, 12:32 AM ET
A couple of EveryBlock interviews
Back in 2006, I had a very enjoyable interview with Robert Niles at Online Journalism Review. Now, Robert and I have gotten back together for another e-mail conversation about my latest project, EveryBlock: check it out.
And there's more! Earlier today, Rex Sorgatz published an interview with me about EveryBlock, with more of a technology focus.
Thanks to both Robert and Rex for the great questions.
Comments:
Posted by Sebastian on February 20 at 5:44 PM ET:
Really good interviews, Congrats!
You mentioned:
"The second layer is the data storage layer, which we built in a way that can handle an arbitrary number of data types, each with arbitrary attributes. For example, a restaurant inspection has a violation (or multiple violations), whereas a crime has a crime type (e.g., homicide)."
I have a question: Does this data storage layer work with django's ORM? How do you query this? I have been looking for ways to work with schema-less databases in django and I'm thinking what you have done might be it.
Post a comment:
Comments on this entry are closed.
Don't see any comments? That's because my Web hosting provider has made a server upgrade that broke the commenting feature on this site. I'm working to restore that; please check back later.
January 31, 2008, 1:02 AM ET
In memory of chicagocrime.org
It's with mixed feelings that I announce the end of one of my projects, chicagocrime.org. This site has been serving Chicago residents since May 2005. I hope you'll indulge me in a brief retrospective.
Chicagocrime.org was one of the original map mashups, combining crime data from the Chicago Police Department with Google Maps. It offered a page and RSS feed for every city block in Chicago and a multitude of ways to browse crime data by type, by location type (e.g., sidewalk or apartment), by ZIP code, by street/address, by date, and even by an arbitrary route. The New York Times Magazine featured it in its 2005 "Year in Ideas" issue, and it won the 2005 Batten Award for Innovations in Journalism.
It's been a fun ride. When I launched the site, Google Maps hadn't yet released the mapping API that's so common even passé? today. I can't help but feel like an old-timer: "Back in my day, we had to reverse-engineer Google's obfuscated JavaScript just to get maps embedded on our own sites!" Now it seems like every other Web site finds an excuse to use those familiar, bubbly, yellow-white-blue-pastel map tiles.
Chicagocrime.org wasn't the first Google Maps mashup. That honor belongs to Paul Rademacher's HousingMaps, which, at that time, was modestly titled "Craigslist + Google Maps." The straightforwardness of that original title illustrates the excitement of it all: just the mere fact that somebody had mixed Craigslist data with Google's maps was new and remarkable. Kudos to Paul for keeping the site up and running for all these years. Not only was it a groundbreaking technical achievement; it remains genuinely useful.
A lot of good has come out of chicagocrime.org. At the local level, countless Chicago residents have contacted me to express their thanks for the public service. Community groups have brought print-outs of the site to their police-beat meetings, and passionate citizens have taken the site's reports to their aldermen to point out troublesome intersections where the city might consider installing brighter street lights.
It's done some good on a larger scale, too. The site helped influence Google to open up its mapping API for all to use. It inspired at least a dozen "spin-off" sites in other cities, from Berkeley to New Haven to Houston most of whose designs were very similar to Wilson's beautiful chicagocrime.org design. And the site's slashdotting forced me to write parts of Django's cache system. (Django itself was released open-source two months later; chicagocrime.org was the first public Django-powered site not run by the Lawrence Journal-World.)
A few weeks ago, I received an e-mail from the folks at Amazon EC2, where the crime site is hosted, saying the server instance that houses the site will be terminated on February 15 and that it will no longer be accessible after January 31. This is happening because I was an early user of EC2 and their network has gone through some changes that require all customers of a certain tenure to rebuild their servers. Instead of going through the hassle of upgrading my server instance, I'll let the Amazon staff shut it down on Thursday. All pages will redirect to the appropriate pages on my newest project, EveryBlock.
In many ways, EveryBlock is the next generation of chicagocrime.org. I've often described it to people as "chicagocrime.org on steroids more than just crime, and more than just Chicago." It's brought to you by the same people (Wilson and me from chicagocrime.org, plus Paul and Dan, who've worked on similar projects), and it has the same philosophies. As we developed EveryBlock, we kept chicagocrime.org firmly in our minds this new thing we were making had to be a superset, an expansion, a significant step forward. So there's almost nothing you could do on the old chicagocrime.org that you can't do on EveryBlock. And, unlike chicagocrime.org, which was always a side project, EveryBlock has a team of four people improving it full-time, meaning we have the resources to add features, such as e-mail alerts (just added yesterday), that chicagocrime.org never had. We hope EveryBlock is a worthy successor.
This story has a fitting epilogue. In just a few weeks after chicagocrime.org goes offline, the site will be featured in an exhibition at New York's Museum of Modern Art, called Design and the Elastic Mind. Chicagocrime.org will have ended its life and become a museum piece.
More blog entries are in the archive.

Comments:
Posted by Michal Migurski on February 18 at 9:21 PM ET:
Good luck, I'm curious who you'll find!
Posted by Jonas on February 19 at 7:11 AM ET:
Well, that pretty much describes me... what a bummer I'm halfway around the world in good ol' Europe. Best of luck with EveryBlock, it's a great idea.
Posted by Popcorn Mariachi on February 20 at 4:28 PM ET:
I'm with Jonas. This sounds like the perfect dream job, but I'm just too far away (Same country, at least)
Oh well.
Posted by Joshua Bloom on February 20 at 5:04 PM ET:
How about Beautiful Soup and a couple of programmer interns?
Post a comment:
Comments on this entry are closed.
Don't see any comments? That's because my Web hosting provider has made a server upgrade that broke the commenting feature on this site. I'm working to restore that; please check back later.