Readability and HTML::ExtractMain

| | Comments (2) | TrackBacks (0)

I'm a big fan of Arc90's Readability tool, which "makes reading on the Web more enjoyable by removing the clutter around what you're reading." It identifies the main body of the article or blog you're reading, re-presents it using an easy-to-read stylesheet, and hides everything else. It's a clever app, and I use it almost every day.

I needed to be able to pull out the main content of a web page for a personal project; it took me a few days till I realized that Readability does exactly that, and that Arc90 actually encourages ports to other platforms.

I just released HTML::ExtractMain, my Perl rewrite of Readability's content identification strategies. It's online at CPAN, and free to use under standard open source licenses. It's been a while since I released code as open source, and it feels good to be able to scratch my own itch while sharing code with other developers.

0 TrackBacks

Listed below are links to blogs that reference this entry: Readability and HTML::ExtractMain.

TrackBack URL for this entry: http://www.chatterjee.net/mt/mt-tb.cgi/95

2 Comments

Funny: I know someone at Arc90, and I'm curious whether you find Leonard's Beautiful Soup tool useful for similar purposes.

The Arc90 code basically implements a series of rules indicating what part of an HTML document is probably interesting (e.g. if a block has ID "footer", you can probably ignore it).

The original is built in JavaScript and uses the browser's DOM methods. Nirmal Patel's Python port (http://nirmalpatel.com/fcgi/hn.py) uses Beautiful Soup as its parser. My Perl port uses HTML::TreeBuilder as its parser.

Leave a comment

About

Anirvan Chatterjee is a San Francisco Bay Area tech geek and bibliophile.

Syndication

Enter your email address:

About this Entry

This page contains a single entry by Anirvan Chatterjee published on August 1, 2009 7:05 PM.

Exiting BookFinder.com... was the previous entry in this blog.

August 15: Marching for India, and the planet is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Recently read