Readability and HTML::ExtractMain

| | Comments (3) | TrackBacks (0)

I'm a big fan of Arc90's Readability tool, which "makes reading on the Web more enjoyable by removing the clutter around what you're reading." It identifies the main body of the article or blog you're reading, re-presents it using an easy-to-read stylesheet, and hides everything else. It's a clever app, and I use it almost every day.

I needed to be able to pull out the main content of a web page for a personal project; it took me a few days till I realized that Readability does exactly that, and that Arc90 actually encourages ports to other platforms.

I just released HTML::ExtractMain, my Perl rewrite of Readability's content identification strategies. It's online at CPAN, and free to use under standard open source licenses. It's been a while since I released code as open source, and it feels good to be able to scratch my own itch while sharing code with other developers.

0 TrackBacks

Listed below are links to blogs that reference this entry: Readability and HTML::ExtractMain.

TrackBack URL for this entry: http://www.chatterjee.net/mt/mt-tb.cgi/95

3 Comments

Funny: I know someone at Arc90, and I'm curious whether you find Leonard's Beautiful Soup tool useful for similar purposes.

The Arc90 code basically implements a series of rules indicating what part of an HTML document is probably interesting (e.g. if a block has ID "footer", you can probably ignore it).

The original is built in JavaScript and uses the browser's DOM methods. Nirmal Patel's Python port (http://nirmalpatel.com/fcgi/hn.py) uses Beautiful Soup as its parser. My Perl port uses HTML::TreeBuilder as its parser.

Hey Chatterjee,
Thanks for your awesome work in porting the Readability algorithm to Perl.
I have just one observation.

Your module seems to miss out on the main posts of the Google blogs while the Arc90's readability 'bookmark' is capturing it fine.

url = http://googleblog.blogspot.com/2010/03/introducing-google-ad-innovations.html

captured post=
Sign up to get our posts via email. No more than one message per day.Delivered by FeedBurner

It seems particular to only the Google tech blogs.

-Billa

Leave a comment