Recently in Tech Category

I recently spent a month using Yahoo instead of Google as my default search engine. (Incidentally, the "Google's better, because the Yahoo home page is too busy" argument is bunk--Yahoo's search page at http://search.yahoo.com/ is every bit as clean and simple as Google's.)

I was surprised at how decent Yahoo's search was. Though Google's ranking still seemed to make more intuitive sense, Yahoo did a reasonably good job throughout. Unfortunately,I never felt like Yahoo's search was actually doing a noticeably better job than Google's. It was just a series of minor disappointments when I noticed results that Google could have done better with.

Yahoo's typo detection is very poor, compared to Google's:

  • When I searched for "eage book mod_perl" (with "eagle" mispelled "eage"), Yahoo didnt catch my typo. Google caught the error, and gave me exactly what I wanted--information on Practical mod_perl, a.k.a. the "Eagle" book.
  • When I searched for "ighthouseapp", Yahoo didn't figure out that I wanted to find out about the Lighthouse product, located at www.lighthouseapp.com. Google did.

Some of the results were just bizarre, but have since been fixed by a reindex:

  • I searched for "yahoo india news" and the fourth results down on the search results was a nonexistent page on Yahoo's own site, leading to a 404
  • For some reason, results from Target.com were coming up absurdly high for a wide variety of results. It looks like this has been addressed now.
I was also surprised at how tech-centric Google is, vs. Yahoo:

  • When I searched for "puppet" on Yahoo, the first page of results referred to, well, puppets. Over on Google, the first two hits were for the open source system administration software called Puppet. (Of course, in this case, I really was looking for the Puppet software, so the point goes to Google, but the bias is interesting to note.)

My biggest practical irritation with Yahoo Search wasn't with the web search itself, but the lack of an integrated blog search. I frequently jump to Google's blog search when I'm trying to find out what people are saying about something. Not having that be just a click away changed the way I interacted with the web, very much for the worse.

The Yahoo challenge was fun, but it actually made me appreciate Google even more. Next up, I'm looking forward to trying Bing for a month. When doing side-by-side tests, it seems to return significantly more relevant results than Yahoo Search (which is likely one good reason for the recent deal). I may end up back at Google, but I want to know that I'm using it for the right reasons, and not just laziness-induced lock-in.


I'm a big fan of Arc90's Readability tool, which "makes reading on the Web more enjoyable by removing the clutter around what you're reading." It identifies the main body of the article or blog you're reading, re-presents it using an easy-to-read stylesheet, and hides everything else. It's a clever app, and I use it almost every day.

I needed to be able to pull out the main content of a web page for a personal project; it took me a few days till I realized that Readability does exactly that, and that Arc90 actually encourages ports to other platforms.

I just released HTML::ExtractMain, my Perl rewrite of Readability's content identification strategies. It's online at CPAN, and free to use under standard open source licenses. It's been a while since I released code as open source, and it feels good to be able to scratch my own itch while sharing code with other developers.

There's a raging debate around climate change and intellectual property, and the planet's fate may be linked to the way we think about patent protectionism.

G77 countries at the Bonn climate conference were been demanding access to green technology intellectual property, as a requirement for moving ahead. Industrialized countries have been stonewalling, arguing that greentech IP is private, and can't be shared. G77 countries have come back with a proposal where developed nations would pay into a pool that would buy access to greentech IP, to be shared with developing nations, which has been worrying US politicians advocating for stronger IP rights. This is one of several important threads involved in international climate negotiations, but has been substantially underreported on in the American IP reform community.

I'm particularly interested by India's comparison of greentech IP to essential HIV/AIDS drugs, framing their current demands in light of a widely understood battle over patent protection vs. humanitarian access. I'm hoping to see more folks pick up on this angle, and see where the comparison works, and where it doesn't.

l've been enjoying playing with GeoPlanet, a free RESTful geography data API provided by Yahoo's Geo Technologies group. GeoPlanet issues WOEIDs ("Where on Earth IDs"), unique identifiers for place, ranging from plazas, up to to neighborhoods, districts, cities, counties, states, and nations. WOEIDs are intelligently linked up, so you can programmatically navigate between parent, child, and adjoining places. It deals well with ambiguity, and it's really nicely internationalized, both in terms of input and output. You can feed it free text (e.g. "Berkeley, California") and get back a ranked series of best-matches, along with latitude/longitude, bounding box, parent areas, and human-readable names.

There's a great interview with one of GeoPlanet's developers at O'Reilly Radar:

"[U]sually...geography is handled as a purely spatial problem. What I mean by that is that things are handled in longitude and latitudes. And traditionally, if you have a place such as a city or town which is polygonal on the map, it's usually boiled down to a centroid, which again is a coordinate pair. And then all of the questions relate to the coordinate pair...instead of taking a spatially-based approach to location, we take a place-based approach...It could be a park. It could be a region like the Pacific Northwest. It could be a continent and even the earth is a named place....we take all of these different names places and all of these different granularities and we give them unique identifiers called Where On Earth ids or WOE ids for short...

So coming back to the point about really open location, one of the goals that we want here is that we want to be able to ensure that we can all...refer unambiguously to the same place no matter how it's called. So the United States is the United States or it's USA. Or it's Les Etats Unis. All of the different labels are assigned with the same Where On Earth identifier. And it's really exposing that identifier out is we think the prime benefit...We won't tell you everything about them or their census statistics or the population. It's really, "Here's the identifier. This is where it can be found. And this is how this identifier relates to other identifiers." (more...)

Yahoo's work around developing interesting open platforms is totally underhyped. I'm ready to consider locking myself into Yahoo's WOEID system in my own apps; it's rich and open enough for my needs.

I recently launched a new web tool called DesiFilter.

Like a lot of folks from immigrant communities, I tend to be hyper-aware of names from my culture. If I'm watching a movie, part of my brain goes "hey, wow!" when I see that the gaffer's backup caterer is named Banerjee or Patel or Khan.

DesiFilter sample results

South Asian American community journalists and bloggers will regularly do the same--scanning long lists of names to find community members involved in larger news stories. So I built a tool to help out, based on a list of over 26,000 uniquely South Asian first and last names I collected and hand-edited. (The word "Desi" is often used interchangeably with South Asian in diaspora.)

You just give DesiFIlter a URL or a bunch of text, and it'll find and highlight possible South Asian names. Commercial name ethnicity matching tools have been around for a while, and are used for things like targeted marketing and political campaigning. I believe this is the first such tool that handles South Asian names that's freely available to the public.

It wasn't particularly hard to build; the tech side (powered by Perl's Regexp::Assemble) was a breeze compared to the difficult task of collecting and refining name lists. South Asian names come from all over, so I ended up making a lot of awkward decisions to maximize usability in majority-Anglo countries, including throwing out most Anglo and many Portuguese names common in South Asia to minimize false positives. This means, for example, that it'll fail to identify John Abraham as a South Asian name. Short of a hard-to-build-and-visualize system of weights, I can't think of a much better solution.

DesiFilter got some big love on Sepia Mutiny. I'm currently working on some features to make it more useful to the folks over at the South Asian Journalists Association.

I really enjoyed listening to CBC Spark's short radio documentary piece on the proliferation of USB drive-borne computer viruses in Sierra Leone. It's a very short peek into everyday technology culture in a far away place, and reminded me both of my virus-laden early 1990s, as well as technology culture in India.

Listen to it online on the Spark episode page, or just grab the MP3; the Sierra Leone section is from 18:50 to 25:50. Associated photos of Sierra Leone computer spaces are online at Flickr.

I've been looking into moving a Movable Type instance onto a managed hosted blogging platform. The obvious next step up from Movable Type would be TypePad, the hosted blogging service operated by Six Apart, the makers of Movable Type. I really like MT and Six Apart, and would normally be happy to stick to their platform.

But TypePad's slower rate of development, nonfunctional multi-author import tool, and aging look and feel are worrisome, enough to make me seriously evaluate WordPress.com, the other best-known hosted blogging platform.

I want reasonable modern business-class weblog hosting, with multiple authors, a custom domain, pretty-ish URLs, no ads, developer flexibility, FeedBurner support, and business-style billing. Is that too much to ask for? I did some research, and here's what I found (important distinctions bolded):

Feature TypePad WordPress.com
Cost $150/year (Pro account) $55/year (domain, CSS, no ads)
Platform stability high high
Usability high very high
Rate of active development medium high
Business-friendly invoicing, etc. high medium
Disk space 1 GB 3 GB
Bandwidth 10 GB unmetered
Canned themes "hundreds" 70+
Widgets many, but limited growth many
Category support
1 level
multiple levels
Tag/keyword support
yes
yes
Clean URLs medium very
Spam blocking quality high high
# of blogs unlimited 1
Custom CSS yes yes
Custom HTML yes limited
Custom JavaScript yes no
FeedBurner support yes no (autodiscovery URLs fixed)
# of authors unlimited? unlimited?
# of administrators 1 unlimited?
Import multiple authors no ?
Edit posted author no yes

Some of the two platforms' gaps are puzzling. I can't comprehend why TypePad doesn't support multiple levels of categories, something Movable Type's supported for ages. And it's head-bangingly frustrating that WordPress.com just doesn't work with FeedBurner, because there's no way for administrators or widget authors to edit the feed autodiscovery URL.

In the end, I grudgingly ended up picking TypePad because it gives users full HTML and JavaScript access, allowing first class integration with 3rd party services like FeedBurner. (It's nice knowing you always have an escape hatch if the platform isn't giving you every service you need.)

It's frustrating having to make these choices. Why aren't there better options out there?

Scribd takedown notice excerpt

I'm used to hearing about people receiving DMCA takedown notices, a procedure in which a copyright owner tells a service provider that they're hosting infringing data of some type, and requesting removal or the disabling of access. Being a techie with an interest in fair use, I often side with reform-minded groups that focus on abuses of the system, where DMCA takedown notices are incorrectly targeted, or ignore fair use rights.

Given that stance, I was surprised to find myself sending a DMCA takedown notice earlier this week.

While looking at online document-sharing service Scribd, I found a copy of article that I'd written several years ago. It was intact, and had my original copyright line on it, but the document was marked as being licensed under a Creative Commons license, when I'd never licensed it as such. The user, whose username made him or her difficult to identify and contact directly, had probably taken my article, uploaded it to Scribd, adding a default Creative Commons license on all the content in the account.

As far as I could tell, the use was harmless, but I didn't like the fact that the article's licensing details were incorrect. What to do? I emailed Scribd's copyright contact, describing the situation, and explaining that I didn't have a problem with the document being on Scribd, but that it was being redistributed with incorrect licensing information. I wouldn't have had a problem if it were uploaded for personal use (sort of like saving a photocopy of an original article).

A real live human being from Scribd got back to me, suggesting they could act only if I sent a DMCA notice, and including a sample DMCA takedown notice form letter. I filled it in: name, address, URLs, and replied. Within minutes, the document was taken down. Gone.

And now I have regrets. Should I have demanded more forcefully to speak to the original user, who'd clearly found the article of interest, instead of working only through Scribd's copyright department? Should I have participated in the process at all? How to balance the needs of users and content providers?

Apple iMac Flat Panel

At work, I use a five-year-old Apple iMac G4, with 256 MB of RAM and OS X 10.3. It works great. We have a cultural bias against older computers, but there's so much pleasure in working with tools that you know inside and out, and work just like you expect them to.

My work stack's pretty simple. I spend most of my days inside  Terminal and Camino (usually  RPMozley's optimized G4 builds, for maximum speed). I jump into Firefox while in design mode, TextWrangler while working with text files, GraphicConverter for very occasional image editing, and Psi for XMPP IM. In terms of productivity apps, I use Word and Excel 2004 most of the time, and jump into NeoOffice when dealing with the new 2008 XML Office file formats.

I'm still in love with the flat panel iMac hardware, featuring an elegant 180° swiveling screen I use all the time, turning it so that passers-by or folks on the other side of the office can see what I'm pointing at. It's a shame that Apple doesn't offer anything comparable now.

It would be really easy for me to upgrade, but why? It depresses the hell out of me when I see people I know stuck in eternal upgrade loops, throwing money at unused performance. Or even worse, unusable performance, when all the gains from an upgrade disappear under the weight of increasingly bulky and unusable software, leaving no meaningful boost in productivity.

As a computer professional, and a server-side guy, creativity isn't in the expense or bling of your tools, but how you use them. I'm lucky to be able to work in a space where I can choose good tools, and refine my understanding of them over time. (For that matter, it also helps that we use a lot of web-based and command-line tools, which don't require heavy or platform-specific clients.)

Your computer's probably faster than mine, but I probably like mine better.

Groan. As if JotSpot's integration (and death) within Google weren't bad enough, FeedBurner's integration into the Google system has been awkward as well.

They changed the hostname hosting feeds from "feeds.feedburner.com" to the ugly "feeds2.feedburner.com." Inelegant, but workable.

Subscriber tracking support has been off, and it's been suggested that it take a week for all subscribers to be counted again; strangely enough, it's Google Reader stats that are missing. Confusing, but tolerable. (Can't the two parts of the company talk to each other?)

But most annoyingly, FeedBurner's site analytics features have been removed, making it impossible to gauge combined feed and site aggregate traffic. This used to be one of FeedBurner's best features. I understand that there's likely an internal imperative to reduce duplication with Google Analytics, but the obvious solution would have been to offer some kind of linkage between the FeedBurner and Analytics products.

As it stands now, users have to log into one system, then log into another, and manually make a best-guess calculation of reach. This is a major step backwards. I realize both FeedBurner and Google Analytics are free services, but I've grown to expect a much better end-user experience from these teams.