Tech: April 2009 Archives

l've been enjoying playing with GeoPlanet, a free RESTful geography data API provided by Yahoo's Geo Technologies group. GeoPlanet issues WOEIDs ("Where on Earth IDs"), unique identifiers for place, ranging from plazas, up to to neighborhoods, districts, cities, counties, states, and nations. WOEIDs are intelligently linked up, so you can programmatically navigate between parent, child, and adjoining places. It deals well with ambiguity, and it's really nicely internationalized, both in terms of input and output. You can feed it free text (e.g. "Berkeley, California") and get back a ranked series of best-matches, along with latitude/longitude, bounding box, parent areas, and human-readable names.

There's a great interview with one of GeoPlanet's developers at O'Reilly Radar:

"[U]sually...geography is handled as a purely spatial problem. What I mean by that is that things are handled in longitude and latitudes. And traditionally, if you have a place such as a city or town which is polygonal on the map, it's usually boiled down to a centroid, which again is a coordinate pair. And then all of the questions relate to the coordinate pair...instead of taking a spatially-based approach to location, we take a place-based approach...It could be a park. It could be a region like the Pacific Northwest. It could be a continent and even the earth is a named place....we take all of these different names places and all of these different granularities and we give them unique identifiers called Where On Earth ids or WOE ids for short...

So coming back to the point about really open location, one of the goals that we want here is that we want to be able to ensure that we can all...refer unambiguously to the same place no matter how it's called. So the United States is the United States or it's USA. Or it's Les Etats Unis. All of the different labels are assigned with the same Where On Earth identifier. And it's really exposing that identifier out is we think the prime benefit...We won't tell you everything about them or their census statistics or the population. It's really, "Here's the identifier. This is where it can be found. And this is how this identifier relates to other identifiers." (more...)

Yahoo's work around developing interesting open platforms is totally underhyped. I'm ready to consider locking myself into Yahoo's WOEID system in my own apps; it's rich and open enough for my needs.

I recently launched a new web tool called DesiFilter.

Like a lot of folks from immigrant communities, I tend to be hyper-aware of names from my culture. If I'm watching a movie, part of my brain goes "hey, wow!" when I see that the gaffer's backup caterer is named Banerjee or Patel or Khan.

DesiFilter sample results

South Asian American community journalists and bloggers will regularly do the same--scanning long lists of names to find community members involved in larger news stories. So I built a tool to help out, based on a list of over 26,000 uniquely South Asian first and last names I collected and hand-edited. (The word "Desi" is often used interchangeably with South Asian in diaspora.)

You just give DesiFIlter a URL or a bunch of text, and it'll find and highlight possible South Asian names. Commercial name ethnicity matching tools have been around for a while, and are used for things like targeted marketing and political campaigning. I believe this is the first such tool that handles South Asian names that's freely available to the public.

It wasn't particularly hard to build; the tech side (powered by Perl's Regexp::Assemble) was a breeze compared to the difficult task of collecting and refining name lists. South Asian names come from all over, so I ended up making a lot of awkward decisions to maximize usability in majority-Anglo countries, including throwing out most Anglo and many Portuguese names common in South Asia to minimize false positives. This means, for example, that it'll fail to identify John Abraham as a South Asian name. Short of a hard-to-build-and-visualize system of weights, I can't think of a much better solution.

DesiFilter got some big love on Sepia Mutiny. I'm currently working on some features to make it more useful to the folks over at the South Asian Journalists Association.

About

Anirvan Chatterjee is a San Francisco Bay Area tech geek and bibliophile.

Syndication

Enter your email address:

About this Archive

This page is a archive of entries in the Tech category from April 2009.

Tech: February 2009 is the previous archive.

Tech: June 2009 is the next archive.

Find recent content on the main index or look in the archives to find all content.

Recently read