Recently in Tech Category

Remembering Aaron Swartz

Boston Wiki Meetup

Good grief. Aaron Swartz is dead. As a fellow startup founder driven by social justice principles, Aaron was someone I looked to both as an example and as a peer.

We were in touch off and on for several years. Though we'd emailed previously, I first met the Internet hero and boy genius around 2007, when we met up to talk about book data, digital libraries, open data, and spidering/scraping strategies. He was incredibly smart and principled, and I was struck by his ability to dive so deeply into new areas.

When he sold Reddit, I gave him advice based on what I'd learned selling

And when I was about to leave, I very explicitly used Aaron as a model of what it could look like for a startup founder to do tech-driven social justice work after exiting. (I wasn't nearly as successful at it as him.)

Aaron worked on so many things I use regularly:

Aaron was a hero of the open net. We're all poorer for his absence.

An Online Guide to "Anokha: Soundz of the Asian Underground"

I'm in the process of migrating some old web content, when I came across a fan page I'd created in 1997 for the album Anokha: Soundz of the Asian Underground. It felt like an incredibly important compilation at the time, and helped break the Asian Underground and new British Asian electronic music scenes. I'd been hunting for reviews, to get a better sense of the context, and decided to put up a fan page about the album, linking to off-site reviews.

I just came back to my Online Guide to Anokha website after twelve years, and for nostalgia's sake, decided to look at some of the reviews.

Click. 404. Click. 404. Click. 404

All twenty-four outgoing links had died. In many cases, the entire website had disappeared. Tim Berners-Lee famously said "cool URIs don't change." Apparently nobody had informed the rest of us.

I suffered a moment of panic, realizing I could never go back to recapture the context of the document. And then, a sigh. There was always the Internet Archive. I loaded up the page in the Internet Archive.

Click. Yes. Click. Yes. Click. Yes.

I found pulled up seventeen of twenty-four links. A few had issues, including reliance on dead video plugin formats. One now-dead site had blocked crawlers, preventing the Internet Archive from storing any record of their existence. But by and large, it was mostly there. I breathed a sigh of relief.

The Internet Archive still feels like a massively-unsung hero of the net, single-handedly staving off the forces of amnesia by linkrot, for those who choose to embrace the open web. It gives me pause to think about how much less we'll be archiving as we spend more time in walled gardens like Facebook, or with hard-to-index dynamic/AJAX content.

Below, for your linkrot-loving enjoyment, are the broken links from 1998; you can also see the the page with archived links at the Internet Archive.



About Talvin Singh

I recently spent a month using Yahoo instead of Google as my default search engine. (Incidentally, the "Google's better, because the Yahoo home page is too busy" argument is bunk--Yahoo's search page at is every bit as clean and simple as Google's.)

I was surprised at how decent Yahoo's search was. Though Google's ranking still seemed to make more intuitive sense, Yahoo did a reasonably good job throughout. Unfortunately,I never felt like Yahoo's search was actually doing a noticeably better job than Google's. It was just a series of minor disappointments when I noticed results that Google could have done better with.

Yahoo's typo detection is very poor, compared to Google's:

  • When I searched for "eage book mod_perl" (with "eagle" mispelled "eage"), Yahoo didnt catch my typo. Google caught the error, and gave me exactly what I wanted--information on Practical mod_perl, a.k.a. the "Eagle" book.
  • When I searched for "ighthouseapp", Yahoo didn't figure out that I wanted to find out about the Lighthouse product, located at Google did.

Some of the results were just bizarre, but have since been fixed by a reindex:

  • I searched for "yahoo india news" and the fourth results down on the search results was a nonexistent page on Yahoo's own site, leading to a 404
  • For some reason, results from were coming up absurdly high for a wide variety of results. It looks like this has been addressed now.
I was also surprised at how tech-centric Google is, vs. Yahoo:

  • When I searched for "puppet" on Yahoo, the first page of results referred to, well, puppets. Over on Google, the first two hits were for the open source system administration software called Puppet. (Of course, in this case, I really was looking for the Puppet software, so the point goes to Google, but the bias is interesting to note.)

My biggest practical irritation with Yahoo Search wasn't with the web search itself, but the lack of an integrated blog search. I frequently jump to Google's blog search when I'm trying to find out what people are saying about something. Not having that be just a click away changed the way I interacted with the web, very much for the worse.

The Yahoo challenge was fun, but it actually made me appreciate Google even more. Next up, I'm looking forward to trying Bing for a month. When doing side-by-side tests, it seems to return significantly more relevant results than Yahoo Search (which is likely one good reason for the recent deal). I may end up back at Google, but I want to know that I'm using it for the right reasons, and not just laziness-induced lock-in.

I'm a big fan of Arc90's Readability tool, which "makes reading on the Web more enjoyable by removing the clutter around what you're reading." It identifies the main body of the article or blog you're reading, re-presents it using an easy-to-read stylesheet, and hides everything else. It's a clever app, and I use it almost every day.

I needed to be able to pull out the main content of a web page for a personal project; it took me a few days till I realized that Readability does exactly that, and that Arc90 actually encourages ports to other platforms.

I just released HTML::ExtractMain, my Perl rewrite of Readability's content identification strategies. It's online at CPAN, and free to use under standard open source licenses. It's been a while since I released code as open source, and it feels good to be able to scratch my own itch while sharing code with other developers.

There's a raging debate around climate change and intellectual property, and the planet's fate may be linked to the way we think about patent protectionism.

G77 countries at the Bonn climate conference were been demanding access to green technology intellectual property, as a requirement for moving ahead. Industrialized countries have been stonewalling, arguing that greentech IP is private, and can't be shared. G77 countries have come back with a proposal where developed nations would pay into a pool that would buy access to greentech IP, to be shared with developing nations, which has been worrying US politicians advocating for stronger IP rights. This is one of several important threads involved in international climate negotiations, but has been substantially underreported on in the American IP reform community.

I'm particularly interested by India's comparison of greentech IP to essential HIV/AIDS drugs, framing their current demands in light of a widely understood battle over patent protection vs. humanitarian access. I'm hoping to see more folks pick up on this angle, and see where the comparison works, and where it doesn't.

l've been enjoying playing with GeoPlanet, a free RESTful geography data API provided by Yahoo's Geo Technologies group. GeoPlanet issues WOEIDs ("Where on Earth IDs"), unique identifiers for place, ranging from plazas, up to to neighborhoods, districts, cities, counties, states, and nations. WOEIDs are intelligently linked up, so you can programmatically navigate between parent, child, and adjoining places. It deals well with ambiguity, and it's really nicely internationalized, both in terms of input and output. You can feed it free text (e.g. "Berkeley, California") and get back a ranked series of best-matches, along with latitude/longitude, bounding box, parent areas, and human-readable names.

There's a great interview with one of GeoPlanet's developers at O'Reilly Radar:

"[U]sually...geography is handled as a purely spatial problem. What I mean by that is that things are handled in longitude and latitudes. And traditionally, if you have a place such as a city or town which is polygonal on the map, it's usually boiled down to a centroid, which again is a coordinate pair. And then all of the questions relate to the coordinate pair...instead of taking a spatially-based approach to location, we take a place-based approach...It could be a park. It could be a region like the Pacific Northwest. It could be a continent and even the earth is a named place....we take all of these different names places and all of these different granularities and we give them unique identifiers called Where On Earth ids or WOE ids for short...

So coming back to the point about really open location, one of the goals that we want here is that we want to be able to ensure that we can all...refer unambiguously to the same place no matter how it's called. So the United States is the United States or it's USA. Or it's Les Etats Unis. All of the different labels are assigned with the same Where On Earth identifier. And it's really exposing that identifier out is we think the prime benefit...We won't tell you everything about them or their census statistics or the population. It's really, "Here's the identifier. This is where it can be found. And this is how this identifier relates to other identifiers." (more...)

Yahoo's work around developing interesting open platforms is totally underhyped. I'm ready to consider locking myself into Yahoo's WOEID system in my own apps; it's rich and open enough for my needs.

I recently launched a new web tool called DesiFilter.

Like a lot of folks from immigrant communities, I tend to be hyper-aware of names from my culture. If I'm watching a movie, part of my brain goes "hey, wow!" when I see that the gaffer's backup caterer is named Banerjee or Patel or Khan.

DesiFilter sample results

South Asian American community journalists and bloggers will regularly do the same--scanning long lists of names to find community members involved in larger news stories. So I built a tool to help out, based on a list of over 26,000 uniquely South Asian first and last names I collected and hand-edited. (The word "Desi" is often used interchangeably with South Asian in diaspora.)

You just give DesiFIlter a URL or a bunch of text, and it'll find and highlight possible South Asian names. Commercial name ethnicity matching tools have been around for a while, and are used for things like targeted marketing and political campaigning. I believe this is the first such tool that handles South Asian names that's freely available to the public.

It wasn't particularly hard to build; the tech side (powered by Perl's Regexp::Assemble) was a breeze compared to the difficult task of collecting and refining name lists. South Asian names come from all over, so I ended up making a lot of awkward decisions to maximize usability in majority-Anglo countries, including throwing out most Anglo and many Portuguese names common in South Asia to minimize false positives. This means, for example, that it'll fail to identify John Abraham as a South Asian name. Short of a hard-to-build-and-visualize system of weights, I can't think of a much better solution.

DesiFilter got some big love on Sepia Mutiny. I'm currently working on some features to make it more useful to the folks over at the South Asian Journalists Association.

I really enjoyed listening to CBC Spark's short radio documentary piece on the proliferation of USB drive-borne computer viruses in Sierra Leone. It's a very short peek into everyday technology culture in a far away place, and reminded me both of my virus-laden early 1990s, as well as technology culture in India.

Listen to it online on the Spark episode page, or just grab the MP3; the Sierra Leone section is from 18:50 to 25:50. Associated photos of Sierra Leone computer spaces are online at Flickr.

I've been looking into moving a Movable Type instance onto a managed hosted blogging platform. The obvious next step up from Movable Type would be TypePad, the hosted blogging service operated by Six Apart, the makers of Movable Type. I really like MT and Six Apart, and would normally be happy to stick to their platform.

But TypePad's slower rate of development, nonfunctional multi-author import tool, and aging look and feel are worrisome, enough to make me seriously evaluate, the other best-known hosted blogging platform.

I want reasonable modern business-class weblog hosting, with multiple authors, a custom domain, pretty-ish URLs, no ads, developer flexibility, FeedBurner support, and business-style billing. Is that too much to ask for? I did some research, and here's what I found (important distinctions bolded):

Feature TypePad
Cost $150/year (Pro account) $55/year (domain, CSS, no ads)
Platform stability high high
Usability high very high
Rate of active development medium high
Business-friendly invoicing, etc. high medium
Disk space 1 GB 3 GB
Bandwidth 10 GB unmetered
Canned themes "hundreds" 70+
Widgets many, but limited growth many
Category support
1 level
multiple levels
Tag/keyword support
Clean URLs medium very
Spam blocking quality high high
# of blogs unlimited 1
Custom CSS yes yes
Custom HTML yes limited
Custom JavaScript yes no
FeedBurner support yes no (autodiscovery URLs fixed)
# of authors unlimited? unlimited?
# of administrators 1 unlimited?
Import multiple authors no ?
Edit posted author no yes

Some of the two platforms' gaps are puzzling. I can't comprehend why TypePad doesn't support multiple levels of categories, something Movable Type's supported for ages. And it's head-bangingly frustrating that just doesn't work with FeedBurner, because there's no way for administrators or widget authors to edit the feed autodiscovery URL.

In the end, I grudgingly ended up picking TypePad because it gives users full HTML and JavaScript access, allowing first class integration with 3rd party services like FeedBurner. (It's nice knowing you always have an escape hatch if the platform isn't giving you every service you need.)

It's frustrating having to make these choices. Why aren't there better options out there?

Scribd takedown notice excerpt

I'm used to hearing about people receiving DMCA takedown notices, a procedure in which a copyright owner tells a service provider that they're hosting infringing data of some type, and requesting removal or the disabling of access. Being a techie with an interest in fair use, I often side with reform-minded groups that focus on abuses of the system, where DMCA takedown notices are incorrectly targeted, or ignore fair use rights.

Given that stance, I was surprised to find myself sending a DMCA takedown notice earlier this week.

While looking at online document-sharing service Scribd, I found a copy of article that I'd written several years ago. It was intact, and had my original copyright line on it, but the document was marked as being licensed under a Creative Commons license, when I'd never licensed it as such. The user, whose username made him or her difficult to identify and contact directly, had probably taken my article, uploaded it to Scribd, adding a default Creative Commons license on all the content in the account.

As far as I could tell, the use was harmless, but I didn't like the fact that the article's licensing details were incorrect. What to do? I emailed Scribd's copyright contact, describing the situation, and explaining that I didn't have a problem with the document being on Scribd, but that it was being redistributed with incorrect licensing information. I wouldn't have had a problem if it were uploaded for personal use (sort of like saving a photocopy of an original article).

A real live human being from Scribd got back to me, suggesting they could act only if I sent a DMCA notice, and including a sample DMCA takedown notice form letter. I filled it in: name, address, URLs, and replied. Within minutes, the document was taken down. Gone.

And now I have regrets. Should I have demanded more forcefully to speak to the original user, who'd clearly found the article of interest, instead of working only through Scribd's copyright department? Should I have participated in the process at all? How to balance the needs of users and content providers?