Tagging Text Automatically
Classifications are important; we do better as users when we can use a well-structured classification scheme to find information we're looking for. In the web 1.0 world, that played out in a hierarchy imposed by site designers (typified in the eternal sitemap) that the audience was forced to work with. Web 2.0 saw a move away from this predefined architecture and towards audience-defined taxonomies (sometimes called folksonomies) built on tags - end users associate tags with pieces of content, and use various mechanisms to navigate between similarly-tagged items.
Tagging is a great strategy in certain circumstances, but it has a few important drawbacks. The one we'll talk about here is the blank-state problem: if you're relying entirely upon the audience to generate your tags, then new content in a system suffers an inherent disadvantage. When a browser comes to the site, they'll explore the existing tag architecture, but they won't find the new content (since it hasn't been tagged yet). They may still be able to find it via some other mechanism (search, for example), but unless they then tag it the content will stay buried. It's a rich-get-richer situation - well-tagged content will be found and tagged more often, while under-tagged content will not be found and will remain under-tagged.
The obvious solution to this problem is to start all new content off with some starting tags - but that raises the question of where those tags come from. You could have the content creator enter them, and for small amounts of content with a distinct creator that's fine. But what if you have a system that has massive amounts of content entered at one time? Or a system in which content is generated automatically (through feeds, for instance)? The ideal solution here would be to automatically extract tags from the text itself - and as luck would have it, it is entirely possible to do just that.
It turns out that there are multiple web services out there that will take in your content and spit back keywords. The two most well-known of these are Tagthe.net (TTN) and Yahoo's Term Extractor (YTE) - they're both free, and they both work reasonably well (at Viget, we tend to use YTE - we find it produces slightly more relevant results and can return multi-word tags; we'll be implementing it in this example, though a TTN example is very similar).
Implementation: The Basics
The first step is to get an application ID from Yahoo! - it's essentially an API key that identifies your application. You can get one here (you may have to log in or register for an account with Yahoo!).
Once you've got your application ID, it's as easy as POSTing your content to the Term Extractor URL (at http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction). You'll get an XML response with your tags in a ResultSet (each tag is a Result).
Implementation: The Ruby Way
Here's the full example from a working Rails application. We drop this code into a TagExtractor class and call it with TagExtractor.extract(text).
class TagExtractor APP_KEY = 'key' API_SITE_URL = 'api.search.yahoo.com' API_PAGE_URL = '/ContentAnalysisService/V1/termExtraction'; require 'net/http' require 'rexml/document' require 'uri' # public wrapper for the retrieve and parse process def self.extract(text) options = Hash.new options[:context] = text tag_xml = retrieve(options) parse(tag_xml) end private # pass the content to YTE for term extraction def self.retrieve(options) options['appid'] = APP_KEY res = nil Net::HTTP.start(API_SITE_URL) do |http| req = Net::HTTP::Post.new(API_PAGE_URL) req.form_data = options res = http.request(req) end res.body end # parse the XML returned from YTE into an array of tags def self.parse(xml) tags = Array.new doc = REXML::Document.new(xml) doc.elements.each("*/Result") do |result| tags << result.text end tags end end