Close and Go BackBack to Viget

Tagging Text Automatically

Ben Scofield
Ben Scofield, Development Director, November 21, 2006 7

Classifications are important; we do better as users when we can use a well-structured classification scheme to find information we’re looking for. In the web 1.0 world, that played out in a hierarchy imposed by site designers (typified in the eternal sitemap) that the audience was forced to work with. Web 2.0 saw a move away from this predefined architecture and towards audience-defined taxonomies (sometimes called folksonomies) built on tags - end users associate tags with pieces of content, and use various mechanisms to navigate between similarly-tagged items.

The Problem

Tagging is a great strategy in certain circumstances, but it has a few important drawbacks. The one we’ll talk about here is the blank-state problem: if you’re relying entirely upon the audience to generate your tags, then new content in a system suffers an inherent disadvantage. When a browser comes to the site, they’ll explore the existing tag architecture, but they won’t find the new content (since it hasn’t been tagged yet). They may still be able to find it via some other mechanism (search, for example), but unless they then tag it the content will stay buried. It’s a rich-get-richer situation - well-tagged content will be found and tagged more often, while under-tagged content will not be found and will remain under-tagged.

The Solution

The obvious solution to this problem is to start all new content off with some starting tags - but that raises the question of where those tags come from. You could have the content creator enter them, and for small amounts of content with a distinct creator that's fine. But what if you have a system that has massive amounts of content entered at one time? Or a system in which content is generated automatically (through feeds, for instance)? The ideal solution here would be to automatically extract tags from the text itself - and as luck would have it, it is entirely possible to do just that.

It turns out that there are multiple web services out there that will take in your content and spit back keywords. The two most well-known of these are Tagthe.net (TTN) and Yahoo's Term Extractor (YTE) - they're both free, and they both work reasonably well (at Viget, we tend to use YTE - we find it produces slightly more relevant results and can return multi-word tags; we'll be implementing it in this example, though a TTN example is very similar).

Implementation: The Basics

The first step is to get an application ID from Yahoo! - it's essentially an API key that identifies your application. You can get one here (you may have to log in or register for an account with Yahoo!).

Once you've got your application ID, it's as easy as POSTing your content to the Term Extractor URL (at http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction). You'll get an XML response with your tags in a ResultSet (each tag is a Result).

Implementation: The Ruby Way

Here's the full example from a working Rails application. We drop this code into a TagExtractor class and call it with TagExtractor.extract(text).

class TagExtractor
  APP_KEY = 'key'
  API_SITE_URL = 'api.search.yahoo.com'
  API_PAGE_URL = '/ContentAnalysisService/V1/termExtraction';

  require 'net/http'
  require 'rexml/document'
  require 'uri'

  # public wrapper for the retrieve and parse process
  def self.extract(text)
    options = Hash.new
    options[:context] = text
    tag_xml = retrieve(options)

    parse(tag_xml)
  end
  
  private
  # pass the content to YTE for term extraction
  def self.retrieve(options)
    options['appid'] = APP_KEY
    res = nil

    Net::HTTP.start(API_SITE_URL) do |http|
      req = Net::HTTP::Post.new(API_PAGE_URL) 
      req.form_data = options 
      res = http.request(req)
    end

    res.body
  end  
  
  # parse the XML returned from YTE into an array of tags
  def self.parse(xml)
    tags = Array.new
    doc = REXML::Document.new(xml) 
    doc.elements.each("*/Result") do |result| 
      tags << result.text
    end
    tags
  end
end
DEkart said on 11/25 at 03:00 AM

Looks very interesting, but it seems that it does not work with non-english text.

Ben Scofield - Sr. Developer said on 11/27 at 08:22 AM

Yes - unfortunately, Yahoo’s offering only works on English content at the moment. Tagthe.net is a little better (it handled Italian, Spanish, and French when I tested it recently), but it still doesn’t seem to work on other widespread languages (including Russian, Chinese, and Japanese). I’m not aware of a solution for those languages at the moment, but I’m sure people are working on them.

Michael @ SEOG said on 11/27 at 11:08 AM

I’m not sure if keyword extraction really gives the full benefit of tagging. Just parsing existing keywords is not the same as someone tagging it to categories that make sense. There are many pieces of information that may not have a certain word in them but should still be tagged with it. I always thought the advantage of tagging was the “meta-data” that is placed on the article that helps other people to discover it and place it in context.

Ben Scofield - Sr. Developer said on 11/27 at 02:17 PM

I agree with you, Michael - human-mediated tagging is far superior to this method. Keyword extraction of this sort is not without merit, however; most importantly, it generates a usable first draft of metadata (which is especially valuable when you have a large set of documents). It’s also not as if keywords are completely divorced from the tags that a human would choose, so direct extraction does provide some of the benefits of ‘real’ tagging.

Trackback: Viget’s Four Labs » Blog Archive » Testing with Mock Objects in Rails on 11/30 at 12:20 PM [...] My last two posts on Ruby on Rails both dealt with using external resources in your application (Akismet and the Yahoo Term Extractor, respectively). One difficulty that often arises when code relies upon external pieces is testing - third parties often frown on the high-volume, rapid hits that a comprehensive test suite generates, and test runs can be dramatically slowed when they rely on resources outside the local machine or network. Luckily, Rails has built-in support for a technique that resolves both of these difficulties: mock objects. [...]-----
Daniel said on 07/24 at 04:56 AM

A very interesting idea. I have been playing with tagthe.net and it doesn’t seem to work at all for me, unless the tags generated are meant to be a checksum!

I note this post is 2006 and I have been reading up on this for a little while. Have you any developments worth noting? Website submission has grown beyond META tags and focusses on the <body></body> section I feel although I’m a little flaky on what’s hot and what’s not. That combined with the latest tagging ideas, Yahoo’s pipes and you have a very intelligent site?

Ben said on 07/25 at 06:19 PM

Thanks for the comment, Daniel! It has been quite a while since we’ve worked with this, so I’m not sure that I’d recommend anything in this post nowadays without testing it again first :)

I have been thinking a little bit about tagging in general recently, though—and in particular about tagging UIs. I’m hoping to have some time to spend on playing with different approaches sometime soon.

Daniel said on 07/26 at 07:06 AM

That’s great news. I suppose I came to your post whilst reasoning website transparency for optimised submission, maintenance and promotion. If you look at my URL you can see nothing simpler than a mind dump of promotion ideas, but it’s basic, involves nothing new and that’s what I’m looking for! I’d be interested to see any follow up you have. :)

Name:

Email:

URL:

Not a robot? Prove it by entering the word below.


Remember my personal information

Notify me of follow-up comments?

A Development Community for Viget Labs and Beyond

Every team member here at Viget Labs strives to be an innovator. We members of the development team are no different - that's why we're constantly engaging in community discussions and exploring the unknown that is the next generation of open-source web applications.

Viget Is Hiring!

Viget has job openings for Ruby Developers, Interns, and Front-End Developers. Learn More »

Recent Comments

I think that polymorphic_url(@commentable, :anchor => “comment_#{@comment.id}") should work. You can also refactor the “comment_#{@comment.id}” to a separated method, like dom_id, which returns the dom identifier of the comment.