Alternatives to Sphinx: Fuzzy String Matching in PostgreSQL
Tony Pitale, Former Viget
It always seemed to me a big leap between search as a simple string match and a full-text search engine, such as Sphinx. With the contrib files (/opt/local/share/postgresql83/contrib from macports) available in PostgreSQL I have found what I consider to be a happy balance which lies somewhere in between. We can get a few of the simpler features of Sphinx without very much work at all.
I'll be focusing on 2 particular algorithms for doing what is often referred to as "Fuzzy String Matching". First up is Soundex. To quote Wikipedia, "Soundex is a phonetic algorithm …" and "The goal is for homophones to be encoded to the same representation …". Soundex is a very, very simple algorithm. In fact, I wrote some code in a freshman course to encode words and phrases. In the end, words are collapsed into the first letter, followed by three digits. Soundex is useful for finding "similar" words, especially misspellings. Other forms of phonetic algorithms include Metaphone, and Double-Metaphone.
Using Soundex (or any of the others) is a simple task. Use the given functions provided as you would use any other function. In my database I have a table, full of items, each with an item name. If a user searches for something like "Robart" (instead of Robert) the query will look something like this:
SELECT * FROM items WHERE difference(name, 'Robart') > 2;
When using the difference function, the values range from 0 (no match) to 4 (exact match). I've found that a two is pretty good at finding most misspellings in user queries. In my above example 'Robart' and 'Robert' are actually identical encodes (R163) but the name of the item we are looking for might have 'Robert' somewhere in the middle of the phrase and would therefore not match at all. Other downsides to Soundex are that it only works well in English and it only really works well on relatively short words as about three consonants will be used in the encode (the rest are thrown away).
Next, we have Trigram or Trigraph which the PostgreSQL docs describe as, "… a group of three consecutive characters taken from a string." The way in which two strings are matched is by counting the number of Trigrams they share.
The module for Trigram in PostgreSQL is a little more difficult to set up because it requires the creation of an index. The benefit of this is that it improves the overall performance of searching. To complete the setup we must add a new index to our table and column of choice. Given a table named 'items' and a column 'name' we might have something like this (in a migration, of course):
CREATE INDEX items_name_trigram_index ON items USING gist (name gist_trgm_ops);
With this index in place we can now query like so:
SELECT * FROM items WHERE name % 'Robart';
The percent operator is special to columns indexed with the options 'gist_trgm_ops' (as in the index we just created). If you want to use other operators such as ~* (case-insensitive regex), you would be wise to add another index.
Mixing It Up
If we put all of these things together, we might have a query that looks something like this:
SELECT * FROM items WHERE name % 'Robart' OR difference(name, 'Robart') > 2 OR name ~* 'Robart';
Installation is a breeze. My preferred way to "import" sql files is by using the -f flag and a filename with the psql command. You could also use the '\i' operator when already working within your database. Though, you have to be sure to use the full path.
psql -f /opt/local/share/postgresql83/contrib/fuzzystrmatch.sql <dbname> psql -f /opt/local/share/postgresql83/contrib/pg_trgm.sql <dbname>
Another stepping stone between fuzzy string matching and a full-text Sphinx setup would be the PostgreSQL contrib-included tsearch2. Tsearch2 provides full-text searching vectors and indexing from within PostgreSQL itself. It will likely never be as fast or as feature-full as Sphinx. So, if your immediate needs include fantastic speed searching over many gigabytes of text data, Sphinx is very likely your best bet and well worth the effort of installing and configuring.
However, when it comes to simplicity and time versus performance I think the benefits to trying out fuzzy string matching are tangible.