How We Implemented Lucene Fuzziness and Wildcards

In version 3.2.8, we’re adding a couple configuration parameters to our Advanced Search (ie, Lucene) implementation in the Enterprise Edition. I have to say, I’m always impressed with how easy it is to use Lucene’s API. It is surprising how easy it is to implement seemingly-complex features via their API. Kudos to the Lucene folks!

We decided we would give administrators control over a few of the aspects of how Lucene analyzes searches inputted by users. For those who are looking for them, here is where the settings can be found on our Advanced Search Settings screen:

New Lucene Settings

Here is the full scoop on what these settings do.

Advanced Search Match Style

The first setting, Advanced Search Match Style, represents the technique the system should use to match documents during Lucene searches inputted by users. They are listed in rough order from narrowest to broadest.

  • Exact Match (wraps search in double quotes)
  • All Terms (uses AND operator between terms)
  • All Terms Fuzzy (AND operator and tilde appended to each term)
  • All Terms Wildcard (AND operator and asterisk appended to each term)
  • Any Term (OR opertator – this is Lucene’s default)
  • Any Term Fuzzy (OR opertator and tilde appended to each term)
  • Any Term Wildcard (OR opertator and asterisk appended to each term)

There are a couple Lucene features at play here. Let’s go through them briefly.

The Default Operator: AND or OR

First, let’s talk about the “default operator” for searches. Lucene uses the OR operator by default. For example, if a user enters hubble space telescope for his search term, by default it returns products that match either “hubble” or “space” or “telescope”. For product catalogs that are very large, or if your users commonly search for multiple-word phrases where one of the words is very common (like “space” perhaps), you might find the OR operator returns too many matches. In that case you can change the matching style to one of the “All Terms” options so it uses the AND operator instead. That will require a match on all terms in the user’s input, eg., “hubble” and “space” and “telescope”. In terms of the Java code it tells us which way to call QueryParser.setDefaultOperator(). Incidentally, the user still can switch the search to a different operator by inputting the terms AND or OR himself. Ie., hubble or telescope will return matches on either “hubble” or “telescope”, even if you’ve specified an “All Terms” option in the setting.

Fuzziness and Wildcards

Second, we have options to automatically add fuzziness or a wildcard to each of the terms inputted by the user. Fuzziness means that Lucene will forgive a small spelling mistake or fat-fingering, and consider a term to match if it is close to another term. Eg, “huble” might match “hubble” if one of the fuzzy options are enabled. In Java code, this involves appending a tilde character (~) to the end of each term entered by the user. For example, the above search would be submitted to Lucene as “hubble~ space~ telescope~”. Note that if one of the non-Fuzzy settings is chosen the user can still input a fuzzy search by appending the tildes himself in the search string. The setting determines whether or not SoftSlate automatically adds them if they are not there. Similarly for the Wildcard options. If one of those options are set, SoftSlate will automatically append asterisks to each term, making them match on partial words. Eg., “spa” would be submitted as “spa*” and therefore match “space”. Note again that the user does have some control over the situation. Even if a Wildcard setting is not chosen, he can add asterisk to a term to turn it into a wildcard search. You might wondering, what if the setting specifies Fuzzy and user inputs a wildcard? In this case the user wins and the input is left alone as a wildcard search. Any term the user submits with a special character appended to it is left alone (out of respect for his wishes).

Stemming and Wildcards

We discovered we had to resolve an interesting problem using wildcard searches with the Snowball analyzer. The Snowball analyzer can be specified via the “Lucene Analyzer” setting. It’s purpose is to “stem” the terms that are indexed and submitted by users. Stemming means that the term is broken down to its root so that, for example, “running” and “runs” both are broken down to the root term “run”. This posed a problem when combined with the options to automatically add wildcards to each term the user submits. Internally, Lucene assumes that the wildcard is part of the user’s input, not added automatically for them. So by design it does not “stem” wildcard searches when they are submitted. However, the terms are already indexed in stemmed form. So if “running” was submitted, SoftSlate added a wildcard to it to make “running*”, Lucene saw the asterisk and skipped stemming the term. But since the index had not indexed “running”, only “run”, the submission failed to match on the original “running” that was submitted! To make a long story short, to work around this, we also add a search for the original term that the user inputs – minus the wildcard at the end – to make sure it is covered. That is, “running” is reworked as “running OR running*”.

Advanced Search Fuzziness

The Advanced Search Fuzziness setting is a factor from 0.0 to 1.0 that tells the system how fuzzy fuzzy searches should be. This setting applies if one of the Fuzzy options is selected for the Advanced Search Match Style, or the user himself appends a tilde (~) to the end of a search term. A higher value will cause fuzzy searches to be less fuzzy (ie, narrower). A lower value will cause fuzzy searches to be broader. In terms of the Java code the value of this setting is piped directly into QueryParser.setFuzzyMinSim(). Lucene’s default by the way is 0.50.

About David Tobey

I'm a web developer and consultant based in lovely Schenectady, NY, where I run SoftSlate, LLC. In my twenties I worked in book publishing, where I met my wife. In my thirties I switched careers and became a computer programmer. I am still in the brainstorming phase for what to do in my forties, and I am quickly running out of time to make a decision. If you have any suggestions, please let me know. UPDATE: I have run out of time and received no (realistic) suggestions. I guess it's programming for another decade.
This entry was posted in SoftSlate Commerce. Bookmark the permalink.

Leave a Reply