Custom 'Fuzzy' search system in ASP.NET

johna by | June 19, 2013 | ASP.NET Web Forms Web Development

Many or perhaps most websites have the need for a search facility to search the website content or other things like products. Ideally you want an intelligent search, a bit like a search engine, where search results are ordered by relevancy, misspellings are tolerated, and maybe even word stemming is used (so a search for "teach" also returns results for "teacher", "teaches", "teaching", "taught", etc).

Although there are quite a few solutions to do this, there are often compromises to make, and some can be expensive.

For a recent project (ASP.NET, C#) I required a sophisticated search and, after investigating many solutions, I decided to write my own. This article explains the process but does not include any code. If there is demand I may create a sample project with code examples in a future article.

I looked into Oracle's Fuzzy Search, SQL Server's full text search, plus others like Microsoft Search Server. I wanted immediate indexing and better handling of misspelling and stemming.

The first step was to take care of the spell checking and word stemming requirements. I found a free library called NHunspell which can do this, together with free dictionary and thesaurus files.

I decided the best way to search my database was to create a search index column where I store the stemmed version of each word from any column that is searchable, with identifiers in place to mark which column the word is in.

At this point I should mention how ordering by relevancy is handled. Let's consider a blog post as an example. The blog post has a title, a subtitle, and the article itself. If search words are found in the title they should be considered more relevant than if found in the subtitle or article. And if found in the subtitle they would be more relevant than if in the article. If there is more than one search word then if multiple words are found together then they should be considered more relevant than if found separately.

To achieve this, in my search index column I store every word even if it is repeated.

As an example, if we consider a blog article with a title of "Sample blog title", a subtitle of "My subtitle", and the article is "The quick brown fox jumps over the lazy dog" my search index column would be:
|sample|blog|title|{1}|my|subtitle|{2}|the|quick|brown|fox|jumps|over|the|lazy|dog

I use the pipe character to separate words as my search query will be targeting whole words not partial matches (ie. A search for "sam" should not match "sample"), but any character could be used. I use {1} and {2} to indicate the end of each column's (title, subtitle and article) words.

Before indexing, all words are converted to lower case, any non-alphanumeric characters are replaced with spaces (except my column identifiers), words ending in an apostrophe and the letter s have this removed, multiple spaces are trimmed back to a single space, and, because the thesaurus files I used did not seem to handle word stemming perfectly, I also implement a couple of manual fixes. Lastly I have a bunch of abbreviations which I convert to the non-abbreviated version. I put this functionality in my "WordFix" method, which I also use later for user searches.

With this cleaned text I then convert it into a string array, split on spaces.

I then iterate through the array, skipping my column identifiers. I then spell check the word (unless it has any numeric characters in it), and if it is a valid word I then stem it and record the stemmed version. If spell check fails then I add the word to a user dictionary table if it isn't already in it. The user dictionary means that we can exclude words from our own content from showing as spelling errors during user searches. That leads me to a fundamental of this system, if a word is in our database then it is considered a valid word. Each word and column identifier is added to a string, just like the sample I showed above.

With the search index created we can move on to the user search facility. I first run through the users search phrase through my "WordFix" function to clean it up.

The next step is to create an array of unique words from the user's search phrase. I run each word through spell checking and the user dictionary to see if there are any words the system doesn't recognise. If there are I show a message saying that the word was misspelled and offer some suggestions, which NHunspell provides. If there are no suggestions then I don't show a message about that word.

I then convert the users search phrase into an array of search words which I then run through the word stemmer to get the stemmed version.

Now I can produce my database query. I used Linq to SQL and my query was built by iterating through the array or words and use Linq .Contains("|" + word + "|"). Even if there is record paging I don't do it at this stage — I return the entire result set.

At this point I have a result set List object. I have added an extra property for my ordering by relevancy, a relevancy score property. So now I run each row through my relevancy scorer routine which assigns a score to each row which I can use to order my results and then do paging if required.

The relevancy scorer is a little complicated as it has to check for consecutive and non-consecutive matches, and determine in which column the match is in. This is where some sample code would really help, but for the moment I will just explain the process.

I record all possible single and multiple word combinations of the user's search phrase. So for "big brown dog" I would record "big brown dog" (3 words), "big brown" (2 words), "brown dog" (2 words), "big" (1 word), "brown" (1 word) and "dog" (1 word).

Then I record the character position of each of the column identifiers.

For each of these word combinations I check if that exists (eg. IndexOf("|big|brown|dog|"). If it does I work out in which column it is in (based on the column identifier positions) and assign a score based on number of consecutive words, column it's in, and I also give more weight depending on the position in the column as search terms closer to the beginning of the column are considered more relevant. I skip any lesser word combinations (if I found "big brown dog" then I know that all the other combinations are there too).

With this data I can now show the results, but I also need hit highlighting. My hit highlighter function has to work through the content and find the matched terms by again stemming each word in the content to display and comparing that to the search words.

If all this sounds like it is going to put a lot of demand on the web server, then that is probably right. However, I run this search system on a few very high volume websites and they run fine, albeit on virtual servers with plenty of CPU and memory. The biggest bottleneck is getting the data from SQL Server.

To keep memory use low I suggest retrieving and storing as little information for each row as you will need to display your results. Alternately you can just retrieve a database identifier and the search index then retrieve full data for only the rows that are to be displayed once the relevancy scoring is completed.

UPDATE 13 April 2017: I've created an example system for how this can work. See the project on GitHub.

Related Posts

Web Development

How to set up a debugging using the Turnkey Linux LAMP stack and VS Code

by johna | December 19, 2023
The second part in my guide to setting up a website and database using the Turnkey Linux LAMP stack.

Website Hosting Web Development

How to set up a website and database using the Turnkey Linux LAMP stack

by johna | November 18, 2023
If you need to host your own website for the purposes of web development, Turnkey Linux LAMP Stack is an easy to install all-in-one solution that you can set up on a spare computer or a VM (Virtual Machine).

Web Development

Intermittent "Unable to read data from the transport connection: net_io_connectionclosed" errors

by johna | May 6, 2020
If you are having intermittent problems sending email in .NET using System.Net.Mail consider switching libraries.

Comments

Thomas Maierhofer

by Thomas Maierhofer | September 8, 2014

Hi John, it is always nice to see what people do with NHunspell. I've taken the projects homepage to http://www.crawler-lib.net/ maybe you update the article if you find time. Thanks in advance.

Reply

Leave a Reply
John Avis

by John Avis | September 8, 2014

Thanks Thomas. I have updated the link in the article. NHunspell has been very useful for me and I will take a look at your website to see your other work.

Reply

About

...random postings about web development and programming, Internet, computers and electronics topics.

I recommend ASPnix for web hosting and Crazy Domains for domain registration.

Subscribe

Get the latest posts delivered to your inbox.