Information Retrieval : Spell Correction in Search Engines – Algorithms & Strategies
Spell Checking & Correction is an important component of any information retrieval(aka document retrieval) system like Google, Bing, amazon, eBay and alike. User specifies his/her information need in the form of search query, which is later used by IR system to retrieve ranked set of relevant documents. For example, user might search “red shirt” hoping to see shirts which are red in color. Since the search string is a human language input, it is quite common for spell errors to occur while being typed. It becomes crucial to detect and correct for spell error before sending it to relevance engine to fetch relevant documents.
In this blog post, my main focus is to talk about a specific type of error that is more common in apps called fat-finger error and how we could use it strategically to improve speller accuracy. Before we get to this, briefly discussing how a spell checking/correction happens traditionally. Feel free to skip to the next two paragraphs if you are already aware of how spell check/correct works.
How spell checking/correction traditionally works ?
The classic method relies on a dictionary based approach, search text is tokenized into words. Each word is checked against a dictionary, if the word is in the dictionary, then it assumes spelling is correct. If the word doesn’t exactly match any word in dictionary, spell correction is needed. Edit distance measure is calculated between tokenized word and every word in dictionary to find close/similar words. Word with least(say one) edit distance is replaced as spell-corrected word. In most cases, more than one word with edit distance one would show-up, hence most spell correct algorithms use additional information like frequency of word to decide the most best possible word to replace. In practical setup, collation and normalization would be used in addition to the above mentioned underlying principle.
To get a little deeper into edit distance – edit distance is the minimum number of operations needed to transform string A to string B. The operations could be INSERT(I), DELETE(D), REPLACE(R), TRANSPOSE(T) (illustrated in pic below). Depending on what type of operations are supported, we have different variants of edit distance measures. For instance, levenshtein distance supports only INSERT, DELETE and REPLACE which is one of the most commonly used in industry. Since there are going to be various ways of doing IDRT operations, a technique called dynamic programming is used to make computations faster.
edit distance(Ineoramtan,Information) = 4
Getting back to main topic, In this age of start-ups, there are interesting marketplaces/eCommerce companies getting built for almost everything not just consumer products/electronics but for household services, groceries, ride sharing,etc. For all these to be successful, content discovery through a fundamentally strong search system becomes a key differentiator. Marketplaces offer choice to consumers to search via mobile site, desktop web and Mobile App. As these channels come with different form factors in display/input format – for example, users searching on app might have touch screen experience with a small qwerty keyboard vs. web user who might have a big desktop keyboard input. Hence algorithms that work behind each of channels need to be customized leveraging these differences in interaction experience.
It all started with this hypothesis, “Given form-factor differences between web searches vs. app searches, are there any distinct spelling error patterns in web vs. app which can be leveraged for improving our speller accuracy ?”
I looked at all searches for the last 7 days that happened on web vs. app. Ran a standard speller to identify misspelled words(considering only 1 edit distance apart) and classified them by operation (Insert, Delete, Replace, Transpose) and plotted them side by side on a bar chart (see chart). One thing stood out, there was a significant difference in no. of replace operations seen between web vs. app. ie App searches have higher replace based spell corrections compared to web.
Spell Error Comparison – Web vs. App
This definitely looks like a pattern. Lets look at some example replace operations in app – panasinic, oendrive, nikin – what do you think, they have in common ? All have one replace operation needed for spell correction. But most interestingly, all of the replace operations involve adjacent letters in the qwerty layout. Panasi[o]nic needed one replace operation from i to o, o appears just next to i in the qwerty layout. Now, why does this happen ? It is easy to reason why – fat finger syndrome. More below.
What is fat finger error ?
It is a source of common spell error in smartphones. As app users, we are more likely to misspell by clicking adjacent letters(to left or right) instead of the intended one. For example: If user intends to search “shoes”, the most likely spell errors as seen in our searches on app is “shor[e]s” or “shi[o]es”. If we check the keyboard layout, ‘e’ occurs before ‘r’ so user intended to click ‘e’ but mistakenly adjacent letter ‘r’ got in. Similarly, ‘i’ occurs before ‘o’, so the user here intended ‘o’ but the adjacent letter ‘i’ got in. The difference in replace based operations is mainly attributable to such replace operations involving adjacent proximity letters in keyboard layout.
Fun bit about fat finger error, source of this word goes back to investment banking where traders bidding for a stock commit errors due to which actual bid/order placed is much higher than intended value leading financial loss – something like short selling 1,00,000 shares instead of 10,000 due to an extra 0.
Given this insight, How could we make our speller channel-aware and intelligent ?
By giving more weight-age to ‘replace’ based edit distance operations involving proximity letters in app searches would help increase accuracy of speller. As of this writing, none of the commercially available spell checker modules provide flexibility in such features. Even the default speller module in Apache Solr/elastic search don’t support them yet.
Hope this piece of research help to build upon accuracy of spell correction. Please feel free to shoot me an email/comment for questions/clarifications.