Archive

NLP

Information Retrieval : Spell Correction in Search Engines – Algorithms & Strategies

Spell Checking & Correction is an important component of any information retrieval(aka document retrieval) system like Google, Bing, amazon, eBay and alike. User specifies his/her information need in the form of search query, which is later used by IR system to retrieve ranked set of relevant documents. For example, user might search “red shirt” hoping to see shirts which are red in color. Since the search string is a human language input, it is quite common for spell errors to occur while being typed. It becomes crucial to detect and correct for spell error before sending it to relevance engine to fetch relevant documents.
In this blog post, my main focus is to talk about a specific type of error that is more common in apps called fat-finger error and how we could use it strategically to improve speller accuracy. Before we get to this, briefly discussing how a spell checking/correction happens traditionally. Feel free to skip to the next two paragraphs if you are already aware of how spell check/correct works.

 How spell checking/correction traditionally works ?

The classic method relies on a dictionary based approach, search text is tokenized into words. Each word is checked against a dictionary, if the word is in the dictionary, then it assumes spelling is correct. If the word doesn’t exactly match any word in dictionary, spell correction is needed. Edit distance measure is calculated between tokenized word and every word in dictionary to find close/similar words. Word with least(say one) edit distance is replaced as spell-corrected word. In most cases, more than one word with edit distance one would show-up, hence most spell correct algorithms use additional information like frequency of word to decide the most best possible word to replace. In practical setup, collation and normalization would be used in addition to the above mentioned underlying principle.

editdist
To get a little deeper into edit distance – edit distance is the minimum number of operations needed to transform string A to string B. The operations could be INSERT(I), DELETE(D), REPLACE(R), TRANSPOSE(T) (illustrated in pic below). Depending on what type of operations are supported, we have different variants of edit distance measures. For instance, levenshtein distance supports only INSERT, DELETE and REPLACE which is one of the most commonly used in industry. Since there are going to be various ways of doing IDRT operations, a technique called dynamic programming is used to make computations faster.

edit distance(Ineoramtan,Information) = 4

Getting back to main topic, In this age of start-ups, there are interesting marketplaces/eCommerce companies getting built for almost everything not just consumer products/electronics but for household services, groceries, ride sharing,etc. For all these to be successful, content discovery through a fundamentally strong search system becomes a key differentiator. Marketplaces offer choice to consumers to search via mobile site, desktop web and Mobile App. As these channels come with different form factors in display/input format – for example, users searching on app might have touch screen experience with a small qwerty keyboard vs. web user who might have a big desktop keyboard input. Hence algorithms that work behind each of channels need to be customized leveraging these differences in interaction experience.

It all started with this hypothesis, “Given form-factor differences between web searches vs. app searches, are there any distinct spelling error patterns in web vs. app which can be leveraged for improving our speller accuracy ?” 

I looked at all searches for the last 7 days that happened on web vs. app. Ran a standard speller to identify misspelled words(considering only 1 edit distance apart) and classified them by operation (Insert, Delete, Replace, Transpose) and plotted them side by side on a bar chart (see chart). One thing stood out, there was a significant difference in no. of replace operations seen between web vs. app. ie App searches have higher replace based spell corrections compared to web.

This definitely looks like a pattern. Lets look at some example replace operations in app – panasinic, oendrive, nikin – what do you think, they have in common ? All have one replace operation needed for spell correction. But most interestingly, all of the replace operations involve adjacent letters in the qwerty layout. Panasi[o]nic needed one replace operation from i to o, o appears just next to i in the qwerty layout. Now, why does this happen ? It is easy to reason why – fat finger syndrome. More below.

What is fat finger error ?
It is a source of common spell error in smartphones. As app users, we are more likely to misspell by clicking adjacent letters(to left or right) instead of the intended one. For example: If user intends to search “shoes”, the most likely spell errors as seen in our searches on app is “shor[e]s” or “shi[o]es”. If we check the keyboard layout, ‘e’ occurs before ‘r’ so user intended to click ‘e’ but mistakenly adjacent letter ‘r’ got in. Similarly, ‘i’ occurs before ‘o’, so the user here intended ‘o’ but the adjacent letter ‘i’ got in. The difference in replace based operations is mainly attributable to such replace operations involving adjacent proximity letters in keyboard layout.
Fun bit about fat finger error, source of this word goes back to investment banking where traders bidding for a stock commit errors due to which actual bid/order placed is much higher than intended value leading financial loss – something like short selling 1,00,000 shares instead of 10,000 due to an extra 0.

Given this insight, How could we make our speller channel-aware and intelligent ?
By giving more weight-age to ‘replace’ based edit distance operations involving proximity letters in app searches would help increase accuracy of speller. As of this writing, none of the commercially available spell checker modules provide flexibility in such features. Even the default speller module in Apache Solr/elastic search don’t support them yet.

Hope this piece of research help to build upon accuracy of spell correction. Please feel free to shoot me an email/comment for questions/clarifications.

Advertisements

For the last 6 months, i have been closely following trends in information management. Below are few of my observations.

  • Data source explosion: Business Problems are gaining complexity day by day, hence there is a huge demand for analyzing data from multitude of sources to help companies frame strategies for growth.  GPS data accumulated by Telecom companies offer insights into customers current location and provide context aware recomendations. Infact, some of the telecom companies have introduced location based pricing. Sensor data helps identify security threats to secure networks. Social network data has opened up as a channel for marketing services/product. Analysis of such closely knit data leads to behavioral & Contextual targeting. Traditional data analysis tools/algorithms fail to perform efficiently because such data are of huge sizes and needs newer datastructures for efficient analysis.
  • Databases going beyond relational is gaining popularity. NoSQL dbs and Graph/Tree/XML based databases.
  • Open Source tools continue to emerge.(R, RapidMiner, Weka)
  • Growing need for massive dataset analysis.
  • Artificial Intelligence(AI) and NLP gaining popularity among data analysts( in additional to ML techniques)
  • Multimedia Analytics: Need for gathering critical metrics like customer footfalls, quantifying customers satisfaction by using facial expressions. All these applications demand high end signal processing( both Image & Video). There is a lot of scope for innovation in this area.
  • Privacy preserving techniques for data analysis. This in turn encourages companies to outsource some of the critical data analysis to third parties.
  • Agile Methodologies for Analytics Project to cope up with rapidly changing customer/business needs.
  • Bio-Inspiration/Bio-Imitation: To learn from nature/natural processes and develop analogous techniques which could potentially solve a real-world problem. Some classic examples are development of Neural network inspired by working of a human brain, solving path optimization problem from Ant colonies, 280 degree view of honey bee(vision) etc.
  • More and more data are made publicly available.
  • Real Time data integration, insight generation and business decision.
  • Complex visualization techniques through new technology like Adobe Flex , MS Silverlight,etc which are known for generating RIA.(Rich Internet Applications)

And I am sure these are just few items in the list and really not exhaustive. Feel free to share your comments.

%d bloggers like this: