Deep learning neural networks have shown promising results in problems related to vision, speech and text with varying degrees of success. I have tried looking at a text problem here, where we are trying to predict gender from name of the person. RNNs are a good fit for this as it involves learning from sequences (in this case sequence of characters). Traditional RNNs have learning problems due to vanishing gradients. Recent advancements have shown two variants of RNN can help solve the problem
(i) LSTM or Long short term memory- uses memory/forget gates to retain or pass patterns learnt in sequence useful for predicting target variable. we will use this for our model. (recommend colah’s blog for a deeper understanding of theory behind LSTM)
(ii) GRU or gated recurrent unit
To formally state the problem, we are interested in predicting if the given name is male/female. In the past, there have been attempts to predict gender based on simple rules on name as seen in NLTK, for example that relies on last character in name to classify gender which suffers from poor accuracy due to high generalization.
we will use character sequences which make up the name as our X variable, with Y variable as m/f indicating the gender. we use a stacked LSTM model and a final dense layer with softmax activation (many-to-one setup). categorical cross-entropy loss is used with adam optimizer. A 20% dropout layer is added for regularization to avoid over-fitting. The schematic shows the model setup.


About the dataset

We use indian names dataset available in mbejda github account which has a collection of male and female indian name database collected from public records. Basic preprocessing is done to remove duplicates, special characters. Final distribution of m/f classes is 55%:45%. It is to be noted we use the full name here. some names have more than 2 words depending on surname, family name,etc.

Implementation using keras

The complete dataset, code and python notebook is avaialble in my github repo.The complete implementation is done using keras with tensorflow backend. Input representations & parameters are highlighted below

(i) vocabulary : we have a set of 39 chars including a-z, 0–9, space, dot and a special END token.
(ii) max sequence length: is chosen as 30 ie. chars exceeding 30 chars are truncated. if name has less than 30 chars “END” token is padded.
(iii) one hot encoding: each of the character is one-hot encoded represented as [1 X 39] dimension array.
(iv) batch size: 1000 samples in a batch
(v) epochs : 50 epochs or 50 times we iterate over the entire dataset (see once)
(vi)Y labels : represented as array of [1 X 2] with first column indicating male, second col for female. ex: [1 0] for male.


Loss & Accuracy charts as a function of epochs. As seen from loss charts, after 40 epochs the validation loss saturates while training loss keep reducing indicating over-fitting problems.


The model took ~ 2 hours to run on my moderate high end laptop (CPU based). The final classification accuracy stands at 86% on validation dataset. More stats around precision,recall and confusion matrix is presented below.

Model shows a higher precision for identifying male (88%) vs. female(84%). Of the total 3,028 samples in validation, 411 samples were incorrectly predicted leading to accuracy of 86.43% and error of 13.5%

Interpreting the Model (what did the model learn ?)

looking at the predictions made on validation set, the model seems to have learnt following patterns in sequences. For example occurance“mr.” implies a likely male name, while occurance of “smt” and “mrs” are indicative of female names. The beauty of deep learning is about end to end learning capability without needing an explicit feature engineering step needed in traditional learning. Additionally female names tend to end with vowels a,i,o which model seems to have picked. There are more complex patterns picked up by model, not easily visible through inspection.

Next steps

Some thoughts on improving the model accuracy are

(i) Using pre-trained character embedding instead of one-hot encoding (like this)
(ii) Better sampling
a. limit to first name and last name.
b. sub-sample male class to balance m-f classes as 1:1.
c. remove all non-chars from vocabulary.
(iii) hyper-parameter tuning
a. max length of sequence
b. stacked layers.
c. LSTM gates

I would love to hear your comments / questions / suggestions regarding the post, please share if you find this interesting.

Machine Learning – Linear Regression using gradient descent vs. neural network

Machine learning or Supervised Learning broadly encompasses two classes of problems – regression and classification. Regression deals with predicting a continous variable while a classification problem deals with categorical prediction. As an example, predicting house price is a regression problem which can take any real number. While, email spam detection is a classification problem as the outcome is a special kind of categorical variable ie. binary (spam 1/ non-spam 0). Numerous algorithms have been built over the last few decades which falls under one of these two classes. The focus of this post will be regression techniques and will reserve another post for classification techniques. Feel free to subscribe/bookmark this page for upcoming posts.

  • Introduction – Regression
  • Algorithms
    • Oridinary Least Squares (OLS)
    • Gradient Descent
    • Neural Networks
  • Comparison of Algorithms
  • Conclusions & Inferences

Regression Techniques helps to understand relationship of continous variable as a function of one of more independent aka explanatory variables. It can be considered as a curve fitting problem where we are interested in learning a function y = f(X) where X belongs to x1,x2,x3..xi which best fits the y. Now, how do we quantify best fit ? Most techniques uses some measure of error – SSE (sum-of-squared error) also called cost function aka objective function to quantify the quality of fit. We desire to learn parameters which gives the least error. So, formally we can define this as an optimization problem to minimize SSE given the parameters of function f(x).

Linear vs. Non-Linear : When y = f(x) takes the shape of y = a0 + a1 * x1 + a2 * x2 + .. + an * xn, we call this a linear regression where we are trying to fit a straight line to the curve. In non-linear we learn y = f(x) of more complex forms involving log, exponents, higher order polynomials of independent variables. For example y = a0 + a1 * log(x1) + a2 * e^x + a3* x^3.

Single vs. Multiple Regression : When y = f(x) has a more than one explanatory variable then called multiple regression.

Lets take a simple linear regression problem(dataset) and try to apply couple of algorithms and compare accuracy, complexity, run times.  We will conclude this post with in-depth understanding of techniques. Following three techniques have been explored in detail using R libraries. Feel free to download notebook and try running yourself, playing with parameters and plots available on my github. The notebooks are also available in html so you can explore the charts & documentation here. A high level summary is presented below, highly recommend checking the notebook here.

Link to jupyter notebook (code, charts & data) here

  1. Ordinary Least Squares
    A closed form solution uses matrix multiplication & inverse to find parameters which minimize error (SSE). This is an anlaytic solution always finds a minima (of error). Looking the (x,y) plot we attempt fitting non-linear model using linear regression exploiting variable transformation techniques.
  2. Gradient Descent
    A open form solution, which uses mathematical optimization by initializing parameters randomly and iteratively adjusting the parameters depending on the gradient that reduces SSE. We start with learning rate of 0.01 and iterate for 120000 times.
  3. Neural Networks
    Now, generally popular among AI (Artificial Intelligence) practitioners which mimics working of human brain, modelling hidden layers with weights propagating across layers between input & output nodes. Internally, uses gradient descent with back-propagation to learn the weights. We use a 3 hidden layers, each with 3 nodes and train a neural network model. Input & Output are single nodes as we have a single predictor and a single explanatory variable.



Lets compare which of the models performs well in terms of best fitting the data. Please note, here I am not using train/test approach as idea here is emphasis on technique. So we are talking about training error here. Lets use RMSE (root mean square error) to compare the three models

Ordinary Least Squares
Gradient Descent
Neural Network
0.000534 0.000984 0.000278

Looking at RMSE, neural network seems to be doing a great job in learning the data. NN is known to have good memorizing capability and causes over fitting leading high variance system. Also we must note, Neural Network did not involve any kind of feature engineering we just passed x values unlike in other methods, we had x^2 as feature. So NN is known to be great at identifying latent features without explicit need for feature engineering usually done by domain experts of respective problem spaces.

Table below explains more about each of the 3 techniques, how they differ, when to use-what etc.

Ordinary Least Squares Gradient Descent Neural Network
Closed Form
Analytic Solution
Open Form
Iterative Optimization
Slow, involves matrix inverse computation Fast for large datasets Slow, training usually done on GPUs which can handle matrix computations easily
Hyperparameters – epoch(iterations), alpha-learning rate Hyperparameters – hidden layers, nodes, activation fn., learning rate
Feature Engineering Required Feature Engineering Required Little/No Feature Engineering Needed
More data is good More data is good While more data is good, NN requires a lot of data for good generalization
Good Interpretability. Easy to communicate findings Good Interpretability. Easy to communicate findings Blackbox, difficult to explain suffers from interpretability
Most commonly used for building offline models on small datasets Commonly used for large scale learning involving thousands of features Used commonly in text/vision/speech – mostly uses two variants CNN and RNN architectures.
Tools: R,SAS,Python Tools: Python,R Tools: Deeplearning4j, Theano, Python, R, Tensorflow
Can get stuck in local minima, due to bad initialization/learning rate Can get stuck in local minima, due to bad initialization/learning rate

What we did not talk about, but important in this context ?

  • Cross Validation
  • Feature engineering and reduction(PCA)
  • Hyperparameter tuning
  • Sampling
  • Objective Functions
  • Regularization L1/L2
  • Interaction Effects

A lot of content to digest here, feel free to share any feedback/comment you have about any part of the blog post – would love to chat around. I will be back with a similar post on classification in coming days comparing logistic regression, decision trees, random forest(ensembles) and neural networks.

Information Retrieval : Spell Correction in Search Engines – Algorithms & Strategies

Spell Checking & Correction is an important component of any information retrieval(aka document retrieval) system like Google, Bing, amazon, eBay and alike. User specifies his/her information need in the form of search query, which is later used by IR system to retrieve ranked set of relevant documents. For example, user might search “red shirt” hoping to see shirts which are red in color. Since the search string is a human language input, it is quite common for spell errors to occur while being typed. It becomes crucial to detect and correct for spell error before sending it to relevance engine to fetch relevant documents.
In this blog post, my main focus is to talk about a specific type of error that is more common in apps called fat-finger error and how we could use it strategically to improve speller accuracy. Before we get to this, briefly discussing how a spell checking/correction happens traditionally. Feel free to skip to the next two paragraphs if you are already aware of how spell check/correct works.

 How spell checking/correction traditionally works ?

The classic method relies on a dictionary based approach, search text is tokenized into words. Each word is checked against a dictionary, if the word is in the dictionary, then it assumes spelling is correct. If the word doesn’t exactly match any word in dictionary, spell correction is needed. Edit distance measure is calculated between tokenized word and every word in dictionary to find close/similar words. Word with least(say one) edit distance is replaced as spell-corrected word. In most cases, more than one word with edit distance one would show-up, hence most spell correct algorithms use additional information like frequency of word to decide the most best possible word to replace. In practical setup, collation and normalization would be used in addition to the above mentioned underlying principle.

To get a little deeper into edit distance – edit distance is the minimum number of operations needed to transform string A to string B. The operations could be INSERT(I), DELETE(D), REPLACE(R), TRANSPOSE(T) (illustrated in pic below). Depending on what type of operations are supported, we have different variants of edit distance measures. For instance, levenshtein distance supports only INSERT, DELETE and REPLACE which is one of the most commonly used in industry. Since there are going to be various ways of doing IDRT operations, a technique called dynamic programming is used to make computations faster.

edit distance(Ineoramtan,Information) = 4

Getting back to main topic, In this age of start-ups, there are interesting marketplaces/eCommerce companies getting built for almost everything not just consumer products/electronics but for household services, groceries, ride sharing,etc. For all these to be successful, content discovery through a fundamentally strong search system becomes a key differentiator. Marketplaces offer choice to consumers to search via mobile site, desktop web and Mobile App. As these channels come with different form factors in display/input format – for example, users searching on app might have touch screen experience with a small qwerty keyboard vs. web user who might have a big desktop keyboard input. Hence algorithms that work behind each of channels need to be customized leveraging these differences in interaction experience.

It all started with this hypothesis, “Given form-factor differences between web searches vs. app searches, are there any distinct spelling error patterns in web vs. app which can be leveraged for improving our speller accuracy ?” 

I looked at all searches for the last 7 days that happened on web vs. app. Ran a standard speller to identify misspelled words(considering only 1 edit distance apart) and classified them by operation (Insert, Delete, Replace, Transpose) and plotted them side by side on a bar chart (see chart). One thing stood out, there was a significant difference in no. of replace operations seen between web vs. app. ie App searches have higher replace based spell corrections compared to web.

This definitely looks like a pattern. Lets look at some example replace operations in app – panasinic, oendrive, nikin – what do you think, they have in common ? All have one replace operation needed for spell correction. But most interestingly, all of the replace operations involve adjacent letters in the qwerty layout. Panasi[o]nic needed one replace operation from i to o, o appears just next to i in the qwerty layout. Now, why does this happen ? It is easy to reason why – fat finger syndrome. More below.

What is fat finger error ?
It is a source of common spell error in smartphones. As app users, we are more likely to misspell by clicking adjacent letters(to left or right) instead of the intended one. For example: If user intends to search “shoes”, the most likely spell errors as seen in our searches on app is “shor[e]s” or “shi[o]es”. If we check the keyboard layout, ‘e’ occurs before ‘r’ so user intended to click ‘e’ but mistakenly adjacent letter ‘r’ got in. Similarly, ‘i’ occurs before ‘o’, so the user here intended ‘o’ but the adjacent letter ‘i’ got in. The difference in replace based operations is mainly attributable to such replace operations involving adjacent proximity letters in keyboard layout.
Fun bit about fat finger error, source of this word goes back to investment banking where traders bidding for a stock commit errors due to which actual bid/order placed is much higher than intended value leading financial loss – something like short selling 1,00,000 shares instead of 10,000 due to an extra 0.

Given this insight, How could we make our speller channel-aware and intelligent ?
By giving more weight-age to ‘replace’ based edit distance operations involving proximity letters in app searches would help increase accuracy of speller. As of this writing, none of the commercially available spell checker modules provide flexibility in such features. Even the default speller module in Apache Solr/elastic search don’t support them yet.

Hope this piece of research help to build upon accuracy of spell correction. Please feel free to shoot me an email/comment for questions/clarifications.

In this blog post, I am going to discuss about mobile advertising networks, ecosystem & industry trends. This understanding would serve as base for the future posts I plan to write around data science problems in advertising industry. Lets start in top down fashion. Advertising is a multi-billion dollar industry (estimated $500Bn a year globally as of 2010). Traditional advertising relied on TV, Radio, Prints: fliers & Billboards. With the advent of digital technologies, the advertising dollars has been fast shifting to Web and Mobile. Specifically, mobile growth has outgrown desktop growth due to availability of inexpensive mobile devices in the form of smartphones and tablets; and reduced mobile data rates in the recent days. We know Google, Facebook & Twitter makes money majorly through Ads. It is important to realize that these companies put data at the centre of all its products, which helps create targeted ads relevant to users resulting in great CTRs (Click-thru-rate: a key metric in advertising to measure ad effectiveness/relevance. More on metrics in my next blogpost)

So, what exactly is an Ad Network?
An Advertising Network is a marketplace for ads, which connects advertisers(on demand side) with publishers(on supply side). It is merely an entity Read More

   We talk about big data quite often these days, wanted to put some fundas about basics around data. Do you know the singular form of data ? How data differs from information vs. knowledge? How insights convert to actions? Here is my attempt towards answering some of these.
      Data is often raw in binary represented as a number or a character or a string. Data is a plural version of datum. Information is anything which puts context to data. For example: The number 89 itself doesn’t mean anything unless we fit a context to say – the car’s speed is 89 kmph. Knowledge on other hand is about knowing how things around us work & the larger world is interconnected. It is obtained based on experiences, experiment, research, etc. In the below infographic I have tried to explain these terms using a simplified version of intelligence that can be embedded in cars. More on the infographic follows.
Data 2 Actions
           I have tried to take a very simplified version here. Now just imagine, when we talk about intelligent cars – we are usually talking about 100s of such parameters instead of just a single varaible (speed of car) ,all collected from multiple sensors obtained in realtime streaming format to make such decisions. Knowledge is obtained through predictive algorithms which continuously learns [in AI terms to say “adapts”] from data and help in making recommendation about safety of the vehicle. Now imagine these 100s of parameters collected every millisecond from 1000s of connected cars around you – this is what forms “Big Data”.
              Hope you liked this post, I will be writing up more articles – as have been getting requests from friends around the globe. Stay tuned !

I have spent most of my life into data and its applications to problems. Now, when i look at some patterns in algorithms we use in analyzing data, one thing that emerges is increased use of meta-algorithms. Boosting techniques(like AdaBoost) is one such meta-algorithm which uses multiple weak learners(classifiers) to improve prediction accuracy. Random forests,a very prominent Read More

We are in the era of big data, with newer sources of data emerging at an exponential rate involving sensor data, EHR, social network/media data & machine generated data. In this blog post, I will be discussing specifically about social network data, its applications in data science problems, solutions & visualizations. In simple terms, a network is a group of nodes interconnected by links (also called edges). In a social network, users are the nodes and connections are the links/edges. Consider a Facebook user’s network, by adding friends, we are creating the links. Before getting into a little more of technical details of a network, let’s spend some time on more interesting area – its applications to data science problems.
Linkedin, Facebook & other social network uses the network information, to predict “People you may know” & offer people recommendations. Product companies like Microsoft, Oracle uses network analytics to identify key influencers in leading tech forum/online community networks to help market their products by utilizing the greater reach of the identified influencers. WWW is another example of networks. The pages are interconnected in the form of network & its analysis helps understand information flow across the WWW. “People you may know” feature generally works using triangulation. i.e If B and C are connected. If A knows B, then it is likely that A knows C. Most of the people recommendation work based on this principle.
Read More

Clustering is an unsupervised classification (learning) technique, where the objective is to maximize inter-cluster distance while minimizing the intra-cluster distance.  By unsupervised, we mean clustering or segmenting or classifying data based on all the available attributes and specifically there is no availability of class information. A supervised classification on other hand uses class information.
As usual, before we jump into ‘how’ let’s answer the ‘why’. Clustering is applied to solve variety of problems ranging from biological systems to using it for exploratory analysis of data ( as a pre-processing technique).  Many of the predictive analytics algorithms use clustering solutions as one of their components. It is used in all major brands for CRM, to understand their customer better. Another use of clustering is in outlier detection or fraud transaction identification.  If you have heard about a site called, it extensively works on clustering algorithms where the sites are segmented/clustered based on website attributes like category of domain, number of users, traffic, content type, corporate or personal, blog, image blog, video blog,etc. For example, if you entered INMOBI, you would get a list of companies which are in this space mainly its competitors – mojiva, Millenialmedia, Admob, Quattro, Mobclix,etc. If you are looking for image hosting site and want to know alternatives/options, this will be helpful.

We talk about similarity in terms of distance measures like

(i)                  Euclidean Distance

(ii)                Manhattan Distance

Read More

Hadoop is an open source framework for writing and running distributed application that process huge amounts of data ( more famously called Big Data). The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: “The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term”

It has two components

–          Distributed Storage ( uses HDFS – Hadoop file system)
Ensures the data is distributed evenly across all the nodes of Hadoop cluster. There is option of replicate data across nodes (redundancy) to provide capabilities to recover from failures.

–          Distributed Computing ( uses MR – Map Reduce Paradigm)
Once the data is available on Hadoop cluster. The MR codes ( typically return in Java,C++) is moved to each of the nodes for computation on the data. Map Reduce has two phases mapper and Reducer.

One of the early examples of a distributed computing include SETI@home project, where a group of people volunteered to offer CPU time of their personal computer for research on radio telescope data to find intelligent life outside earth. However this differs from Hadoop MR is in the fact that, data is moved to place where computing takes place in case SETI, while code is moved to the place of data in latter case. Other projects include finding the largest prime numbers, sorting Pet bytess of data in shortest time,etc.

Applications of Hadoop MR – Big data

  •           Weblog analysis
  •           Fraud detection
  •           Text Mining
  •           Search Engine Indexing
  •           LinkedIn uses for “Who viewed  your profile” and “People you may know – recommendations”
  •  uses for book recommendation

Hadoop MR Wrapper applications include

  •           Pig : A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
  •           Hive : A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the
    runtime engine to MapReduce jobs) for querying the data.
  •           Mahout : Machine Learning implementation in Map Reduce.

In the recent past, a variety of new social media sites have emerged – location Based Services( like foursquare, gowala), Group Deals( like groupon),  microblogging( like twitter, fb). These social media sites have provided integration features with other SM sites, for instance A foursquare checkin can be configured to automatically publish a tweet with the URL of location checked-in. All these data being primarily open source, we have various business opportunities to leverage the data by integrating this with the internal customer data.
A retailer’s major concern is the need to understand their customers better, to gain the 360 degree view of the customer. Most of the companies, have strategy to integrate the internal customer behavior data acorss POS, Ecomm, Mail order,etc. By leveraging the social media data and integrating with location based services(like foursquare) and microblogging services(twitter), the retailers now have the ability to track customers.
Read More

%d bloggers like this: