Deep learning gender from name -LSTM Recurrent Neural Networks

Deep learning neural networks have shown promising results in problems related to vision, speech and text with varying degrees of success. I have tried looking at a text problem here, where we are trying to predict gender from name of the person. RNNs are a good fit for this as it involves learning from sequences (in this case sequence of characters). Traditional RNNs have learning problems due to vanishing gradients. Recent advancements have shown two variants of RNN can help solve the problem
(i) LSTM or Long short term memory- uses memory/forget gates to retain or pass patterns learnt in sequence useful for predicting target variable. we will use this for our model. (recommend colah’s blog for a deeper understanding of theory behind LSTM)
(ii) GRU or gated recurrent unit
To formally state the problem, we are interested in predicting if the given name is male/female. In the past, there have been attempts to predict gender based on simple rules on name as seen in NLTK, for example that relies on last character in name to classify gender which suffers from poor accuracy due to high generalization.
we will use character sequences which make up the name as our X variable, with Y variable as m/f indicating the gender. we use a stacked LSTM model and a final dense layer with softmax activation (many-to-one setup). categorical cross-entropy loss is used with adam optimizer. A 20% dropout layer is added for regularization to avoid over-fitting. The schematic shows the model setup.

LSTM_RNN_architecture

About the dataset

We use indian names dataset available in mbejda github account which has a collection of male and female indian name database collected from public records. Basic preprocessing is done to remove duplicates, special characters. Final distribution of m/f classes is 55%:45%. It is to be noted we use the full name here. some names have more than 2 words depending on surname, family name,etc.

Implementation using keras

The complete dataset, code and python notebook is avaialble in my github repo.The complete implementation is done using keras with tensorflow backend. Input representations & parameters are highlighted below

(i) vocabulary : we have a set of 39 chars including a-z, 0–9, space, dot and a special END token.
(ii) max sequence length: is chosen as 30 ie. chars exceeding 30 chars are truncated. if name has less than 30 chars “END” token is padded.
(iii) one hot encoding: each of the character is one-hot encoded represented as [1 X 39] dimension array.
(iv) batch size: 1000 samples in a batch
(v) epochs : 50 epochs or 50 times we iterate over the entire dataset (see once)
(vi)Y labels : represented as array of [1 X 2] with first column indicating male, second col for female. ex: [1 0] for male.

acc_charts_png

Loss & Accuracy charts as a function of epochs. As seen from loss charts, after 40 epochs the validation loss saturates while training loss keep reducing indicating over-fitting problems.

loss_charts_png

The model took ~ 2 hours to run on my moderate high end laptop (CPU based). The final classification accuracy stands at 86% on validation dataset. More stats around precision,recall and confusion matrix is presented below.

Model shows a higher precision for identifying male (88%) vs. female(84%). Of the total 3,028 samples in validation, 411 samples were incorrectly predicted leading to accuracy of 86.43% and error of 13.5%

Interpreting the Model (what did the model learn ?)

looking at the predictions made on validation set, the model seems to have learnt following patterns in sequences. For example occurance“mr.” implies a likely male name, while occurance of “smt” and “mrs” are indicative of female names. The beauty of deep learning is about end to end learning capability without needing an explicit feature engineering step needed in traditional learning. Additionally female names tend to end with vowels a,i,o which model seems to have picked. There are more complex patterns picked up by model, not easily visible through inspection.

Next steps

Some thoughts on improving the model accuracy are

(i) Using pre-trained character embedding instead of one-hot encoding (like this)
(ii) Better sampling
a. limit to first name and last name.
b. sub-sample male class to balance m-f classes as 1:1.
c. remove all non-chars from vocabulary.
(iii) hyper-parameter tuning
a. max length of sequence
b. stacked layers.
c. LSTM gates

I would love to hear your comments / questions / suggestions regarding the post, please share if you find this interesting.

Advertisements
2 comments
  1. mohit narang said:

    great one , I am having a similar model for data quality to predict whether a customer name is correct or not.

    • cool one Mohit ! curious to know, when you mean correct – did you mean spell errors ?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: