Deep learning neural networks have shown promising results in problems related to vision, speech and text with varying degrees of success. I have tried looking at a text problem here, where we are trying to predict gender from name of the person. RNNs are a good fit for this as it involves learning from sequences (in this case sequence of characters). Traditional RNNs have learning problems due to vanishing gradients. Recent advancements have shown two variants of RNN can help solve the problem
(i) LSTM or Long short term memory- uses memory/forget gates to retain or pass patterns learnt in sequence useful for predicting target variable. we will use this for our model. (recommend colah’s blog for a deeper understanding of theory behind LSTM)
(ii) GRU or gated recurrent unit
To formally state the problem, we are interested in predicting if the given name is male/female. In the past, there have been attempts to predict gender based on simple rules on name as seen in NLTK, for example that relies on last character in name to classify gender which suffers from poor accuracy due to high generalization.
we will use character sequences which make up the name as our X variable, with Y variable as m/f indicating the gender. we use a stacked LSTM model and a final dense layer with softmax activation (many-to-one setup). categorical cross-entropy loss is used with adam optimizer. A 20% dropout layer is added for regularization to avoid over-fitting. The schematic shows the model setup.
About the dataset
We use indian names dataset available in mbejda github account which has a collection of male and female indian name database collected from public records. Basic preprocessing is done to remove duplicates, special characters. Final distribution of m/f classes is 55%:45%. It is to be noted we use the full name here. some names have more than 2 words depending on surname, family name,etc.
Implementation using keras
The complete dataset, code and python notebook is avaialble in my github repo.The complete implementation is done using keras with tensorflow backend. Input representations & parameters are highlighted below
(i) vocabulary : we have a set of 39 chars including a-z, 0–9, space, dot and a special END token.
(ii) max sequence length: is chosen as 30 ie. chars exceeding 30 chars are truncated. if name has less than 30 chars “END” token is padded.
(iii) one hot encoding: each of the character is one-hot encoded represented as [1 X 39] dimension array.
(iv) batch size: 1000 samples in a batch
(v) epochs : 50 epochs or 50 times we iterate over the entire dataset (see once)
(vi)Y labels : represented as array of [1 X 2] with first column indicating male, second col for female. ex: [1 0] for male.