Skip to content

This project seeks to build a classifier to predict someone's gender (binary categories) based on their full names. It is IMPORTANT to note that the model's predictions are only valid for Indonesian names.

Notifications You must be signed in to change notification settings

LingAdeu/predicting-gender-based-on-name

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

header

Using Personal Names to Predict Gender: A 3-Character N-Gram Approach

About

In this project, I aim to investigate whether a conventional machine learning algorithm along with a character n-grams can outperform Long Short-Term Memory (LSTM) within the character level which already achieved an excellent F1 score of 0.93 (Septiandri, 2017). To this end, I compared different models with 3-character n-gram, especially within word boundary, so that the models can learn spacing between the name parts (e.g., first and last name). The experiment resulted in Support Vector Machine (SVM) with linear kernel function as the final model, achieving F1 score of 0.94, slightly above the LSTM model's performance. Considering the higher performance over char-LSTM, this project concludes that a conventional model can perform equally well with LSTM model, a type of Recurrent Neural Network (RNN), when using word-boundary 3-char n-grams.

Important

For convenient purposes, the notebook for performing gender difference based on names, training and testing the model can be read here. This is a notebook viewer version which can display the model's pipeline better.

Content

.
├── README.md                   <- The top-level README for using this project
├── data
│   └── indonesian-names.csv    <- The dataset for training and testing the model
├── model
│   └── final_model.pkl         <- The final model (SVM with linear kernel function)
├── notebook
│   └── notebook.ipynb          <- The Jupyter notebook to build the model
├── requirements.txt            <- The requirements file for reproducing the environment
└── src
    └── app.py                  <- Streamlit app

Feedback

If there are any questions or suggestions for improvements, feel free to contact me here:

linkedin logo gmail logo

About

This project seeks to build a classifier to predict someone's gender (binary categories) based on their full names. It is IMPORTANT to note that the model's predictions are only valid for Indonesian names.

Topics

Resources

Stars

Watchers

Forks