Using Personal Names to Predict Gender: A 3-Character N-Gram Approach

About

In this project, I aim to investigate whether a conventional machine learning algorithm along with a character n-grams can outperform Long Short-Term Memory (LSTM) within the character level which already achieved an excellent F1 score of 0.93 (Septiandri, 2017). To this end, I compared different models with 3-character n-gram, especially within word boundary, so that the models can learn spacing between the name parts (e.g., first and last name). The experiment resulted in Support Vector Machine (SVM) with linear kernel function as the final model, achieving F1 score of 0.94, slightly above the LSTM model's performance. Considering the higher performance over char-LSTM, this project concludes that a conventional model can perform equally well with LSTM model, a type of Recurrent Neural Network (RNN), when using word-boundary 3-char n-grams.

Important

For convenient purposes, the notebook for performing gender difference based on names, training and testing the model can be read here. This is a notebook viewer version which can display the model's pipeline better.

Content

.
├── README.md                   <- The top-level README for using this project
├── data
│   └── indonesian-names.csv    <- The dataset for training and testing the model
├── model
│   └── final_model.pkl         <- The final model (SVM with linear kernel function)
├── notebook
│   └── notebook.ipynb          <- The Jupyter notebook to build the model
├── requirements.txt            <- The requirements file for reproducing the environment
└── src
    └── app.py                  <- Streamlit app

Feedback

If there are any questions or suggestions for improvements, feel free to contact me here:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using Personal Names to Predict Gender: A 3-Character N-Gram Approach

About

Content

Feedback

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
model		model
notebook		notebook
src		src
.gitignore		.gitignore
README.md		README.md
header.png		header.png
requirements.txt		requirements.txt

LingAdeu/predicting-gender-based-on-name

Folders and files

Latest commit

History

Repository files navigation

Using Personal Names to Predict Gender: A 3-Character N-Gram Approach

About

Content

Feedback

About

Topics

Resources

Stars

Watchers

Forks

Languages