In this project, I aim to investigate whether a conventional machine learning algorithm along with a character n-grams can outperform Long Short-Term Memory (LSTM) within the character level which already achieved an excellent F1 score of 0.93 (Septiandri, 2017). To this end, I compared different models with 3-character n-gram, especially within word boundary, so that the models can learn spacing between the name parts (e.g., first and last name). The experiment resulted in Support Vector Machine (SVM) with linear kernel function as the final model, achieving F1 score of 0.94, slightly above the LSTM model's performance. Considering the higher performance over char-LSTM, this project concludes that a conventional model can perform equally well with LSTM model, a type of Recurrent Neural Network (RNN), when using word-boundary 3-char n-grams.
Important
For convenient purposes, the notebook for performing gender difference based on names, training and testing the model can be read here. This is a notebook viewer version which can display the model's pipeline better.
.
├── README.md <- The top-level README for using this project
├── data
│ └── indonesian-names.csv <- The dataset for training and testing the model
├── model
│ └── final_model.pkl <- The final model (SVM with linear kernel function)
├── notebook
│ └── notebook.ipynb <- The Jupyter notebook to build the model
├── requirements.txt <- The requirements file for reproducing the environment
└── src
└── app.py <- Streamlit app
If there are any questions or suggestions for improvements, feel free to contact me here: