Comparison of different supervised learning models for the classification of particles produced during inelastic electron-proton scattering. The goal is to identify particles and evaluate the best model among the following:
- Decision Tree
- Random Forest
- Multilayer Perceptron
- K-Nearest Neighbor
The data used are the product of the response of six different detectors, usanti through the simulation platform GEANT4. The DataSet can be found at Kaggle.
The features in the dataset used are as follows:
Features | Meaning | Dimension |
---|---|---|
id | Particle Name | |
p | Momentum | |
theta | Scattering angle | |
beta | Relativistic Veliocityu |
|
nphe | Number of Photoelectrons | |
ein | Input Energy | |
eout | Output Energy |
A ciascun id inoltre è associata una precisa particella:
id | Particle | Symbol | Mass (MeV) |
---|---|---|---|
(-11) | Positron | ||
(211) | Pion | ||
(321) | Kaon | ||
(2212) | Proton |
Regardless of the underlying physical model, i.e., the Standard Model, it is sufficient to know that each particle is characterized by a certain set of values and in particular their rest mass.
The following libraries were exploited for preliminary data preparation:
- Numpy
- MatPlotLib
- Pandas
- Seaborn
- Imbalanced Learn
In addition, physical and intuitive observations enabled further preparation and simplification of the dataset.
Finally, prior to the construction of the MachineLearning models and their subsequent training, a resampling procedure was performed to correct the balance of the dataset.
Supervised learning was chosen for the choice of models, in particular, classifiers were exploited. The latter, based on the previously mentioned algorithms, are offered by the Scikit Learn library:
- DecisionTreeClassifier
- RandomForestClassifier
- MLPClassifier
- KNeighborsClassifier
Specifically for the RandomForestClassifier the importance of features for model training was evaluated, confirming what had been inferred in the exploratory analysis.
While for MLPClassifier an optimization of hyperparameters was chosen, by means of a GridSearch.
Finally, accuracy and ,visually, confusion matrices were used as metrics to evaluate the efficiency of the models. The results do not show a predominance of one model over another, but the KNeighborsClassifier is the least efficient.
Classifier | Accuracy |
---|---|
Decision Tree | 89.6% |
Random Forest | 93.2% |
ML Perceptron | 93.1% |
K-Nearest Neighbor | 88.6% |