Skip to content

[DMKD 2024] A Practical Approach to Novel Class Discovery in Tabular Data

License

Notifications You must be signed in to change notification settings

ColinTr/PracticalNCD

Repository files navigation

PracticalNCD

Code used to generate the results of the DMKD journal paper A Practical Approach to Novel Class Discovery in Tabular Data

License: MIT

🔍 Overview

This python library proposes an ensemble tools for the Machine Learning problem of Novel Class Discovery.

In this library, you will find the following tools illustrated through Jupyter Notebooks:

  • An hyperparameter optimization procedure tailored to transfer the results from the known classes to the novel classes.
  • An estimation of the number of clusters by applying clustering quality metrics in the latent space of NCD methods.
  • Two unsupervised clustering algorithms modified to utilize the data available in the NCD setting.
  • A novel method called PBN (for Projection-Based NCD).

🐍 Setting up the Python environment

Option 1 - With Anaconda:

# Create the virtual environment and install the packages with conda
conda env create --file environment.yml --prefix ./venvpracticalncd

# Activate the virtual environment
conda activate .\venvpracticalncd

# Add package missing from conda repositories
pip install iteration-utilities==0.11.0

Option 2 - Without Anaconda:

Prerequisite: having Python 3.10.9 the default python 3.10 version.

# Create the empty virtual environment
py -3.10 -m venv venvpracticalncd

# Activate the virtual environment
# On windows:
  .\venvpracticalncd\Scripts\activate
# On linux:
  source venvpracticalncd/bin/activate
  
# Install the needed packages
pip install -r requirements.txt

# And finish by installing pytorch independently
pip install torch==1.12.1 --index-url https://download.pytorch.org/whl/cu113

Finishing touches

# Add the virtual environment as a jupyter kernel
ipython kernel install --name "venvpracticalncd" --user

# Check if torch supports GPU (you need CUDA 11 installed)
python -c "import torch; print(torch.cuda.is_available())"

💻 Usage

Three notebooks are available:

  • Full_notebook.ipynb lets you train and evaluate the models when the number of clusters k is known in advance.
  • Full_notebook_with_k_estimation.ipynb (self-explanatory).
  • results_wrt_n_unknown_classes.ipynb is used to evaluate the performance of all the models when the number of novel classes increases. It was used to generate Figure C1 of Appendix C.

📊 Datasets

The datasets will be automatically downloaded from https://archive.ics.uci.edu/ on the first execution.
If it fails, please try disabling proxies.

However, the data splits for some datasets are random and the results can vary compared to the paper.

The most impacted datasets are:

  • LetterRecognition
  • USCensus1990
  • multiple_feature

📜 Citation

If you found this work useful, please use the following citation:

@article{tr2024practical,
   title = {A Practical Approach to Novel Class Discovery in Tabular Data},
   author = {Troisemaine, Colin and Reiffers{-}Masson, Alexandre and Gosselin, St{'{e}}phane and Lemaire, Vincent and Vaton, Sandrine},
   journal = {Data Mining and Knowledge Discovery},
   year = {2024},
   month = {May},
   day = {31},
   issn = {1573-756X},
   doi = {10.1007/s10618-024-01025-y}
}

⚖️ License

Copyright (c) 2023 Orange.

This code is released under the MIT license. See the LICENSE file for more information.

About

[DMKD 2024] A Practical Approach to Novel Class Discovery in Tabular Data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published