Quick Start Notebook #186

FarnazH · 2023-11-12T19:11:33Z

This PR contains the quick_start.ipynb showcasing various functionalities of the package alongside a clear comparison of methods. While working on this notebook, I improved the package code/docstring and fixed some bugs. These changes are directly pushed to the main branch so that we can move faster with our release. The method comparison figures (selecting from one cluster) have been added to the paper. There is still some work that needs to be done.

@marco-2023, can you please:

Complete the two TODO items comparing the sections through diversity measures.
Add an example of the selection method.
Review this notebook (feel free to go ahead and make changes and push to this branch).

@maximilianvz, can you please review this PR and share any comments you have on the notebook?

The changes made to the notebook where: 1. Added function to render the result tables 2. Measured the diversity of the selected sets 3. Added selections based on n-similarity methods Additionally: 4. Added example using n-similarity methods to compute diversity

marco-2023 · 2023-11-28T20:10:42Z

@FarnazH @maximilianvz I went through the notebook and:

Added a function to render the tables (list of lists) as markdown cells
Completed the TODO items
Added an example of selection using the n-similarity methods
Used the n-similarity methods to compute diversity in one case. (this is extra and would like your opinion on this).

Can you tell me what you think about it after a quick look?

Should we calculate 4. for all selections or not use it altogether?
All n-similarity methods selected the same points. According to Ramon, this is to be expected due to the (low) dimensionality of the problem at hand. Should we just use one then?

Several things to note are:

I was only able to use two diversity measures from the diversity module. The rest need binary chains as elements.
The documentation of the n-similarity methods is not being generated on the website.

maximilianvz · 2023-12-01T21:04:25Z

@FarnazH and @marco-2023, I have several comments:

There are some typos in the notebook (I may have missed some, so I encourage others to check for this, too):

On 9 occasions, "medoid" is misspelled as "mediod".
In the explanation under Example 1: MaxMin Selector, it is stated that "This can a user-defined function or a sklearn.metrics.pairwise_distances function". This needs to be changed to "This can BE a user-defined function or a sklearn.metrics.pairwise_distances function".
In the documentation for the render_table function, the following is said: "The data to be rendered in a table, each inner list represents a row with the first row being the header. All" I believe the trailing "All" was written mistakenly.

In this part of the notebook, the header is "N-Similarity based methods". However, the title of the plot in this section is "Comparing (N)Similarity-Based Selectors". I'd appreciate consistency in how "N-Similarity Based"/"(N)Similarity-Based" is written throughout the notebook:

There is a warning in this part of the notebook (which doesn't seem to affect performance) that would be nice to have removed:

This is a small thing, but for new users, it may be helpful to include a sentence at the end of the explanation here that clarifies that larger values for logdet and smaller values for wdudcorrespond to greater diversity.

When comparing multiple selection methods on data within a single cluster, comparisons of selection diversity are given for distance-based, partition-based, and N-similarity-based methods. However, in the latter half of the notebook, where selection is done for data with multiple clusters, there is no such diversity comparison performed. It would probably be best to add this.
@marco-2023, you would know more about this than me, but you point out that the selected sets are the same for all similarity indices, which is a consequence of the data being low-dimensional. I assume this is unavoidable if we're using 2-dimensional data (which was done so things could be easily visualized), but I'm not sure how effectively this conveys the usefulness of the various similarity indices of the N-similarity-based methods offered by the package. It may not be worth the effort (i.e., I'm not adamant that things need to be changed here), but perhaps we could use a higher-dimensional example and sacrifice visualization in instances where we want to showcase the N-similarity-based methods and how indices affect diversity. If this were done, we should include a note warning that the choice of similarity index won't affect selection diversity when working in low-dimensional space. Alternatively, we can just use one similarity index and provide this warning.

(this is unrelated to the notebook) @FarnazH, correct me if I'm wrong, but I believe we should update the documentation of OptiSim, which currently states that the medoid centre is chosen as the initial point. Like DISE, OptiSim has a ref-index argument with a default value of zero, so it isn't guaranteed that the medoid center is the initial point (in most cases, I'd imagine it won't be):

FanwangM · 2024-06-27T21:38:40Z

When computing diversity, a distance matrix is used. We should use the feature matrix instead I think.

FanwangM

Thanks for sharing the updated notebooks. They look good to me. I did some cleaning up and keep working on these notebooks.
I will merge them first and then get a cleaner version soon.

Add new quick_start.ipynb notebook

ea4e15a

FarnazH requested review from maximilianvz and marco-2023 November 12, 2023 19:11

FarnazH assigned FarnazH and FanwangM Nov 12, 2023

Clarify args of GridPartition in quick_start

ac89b72

This was referenced Nov 12, 2023

[methods.partition GridPartitioning] #134

Open

optimize_radius not using information of clusters #154

Closed

Address TODO items

9c4b28c

The changes made to the notebook where: 1. Added function to render the result tables 2. Measured the diversity of the selected sets 3. Added selections based on n-similarity methods Additionally: 4. Added example using n-similarity methods to compute diversity

Add suggested fixes

bacf052

marco-2023 mentioned this pull request Dec 15, 2023

New average similarity method #189

Merged

FanwangM mentioned this pull request Mar 5, 2024

Update in Docs/Code necessary #194

Closed

Rename quick_start.ipynb to split it into smaller notebooks

0f6af11

FarnazH added 2 commits August 12, 2024 14:30

Some API updates to distance and partition methods

cdbbca6

Notebooks split into 3

1df961e

FarnazH requested a review from FanwangM August 14, 2024 21:05

FanwangM and others added 7 commits August 14, 2024 23:43

Merge branch 'main' into quick_start_notebook

7a92af1

Clean up test_partition.py

3ea3b2b

Clean up the distance based Jupyter notebook

10a151d

Use more descriptive text

091f7d7

Add comments for the sys path configurations

6233d2d

Use rendered table to display the diversity

928729b

Cleaning up similarity based Jupyter notebook

367f741

FanwangM approved these changes Aug 15, 2024

View reviewed changes

Cleaning up partition based methods

072ab6a

FanwangM merged commit 5a7e001 into main Aug 15, 2024
10 of 11 checks passed

FanwangM mentioned this pull request Aug 15, 2024

Refactoring the jupyter notebook #216

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick Start Notebook #186

Quick Start Notebook #186

FarnazH commented Nov 12, 2023 •

edited

Loading

marco-2023 commented Nov 28, 2023

maximilianvz commented Dec 1, 2023

FanwangM commented Jun 27, 2024

FanwangM left a comment •

edited

Loading

Quick Start Notebook #186

Quick Start Notebook #186

Conversation

FarnazH commented Nov 12, 2023 • edited Loading

marco-2023 commented Nov 28, 2023

maximilianvz commented Dec 1, 2023

FanwangM commented Jun 27, 2024

FanwangM left a comment • edited Loading

Choose a reason for hiding this comment

FarnazH commented Nov 12, 2023 •

edited

Loading

FanwangM left a comment •

edited

Loading