Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick Start Notebook #186

Merged
merged 15 commits into from
Aug 15, 2024
Merged

Quick Start Notebook #186

merged 15 commits into from
Aug 15, 2024

Conversation

FarnazH
Copy link
Member

@FarnazH FarnazH commented Nov 12, 2023

This PR contains the quick_start.ipynb showcasing various functionalities of the package alongside a clear comparison of methods. While working on this notebook, I improved the package code/docstring and fixed some bugs. These changes are directly pushed to the main branch so that we can move faster with our release. The method comparison figures (selecting from one cluster) have been added to the paper. There is still some work that needs to be done.

@marco-2023, can you please:

  1. Complete the two TODO items comparing the sections through diversity measures.
  2. Add an example of the selection method.
  3. Review this notebook (feel free to go ahead and make changes and push to this branch).

@maximilianvz, can you please review this PR and share any comments you have on the notebook?

The changes made to the notebook where:
1. Added function to render the result tables
2. Measured the diversity of the selected sets
3. Added selections based on n-similarity methods

Additionally:
4. Added example using n-similarity methods to compute diversity
@marco-2023
Copy link
Collaborator

@FarnazH @maximilianvz I went through the notebook and:

  1. Added a function to render the tables (list of lists) as markdown cells
  2. Completed the TODO items
  3. Added an example of selection using the n-similarity methods
  4. Used the n-similarity methods to compute diversity in one case. (this is extra and would like your opinion on this).

Can you tell me what you think about it after a quick look?

  • Should we calculate 4. for all selections or not use it altogether?
  • All n-similarity methods selected the same points. According to Ramon, this is to be expected due to the (low) dimensionality of the problem at hand. Should we just use one then?

Several things to note are:

  1. I was only able to use two diversity measures from the diversity module. The rest need binary chains as elements.
  2. The documentation of the n-similarity methods is not being generated on the website.

@maximilianvz
Copy link
Collaborator

@FarnazH and @marco-2023, I have several comments:

  • There are some typos in the notebook (I may have missed some, so I encourage others to check for this, too):
  1. On 9 occasions, "medoid" is misspelled as "mediod".
  2. In the explanation under Example 1: MaxMin Selector, it is stated that "This can a user-defined function or a sklearn.metrics.pairwise_distances function". This needs to be changed to "This can BE a user-defined function or a sklearn.metrics.pairwise_distances function".
  3. In the documentation for the render_table function, the following is said: "The data to be rendered in a table, each inner list represents a row with the first row being the header. All" I believe the trailing "All" was written mistakenly.
  • In this part of the notebook, the header is "N-Similarity based methods". However, the title of the plot in this section is "Comparing (N)Similarity-Based Selectors". I'd appreciate consistency in how "N-Similarity Based"/"(N)Similarity-Based" is written throughout the notebook:
Screenshot 2023-12-01 at 3 33 38 PM
  • There is a warning in this part of the notebook (which doesn't seem to affect performance) that would be nice to have removed:

image

  • This is a small thing, but for new users, it may be helpful to include a sentence at the end of the explanation here that clarifies that larger values for logdet and smaller values for wdudcorrespond to greater diversity.
Screenshot 2023-12-01 at 3 39 01 PM
  • When comparing multiple selection methods on data within a single cluster, comparisons of selection diversity are given for distance-based, partition-based, and N-similarity-based methods. However, in the latter half of the notebook, where selection is done for data with multiple clusters, there is no such diversity comparison performed. It would probably be best to add this.

  • @marco-2023, you would know more about this than me, but you point out that the selected sets are the same for all similarity indices, which is a consequence of the data being low-dimensional. I assume this is unavoidable if we're using 2-dimensional data (which was done so things could be easily visualized), but I'm not sure how effectively this conveys the usefulness of the various similarity indices of the N-similarity-based methods offered by the package. It may not be worth the effort (i.e., I'm not adamant that things need to be changed here), but perhaps we could use a higher-dimensional example and sacrifice visualization in instances where we want to showcase the N-similarity-based methods and how indices affect diversity. If this were done, we should include a note warning that the choice of similarity index won't affect selection diversity when working in low-dimensional space. Alternatively, we can just use one similarity index and provide this warning.

Screenshot 2023-12-01 at 3 45 11 PM
  • (this is unrelated to the notebook) @FarnazH, correct me if I'm wrong, but I believe we should update the documentation of OptiSim, which currently states that the medoid centre is chosen as the initial point. Like DISE, OptiSim has a ref-index argument with a default value of zero, so it isn't guaranteed that the medoid center is the initial point (in most cases, I'd imagine it won't be):

image

@FanwangM
Copy link
Collaborator

When computing diversity, a distance matrix is used. We should use the feature matrix instead I think.

@FarnazH FarnazH requested a review from FanwangM August 14, 2024 21:05
Copy link
Collaborator

@FanwangM FanwangM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing the updated notebooks. They look good to me. I did some cleaning up and keep working on these notebooks.
I will merge them first and then get a cleaner version soon.

@FanwangM FanwangM merged commit 5a7e001 into main Aug 15, 2024
10 of 11 checks passed
@FanwangM FanwangM mentioned this pull request Aug 15, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants