Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About update metadata with the corresponding image sample in shards #77

Open
ypwang61 opened this issue Jan 25, 2024 · 2 comments
Open

Comments

@ypwang61
Copy link

ypwang61 commented Jan 25, 2024

Hi all, thanks for your great work!! I want to ask some relevant technical questions: I have used dataset2metadata to generate the metadata set from the 'shards' and store some 'post_process_feature' like a mask in the metadata. However, I want to update the metadata with some new values which will need to calculate the clip score after masking the image data, and thus I need to get the image/text data from shards that exactly align with the uid of corresponding metadata so that I can choose the right mask. I wonder if there is any relevant function in datacomp/dataset2metadata has achieve similar function or provide some insights? Or I need to do something like align the uids of metadata and raw data?

@sagadre
Copy link
Collaborator

sagadre commented Jan 26, 2024

Hi @ypwang61! If I understand correctly, you are currently using dataset2metadata to compute your post_process_feature, which is a mask that will affect CLIP score computation. In this step you can also consider computing the updated CLIP scores after masking so that you don’t have to do two steps / realign metadata to shards

@ypwang61
Copy link
Author

ypwang61 commented Jan 27, 2024

Thank you very much for your reply @sagadre !! Yeah, I do need to calculate the post_process_feature. The problem is that calculating mask takes time much longer than time of calculating CLIP scores. So we just store the masks in the post_process_feature first, and thus now wondering if there is any way to easily realign metadata to shards like match the uids of data from metadata and shards. But yeah if this is not convenient, I should combine calculating mask and new CLIP score together in the first pass :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants