Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflict between webdataset implementation and OpenCLIP #935

Open
wanghao14 opened this issue Sep 11, 2024 · 3 comments
Open

Conflict between webdataset implementation and OpenCLIP #935

wanghao14 opened this issue Sep 11, 2024 · 3 comments

Comments

@wanghao14
Copy link

I encountered the following error when finetuning OpenCLIP on my own data:

File "./src/training/data.py", line 281, in group_by_keys_nothrow
    fname, value = filesample["fname"], filesample["data"]
KeyError: 'fname'

This issue has been previously proposed in Webdataset issue #384. The problem appears to stem from a conflict in loading datasets in WebDataset format within OpenCLIP:

  1. The function tarfile_to_samples_nothrow is added to the pipeline;
  2. Within tarfile_to_samples_nothrow, tar_file_expander is called to iterate over opened tar file.
  3. When the data in this thread is exhausted, {} (the default value of eof_value) is returned, as implemented in webdataset.
  4. When the given {} is passed to group_by_keys_nothrow , the error is triggered.
    As I am new to working with WebDatasets and not fully familiar with the underlying principles, I would appreciate any guidance or solutions you can provide.
@fartashf
Copy link

The fix is to pass eof_value=None to tar_file_expander on this line:

files = tar_file_expander(streams, handler=handler)

This is a breaking change from webdataset and not sure if the right solution is to ask OpenCLIP to fix it or to ask webdataset to change the default value.

@rwightman
Copy link
Collaborator

hmm, that's a bit of a pain. Looks like webdataset might have been trying to fix an issue that I ran into and why I added this extra bit of code (nowthrow) in the first place, some webdatasets had colliding filenames across shards and would cause a crash. I think this might prevent that and other possible issues (completeing a sample with bits from two different shards)...

To fix I'd have to force using the latest webdataset and remove these hacks. And the painful part is verifying it all works :/

@wanghao14
Copy link
Author

@rwightman Thank you for the detailed explanation regarding the cause of the issue. I really appreciate your efforts in maintaining the code.

Given the situation, do you think @fartashf's method could be a viable solution to retain your hacks while ensuring compatibility with the latest version of webdataset?

Thanks again for your hard work and dedication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants