Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickling tokenizer fails due to builtins.CoreBPE #231

Closed
jerheff opened this issue Dec 23, 2023 · 4 comments
Closed

Pickling tokenizer fails due to builtins.CoreBPE #231

jerheff opened this issue Dec 23, 2023 · 4 comments

Comments

@jerheff
Copy link

jerheff commented Dec 23, 2023

I am using tiktoken in a dataset preprocessing step for a pytorch DataLoader. They support multiprocessing in creating batches which spawns workers. This fails with exception:

TypeError: cannot pickle 'builtins.CoreBPE' object

I am not familiar with Rust, but this thread seems to suggest that a few methods in the Rust implementation would enable pickling the tokenizer.

@sk-g
Copy link

sk-g commented Jan 3, 2024

+1

Fails with this error when used in multiprocessing context, eg

    train_dataset = train_dataset.map(
        tokenization_function,
        batched=True,
        batch_size=1000,
        num_proc=args.num_proc,
        load_from_cache_file=not args.overwrite_cache,
        desc=f"Running tokenizer on train dataset with {len(train_dataset)} items"
        )

@sk-g
Copy link

sk-g commented Jan 3, 2024

Related issue: #181

@hauntsaninja
Copy link
Collaborator

I allow Encoding to be pickled in tiktoken 0.6. Please let me know if the implementation doesn't work well for you!

@hauntsaninja hauntsaninja closed this as not planned Won't fix, can't repro, duplicate, stale Feb 9, 2024
@jerheff
Copy link
Author

jerheff commented Feb 9, 2024

@hauntsaninja Works for me. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants