Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for text data & tokenization #577

Merged
merged 2 commits into from
Jul 31, 2023

Conversation

s314cy
Copy link
Contributor

@s314cy s314cy commented Apr 20, 2023

closes #572 and closes #491

add support for text data & tokenization:

  • tokenize samples
  • load labelled text data in the browser/node
  • load unlabelled text data in the browser/node
  • lazily load text data in node

this PR includes a rework of the data preprocessing pipeline, which is much more modular and makes it easy to add new preprocessing functions!

it also fixes the CI by:

  • making the github actions data cache run-specific
  • ensuring the data download script bypasses the gbucket cache
  • replacing the example data's archive from a BSD tar to a GNU tar (macOS vs. linux) which caused issues in the CI

@s314cy s314cy added feature New feature or request discojs Related to Disco.js labels Apr 20, 2023
@s314cy s314cy self-assigned this Apr 20, 2023
Tokenize = 'tokenize'
}

export function getPreprocessImage (task: Task): PreprocessText {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this one be called image?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also mind adding a comment if you will output a stream of token ids?

for LLMs, we can then also support datasets without any label being needed

also let's say where/how people could load different tokenizers (task config or hardcoded either is fine)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make sure that the PR follows your comments once it's out of the "draft" stage!

@martinjaggi
Copy link
Member

very cool, thanks for getting this started!

@s314cy s314cy force-pushed the 572-tokenizer-support-s314cy branch 2 times, most recently from 283003b to 8834d99 Compare April 24, 2023 14:26
@s314cy s314cy force-pushed the 572-tokenizer-support-s314cy branch from 57803d6 to e8c307f Compare May 4, 2023 10:37
@s314cy s314cy force-pushed the 572-tokenizer-support-s314cy branch 2 times, most recently from 55be642 to 6711773 Compare May 23, 2023 12:54
@s314cy s314cy force-pushed the 572-tokenizer-support-s314cy branch 2 times, most recently from 699116f to 9c96d71 Compare July 6, 2023 12:13
@s314cy s314cy force-pushed the 572-tokenizer-support-s314cy branch from bce22ad to f78a7e9 Compare July 31, 2023 11:48
@s314cy s314cy marked this pull request as ready for review July 31, 2023 11:48
@s314cy s314cy merged commit acd4250 into develop Jul 31, 2023
6 checks passed
@s314cy s314cy deleted the 572-tokenizer-support-s314cy branch July 31, 2023 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discojs Related to Disco.js feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support modern tokenizers for NLP add NLP usecase
2 participants