Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scape examples from foodista #69

Merged
merged 3 commits into from
May 8, 2024
Merged

Scape examples from foodista #69

merged 3 commits into from
May 8, 2024

Conversation

blester125
Copy link
Collaborator

This PR scapes training data from foodista, a shared collection of recipes and information about cooking tools, techniques, and ingredients that is distributed under the CC-BY 3.0 license.

Data is collected in 4 steps:

  1. an index of pages is built from the sitemap
  2. all pages are downloaded
  3. the pages are converted to dolma examples, with the raw html as the "text" key.
  4. The html is parsed as part of a dolma processor.

closes #11

This PR scapes training data from foodista, a shared collection of
recipes and information about cooking tools, techniques, and
ingredients that is distributed under the CC-BY 3.0 license.

Data is collected in 4 steps:
1. an index of pages is built from the sitemap
2. all pages are downloaded
3. the pages are converted to dolma examples, with the raw html as the `"text"` key.
4. The html is parsed as part of a dolma processor.
@blester125 blester125 requested a review from craffel May 7, 2024 19:26
food/examples/4.json Outdated Show resolved Hide resolved
food/preprocess.py Show resolved Hide resolved
food/examples/2.json Outdated Show resolved Hide resolved
@blester125 blester125 merged commit 664312c into main May 8, 2024
2 checks passed
@blester125 blester125 deleted the feat/food branch May 8, 2024 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Food Content
2 participants