Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Provenance data #61

Merged
merged 17 commits into from
May 23, 2024
Merged

Data Provenance data #61

merged 17 commits into from
May 23, 2024

Conversation

shayne-longpre
Copy link
Collaborator

Adding the Data Provenance data.

@shayne-longpre shayne-longpre mentioned this pull request Apr 9, 2024
2 tasks
Copy link
Collaborator

@blester125 blester125 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be marked as a WIP? There seems to be a lot of things, like the include.txt missing?

data_provenance/download.py Outdated Show resolved Hide resolved
data_provenance/download.py Outdated Show resolved Hide resolved
data_provenance/download.py Outdated Show resolved Hide resolved
@shayne-longpre
Copy link
Collaborator Author

@blester125 could you take a look at this pull request when you get a chance? The only thing I was unable to test is the to_dolma call as I'm unable to pip install dolma for some reason... any ideas?

image

@shayne-longpre
Copy link
Collaborator Author

Also, how do we run the default linter?

@Skylion007
Copy link

Skylion007 commented Apr 26, 2024

@shayne-longpre run pre-commit

@shayne-longpre
Copy link
Collaborator Author

@shayne-longpre run pre-commit

Could you direct me on how to run that?

@Skylion007
Copy link

pip install pre-commit
pre-commit install
pre-commit run -a

@shayne-longpre
Copy link
Collaborator Author

pip install pre-commit pre-commit install pre-commit run -a

Done. Thanks!

@Skylion007
Copy link

We should just setup a pre-commit.ci here to do autoformat PRs. I'd be happy to do it if I can get proper permissions

@blester125
Copy link
Collaborator

We should just setup a pre-commit.ci here to do autoformat PRs. I'd be happy to do it if I can get proper permissions

We already have a CI that blocks merges until the formatting is correct (uses black and isort), so I don't think we need to have the CI actually re-write commits. The tools are pretty reliable, but I think using pre-commit (and thus having people need to spot check formatting changes by adding them back to git staging) is less error prone.

Copy link
Collaborator

@blester125 blester125 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this in! There are a few tweaks to make but we're pretty close!

data_provenance/download.py Outdated Show resolved Hide resolved
data_provenance/download.py Outdated Show resolved Hide resolved
data_provenance/download.py Outdated Show resolved Hide resolved
data_provenance/download.py Show resolved Hide resolved
data_provenance/download.py Outdated Show resolved Hide resolved
data_provenance/to-dolma.py Outdated Show resolved Hide resolved
data_provenance/to-dolma.py Show resolved Hide resolved
data_provenance/to-dolma.py Outdated Show resolved Hide resolved
data_provenance/to-dolma.py Outdated Show resolved Hide resolved
data_provenance/to-dolma.py Show resolved Hide resolved
@shayne-longpre
Copy link
Collaborator Author

@blester125 could you take a look at this pull request when you get a chance? The only thing I was unable to test is the to_dolma call as I'm unable to pip install dolma for some reason... any ideas?

image

Still hitting this issue cc @blester125

@blester125
Copy link
Collaborator

@blester125 could you take a look at this pull request when you get a chance? The only thing I was unable to test is the to_dolma call as I'm unable to pip install dolma for some reason... any ideas?
image

Still hitting this issue cc @blester125

🤔 What version of things are you using? I'm on python3.11 and I was able to install both dolma 0.9.1 and dolma 1.0.3 on ubuntu 22.04 and 23.04 respectively.

Maybe try to see if you can bump your python version so that there is a pre-built wheel for it? The list of wheels are here https://pypi.org/project/dolma/#files

@shayne-longpre
Copy link
Collaborator Author

@blester125 I've reviewed the new licenses with Aviya and we trimmed them down further for an abundance of caution.

The number of included datasets is now ~340, as compared to ~500 before. Everything appears to be working!

@@ -24,7 +29,16 @@ class PermissiveLicenses(StringEnum):
GFDL = "GNU Free Documentation License"
APACHE_2 = "Apache 2 License - https://www.apache.org/licenses/LICENSE-2.0"
MIT = "MIT License"
BSD = "BSD License"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any source in here already that was BSD licensed? If so we should make sure to update it to use one of the BSD_2 or BSD_3 to avoid breakage

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any code datasources have been checked in yet so I don't think anyone would have used it.

@blester125
Copy link
Collaborator

I pulled the branch to double check that the dolma stuff is working, but when I run download with the include_test.csv file it seems like neither of the datasets have a user_parent key. Is there a step I missed? The README just says to run download.

Also, how much effort would it be to setup so it runs from the data_provenance dir instead of one level up where we have to set the PYTHONPATH?

@shayne-longpre
Copy link
Collaborator Author

user_parent

@blester125 sorry the new HuggingFace object key is "dataset" not "user_parent"! I just updated it -- thanks for catching.

As for PYTHONPATH I'm not clear actually on what the changes are? python paths always confuse me

Shayne Longpre and others added 2 commits May 14, 2024 20:20
This commit does the following:
* Updates the data_provenance code so it will be run from its dir
  instead of the repo root.
* Updates the include_test.csv as it didn't match the include.csv (was
  missing the `GitHub License` column)
* Adds some logging
@blester125
Copy link
Collaborator

I just pushed a commit that fixes some issues I found while testing (the include_test.csv was missing the 'GitHub License) and I updated it so the code should be run from the data_provenancedir instead of the repo root (which removes the need for thePYTHONPATH). and the datasets in the test csv seemed to be using targetsinstead oflabels` so I updated the code to be able to handle both.

I was able to run the download and to dolma script with the new include_test.csv and the results look good, I think it's ready to merge but @shayne-longpre should take a quick look at my changes to make sure I didn't miss anything.

@shayne-longpre shayne-longpre merged commit 28e210b into main May 23, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants