Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/wiki scraper #51

Merged
merged 1 commit into from
Sep 24, 2024
Merged

Feat/wiki scraper #51

merged 1 commit into from
Sep 24, 2024

Conversation

blester125
Copy link
Collaborator

This PR adds scripts that can be used to get an xml export of mediawiki sites that don't provide dumps. The resulting dump will contain a list of <page>, one for each exported page. Each page has multiple <revision> which can be used to create an author list. The most recent <revision>'s <text> can be used to get the mediawiki markup representation of the page to use as the document text.

An index of pages is built using the Special:AllPages query url and then exports are made using Special:Export.

Copy link
Collaborator

@craffel craffel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which wikis have you tried this on? Can it be easily parallelized to get more than 35 pages at once?

wikiscrape/README.md Outdated Show resolved Hide resolved
wikiscrape/export_pages.py Outdated Show resolved Hide resolved
wikiscrape/export_pages.py Outdated Show resolved Hide resolved
wikiscrape/list_pages.py Outdated Show resolved Hide resolved
@StellaAthena
Copy link
Collaborator

I think it's a good idea to replace Wikipedia's custom mathematics syntax with LaTeX. Does it make sense to do it at this stage of the pipeline, or later?

@blester125
Copy link
Collaborator Author

@craffel Yeah it should be easy to parallelize this, it runs off files which list page titles (one per line) so you can parallelize over the files (and we are already parallelizing over wiki's) and we can split the inputs pretty easily for more parallelism.

@StellaAthena I think it would best to have that happen later. I was thinking after this export there would be a step that converts from these xml to dolma which would have raw wiki markup as the text field. Then the next step would be converting wikitext to plaintext and that is where the math conversion would happen.

Add tools to scrape mediawiki wikis that don't publish dumps

Add tool that exports the xml based on the list of pages.

Add the ability to convert wikis to dolma

Download and extract script supports multiworker

Create WTF Wikipedia parsing server which uses a worker pool to allow for timeouts

Creation of script that removes html tags we found in many wiki dumps.

Added Shadow Paging to the creation of wikitext dolma files

Added Shadow Paging to dolma preprocessing.

Added script that remove `None` lines from dolma files.

Added script that can combine dolma shards while tracking what was used
where to allow for aligned combinations of later versions.
@blester125
Copy link
Collaborator Author

Datasets have been uploaded to https://huggingface.co/datasets/blester125/wiki-dolma

WikiMedia + Talk pages are cleaner and have 14.6 Billion tokens.
WikiTeam3 wikis have 65.1 Billion tokens. They are less clean, various default/boilerplate pages pop up

@blester125 blester125 merged commit f567cd1 into main Sep 24, 2024
2 checks passed
@blester125 blester125 deleted the feat/wiki-scraper branch September 24, 2024 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants