Feat/wiki scraper #51

blester125 · 2024-01-15T22:39:51Z

This PR adds scripts that can be used to get an xml export of mediawiki sites that don't provide dumps. The resulting dump will contain a list of <page>, one for each exported page. Each page has multiple <revision> which can be used to create an author list. The most recent <revision>'s <text> can be used to get the mediawiki markup representation of the page to use as the document text.

An index of pages is built using the Special:AllPages query url and then exports are made using Special:Export.

craffel

Which wikis have you tried this on? Can it be easily parallelized to get more than 35 pages at once?

wikiscrape/README.md

wikiscrape/export_pages.py

wikiscrape/list_pages.py

StellaAthena · 2024-01-16T03:36:20Z

I think it's a good idea to replace Wikipedia's custom mathematics syntax with LaTeX. Does it make sense to do it at this stage of the pipeline, or later?

blester125 · 2024-01-16T06:13:19Z

@craffel Yeah it should be easy to parallelize this, it runs off files which list page titles (one per line) so you can parallelize over the files (and we are already parallelizing over wiki's) and we can split the inputs pretty easily for more parallelism.

@StellaAthena I think it would best to have that happen later. I was thinking after this export there would be a step that converts from these xml to dolma which would have raw wiki markup as the text field. Then the next step would be converting wikitext to plaintext and that is where the math conversion would happen.

Add tools to scrape mediawiki wikis that don't publish dumps Add tool that exports the xml based on the list of pages. Add the ability to convert wikis to dolma Download and extract script supports multiworker Create WTF Wikipedia parsing server which uses a worker pool to allow for timeouts Creation of script that removes html tags we found in many wiki dumps. Added Shadow Paging to the creation of wikitext dolma files Added Shadow Paging to dolma preprocessing. Added script that remove `None` lines from dolma files. Added script that can combine dolma shards while tracking what was used where to allow for aligned combinations of later versions.

blester125 · 2024-09-24T16:30:05Z

Datasets have been uploaded to https://huggingface.co/datasets/blester125/wiki-dolma

WikiMedia + Talk pages are cleaner and have 14.6 Billion tokens.
WikiTeam3 wikis have 65.1 Billion tokens. They are less clean, various default/boilerplate pages pop up

blester125 requested review from craffel and nkandpa2 January 15, 2024 22:40

blester125 mentioned this pull request Jan 15, 2024

Wikis that don't have dumps (i.e. non-wikimedia ones) and need scraping #7

Closed

7 tasks

craffel reviewed Jan 16, 2024

View reviewed changes

wikiscrape/README.md Outdated Show resolved Hide resolved

wikiscrape/export_pages.py Outdated Show resolved Hide resolved

wikiscrape/export_pages.py Outdated Show resolved Hide resolved

wikiscrape/list_pages.py Outdated Show resolved Hide resolved

blester125 force-pushed the feat/wiki-scraper branch from d2dadf0 to a2c7008 Compare March 11, 2024 14:57

blester125 force-pushed the feat/wiki-scraper branch 3 times, most recently from 227138a to 517bca9 Compare May 9, 2024 15:16

blester125 force-pushed the feat/wiki-scraper branch 4 times, most recently from 05d64f2 to f364ea8 Compare June 3, 2024 20:50

blester125 force-pushed the feat/wiki-scraper branch from 450e6b1 to 77a1b68 Compare June 24, 2024 22:34

blester125 force-pushed the feat/wiki-scraper branch from 3e31f23 to e7d562c Compare July 31, 2024 17:53

blester125 force-pushed the feat/wiki-scraper branch from 59590a8 to 4e571b6 Compare September 24, 2024 16:26

blester125 merged commit f567cd1 into main Sep 24, 2024
2 checks passed

blester125 deleted the feat/wiki-scraper branch September 24, 2024 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/wiki scraper #51

Feat/wiki scraper #51

blester125 commented Jan 15, 2024

craffel left a comment

StellaAthena commented Jan 16, 2024

blester125 commented Jan 16, 2024

blester125 commented Sep 24, 2024

Feat/wiki scraper #51

Feat/wiki scraper #51

Conversation

blester125 commented Jan 15, 2024

craffel left a comment

Choose a reason for hiding this comment

StellaAthena commented Jan 16, 2024

blester125 commented Jan 16, 2024

blester125 commented Sep 24, 2024