Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/wiki scraper #51

Merged
merged 1 commit into from
Sep 24, 2024
Merged

Feat/wiki scraper #51

merged 1 commit into from
Sep 24, 2024

Commits on Sep 24, 2024

  1. Tooling to download and process Wikis

    Add tools to scrape mediawiki wikis that don't publish dumps
    
    Add tool that exports the xml based on the list of pages.
    
    Add the ability to convert wikis to dolma
    
    Download and extract script supports multiworker
    
    Create WTF Wikipedia parsing server which uses a worker pool to allow for timeouts
    
    Creation of script that removes html tags we found in many wiki dumps.
    
    Added Shadow Paging to the creation of wikitext dolma files
    
    Added Shadow Paging to dolma preprocessing.
    
    Added script that remove `None` lines from dolma files.
    
    Added script that can combine dolma shards while tracking what was used
    where to allow for aligned combinations of later versions.
    blester125 committed Sep 24, 2024
    Configuration menu
    Copy the full SHA
    4e571b6 View commit details
    Browse the repository at this point in the history