-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Tooling to download and process Wikis
Add tools to scrape mediawiki wikis that don't publish dumps Add tool that exports the xml based on the list of pages. Add the ability to convert wikis to dolma
- Loading branch information
1 parent
1ef9a1c
commit 05d64f2
Showing
28 changed files
with
1,818 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -160,3 +160,6 @@ cython_debug/ | |
#.idea/ | ||
.python-version | ||
**/licensed_pile_log.txt | ||
|
||
node_modules | ||
package-lock.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,18 @@ | ||
beautifulsoup4 | ||
charset_normalizer | ||
datasets | ||
dolma | ||
google-cloud-storage | ||
internetarchive | ||
logging_json | ||
markdown-it-py | ||
pandas | ||
patool | ||
pre-commit | ||
pyunpack | ||
rdflib | ||
requests>=2.13 | ||
smart_open | ||
tenacity | ||
pandas | ||
jsonlines | ||
datasets | ||
tqdm | ||
ultimate-sitemap-parser |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Wiki | ||
|
||
## Notes | ||
|
||
The following scanners output a .history.xml to parse | ||
* "Internet Archive HTML5 Uploader ...": Seems to have .7z | ||
* "wikiteam3 (v...)" these get released as .zstandard files. | ||
* Official Wikipedia Dumps | ||
* "Internet Archive Python library ..." >= 1.0.4 | ||
|
||
|
||
The following use the old format | ||
* "Internet Archive Python library 0.X.X": As a zip file, you need to make a new dir with -d when you unzip. | ||
|
||
|
||
The archive url can be created with `f"archive.org/details/{item_id}"` | ||
|
||
|
||
Some of the items have multiple uploads, for example `wiki-kris159shoutwikicom_w` has multiple history files we so need to parse out the date and pic the most recent one, i.e., `kris159shoutwikicom_w-20180506-history-xml.7z` over `kris159shoutwikicom_w-20140129-history.xml.7z` | ||
|
||
## Special Cases | ||
|
||
Shout Wiki, WikiTravelAllLanguages |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
data*/* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Wiki Dumps from the Internet Archive | ||
|
||
We need to download 4.4 TB from the Internet Archive. | ||
|
||
If we had a Gigabit connection it would take 9 hours to download. | ||
|
||
Based on the Internet, people say the IA generally has a bandwidth of 10 Mbps to 1 Mbps, and the longer you download the less bandwidth they give to you. | ||
|
||
| Bandwith | Hosts | Time to DL | | ||
|----------|------:|-----------:| | ||
| 1 Gb/s | 1 | 9h 40m | | ||
| | 4 | 2.3h | | ||
| | 10 | 0.9h | | ||
| 10 Mb/s | 1 | 40d 17h | | ||
| | 4 | 10d + | | ||
| | 10 | 4d + | | ||
| 1 Mb/s | 1 | 407d 9h | | ||
| | 4 | 101d + | | ||
| | 10 | 40d+ | | ||
| | 100 | 4d+ | | ||
| | 500 | 0.8d | | ||
|
||
We really need hardware based parallelism |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,135 @@ | ||
"""Download wiki dumps from the internet archive.""" | ||
|
||
import argparse | ||
import functools | ||
import json | ||
import multiprocessing.dummy as mp | ||
import os | ||
import random | ||
|
||
import internetarchive | ||
import pyunpack | ||
import utils | ||
|
||
from licensed_pile import logs | ||
|
||
parser = argparse.ArgumentParser( | ||
description="Download wiki dumps from the internet archive." | ||
) | ||
parser.add_argument("--wiki_metadata", default="data/ia-wikis.jsonl") | ||
parser.add_argument("--test_run", type=int, help="") | ||
parser.add_argument("--num_threads", type=int, default=32, help="") | ||
parser.add_argument("--worker_id", type=int, required=True, help="") | ||
parser.add_argument("--num_workers", type=int, required=True, help="") | ||
|
||
|
||
# TODO: Default downloading to .../dumps | ||
def download_and_extract( | ||
ident: str, | ||
dl_file, | ||
output_dir: str = "/fruitbasket/users/bdlester/projects/licensed_pile/wiki/archive/data/dumps", | ||
verbose: bool = False, | ||
): | ||
logger = logs.get_logger("wiki/archive") | ||
dest = os.path.join(output_dir, ident) | ||
if os.path.exists(dest): | ||
logger.info( | ||
f"Skipping download of {dl_file['name']} for {ident} as {dest} already exists on disk." | ||
) | ||
return dest | ||
logger.info(f"Downloading {dl_file['name']} for {ident}.") | ||
internetarchive.download( | ||
ident, checksum=True, verbose=verbose, files=dl_file["name"], destdir=output_dir | ||
) | ||
logger.info(f"Extracting download for {ident} to {dest}.") | ||
pyunpack.Archive(os.path.join(dest, dl_file["name"])).extractall(dest) | ||
return dest | ||
|
||
|
||
def download_ia(wiki): | ||
logger = logs.get_logger("wiki/archive") | ||
if (ident := wiki["metadata"]["identifier"]) in utils.KNOWN_BAD: | ||
logger.warning(f"Skipping {ident} as it is listed under utils.KNOWN_BAD") | ||
return None | ||
dl_file = utils.find_download(wiki) | ||
return download_and_extract(ident, dl_file) | ||
|
||
|
||
def download_fandom(wiki): | ||
logger = logs.get_logger("wiki/archive") | ||
logger.warning(f"Fandom downloads not implemented yet, downloading from IA.") | ||
return download_ia(wiki) | ||
|
||
|
||
def download_wikimedia(wiki): | ||
logger = logs.get_logger("wiki/archive") | ||
logger.warning(f"Wikimedia downloads not implemented yet, downloading from IA.") | ||
return download_ia(wiki) | ||
|
||
|
||
def scrape_wiki(wiki): | ||
logger = logs.get_logger("wiki/archive") | ||
logger.warning(f"Wiki Re-scrapes not implemented yet, downloading from IA.") | ||
return download_ia(wiki) | ||
|
||
|
||
def process_wiki(i, wiki, offset): | ||
logger = logs.get_logger("wiki/archive") | ||
if "metadata" not in wiki: | ||
logger.error(f"Metadata missing from line {i}, malformed record") | ||
return None | ||
ident = wiki["metadata"]["identifier"] | ||
if not utils.filter_language(wiki["metadata"].get("language")): | ||
lang = wiki["metadata"].get("language") | ||
logger.warning(f"{ident} appears to not be english, found: {lang}") | ||
return None | ||
if not utils.check_alive(wiki): | ||
logger.info(f"{ident} is offline, getting dump from IA.") | ||
return download_ia(wiki) | ||
if not utils.verify_license(wiki): | ||
logger.error(f"The IA license for {ident} doesn't match the source.") | ||
return None | ||
if utils.check_fandom(wiki): | ||
logger.info(f"{ident} is a fandom wiki, downloading dump from there.") | ||
return download_fandom(wiki) | ||
if utils.check_wikimedia(wiki): | ||
logger.info(f"{ident} is a WikiMedia wiki, downloading dump from there.") | ||
return download_wikimedia(wiki) | ||
if utils.check_out_of_date(wiki, offset): | ||
logger.warning(f"IA dump for {ident} is very out of date, re-scraping.") | ||
return scrape_wiki(wiki) | ||
|
||
|
||
# TODO: configure dest_dir | ||
def main(args): | ||
logger = logs.get_logger("wiki/archive") | ||
logger.info(f"Reading wiki metadata from {args.wiki_metadata}") | ||
with open(args.wiki_metadata) as f: | ||
wiki_metadata = [json.loads(l) for l in f if l] | ||
logger.info(f"{len(wiki_metadata)} wikis to download.") | ||
|
||
if args.test_run: | ||
logger.info(f"Test Run: Only downloading {args.test_run} wikis") | ||
random.shuffle(wiki_metadata) | ||
wiki_metadata = wiki_metadata[: args.test_run] | ||
|
||
wiki_metadata = [ | ||
w for i, w in enumerate(wiki_metadata) if i % args.num_workers == args.worker_id | ||
] | ||
logger.info( | ||
f"{len(wiki_metadata)} wikis to download as {args.worker_id}/{args.num_workers}." | ||
) | ||
|
||
# f = functools.partial(process_wiki, offset=None) | ||
# [f(*w) for w in enumerate(wiki_metadata)] | ||
|
||
with mp.Pool(args.num_threads) as pool: | ||
pool.starmap( | ||
functools.partial(process_wiki, offset=None), enumerate(wiki_metadata) | ||
) | ||
|
||
|
||
if __name__ == "__main__": | ||
args = parser.parse_args() | ||
logs.configure_logging("wiki/archive") | ||
main(args) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
"""Download wiki dump metadata from the internet archive. | ||
The licenseurl regex's we are using to search are mutually exclusive so we can | ||
split the query into multiple chunks instead of `OR`ing them together to get some | ||
parallelism out of the metadata scrape. | ||
""" | ||
|
||
import argparse | ||
import json | ||
import multiprocessing.dummy as mp | ||
import os | ||
|
||
import internetarchive | ||
|
||
from licensed_pile import logs | ||
from licensed_pile.licenses import PermissiveLicenses | ||
|
||
parser = argparse.ArgumentParser( | ||
description="Download metadata for wiki dumps from the IA." | ||
) | ||
parser.add_argument("--output_dir", default="data/metadata/", help="") | ||
parser.add_argument("--file_name", default="ia-wiki-metadata.jsonl") | ||
# TODO: Respect these | ||
parser.add_argument("--include_wikicollections", action="store_true", help="") | ||
parser.add_argument("--licenses", choices=[], action="append", help="") | ||
|
||
|
||
def get_metadata(idx: int, query: str, file_name: str, output_dir: str): | ||
"""Fetch item metadata from IA using query and save it to disk.""" | ||
with open(os.path.join(output_dir, f"{idx:>05}_{file_name}")) as wf: | ||
for item in internetarchive.search_items(query): | ||
wf.write(json.dumps(i.item_metadata) + "\n") | ||
|
||
|
||
def make_queries(licenses, include_wikicollections): | ||
if include_wikicollections: | ||
raise NotImplementedError("...") | ||
license_regexs = licenses | ||
for license_regex in license_regexs: | ||
yield f"collection:(wikiteam) AND licenseurl:({license_regex})" | ||
|
||
|
||
def main(args): | ||
# TODO have something that translates from the PermissiveLicense Enum to regex's | ||
if args.licenses is None: | ||
args.licesnes = [ | ||
"*\/by\/*", | ||
"*\/by-sa\/*", | ||
"*publicdomain*", | ||
"*GNU_Free_Documentation_License*", | ||
] | ||
queries = list(make_queries(args.licesnses, args.include_wikicollections)) | ||
with mp.Pool(len(queries)) as pool: | ||
pool.starmap( | ||
functools.partial( | ||
get_metadata, file_name=args.file_name, output_dir=args.output_dir | ||
), | ||
enumerate(queries), | ||
) | ||
|
||
|
||
if __name__ == "__main__": | ||
args = parser.parse_args() | ||
logs.configure_logging("wiki/archive") | ||
main(args) |
Oops, something went wrong.