Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library of Congress public domain books (loc_books) #73

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

storytracer
Copy link

@storytracer storytracer commented May 15, 2024

This PR adds the code and documentation to download and export public domain books from the Library of Congress Selected Digitized Books collection. The PR closes issue #74 .

Copy link
Collaborator

@blester125 blester125 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work! It looks really good and basically ready to merge!

Left a few comments, can you also create a script that calls all the steps in order?

from licensed_pile.licenses import PermissiveLicenses
from licensed_pile.write import to_dolma

data_path = Path(__file__).resolve().parent / "data"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this configurable based on cli arguments?


self.progress_bar.total = len(download_urls)

with mp.Pool(10) as pool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make this num threads configurable via cli

self.progress_bar.total = len(download_urls)

with mp.Pool(10) as pool:
results = pool.imap(functools.partial(self.download_book), download_urls)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment about how we don't actually use this results variable for more than updating the progress bar?

I thought there would be a bug based on consuming the iterable too early but I see that the real point is the download side effect


data_path = Path(__file__).resolve().parent / "data"

metadata_exports_path = data_path / "exports/metadata"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Globals like this should be all upper case, e.g., METADATA_EXPORTS_PATH

Would it be hard to move this into the class by setting them during the __init__? Then you could pass in data_path to make it easy to configure.

self.progress_bar.update(1)
pass

def urls_to_download(self, text_file_urls):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we are pre-computing the files already downloaded? Would it be simpler to have a check in the download_book method and just log that a book was already downloaded and it is getting skipped?

@click.option("--dolma-shard-size", default=1, help="Shard file size in GB")
@click.option(
"--dolma-filename",
required=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably doesn't need to be required


from licensed_pile import logs

data_path = Path(__file__).resolve().parent / "data"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

self.existing_pages_count += total_pages - len(pages_to_download)
self.progress_bar.total += len(pages_to_download)

with PoolExecutor(max_workers=10) as executor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason to use this over multiprocessing.dummy.Pool like you did before?


self.date_facets = date_facets

def check_existing_files(self, total_pages, output_folder):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above.

pages_to_download.append(page)
return pages_to_download

def download_page(self, facet_url, output_folder, page):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we don't have retries like above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants