-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Library of Congress public domain books (loc_books) #73
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hard work! It looks really good and basically ready to merge!
Left a few comments, can you also create a script that calls all the steps in order?
from licensed_pile.licenses import PermissiveLicenses | ||
from licensed_pile.write import to_dolma | ||
|
||
data_path = Path(__file__).resolve().parent / "data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this configurable based on cli arguments?
|
||
self.progress_bar.total = len(download_urls) | ||
|
||
with mp.Pool(10) as pool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should make this num threads configurable via cli
self.progress_bar.total = len(download_urls) | ||
|
||
with mp.Pool(10) as pool: | ||
results = pool.imap(functools.partial(self.download_book), download_urls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment about how we don't actually use this results
variable for more than updating the progress bar?
I thought there would be a bug based on consuming the iterable too early but I see that the real point is the download side effect
|
||
data_path = Path(__file__).resolve().parent / "data" | ||
|
||
metadata_exports_path = data_path / "exports/metadata" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Globals like this should be all upper case, e.g., METADATA_EXPORTS_PATH
Would it be hard to move this into the class by setting them during the __init__
? Then you could pass in data_path
to make it easy to configure.
self.progress_bar.update(1) | ||
pass | ||
|
||
def urls_to_download(self, text_file_urls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we are pre-computing the files already downloaded? Would it be simpler to have a check in the download_book method and just log that a book was already downloaded and it is getting skipped?
@click.option("--dolma-shard-size", default=1, help="Shard file size in GB") | ||
@click.option( | ||
"--dolma-filename", | ||
required=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably doesn't need to be required
|
||
from licensed_pile import logs | ||
|
||
data_path = Path(__file__).resolve().parent / "data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
self.existing_pages_count += total_pages - len(pages_to_download) | ||
self.progress_bar.total += len(pages_to_download) | ||
|
||
with PoolExecutor(max_workers=10) as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason to use this over multiprocessing.dummy.Pool
like you did before?
loc_books/metadata.py
Outdated
|
||
self.date_facets = date_facets | ||
|
||
def check_existing_files(self, total_pages, output_folder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as above.
loc_books/metadata.py
Outdated
pages_to_download.append(page) | ||
return pages_to_download | ||
|
||
def download_page(self, facet_url, output_folder, page): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason we don't have retries like above?
This PR adds the code and documentation to download and export public domain books from the Library of Congress Selected Digitized Books collection. The PR closes issue #74 .