Library of Congress public domain books (loc_books) #73

storytracer · 2024-05-15T14:08:49Z

This PR adds the code and documentation to download and export public domain books from the Library of Congress Selected Digitized Books collection. The PR closes issue #74 .

blester125

Thanks for the hard work! It looks really good and basically ready to merge!

Left a few comments, can you also create a script that calls all the steps in order?

blester125 · 2024-05-25T22:13:59Z

loc_books/books.py

+from licensed_pile.licenses import PermissiveLicenses
+from licensed_pile.write import to_dolma
+
+data_path = Path(__file__).resolve().parent / "data"


Can we make this configurable based on cli arguments?

blester125 · 2024-05-25T22:17:01Z

loc_books/books.py

+
+        self.progress_bar.total = len(download_urls)
+
+        with mp.Pool(10) as pool:


we should make this num threads configurable via cli

blester125 · 2024-05-25T22:18:58Z

loc_books/books.py

+        self.progress_bar.total = len(download_urls)
+
+        with mp.Pool(10) as pool:
+            results = pool.imap(functools.partial(self.download_book), download_urls)


Can you add a comment about how we don't actually use this results variable for more than updating the progress bar?

I thought there would be a bug based on consuming the iterable too early but I see that the real point is the download side effect

blester125 · 2024-05-25T22:20:51Z

loc_books/books.py

+
+data_path = Path(__file__).resolve().parent / "data"
+
+metadata_exports_path = data_path / "exports/metadata"


Globals like this should be all upper case, e.g., METADATA_EXPORTS_PATH

Would it be hard to move this into the class by setting them during the __init__? Then you could pass in data_path to make it easy to configure.

blester125 · 2024-05-25T22:23:01Z

loc_books/books.py

+                    self.progress_bar.update(1)
+                pass
+
+    def urls_to_download(self, text_file_urls):


Is there a reason we are pre-computing the files already downloaded? Would it be simpler to have a check in the download_book method and just log that a book was already downloaded and it is getting skipped?

blester125 · 2024-05-25T22:32:51Z

loc_books/books.py

+@click.option("--dolma-shard-size", default=1, help="Shard file size in GB")
+@click.option(
+    "--dolma-filename",
+    required=True,


Probably doesn't need to be required

blester125 · 2024-05-25T22:33:06Z

loc_books/metadata.py

+
+from licensed_pile import logs
+
+data_path = Path(__file__).resolve().parent / "data"


Same as above

blester125 · 2024-05-25T22:34:05Z

loc_books/metadata.py

+        self.existing_pages_count += total_pages - len(pages_to_download)
+        self.progress_bar.total += len(pages_to_download)
+
+        with PoolExecutor(max_workers=10) as executor:


Is there a specific reason to use this over multiprocessing.dummy.Pool like you did before?

blester125 · 2024-05-25T22:35:41Z

loc_books/metadata.py

+
+        self.date_facets = date_facets
+
+    def check_existing_files(self, total_pages, output_folder):


Same question as above.

blester125 · 2024-05-25T22:36:28Z

loc_books/metadata.py

+                pages_to_download.append(page)
+        return pages_to_download
+
+    def download_page(self, facet_url, output_folder, page):


is there a reason we don't have retries like above?

storytracer added 7 commits May 7, 2024 15:02

Added loc_books source

d85c01a

Implemented download logic

8cf4c05

Optimized loc_books download settings

f770993

Implemented dolma processing for loc_books

386036c

loc_books: first README version and code refactoring

f5fe4ba

loc_books: updated requirements.txt

6a4a82d

loc_books: code optimizations and documentation

9e63286

blester125 requested changes May 25, 2024

View reviewed changes

storytracer added 4 commits May 27, 2024 11:43

loc_books: replaced all remaining uses of os functions with pathlib

9c703a8

Merge branch 'r-three:main' into loc_books

dbb12ea

Merge branch 'r-three:main' into loc_books

59a5b70

Merge branch 'r-three:main' into loc_books

e39de52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Library of Congress public domain books (loc_books) #73

Library of Congress public domain books (loc_books) #73

storytracer commented May 15, 2024 •

edited

Loading

blester125 left a comment

blester125 May 25, 2024

blester125 May 25, 2024

blester125 May 25, 2024

blester125 May 25, 2024

blester125 May 25, 2024

blester125 May 25, 2024

blester125 May 25, 2024

blester125 May 25, 2024

blester125 May 25, 2024

blester125 May 25, 2024


		self.progress_bar.total = len(download_urls)

		with mp.Pool(10) as pool:


		data_path = Path(__file__).resolve().parent / "data"

		metadata_exports_path = data_path / "exports/metadata"


		from licensed_pile import logs

		data_path = Path(__file__).resolve().parent / "data"


		self.date_facets = date_facets

		def check_existing_files(self, total_pages, output_folder):

Library of Congress public domain books (loc_books) #73

Are you sure you want to change the base?

Library of Congress public domain books (loc_books) #73

Conversation

storytracer commented May 15, 2024 • edited Loading

blester125 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

storytracer commented May 15, 2024 •

edited

Loading