Added extract_text_by_page() #73

JustBobinAround · 2023-11-22T18:10:25Z

I needed to be able to extract text by page number, so I added the feature. Please let me know if you want a different name or if the function already exists.

jrmuizel · 2023-11-22T18:26:16Z

src/lib.rs

+
+pub fn output_doc_page(doc: &Document, output: &mut dyn OutputDev, page: u32) -> Result<(), OutputError> {
+    if let Ok(_) = doc.trailer.get(b"Encrypt") {
+        eprintln!("Encrypted documents are not currently supported: See https://github.com/J-F-Liu/lopdf/issues/168")


It looks like this was copied from an older version of output_doc?

Yes, its just to pass the page number wanted and to remove the for loop that runs through each page. I needed access to the text per page in a set of documents for a vector DB I'm writing right now. Also, I was wondering if I could write a prelude module for this because currently the Document struct from lopdf is not accessible unless the user adds the same version to their own Cargo.toml. If the user does not import the crate and call output_doc_page() with a reference to the loaded Document struct, then they are forced to call extract_text_by_page which forces a read from file every time they want to read a page. So if I made a prelude module to include the appropriate crates as accessible to the user, it would resolve the issue. I'll submit that in a different pull request if you want.

maybe accept a reference to a document instead of path will be better? Since if we have multiple page want to extract from same document, does't require load the document every time.

Added extract_text_by_page()

737e5cd

jrmuizel reviewed Nov 22, 2023

View reviewed changes

Merge branch 'jrmuizel:master' into master

98c5aae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added extract_text_by_page() #73

Added extract_text_by_page() #73

JustBobinAround commented Nov 22, 2023

jrmuizel Nov 22, 2023

JustBobinAround Nov 23, 2023

SzelamC Nov 28, 2023

Added extract_text_by_page() #73

Are you sure you want to change the base?

Added extract_text_by_page() #73

Conversation

JustBobinAround commented Nov 22, 2023

jrmuizel Nov 22, 2023

Choose a reason for hiding this comment

JustBobinAround Nov 23, 2023

Choose a reason for hiding this comment

SzelamC Nov 28, 2023

Choose a reason for hiding this comment