Additional PDF libraries for TestGrammar C++ PoC #18

petervwyatt · 2022-08-09T02:33:43Z

Since each PDF library seems to have its own set of nuances, having a wider choice might show up further PDF file malformations and non-compliances, or even allow for additional checks.

Other multi-platform C/C++ PDF SDKs to consider:

updated pdfium (although the internal interfaces currently being utilized seem to have changed)
QPDF
MuPDF
PoDoFo

ceztko · 2024-07-16T08:58:16Z

I don't want to create expectations here but I'm working hard to have a 1.0.0 PoDoFo release later on this year which will feature a stable API. That will be a good moment to evaluate PoDoFo.

petervwyatt · 2024-07-17T04:27:51Z

Great!
I also hope to soon write up an "experience report" article on some of the lesser supported aspects needed by tools that might do deep file validation based on the Arlington PDF Model (e.g. knowing if something was an indirect reference or not, knowing if duplicate keys were present, knowing if a string was a hex-string or not, etc.).

ceztko · 2024-07-18T08:05:15Z

Good! For the 1.0 stable API, I'm actually cutting with axe a lot of internal APIs which are not pretty enough to be exposed publicly, but after 1.0 it's certainly possible to investigate if more details about the parsing process can be exposed. Of the things you mentioned here is the PoDoFo status:

Knowing if something was an indirect reference or not
Knowing if duplicate keys were present (in a PdfDictionary). [Partial] Today indexing is done on raw value (escaped string): his can be detected by comparing the raw data with the unescaped/expanded string (both are kept). The entries are stored in a std::map that doesn't allow for multiple exact same keys entries. Could be migrated to std::multimap allowing to detect even this situation
Knowing if a string was a hex-string or not

A comprehensive list in the article will certainly help.

petervwyatt · 2024-07-19T00:16:08Z

A few more off the top of my head and without a lot of detail (some of this may also be in the documentation so you know what to expect from the API - vs. having to work it our heuristically or getting a surprise 😀). Some of this is obtuse stuff but important if doing detailed low-level validation:

support for trailer dictionary as a "normal" dictionary (since no object ID) - so can iterate all keys, support private keys, etc.
knowing if a string object is Unicode or not (i.e. can get at the BoMs and BCP-47 language markers, etc)
getting to the exact raw bytes of a string object (not re-encoded in UTF-8 or whatever is standard for the programming language; with escape sequences in-situ)
functionality/API still works even with unknown/unsupported encryption, since that only impacts strings and streams but the rest of the PDF objects are still functional and the DOM is navigatable
duplicate keys: this is more complicated than it sounds, since (for example) /JS, /J#53, /#4aS and /#4a#53 are all the SAME key technically yet differently semantically. And being able to access each of these separately...
getting to the exact raw bytes of name objects (with #-hex codes, for example vs treated as UTF-8 or de-escaped)
treatment of keys that have explicit null values - can these still be seen/accessed? I know the spec says to treat as non-existant but knowing a key is present vs not can be important
handling of objects with object numbers > trailer Size entry? Are these hidden? Still accessible via API?
how revisions (incremental updates) of files are handled (is it possible to access the trailer of each revision? access old or deleted objects that are still present in the file? etc)
documentation on what "version" means in the API - is it just the header comment? Also the DocCatalog Version entry? What about if a revision (incremental updates) are present? Can the DocCatalog Version entry also be extracted independently?
access to Linearization objects (technically not linked into the PDF DOM)
how are "hybrid reference PDFs" processed? Can they be processed as either a pre-PDF 1.5 processor (without any cross-reference and object stream support) and/or post-PDF 1.5 processor (with cross-reference and object stream support)

ceztko · 2024-07-19T08:19:56Z

duplicate keys: this is more complicated than it sounds, since (for example) /JS, /J#53, /#4aS and /#4a#53 are all the SAME key technically yet differently semantically. And being able to access each of these separately...

Ah, I understood duplicated keys in XRef sections/streams, not keys in a dictionary. In a PoDoFo PdfDictionary today indexing is done on raw value (escaped string): his can be detected by comparing the raw data with the unescaped/expanded string (both are kept). The entries are stored in a std::map that doesn't allow for multiple exact same key entries. Could be migrated to std::multimap allowing to detect even this situation (at least iterating all pairs stored in the dictionary).

petervwyatt added this to the TestGrammar C++ PoC milestone Aug 13, 2023

petervwyatt added the enhancement New feature or request label Mar 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional PDF libraries for TestGrammar C++ PoC #18

Additional PDF libraries for TestGrammar C++ PoC #18

petervwyatt commented Aug 9, 2022

ceztko commented Jul 16, 2024

petervwyatt commented Jul 17, 2024

ceztko commented Jul 18, 2024 •

edited

Loading

petervwyatt commented Jul 19, 2024

ceztko commented Jul 19, 2024

Additional PDF libraries for TestGrammar C++ PoC #18

Additional PDF libraries for TestGrammar C++ PoC #18

Comments

petervwyatt commented Aug 9, 2022

ceztko commented Jul 16, 2024

petervwyatt commented Jul 17, 2024

ceztko commented Jul 18, 2024 • edited Loading

petervwyatt commented Jul 19, 2024

ceztko commented Jul 19, 2024

ceztko commented Jul 18, 2024 •

edited

Loading