Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional PDF libraries for TestGrammar C++ PoC #18

Open
petervwyatt opened this issue Aug 9, 2022 · 5 comments
Open

Additional PDF libraries for TestGrammar C++ PoC #18

petervwyatt opened this issue Aug 9, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@petervwyatt
Copy link
Member

Since each PDF library seems to have its own set of nuances, having a wider choice might show up further PDF file malformations and non-compliances, or even allow for additional checks.

Other multi-platform C/C++ PDF SDKs to consider:

  • updated pdfium (although the internal interfaces currently being utilized seem to have changed)
  • QPDF
  • MuPDF
  • PoDoFo
@petervwyatt petervwyatt added this to the TestGrammar C++ PoC milestone Aug 13, 2023
@petervwyatt petervwyatt added the enhancement New feature or request label Mar 16, 2024
@ceztko
Copy link

ceztko commented Jul 16, 2024

I don't want to create expectations here but I'm working hard to have a 1.0.0 PoDoFo release later on this year which will feature a stable API. That will be a good moment to evaluate PoDoFo.

@petervwyatt
Copy link
Member Author

Great!
I also hope to soon write up an "experience report" article on some of the lesser supported aspects needed by tools that might do deep file validation based on the Arlington PDF Model (e.g. knowing if something was an indirect reference or not, knowing if duplicate keys were present, knowing if a string was a hex-string or not, etc.).

@ceztko
Copy link

ceztko commented Jul 18, 2024

Good! For the 1.0 stable API, I'm actually cutting with axe a lot of internal APIs which are not pretty enough to be exposed publicly, but after 1.0 it's certainly possible to investigate if more details about the parsing process can be exposed. Of the things you mentioned here is the PoDoFo status:

  • Knowing if something was an indirect reference or not
  • Knowing if duplicate keys were present (in a PdfDictionary). [Partial] Today indexing is done on raw value (escaped string): his can be detected by comparing the raw data with the unescaped/expanded string (both are kept). The entries are stored in a std::map that doesn't allow for multiple exact same keys entries. Could be migrated to std::multimap allowing to detect even this situation
  • Knowing if a string was a hex-string or not

A comprehensive list in the article will certainly help.

@petervwyatt
Copy link
Member Author

A few more off the top of my head and without a lot of detail (some of this may also be in the documentation so you know what to expect from the API - vs. having to work it our heuristically or getting a surprise 😀). Some of this is obtuse stuff but important if doing detailed low-level validation:

  • support for trailer dictionary as a "normal" dictionary (since no object ID) - so can iterate all keys, support private keys, etc.
  • knowing if a string object is Unicode or not (i.e. can get at the BoMs and BCP-47 language markers, etc)
  • getting to the exact raw bytes of a string object (not re-encoded in UTF-8 or whatever is standard for the programming language; with escape sequences in-situ)
  • functionality/API still works even with unknown/unsupported encryption, since that only impacts strings and streams but the rest of the PDF objects are still functional and the DOM is navigatable
  • duplicate keys: this is more complicated than it sounds, since (for example) /JS, /J#53, /#4aS and /#4a#53 are all the SAME key technically yet differently semantically. And being able to access each of these separately...
  • getting to the exact raw bytes of name objects (with #-hex codes, for example vs treated as UTF-8 or de-escaped)
  • treatment of keys that have explicit null values - can these still be seen/accessed? I know the spec says to treat as non-existant but knowing a key is present vs not can be important
  • handling of objects with object numbers > trailer Size entry? Are these hidden? Still accessible via API?
  • how revisions (incremental updates) of files are handled (is it possible to access the trailer of each revision? access old or deleted objects that are still present in the file? etc)
  • documentation on what "version" means in the API - is it just the header comment? Also the DocCatalog Version entry? What about if a revision (incremental updates) are present? Can the DocCatalog Version entry also be extracted independently?
  • access to Linearization objects (technically not linked into the PDF DOM)
  • how are "hybrid reference PDFs" processed? Can they be processed as either a pre-PDF 1.5 processor (without any cross-reference and object stream support) and/or post-PDF 1.5 processor (with cross-reference and object stream support)

@ceztko
Copy link

ceztko commented Jul 19, 2024

  • duplicate keys: this is more complicated than it sounds, since (for example) /JS, /J#53, /#4aS and /#4a#53 are all the SAME key technically yet differently semantically. And being able to access each of these separately...

Ah, I understood duplicated keys in XRef sections/streams, not keys in a dictionary. In a PoDoFo PdfDictionary today indexing is done on raw value (escaped string): his can be detected by comparing the raw data with the unescaped/expanded string (both are kept). The entries are stored in a std::map that doesn't allow for multiple exact same key entries. Could be migrated to std::multimap allowing to detect even this situation (at least iterating all pairs stored in the dictionary).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants