Update documentation to clarify some limitations of qpdf JSON

This commit is contained in:
Jay Berkenbilt 2022-09-06 10:00:50 -04:00
parent ed04b80caf
commit f95e0549cc
2 changed files with 72 additions and 7 deletions

19
TODO
View File

@ -11,8 +11,6 @@ Next
Before Release:
* Stay on top of https://github.com/pikepdf/pikepdf/pull/315
* Consider whether otherwise unreferenced object streams should be
included in json output. Probably not. Or maybe optionally.
* Support json v2 in the C API. At a minimum, write_json,
create_from_json, and update_from_json need to be there and should
take the same kinds of functions as the C API for logger.
@ -56,6 +54,20 @@ direct objects, which are always "resolved" in QPDFObjectHandle.
Possible future JSON enhancements
=================================
* Consider not including unreferenced objects and trimming the trailer
in the same way that QPDFWriter does (except don't remove `/ID`).
This means excluding the linearization dictionary and hint stream,
the encryption dictionary, all keys from trailer that are removed by
QPDFWriter::getTrimmedTrailer except `/ID`, any object streams, and
the xref stream as long as all those objects are unreferenced. (They
always should be, but there could be some bizarre case of someone
creating a PDF file that has an indirect reference to one of those,
in which case we need to preserve it.) If this is done, make
`--preserve-unreferenced` preserve unreference objects and also
those extra keys. Search for "linear" and "trailer" in json.rst to
update the various places in the documentation that discuss this.
Also update the help for --json and --preserve-unreferenced.
* Add to JSON output the information available from a few additional
informational options:
@ -376,7 +388,8 @@ I find it useful to make reference to them in this list.
convertible back to a valid PDF. Since providing the password may
reveal additional details, --show-encryption could potentially retry
with this option if the first time doesn't work. Then, with the file
open, we can read the encryption dictionary normally.
open, we can read the encryption dictionary normally. If this is
done, search for "raw, encrypted" in json.rst.
* In libtests, separate executables that need the object library
from those that strictly use public API. Move as many of the test

View File

@ -52,6 +52,22 @@ changes a handful of defaults so that the resulting JSON is as close
as possible to the original input and is ready for being converted
back to PDF.
The qpdf JSON data includes unreferenced objects. This may be
addressed in a future version of qpdf. For now, that means that
certain objects that are not useful in the JSON representation are
included. This includes linearization and encryption dictionaries,
linearization hint streams, object streams, and the cross-reference
(xref) stream associated with the trailer dictionary where applicable.
For the best experience with qpdf JSON, you can run the file through
qpdf first to remove encryption, linearization, and object streams.
For example:
::
qpdf --decrypt --object-streams=disable in.pdf out.pdf
qpdf --json-output out.pdf out.json
.. _json-terminology:
JSON Terminology
@ -299,10 +315,46 @@ Object Values
Note that writing JSON output is done by ``QPDF``, not ``QPDFWriter``.
As such, none of the things ``QPDFWriter`` does apply. This includes
recompression of streams, renumbering of objects, removal of
unreferenced objects, anything to do with object streams (which are
not represented by qpdf JSON at all since they are PDF syntax, not
semantics), encryption, decryption, linearization, QDF mode, etc. See
:ref:`rewriting` for a more in-depth discussion.
unreferenced objects, encryption, decryption, linearization, QDF
mode, etc. See :ref:`rewriting` for a more in-depth discussion. This
has a few noteworthy implications:
- Decryption is handled transparently by qpdf. As there are no QPDF
APIs, even internal to the library, that allow retrieval of
encrypted data in its raw, encrypted form, qpdf JSON always includes
decrypted data. It is possible that a future version of qpdf may
allow access to raw, encrypted string and stream data.
- Objects that are related to a PDF file's structure, rather than its
content, are included in the JSON output, even though they are not
particularly useful. In a future version of qpdf, this may be fixed,
and the :qpdf:ref:`--preserve-unreferenced` flag may be able to be
used to get the existing behavior. For now, to avoid this, run the
file through ``qpdf --decrypt --object-streams=disable in.pdf
out.pdf`` to generate a new PDF file that contains no unreferenced
or structural objects.
- Linearized PDF files include a linearization dictionary which is not
referenced from any other object and which references the
linearization hint stream by offset. The JSON from a linearized PDF
file contains both of these objects, even though they are not useful
in the JSON. Offset information is not represented in the JSON, so
there's no way to find the linearization hint stream from the
JSON. If a new PDF is created from JSON that was written, the
objects will be read back in but will just be unreferenced objects
that will be ignored by ``QPDFWriter`` when the file is rewritten.
- The JSON from a file with object streams will include the original
object stream and will also include all the objects in the stream
as top-level objects.
- In files with object streams, the trailer "dictionary" is a
stream. In qpdf JSON files, the ``"trailer"`` key will contain a
dictionary with all the keys in it relating to the stream, and the
stream will also appear as an unreferenced object.
- Encrypted files are decrypted, but the encryption dictionary still
appears in the JSON output.
.. _json.example: