Update documentation to clarify some limitations of qpdf JSON

2024-12-22 19:08:59 +00:00 · 2022-09-06 10:00:50 -04:00 · 2022-09-06 10:00:50 -04:00 · f95e0549cc
commit f95e0549cc
parent ed04b80caf
2 changed files with 72 additions and 7 deletions
--- a/19
+++ b/19
@ -11,8 +11,6 @@ Next
 Before Release:

 * Stay on top of https://github.com/pikepdf/pikepdf/pull/315
-* Consider whether otherwise unreferenced object streams should be
-  included in json output. Probably not. Or maybe optionally.
 * Support json v2 in the C API. At a minimum, write_json,
  create_from_json, and update_from_json need to be there and should
  take the same kinds of functions as the C API for logger.
@ -56,6 +54,20 @@ direct objects, which are always "resolved" in QPDFObjectHandle.
 Possible future JSON enhancements
 =================================

+* Consider not including unreferenced objects and trimming the trailer
+  in the same way that QPDFWriter does (except don't remove `/ID`).
+  This means excluding the linearization dictionary and hint stream,
+  the encryption dictionary, all keys from trailer that are removed by
+  QPDFWriter::getTrimmedTrailer except `/ID`, any object streams, and
+  the xref stream as long as all those objects are unreferenced. (They
+  always should be, but there could be some bizarre case of someone
+  creating a PDF file that has an indirect reference to one of those,
+  in which case we need to preserve it.) If this is done, make
+  `--preserve-unreferenced` preserve unreference objects and also
+  those extra keys. Search for "linear" and "trailer" in json.rst to
+  update the various places in the documentation that discuss this.
+  Also update the help for --json and --preserve-unreferenced.
+
 * Add to JSON output the information available from a few additional
  informational options:

@ -376,7 +388,8 @@ I find it useful to make reference to them in this list.
  convertible back to a valid PDF. Since providing the password may
  reveal additional details, --show-encryption could potentially retry
  with this option if the first time doesn't work. Then, with the file
-  open, we can read the encryption dictionary normally.
+  open, we can read the encryption dictionary normally. If this is
+  done, search for "raw, encrypted" in json.rst.

 * In libtests, separate executables that need the object library
  from those that strictly use public API. Move as many of the test
--- a/manual/json.rst
+++ b/manual/json.rst
@ -52,6 +52,22 @@ changes a handful of defaults so that the resulting JSON is as close
 as possible to the original input and is ready for being converted
 back to PDF.

+The qpdf JSON data includes unreferenced objects. This may be
+addressed in a future version of qpdf. For now, that means that
+certain objects that are not useful in the JSON representation are
+included. This includes linearization and encryption dictionaries,
+linearization hint streams, object streams, and the cross-reference
+(xref) stream associated with the trailer dictionary where applicable.
+For the best experience with qpdf JSON, you can run the file through
+qpdf first to remove encryption, linearization, and object streams.
+For example:
+
+::
+
+   qpdf --decrypt --object-streams=disable in.pdf out.pdf
+   qpdf --json-output out.pdf out.json
+
+
 .. _json-terminology:

 JSON Terminology
@ -299,10 +315,46 @@ Object Values
 Note that writing JSON output is done by ``QPDF``, not ``QPDFWriter``.
 As such, none of the things ``QPDFWriter`` does apply. This includes
 recompression of streams, renumbering of objects, removal of
-unreferenced objects, anything to do with object streams (which are
-not represented by qpdf JSON at all since they are PDF syntax, not
-semantics), encryption, decryption, linearization, QDF mode, etc. See
-:ref:`rewriting` for a more in-depth discussion.
+unreferenced objects, encryption, decryption, linearization, QDF
+mode, etc. See :ref:`rewriting` for a more in-depth discussion. This
+has a few noteworthy implications:
+
+- Decryption is handled transparently by qpdf. As there are no QPDF
+  APIs, even internal to the library, that allow retrieval of
+  encrypted data in its raw, encrypted form, qpdf JSON always includes
+  decrypted data. It is possible that a future version of qpdf may
+  allow access to raw, encrypted string and stream data.
+
+- Objects that are related to a PDF file's structure, rather than its
+  content, are included in the JSON output, even though they are not
+  particularly useful. In a future version of qpdf, this may be fixed,
+  and the :qpdf:ref:`--preserve-unreferenced` flag may be able to be
+  used to get the existing behavior. For now, to avoid this, run the
+  file through ``qpdf --decrypt --object-streams=disable in.pdf
+  out.pdf`` to generate a new PDF file that contains no unreferenced
+  or structural objects.
+
+  - Linearized PDF files include a linearization dictionary which is not
+    referenced from any other object and which references the
+    linearization hint stream by offset. The JSON from a linearized PDF
+    file contains both of these objects, even though they are not useful
+    in the JSON. Offset information is not represented in the JSON, so
+    there's no way to find the linearization hint stream from the
+    JSON. If a new PDF is created from JSON that was written, the
+    objects will be read back in but will just be unreferenced objects
+    that will be ignored by ``QPDFWriter`` when the file is rewritten.
+
+  - The JSON from a file with object streams will include the original
+    object stream and will also include all the objects in the stream
+    as top-level objects.
+
+  - In files with object streams, the trailer "dictionary" is a
+    stream. In qpdf JSON files, the ``"trailer"`` key will contain a
+    dictionary with all the keys in it relating to the stream, and the
+    stream will also appear as an unreferenced object.
+
+  - Encrypted files are decrypted, but the encryption dictionary still
+    appears in the JSON output.

 .. _json.example: