diff --git a/TODO b/TODO index ce962173..49d8be62 100644 --- a/TODO +++ b/TODO @@ -11,8 +11,6 @@ Next Before Release: * Stay on top of https://github.com/pikepdf/pikepdf/pull/315 -* Consider whether otherwise unreferenced object streams should be - included in json output. Probably not. Or maybe optionally. * Support json v2 in the C API. At a minimum, write_json, create_from_json, and update_from_json need to be there and should take the same kinds of functions as the C API for logger. @@ -56,6 +54,20 @@ direct objects, which are always "resolved" in QPDFObjectHandle. Possible future JSON enhancements ================================= +* Consider not including unreferenced objects and trimming the trailer + in the same way that QPDFWriter does (except don't remove `/ID`). + This means excluding the linearization dictionary and hint stream, + the encryption dictionary, all keys from trailer that are removed by + QPDFWriter::getTrimmedTrailer except `/ID`, any object streams, and + the xref stream as long as all those objects are unreferenced. (They + always should be, but there could be some bizarre case of someone + creating a PDF file that has an indirect reference to one of those, + in which case we need to preserve it.) If this is done, make + `--preserve-unreferenced` preserve unreference objects and also + those extra keys. Search for "linear" and "trailer" in json.rst to + update the various places in the documentation that discuss this. + Also update the help for --json and --preserve-unreferenced. + * Add to JSON output the information available from a few additional informational options: @@ -376,7 +388,8 @@ I find it useful to make reference to them in this list. convertible back to a valid PDF. Since providing the password may reveal additional details, --show-encryption could potentially retry with this option if the first time doesn't work. Then, with the file - open, we can read the encryption dictionary normally. + open, we can read the encryption dictionary normally. If this is + done, search for "raw, encrypted" in json.rst. * In libtests, separate executables that need the object library from those that strictly use public API. Move as many of the test diff --git a/manual/json.rst b/manual/json.rst index 85210ee5..b062aacc 100644 --- a/manual/json.rst +++ b/manual/json.rst @@ -52,6 +52,22 @@ changes a handful of defaults so that the resulting JSON is as close as possible to the original input and is ready for being converted back to PDF. +The qpdf JSON data includes unreferenced objects. This may be +addressed in a future version of qpdf. For now, that means that +certain objects that are not useful in the JSON representation are +included. This includes linearization and encryption dictionaries, +linearization hint streams, object streams, and the cross-reference +(xref) stream associated with the trailer dictionary where applicable. +For the best experience with qpdf JSON, you can run the file through +qpdf first to remove encryption, linearization, and object streams. +For example: + +:: + + qpdf --decrypt --object-streams=disable in.pdf out.pdf + qpdf --json-output out.pdf out.json + + .. _json-terminology: JSON Terminology @@ -299,10 +315,46 @@ Object Values Note that writing JSON output is done by ``QPDF``, not ``QPDFWriter``. As such, none of the things ``QPDFWriter`` does apply. This includes recompression of streams, renumbering of objects, removal of -unreferenced objects, anything to do with object streams (which are -not represented by qpdf JSON at all since they are PDF syntax, not -semantics), encryption, decryption, linearization, QDF mode, etc. See -:ref:`rewriting` for a more in-depth discussion. +unreferenced objects, encryption, decryption, linearization, QDF +mode, etc. See :ref:`rewriting` for a more in-depth discussion. This +has a few noteworthy implications: + +- Decryption is handled transparently by qpdf. As there are no QPDF + APIs, even internal to the library, that allow retrieval of + encrypted data in its raw, encrypted form, qpdf JSON always includes + decrypted data. It is possible that a future version of qpdf may + allow access to raw, encrypted string and stream data. + +- Objects that are related to a PDF file's structure, rather than its + content, are included in the JSON output, even though they are not + particularly useful. In a future version of qpdf, this may be fixed, + and the :qpdf:ref:`--preserve-unreferenced` flag may be able to be + used to get the existing behavior. For now, to avoid this, run the + file through ``qpdf --decrypt --object-streams=disable in.pdf + out.pdf`` to generate a new PDF file that contains no unreferenced + or structural objects. + + - Linearized PDF files include a linearization dictionary which is not + referenced from any other object and which references the + linearization hint stream by offset. The JSON from a linearized PDF + file contains both of these objects, even though they are not useful + in the JSON. Offset information is not represented in the JSON, so + there's no way to find the linearization hint stream from the + JSON. If a new PDF is created from JSON that was written, the + objects will be read back in but will just be unreferenced objects + that will be ignored by ``QPDFWriter`` when the file is rewritten. + + - The JSON from a file with object streams will include the original + object stream and will also include all the objects in the stream + as top-level objects. + + - In files with object streams, the trailer "dictionary" is a + stream. In qpdf JSON files, the ``"trailer"`` key will contain a + dictionary with all the keys in it relating to the stream, and the + stream will also appear as an unreferenced object. + + - Encrypted files are decrypted, but the encryption dictionary still + appears in the JSON output. .. _json.example: