mirror of
https://github.com/qpdf/qpdf.git
synced 2024-12-22 10:58:58 +00:00
Update documentation to clarify some limitations of qpdf JSON
This commit is contained in:
parent
ed04b80caf
commit
f95e0549cc
19
TODO
19
TODO
@ -11,8 +11,6 @@ Next
|
||||
Before Release:
|
||||
|
||||
* Stay on top of https://github.com/pikepdf/pikepdf/pull/315
|
||||
* Consider whether otherwise unreferenced object streams should be
|
||||
included in json output. Probably not. Or maybe optionally.
|
||||
* Support json v2 in the C API. At a minimum, write_json,
|
||||
create_from_json, and update_from_json need to be there and should
|
||||
take the same kinds of functions as the C API for logger.
|
||||
@ -56,6 +54,20 @@ direct objects, which are always "resolved" in QPDFObjectHandle.
|
||||
Possible future JSON enhancements
|
||||
=================================
|
||||
|
||||
* Consider not including unreferenced objects and trimming the trailer
|
||||
in the same way that QPDFWriter does (except don't remove `/ID`).
|
||||
This means excluding the linearization dictionary and hint stream,
|
||||
the encryption dictionary, all keys from trailer that are removed by
|
||||
QPDFWriter::getTrimmedTrailer except `/ID`, any object streams, and
|
||||
the xref stream as long as all those objects are unreferenced. (They
|
||||
always should be, but there could be some bizarre case of someone
|
||||
creating a PDF file that has an indirect reference to one of those,
|
||||
in which case we need to preserve it.) If this is done, make
|
||||
`--preserve-unreferenced` preserve unreference objects and also
|
||||
those extra keys. Search for "linear" and "trailer" in json.rst to
|
||||
update the various places in the documentation that discuss this.
|
||||
Also update the help for --json and --preserve-unreferenced.
|
||||
|
||||
* Add to JSON output the information available from a few additional
|
||||
informational options:
|
||||
|
||||
@ -376,7 +388,8 @@ I find it useful to make reference to them in this list.
|
||||
convertible back to a valid PDF. Since providing the password may
|
||||
reveal additional details, --show-encryption could potentially retry
|
||||
with this option if the first time doesn't work. Then, with the file
|
||||
open, we can read the encryption dictionary normally.
|
||||
open, we can read the encryption dictionary normally. If this is
|
||||
done, search for "raw, encrypted" in json.rst.
|
||||
|
||||
* In libtests, separate executables that need the object library
|
||||
from those that strictly use public API. Move as many of the test
|
||||
|
@ -52,6 +52,22 @@ changes a handful of defaults so that the resulting JSON is as close
|
||||
as possible to the original input and is ready for being converted
|
||||
back to PDF.
|
||||
|
||||
The qpdf JSON data includes unreferenced objects. This may be
|
||||
addressed in a future version of qpdf. For now, that means that
|
||||
certain objects that are not useful in the JSON representation are
|
||||
included. This includes linearization and encryption dictionaries,
|
||||
linearization hint streams, object streams, and the cross-reference
|
||||
(xref) stream associated with the trailer dictionary where applicable.
|
||||
For the best experience with qpdf JSON, you can run the file through
|
||||
qpdf first to remove encryption, linearization, and object streams.
|
||||
For example:
|
||||
|
||||
::
|
||||
|
||||
qpdf --decrypt --object-streams=disable in.pdf out.pdf
|
||||
qpdf --json-output out.pdf out.json
|
||||
|
||||
|
||||
.. _json-terminology:
|
||||
|
||||
JSON Terminology
|
||||
@ -299,10 +315,46 @@ Object Values
|
||||
Note that writing JSON output is done by ``QPDF``, not ``QPDFWriter``.
|
||||
As such, none of the things ``QPDFWriter`` does apply. This includes
|
||||
recompression of streams, renumbering of objects, removal of
|
||||
unreferenced objects, anything to do with object streams (which are
|
||||
not represented by qpdf JSON at all since they are PDF syntax, not
|
||||
semantics), encryption, decryption, linearization, QDF mode, etc. See
|
||||
:ref:`rewriting` for a more in-depth discussion.
|
||||
unreferenced objects, encryption, decryption, linearization, QDF
|
||||
mode, etc. See :ref:`rewriting` for a more in-depth discussion. This
|
||||
has a few noteworthy implications:
|
||||
|
||||
- Decryption is handled transparently by qpdf. As there are no QPDF
|
||||
APIs, even internal to the library, that allow retrieval of
|
||||
encrypted data in its raw, encrypted form, qpdf JSON always includes
|
||||
decrypted data. It is possible that a future version of qpdf may
|
||||
allow access to raw, encrypted string and stream data.
|
||||
|
||||
- Objects that are related to a PDF file's structure, rather than its
|
||||
content, are included in the JSON output, even though they are not
|
||||
particularly useful. In a future version of qpdf, this may be fixed,
|
||||
and the :qpdf:ref:`--preserve-unreferenced` flag may be able to be
|
||||
used to get the existing behavior. For now, to avoid this, run the
|
||||
file through ``qpdf --decrypt --object-streams=disable in.pdf
|
||||
out.pdf`` to generate a new PDF file that contains no unreferenced
|
||||
or structural objects.
|
||||
|
||||
- Linearized PDF files include a linearization dictionary which is not
|
||||
referenced from any other object and which references the
|
||||
linearization hint stream by offset. The JSON from a linearized PDF
|
||||
file contains both of these objects, even though they are not useful
|
||||
in the JSON. Offset information is not represented in the JSON, so
|
||||
there's no way to find the linearization hint stream from the
|
||||
JSON. If a new PDF is created from JSON that was written, the
|
||||
objects will be read back in but will just be unreferenced objects
|
||||
that will be ignored by ``QPDFWriter`` when the file is rewritten.
|
||||
|
||||
- The JSON from a file with object streams will include the original
|
||||
object stream and will also include all the objects in the stream
|
||||
as top-level objects.
|
||||
|
||||
- In files with object streams, the trailer "dictionary" is a
|
||||
stream. In qpdf JSON files, the ``"trailer"`` key will contain a
|
||||
dictionary with all the keys in it relating to the stream, and the
|
||||
stream will also appear as an unreferenced object.
|
||||
|
||||
- Encrypted files are decrypted, but the encryption dictionary still
|
||||
appears in the JSON output.
|
||||
|
||||
.. _json.example:
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user