mirror of
https://github.com/qpdf/qpdf.git
synced 2024-11-02 11:46:35 +00:00
d01c4f8819
from "qpdf-v2" to "qpdf": [..., ...]
809 lines
34 KiB
ReStructuredText
809 lines
34 KiB
ReStructuredText
.. cSpell:ignore moddifyannotations
|
|
.. cSpell:ignore feff
|
|
|
|
.. _json:
|
|
|
|
qpdf JSON
|
|
=========
|
|
|
|
.. _json-overview:
|
|
|
|
Overview
|
|
--------
|
|
|
|
Beginning with qpdf version 11.0.0, the qpdf library and command-line
|
|
program can produce a JSON representation of a PDF file. qpdf version
|
|
11 introduces JSON format version 2. Prior to qpdf 11, versions 8.3.0
|
|
onward had a more limited JSON representation accessible only from the
|
|
command-line. For details on what changed, see :ref:`json-v2-changes`.
|
|
The rest of this chapter documents qpdf JSON version 2.
|
|
|
|
Please note: this chapter discusses *qpdf JSON format*, which
|
|
represents the contents of a PDF file. This is distinct from the
|
|
*QPDFJob JSON format* which provides a higher-level interface
|
|
interacting with qpdf the way the command-line tool does. For
|
|
information about that, see :ref:`qpdf-job`.
|
|
|
|
The qpdf JSON format is specific to qpdf. There are two ways to use
|
|
qpdf JSON:
|
|
|
|
- The :qpdf:ref:`--json` command-line flag causes creation of a JSON
|
|
representation of all the objects in a PDF file, excluding stream
|
|
data. This includes an unambiguous representation of the PDF object
|
|
structure and also provides JSON-formatted summaries of other
|
|
information about the file. This functionality is built into
|
|
``QPDFJob`` and can be accessed from the ``qpdf`` command-line tool
|
|
or from the ``QPDFJob`` C or C++ API.
|
|
|
|
- qpdf can create a JSON file that completely represents a PDF file.
|
|
You can think of this as using JSON as an *alternative syntax* for
|
|
representing a PDF file. Using qpdf JSON, it is possible to
|
|
convert a PDF file to JSON, manipulate the structure or contents of
|
|
the objects at a low level, and convert the results back to a PDF
|
|
file. This functionality can be accessed from the command-line with
|
|
the :qpdf:ref:`--json-output`, :qpdf:ref:`--json-input`, and
|
|
:qpdf:ref:`--update-from-json` flags, or from the API using the
|
|
``QPDF::writeJSON``, ``QPDF::createFromJSON``, and
|
|
``QPDF::updateFromJSON`` methods.
|
|
|
|
.. _json-terminology:
|
|
|
|
JSON Terminology
|
|
----------------
|
|
|
|
Notes about terminology:
|
|
|
|
- In JavaScript and JSON, that thing that has keys and values is
|
|
typically called an *object*.
|
|
|
|
- In PDF, that thing that has keys and values is typically called a
|
|
*dictionary*. An *object* is a PDF object such as integer, real,
|
|
boolean, null, string, array, dictionary, or stream.
|
|
|
|
- Some languages that use JSON call an *object* a *dictionary*, a
|
|
*map*, or a *hash*.
|
|
|
|
- Sometimes, it's called on *object* if it has fixed keys and a
|
|
*dictionary* if it has variable keys.
|
|
|
|
This manual is not entirely consistent about its use of *dictionary*
|
|
vs. *object* because sometimes one term or another is clearer in
|
|
context. Just be aware of the ambiguity when reading the manual. We
|
|
frequently use the term *dictionary* to refer to a JSON object because
|
|
of the consistency with PDF terminology.
|
|
|
|
.. _what-qpdf-json-is-not:
|
|
|
|
What qpdf JSON is not
|
|
---------------------
|
|
|
|
Please note that qpdf JSON offers a convenient syntax for manipulating
|
|
PDF files at a low level using JSON syntax. JSON syntax is much easier
|
|
to work with than native PDF syntax, and there are good JSON libraries
|
|
in virtually every commonly used programming language. Working with
|
|
PDF objects in JSON removes the need to worry about stream lengths,
|
|
cross reference tables, and PDF-specific representations of Unicode or
|
|
binary strings that appear outside of content streams. It does not
|
|
eliminate the need to understand the semantic structure of PDF files.
|
|
Working with qpdf JSON still requires familiarity with the PDF
|
|
specification.
|
|
|
|
In particular, qpdf JSON *does not* provide any of the following
|
|
capabilities:
|
|
|
|
- Text extraction. While you could use qpdf JSON syntax to navigate to
|
|
a page's content streams and font structures, text within pages is
|
|
still encoded using PDF syntax within content streams, and there is
|
|
no assistance for text extraction.
|
|
|
|
- Reflowing text, document structure. qpdf JSON does not add any new
|
|
information or insight into the content of PDF files. If you have a
|
|
PDF file that lacks any structural information, qpdf JSON won't help
|
|
you solve any of those problems.
|
|
|
|
This is what we mean when we say that JSON provides a *alternative
|
|
syntax* for working with PDF data. Semantically, it is identical to
|
|
native PDF.
|
|
|
|
.. _qpdf-json:
|
|
|
|
qpdf JSON Format
|
|
----------------
|
|
|
|
This section describes how qpdf represents PDF objects in JSON format.
|
|
It also describes how to work with qpdf JSON to create or
|
|
modify PDF files.
|
|
|
|
.. _json.objects:
|
|
|
|
qpdf JSON Object Representation
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
This section describes the representation of PDF objects in qpdf JSON
|
|
version 2. PDF objects are represented within the ``"objects"``
|
|
dictionary of a qpdf JSON file. This is true both for PDF serialized
|
|
to JSON (:qpdf:ref:`--json-output`, ``QPDF::writeJSON``) or objects as
|
|
they appear in the output of ``qpdf`` with the :qpdf:ref:`--json`
|
|
option.
|
|
|
|
Each key in the ``"objects"`` dictionary is either ``"trailer"`` or a
|
|
string of the form ``"obj:O G R"`` where ``O`` and ``G`` are the
|
|
object and generation numbers and ``R`` is the literal string ``R``.
|
|
This is the PDF syntax for the indirect object reference prepended by
|
|
``obj:``. The value, representing the object itself, is a JSON object
|
|
whose structure is described below.
|
|
|
|
Top-level Stream Objects
|
|
Stream objects are represented as a JSON object with the single key
|
|
``"stream"``. The stream object has a key called ``"dict"`` whose
|
|
value is the stream dictionary as an object value (described below)
|
|
with the ``"/Length"`` key omitted. Other keys are determined by the
|
|
value for json stream data (:qpdf:ref:`--json-stream-data`, or a
|
|
parameter of type ``qpdf_json_stream_data_e``) as follows:
|
|
|
|
- ``none``: stream data is not represented; no other keys are
|
|
present
|
|
|
|
- ``inline``: the stream data appears as a base64-encoded string as
|
|
the value of the ``"data"`` key
|
|
|
|
- ``file``: the stream data is written to a file, and the path to
|
|
the file is stored in the ``"datafile"`` key. A relative path is
|
|
interpreted as relative to the current directory when qpdf is
|
|
invoked.
|
|
|
|
Keys other than ``"dict"``, ``"data"``, and ``"datafile"`` are
|
|
ignored. This is primarily for future compatibility in case a newer
|
|
version of qpdf includes additional information.
|
|
|
|
As with the native PDF representation, the stream data must be
|
|
consistent with whatever filters and decode parameters are specified
|
|
in the stream dictionary.
|
|
|
|
Top-level Non-stream Objects
|
|
Non-stream objects are represented as a dictionary with the single
|
|
key ``"value"``. Other keys are ignored for future compatibility.
|
|
The value's structure is described in "Object Values" below.
|
|
|
|
Note: in files that use object streams, the trailer "dictionary" is
|
|
actually a stream, but in the JSON representation, the value of the
|
|
``"trailer"`` key is always written as a dictionary (with a
|
|
``"value"`` key like other non-stream objects). There will also be a
|
|
a stream object whose key is the object ID of the cross-reference
|
|
stream, even though this stream will generally be unreferenced. This
|
|
makes it possible to assume ``"trailer"`` points to a dictionary
|
|
without having to consider whether the file uses object streams or
|
|
not. It is also consistent with how ``QPDF::getTrailer`` behaves in
|
|
the C++ API.
|
|
|
|
Object Values
|
|
Within ``"value"`` or ``"stream"."dict"``, PDF objects are
|
|
represented as follows:
|
|
|
|
- Objects of type Boolean or null are represented as JSON objects of
|
|
the same type.
|
|
|
|
- Objects that are numeric are represented as numeric in the JSON
|
|
without regard to precision. Internally, qpdf stores numeric
|
|
values as strings, so qpdf will preserve arbitrary precision
|
|
numerical values when reading and writing JSON. It is likely that
|
|
other JSON readers and writers will have implementation-dependent
|
|
ways of handling numerical values that are out of range.
|
|
|
|
- Name objects are represented as JSON strings that start with ``/``
|
|
and are followed by the PDF name in canonical form with all PDF
|
|
syntax resolved. For example, the name whose canonical form (per
|
|
the PDF specification) is ``text/plain`` would be represented in
|
|
JSON as ``"/text/plain"`` and in PDF as ``"/text#2fplain"``.
|
|
|
|
- Indirect object references are represented as JSON strings that
|
|
look like a PDF indirect object reference and have the form ``"O G
|
|
R"`` where ``O`` and ``G`` are the object and generation numbers
|
|
and ``R`` is the literal string ``R``. For example, ``"3 0 R"``
|
|
would represent a reference to the object with object ID 3 and
|
|
generation 0.
|
|
|
|
- PDF strings are represented as JSON strings in one of two ways:
|
|
|
|
- ``"u:utf8-encoded-string"``: this format is used when the PDF
|
|
string can be unambiguously represented as a Unicode string and
|
|
contains no unprintable characters. This is the case whether the
|
|
input string is encoded as UTF-16, UTF-8 (as allowed by PDF
|
|
2.0), or PDF doc encoding. Strings are only represented this way
|
|
if they can be encoded without loss of information.
|
|
|
|
- ``"b:hex-string"``: this format is used to represent any binary
|
|
string value that can't be represented as a Unicode string.
|
|
``hex-string`` must have an even number of characters that range
|
|
from ``a`` through ``f``, ``A`` through ``F``, or ``0`` through
|
|
``9``.
|
|
|
|
qpdf writes empty strings as ``"u:"``, but both ``"b:"`` and
|
|
``"u:"`` are valid representations of the empty string.
|
|
|
|
There is full support for UTF-16 surrogate pairs. Binary strings
|
|
encoded with ``"b:..."`` are the internal PDF representations.
|
|
As such, the following are equivalent:
|
|
|
|
- ``"u:\ud83e\udd54"`` -- representation of U+1F954 as a surrogate
|
|
pair in JSON syntax
|
|
|
|
- ``"b:FEFFD83EDD54"`` -- representation of U+1F954 as the bytes
|
|
of a UTF-16 string in PDF syntax with the leading ``FEFF``
|
|
indicating UTF-16
|
|
|
|
- ``"b:efbbbff09fa594"`` -- representation of U+1F954 as the
|
|
bytes of a UTF-8 string in PDF syntax (as allowed by PDF 2.0)
|
|
with the leading ``EF``, ``BB``, ``BF`` sequence (which is just
|
|
UTF-8 encoding of ``FEFF``).
|
|
|
|
- A JSON string whose contents are ``u:`` followed by the UTF-8
|
|
representation of U+1F954. This is the potato emoji.
|
|
Unfortunately, I am not able to render it in the PDF version
|
|
of this manual.
|
|
|
|
- PDF arrays are represented as JSON arrays of objects as described
|
|
above
|
|
|
|
- PDF dictionaries are represented as JSON objects whose keys are
|
|
the string representations of names and whose values are
|
|
representations of PDF objects.
|
|
|
|
.. _json.output:
|
|
|
|
qpdf JSON Output
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
The format of the JSON written by qpdf's :qpdf:ref:`--json-output`
|
|
flag or the ``QPDF::writeJSON`` API call is a JSON object consisting
|
|
of a single key: ``"qpdf"``. This may be the only key, or it may be
|
|
embedded in the output of ``qpdf --json``. Unknown keys are ignored
|
|
for future compatibility. It is guaranteed that qpdf will never add
|
|
any keys whose names start with ``xdata``, so users are free to add
|
|
their own metadata using keys whose names start with ``xdata`` without
|
|
fear of clashing with a future version of qpdf.
|
|
|
|
The ``"qpdf"`` key points to a two-element JSON array. The first element is
|
|
a JSON object with the following keys:
|
|
|
|
- ``"jsonversion"`` -- a number indicating the JSON version used for
|
|
writing. This will always be ``2``.
|
|
|
|
- ``"pdfversion"`` -- a string containing PDF version as indicated in
|
|
the PDF header (e.g. ``"1.7"``, ``"2.0"``)
|
|
|
|
- ``pushedinheritedpageresources`` -- a boolean indicating whether
|
|
the library pushed inherited resources down to the page level.
|
|
Certain library calls cause this to happen, and qpdf needs to know
|
|
when reading a JSON file back in whether it should do this as it may
|
|
cause certain objects to be renumbered.
|
|
|
|
- ``calledgetallpages`` -- a boolean indicating whether
|
|
``getAllPages`` was called prior to writing the JSON output. This
|
|
method causes page tree repair to occur, which may renumber some
|
|
objects (in very rare cases of corrupted page trees), so qpdf needs
|
|
to know this information when reading a JSON file back in.
|
|
|
|
- ``"maxobjectid"`` -- a number indicating the object ID of the
|
|
highest numbered object in the file. This is provided to make it
|
|
easier for software that wants to add new objects to the file as you
|
|
can safely start with one above that number when creating new
|
|
objects. Note that the value of ``"maxobjectid"`` may be higher than
|
|
the actual maximum object that appears in the input PDF since it
|
|
takes into consideration any dangling indirect object references
|
|
from the original file. This prevents you from unwittingly creating
|
|
an object that doesn't exist but that is referenced, which may have
|
|
unintended side effects. (The PDF specification explicitly allows
|
|
dangling references and says to treat them as nulls. This can happen
|
|
if objects are removed from a PDF file.)
|
|
|
|
The second element is a JSON object containing the actual PDF objects
|
|
as described in :ref:`json.objects`.
|
|
|
|
Note that writing JSON output is done by ``QPDF``, not ``QPDFWriter``.
|
|
As such, none of the things ``QPDFWriter`` does apply. This includes
|
|
recompression of streams, renumbering of objects, anything to do with
|
|
object streams (which are not represented by qpdf JSON at all since
|
|
they are PDF syntax, not semantics), encryption, decryption,
|
|
linearization, QDF mode, etc. See :ref:`rewriting` for a more in-depth
|
|
discussion.
|
|
|
|
.. _json.example:
|
|
|
|
qpdf JSON Example
|
|
~~~~~~~~~~~~~~~~~
|
|
|
|
The JSON below shows an example of a simple PDF file represented in
|
|
qpdf JSON format.
|
|
|
|
.. code-block:: json
|
|
|
|
{
|
|
"qpdf": [
|
|
{
|
|
"jsonversion": 2,
|
|
"pdfversion": "1.3",
|
|
"pushedinheritedpageresources": false,
|
|
"calledgetallpages": false,
|
|
"maxobjectid": 5,
|
|
},
|
|
{
|
|
"obj:1 0 R": {
|
|
"value": {
|
|
"/Pages": "2 0 R",
|
|
"/Type": "/Catalog"
|
|
}
|
|
},
|
|
"obj:2 0 R": {
|
|
"value": {
|
|
"/Count": 1,
|
|
"/Kids": [ "3 0 R" ],
|
|
"/Type": "/Pages"
|
|
}
|
|
},
|
|
"obj:3 0 R": {
|
|
"value": {
|
|
"/Contents": "4 0 R",
|
|
"/MediaBox": [ 0, 0, 612, 792 ],
|
|
"/Parent": "2 0 R",
|
|
"/Resources": {
|
|
"/Font": {
|
|
"/F1": "5 0 R"
|
|
}
|
|
},
|
|
"/Type": "/Page"
|
|
}
|
|
},
|
|
"obj:4 0 R": {
|
|
"stream": {
|
|
"data": "eJxzCuFSUNB3M1QwMlEISQOyzY2AyEAhJAXI1gjIL0ksyddUCMnicg3hAgDLAQnI",
|
|
"dict": {
|
|
"/Filter": "/FlateDecode"
|
|
}
|
|
}
|
|
},
|
|
"obj:5 0 R": {
|
|
"value": {
|
|
"/BaseFont": "/Helvetica",
|
|
"/Encoding": "/WinAnsiEncoding",
|
|
"/Subtype": "/Type1",
|
|
"/Type": "/Font"
|
|
}
|
|
},
|
|
"trailer": {
|
|
"value": {
|
|
"/ID": [
|
|
"b:98b5a26966fba4d3a769b715b2558da6",
|
|
"b:98b5a26966fba4d3a769b715b2558da6"
|
|
],
|
|
"/Root": "1 0 R",
|
|
"/Size": 6
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
|
|
.. _json.input:
|
|
|
|
qpdf JSON Input
|
|
~~~~~~~~~~~~~~~
|
|
|
|
Output in the JSON output format described in :ref:`json.output` can
|
|
be used in two different ways:
|
|
|
|
- By using the :qpdf:ref:`--json-input` flag or calling
|
|
``QPDF::createFromJSON`` in place of ``QPDF::processFile``, a qpdf
|
|
JSON file can be used in place of a PDF file as the input to qpdf.
|
|
|
|
- By using the :qpdf:ref:`--update-from-json` flag or calling
|
|
``QPDF::updateFromJSON`` on an initialized ``QPDF`` object, a qpdf
|
|
JSON file can be used to apply changes to an existing ``QPDF``
|
|
object. That ``QPDF`` object can have come from any source including
|
|
a PDF file, a qpdf JSON file, or the result of any other process
|
|
that results in a valid, initialized ``QPDF`` object.
|
|
|
|
Here are some important things to know about qpdf JSON input.
|
|
|
|
- When a qpdf JSON file is used as the primary input file, it must be
|
|
complete. This means
|
|
|
|
- A PDF version number must be specified with the ``"pdfversion"``
|
|
key
|
|
|
|
- Stream data must be present for all streams
|
|
|
|
- The trailer dictionary must be present, though only the
|
|
``"/Root"`` key is required.
|
|
|
|
- Certain fields from the input are ignored whether creating or
|
|
updating from a JSON file:
|
|
|
|
- ``"maxobjectid"`` is ignored, so it is not necessary to update it
|
|
when adding new objects.
|
|
|
|
- ``"/Length"`` is ignored in all stream dictionaries. qpdf doesn't
|
|
put it there when it creates JSON output, and it is not necessary
|
|
to add it.
|
|
|
|
- ``"/Size"`` is ignored if it appears in a trailer dictionary as
|
|
that is always recomputed by ``QPDFWriter``.
|
|
|
|
- Unknown keys at the to top level of the file, within ``objects``,
|
|
at the top level of each individual object (inside the object that
|
|
has the ``"value"`` or ``"stream"`` key) and directly within
|
|
``"stream"`` are ignored for future compatibility. You should
|
|
avoid putting your own values in those places if you wish to avoid
|
|
risking that your JSON files will not work in future versions of
|
|
qpdf. The exception to this advice is at the top level of the
|
|
overall file where it is explicitly supported for you to add your
|
|
own keys. For example, you could add your own metadata at the top
|
|
level, and qpdf will ignore it. Note that extra top-level keys are
|
|
not preserved when qpdf reads your JSON file.
|
|
|
|
- When qpdf reads a PDF file, the internal object numbers are always
|
|
preserved. However, when qpdf writes a file using ``QPDFWriter``,
|
|
``QPDFWriter`` does its own numbering and, in general, does not
|
|
preserve input object numbers. That means that a qpdf JSON file that
|
|
is used to update an existing PDF must have object numbers that
|
|
match the input file it is modifying. In practical terms, this means
|
|
that you can't use a JSON file created from one PDF file to modify
|
|
the *output of running qpdf on that file*.
|
|
|
|
To put this more concretely, the following is valid:
|
|
|
|
::
|
|
|
|
qpdf --json-output in.pdf pdf.json
|
|
# edit pdf.json
|
|
qpdf in.pdf out.pdf --update-from-json=pdf.json
|
|
|
|
The following will not produce predictable results because
|
|
``out.pdf`` won't have the same object numbers as ``pdf.json`` and
|
|
``in.pdf``.
|
|
|
|
::
|
|
|
|
qpdf --json-output in.pdf pdf.json
|
|
# edit pdf.json
|
|
qpdf in.pdf out.pdf --update-from-json=pdf.json
|
|
# edit pdf.json again
|
|
# Don't do this
|
|
qpdf out.pdf out2.pdf --update-from-json=pdf.json
|
|
|
|
- When updating from a JSON file (:qpdf:ref:`--update-from-json`,
|
|
``QPDF::updateFromJSON``), existing objects are updated in place.
|
|
This has the following implications:
|
|
|
|
- You may omit both ``"data"`` and ``"datafile"`` if the object you
|
|
are updating is already a stream. In that case the original stream
|
|
data is preserved. You must always provide a stream dictionary,
|
|
but it may be empty. Note that an empty stream dictionary will
|
|
clear the old dictionary. There is no way to indicate that an old
|
|
stream dictionary should be left alone, so if your intention is to
|
|
replace the stream data and preserve the dictionary, the
|
|
original dictionary must appear in the JSON file.
|
|
|
|
- You can change one object type to another object type including
|
|
replacing a stream with a non-stream or a non-stream with a
|
|
stream. If you replace a non-stream with a stream, you must
|
|
provide data for the stream.
|
|
|
|
- Objects that you do not wish to modify can be omitted from the
|
|
JSON. That includes the trailer. That means you can use the output
|
|
of a qpdf JSON file that was written using
|
|
:qpdf:ref:`--json-object` to have it include only the objects you
|
|
intend to modify.
|
|
|
|
- You can omit the ``"pdfversion"`` key. The input PDF version will
|
|
be preserved.
|
|
|
|
.. _json.workflow-cli:
|
|
|
|
qpdf JSON Workflow: CLI
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
This section includes a few examples of using qpdf JSON.
|
|
|
|
- Convert a PDF file to JSON format, edit the JSON, and convert back
|
|
to PDF. This is an alternative to using QDF mode (see :ref:`qdf`) to
|
|
modify PDF files in a text editor. Each method has its own
|
|
advantages and disadvantages.
|
|
|
|
::
|
|
|
|
qpdf --json-output in.pdf pdf.json
|
|
# edit pdf.json
|
|
qpdf --json-input pdf.json out.pdf
|
|
|
|
- Extract only a specific object into a JSON file, modify the object
|
|
in JSON, and use the modified object to update the original PDF. In
|
|
this case, we're editing object 4, whatever that may happen to be.
|
|
You would have to know through some other means which object you
|
|
wanted to edit, such as by looking at other JSON output or using a
|
|
tool (possibly but not necessarily qpdf) to identify the object.
|
|
|
|
::
|
|
|
|
qpdf --json-output in.pdf pdf.json --json-object=4,0
|
|
# edit pdf.json
|
|
qpdf in.pdf --update-from-json=pdf.json out.pdf
|
|
|
|
Rather than using :qpdf:ref:`--json-object` as in the above example,
|
|
you could edit the JSON file to remove the objects you didn't need.
|
|
You could also just leave them there, though the update process
|
|
would be slower.
|
|
|
|
You could also add new objects to a file by adding them to
|
|
``pdf.json``. Just be sure the object number doesn't conflict with
|
|
an existing object. The ``"maxobjectid"`` field in the original
|
|
output can help with this. You don't have to update it if you add
|
|
objects as it is ignored when the file is read back in.
|
|
|
|
- Use :qpdf:ref:`--json-input` and :qpdf:ref:`--json-output` together
|
|
to demonstrate preservation of object numbers. In this example,
|
|
``a.json`` and ``b.json`` will have the same objects and object
|
|
numbers. The files may not be identical since strings may be
|
|
normalized, fields may appear in a different order, etc. However
|
|
``b.json`` and ``c.json`` are probably identical.
|
|
|
|
::
|
|
|
|
qpdf --json-output in.pdf a.json
|
|
qpdf --json-input --json-output a.json b.json
|
|
qpdf --json-input --json-output b.json c.json
|
|
|
|
|
|
.. _json.workflow-api:
|
|
|
|
qpdf JSON Workflow: API
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Everything that can be done using the qpdf CLI can be done using the
|
|
C++ API. See comments in :file:`QPDF.hh` for ``writeJSON``,
|
|
``createFromJSON``, and ``updateFromJSON`` for details.
|
|
|
|
.. _json-guarantees:
|
|
|
|
JSON Compatibility Guarantees
|
|
-----------------------------
|
|
|
|
The qpdf JSON representation includes a JSON serialization of the raw
|
|
objects in the PDF file as well as some computed information in a more
|
|
easily extracted format. QPDF provides some guarantees about its JSON
|
|
format. These guarantees are designed to simplify the experience of a
|
|
developer working with the JSON format.
|
|
|
|
Compatibility
|
|
The top-level JSON object is a dictionary (JSON "object"). The JSON
|
|
output contains various nested dictionaries and arrays. With the
|
|
exception of dictionaries that are populated by the fields of
|
|
PDF objects from the file, all instances of a dictionary are
|
|
guaranteed to have exactly the same keys.
|
|
|
|
The top-level JSON structure contains a ``"version"`` key whose
|
|
value is simple integer. The value of the ``version`` key will be
|
|
incremented if a non-compatible change is made. A non-compatible
|
|
change would be any change that involves removal of a key, a change
|
|
to the format of data pointed to by a key, or a semantic change
|
|
that requires a different interpretation of a previously existing
|
|
key.
|
|
|
|
With a specific qpdf JSON version, future versions of qpdf are free
|
|
to add additional keys but not to remove keys or change the type of
|
|
object that a key points to.
|
|
|
|
Documentation
|
|
The :command:`qpdf` command can be invoked with the
|
|
:qpdf:ref:`--json-help` option. This will output a JSON
|
|
structure that has the same structure as the JSON output that qpdf
|
|
generates, except that each field in the help output is a description
|
|
of the corresponding field in the JSON output. The specific
|
|
guarantees are as follows:
|
|
|
|
- A dictionary in the help output means that the corresponding
|
|
location in the actual JSON output is also a dictionary with
|
|
exactly the same keys; that is, no keys present in help are
|
|
absent in the real output, and no keys will be present in the
|
|
real output that are not in help. It is possible for a key to be
|
|
present and have a value that is explicitly ``null``. As a
|
|
special case, if the dictionary has a single key whose name
|
|
starts with ``<`` and ends with ``>``, it means that the JSON
|
|
output is a dictionary that can have any value as a key. This is
|
|
used for cases in which the keys of the dictionary are things
|
|
like object IDs.
|
|
|
|
- A string in the help output is a description of the item that
|
|
appears in the corresponding location of the actual output. The
|
|
corresponding output can have any value including ``null``.
|
|
|
|
- A single-element array in the help output indicates that the
|
|
corresponding location in the actual output is either a single
|
|
item or is an array of any length. The single item or each
|
|
element of the array has whatever format is implied by the single
|
|
element of the help output's array.
|
|
|
|
- A multi-element array in the help output indicates that the
|
|
corresponding location in the actual output is an array of the
|
|
same length. Each element of the output array has whatever format
|
|
is implied by the corresponding element of the help output's
|
|
array.
|
|
|
|
For example, the help output indicates includes a ``"pagelabels"``
|
|
key whose value is an array of one element. That element is a
|
|
dictionary with keys ``"index"`` and ``"label"``. In addition to
|
|
describing the meaning of those keys, this tells you that the actual
|
|
JSON output will contain a ``pagelabels`` array, each of whose
|
|
elements is a dictionary that contains an ``index`` key, a ``label``
|
|
key, and no other keys.
|
|
|
|
Directness and Simplicity
|
|
The JSON output contains the value of every object in the file, but
|
|
it also contains some summary data. This is analogous to how qpdf's
|
|
library interface works. The summary data is similar to the helper
|
|
functions in that it allows you to look at certain aspects of the
|
|
PDF file without having to understand all the nuances of the PDF
|
|
specification, while the raw objects allow you to mine the PDF for
|
|
anything that the higher-level interfaces are lacking.
|
|
|
|
.. _json.considerations:
|
|
|
|
JSON: Special Considerations
|
|
----------------------------
|
|
|
|
For the most part, the built-in JSON help tells you everything you need
|
|
to know about the JSON format, but there are a few non-obvious things to
|
|
be aware of:
|
|
|
|
- If a PDF file has certain types of errors in its pages tree (such as
|
|
page objects that are direct or multiple pages sharing the same
|
|
object ID), qpdf will automatically repair the pages tree. If you
|
|
specify ``"objects"`` (and, with qpdf JSON version 1, also
|
|
``"objectinfo"``) without any other keys, you will see the original
|
|
pages tree without any corrections. If you specify any of keys that
|
|
require page tree traversal (for example, ``"pages"``,
|
|
``"outlines"``, or ``"pagelabel"``), then ``"objects"`` (and
|
|
``"objectinfo"``) will show the repaired page tree so that object
|
|
references will be consistent throughout the file. This is not an
|
|
issue with :qpdf:ref:`--json-output`, which doesn't repair the pages
|
|
tree.
|
|
|
|
- While qpdf guarantees that keys present in the help will be present
|
|
in the output, those fields may be null or empty if the information
|
|
is not known or absent in the file. Also, if you specify
|
|
:qpdf:ref:`--json-key`, the keys that are not listed
|
|
will be excluded entirely except for those that
|
|
:qpdf:ref:`--json-help` says are always present.
|
|
|
|
- In a few places, there are keys with names containing
|
|
``pageposfrom1``. The values of these keys are null or an integer. If
|
|
an integer, they point to a page index within the file numbering from
|
|
1. Note that JSON indexes from 0, and you would also use 0-based
|
|
indexing using the API. However, 1-based indexing is easier in this
|
|
case because the command-line syntax for specifying page ranges is
|
|
1-based. If you were going to write a program that looked through
|
|
the JSON for information about specific pages and then use the
|
|
command-line to extract those pages, 1-based indexing is easier.
|
|
Besides, it's more convenient to subtract 1 in a real programming
|
|
language than it is to add 1 in shell code.
|
|
|
|
- The image information included in the ``page`` section of the JSON
|
|
output includes the key ``"filterable"``. Note that the value of
|
|
this field may depend on the :qpdf:ref:`--decode-level` that you
|
|
invoke qpdf with. The JSON output includes a top-level key
|
|
``"parameters"`` that indicates the decode level that was used for
|
|
computing whether a stream was filterable. For example, jpeg images
|
|
will be shown as not filterable by default, but they will be shown
|
|
as filterable if you run :command:`qpdf --json
|
|
--decode-level=all`.
|
|
|
|
- The ``encrypt`` key's values will be populated for non-encrypted
|
|
files. Some values will be null, and others will have values that
|
|
apply to unencrypted files.
|
|
|
|
- The qpdf library itself never loads an entire PDF into memory. This
|
|
remains true for PDF files represented in JSON format. In general,
|
|
qpdf will hold the entire object structure in memory once a file has
|
|
been fully read (objects are loaded into memory lazily but stay
|
|
there once loaded), but it will never have more than two copies of a
|
|
stream in memory at once. That said, if you ask qpdf to write JSON
|
|
to memory, it will do so, so be careful about this if you are
|
|
working with very large PDF files. There is nothing in the qpdf
|
|
library itself that prevents working with PDF files much larger than
|
|
available system memory. qpdf can both read and write such files in
|
|
JSON format. If you need to work with a PDF file's json
|
|
representation in memory, it is recommended that you use either
|
|
``none`` or ``file`` as the argument to
|
|
:qpdf:ref:`--json-stream-data`, or if using the API, use
|
|
``qpdf_sj_none`` or ``pdf_sj_file`` as the json stream data value.
|
|
If using ``none``, you can use other means to obtain the stream
|
|
data.
|
|
|
|
.. _json-v2-changes:
|
|
|
|
Changes from JSON v1 to v2
|
|
--------------------------
|
|
|
|
The following changes were made to qpdf's JSON output format for
|
|
version 2.
|
|
|
|
- The representation of objects has changed. For details, see
|
|
:ref:`json.objects`.
|
|
|
|
- The representation of strings is now unambiguous for all strings.
|
|
Strings a prefixed with either ``u:`` for Unicode strings or
|
|
``b:`` for byte strings.
|
|
|
|
- Names are shown in qpdf's canonical form rather than in PDF
|
|
syntax. (Example: the PDF-syntax name ``/text#2fplain`` appeared
|
|
as ``"/text#2fplain"`` in v1 but appears as ``"/text/plain"`` in
|
|
v2.
|
|
|
|
- The top-level representation of an object in ``"objects"`` is a
|
|
dictionary containing either a ``"value"`` key or a ``"stream"``
|
|
key, making it possible to distinguish streams from other objects.
|
|
|
|
- The ``"objectinfo"`` key has been removed in favor of a
|
|
representation in ``"objects"`` that differentiates between a stream
|
|
and other kinds of objects. In v1, it was not possible to tell a
|
|
stream from a dictionary within ``"objects"``.
|
|
|
|
- Within the ``"objects"`` dictionary, keys are now ``"obj:O G R"``
|
|
where ``O`` and ``G`` are the object and generation number.
|
|
``"trailer"`` remains the key for the trailer dictionary. In v1, the
|
|
``obj:`` prefix was not present. The rationale for this change is as
|
|
follows:
|
|
|
|
- Having a unique prefix (``obj:``) makes it much easier to search
|
|
in the JSON file for the definition of an object
|
|
|
|
- Having the key still contain ``O G R`` makes it much easier to
|
|
construct the key from an indirect reference. You just have to
|
|
prepend ``obj:``. There is no need to parse the indirect object
|
|
reference.
|
|
|
|
- In the ``"encrypt"`` object, the ``"modifyannotations"`` was
|
|
misspelled as ``"moddifyannotations"`` in v1. This has been
|
|
corrected.
|
|
|
|
Motivation for qpdf JSON version 2
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
qpdf JSON version 2 was created to make it possible to manipulate PDF
|
|
files using JSON syntax instead of native PDF syntax. This makes it
|
|
possible to make low-level updates to PDF files from just about any
|
|
programming language or even to do so from the command-line using
|
|
tools like ``jq`` or any editor that's capable of working with JSON
|
|
files. There were several limitations of JSON format version 1 that
|
|
made this impossible:
|
|
|
|
- Strings, names, and indirect object references in the original PDF
|
|
file were all converted to strings in the JSON representation. For
|
|
casual human inspection, this was fine, but in the general case,
|
|
there was no way to tell the difference between a string that looked
|
|
like a name or indirect object reference from an actual name or
|
|
indirect object reference.
|
|
|
|
- PDF strings were not unambiguously represented in the JSON format.
|
|
The way qpdf JSON v1 represented a string was to try to convert the
|
|
string to UTF-8. This was done by assuming a string that was not
|
|
explicitly marked as Unicode was encoded in PDF doc encoding. The
|
|
problem is that there is not a perfect bidirectional mapping between
|
|
Unicode and PDF doc encoding, so if a binary string happened to
|
|
contain characters that couldn't be bidirectionally mapped, there
|
|
would be no way to get back to the original PDF string. Even when
|
|
possible, trying to map from the JSON representation of a binary
|
|
string back to the original string required knowledge of the mapping
|
|
between PDF doc encoding and Unicode.
|
|
|
|
- There was no representation of stream data. If you wanted to extract
|
|
stream data, you could use :qpdf:ref:`--show-object`, so this wasn't
|
|
that important for inspection, but it was a blocker for being able
|
|
to go from JSON back to PDF. qpdf JSON version 2 allows stream data
|
|
to be included inline as base64-encoded data. There is also an
|
|
option to write all stream data to external files, which makes it
|
|
possible to work with very large PDF files in JSON format even with
|
|
tools that try to read the entire JSON structure into memory.
|
|
|
|
- The PDF version from PDF header was not represented in qpdf JSON v1.
|