Split documentation into multiple pages, change theme

This commit is contained in:
Jay Berkenbilt 2021-12-18 09:01:52 -05:00
parent f3d1138b8a
commit 10fb619d3e
16 changed files with 6263 additions and 6261 deletions

2
TODO
View File

@ -30,8 +30,6 @@ Before release:
I can do about, and it doesn't seem worth fixing. Maybe mention it
somewhere?
* README-maintainer: Fix installation of documentation to website
* Get navigation working properly
* Figure out where to put :ref:`search` so we get doc search
Soon:

View File

@ -0,0 +1,14 @@
.. _acknowledgments:
Acknowledgment
==============
QPDF was originally created in 2001 and modified periodically between
2001 and 2005 during my employment at `Apex CoVantage
<http://www.apexcovantage.com>`__. Upon my departure from Apex, the
company graciously allowed me to take ownership of the software and
continue maintaining it as an open source project, a decision for which I
am very grateful. I have made considerable enhancements to it since
that time. I feel fortunate to have worked for people who would make
such a decision. This work would not have been possible without their
support.

1675
manual/cli.rst Normal file

File diff suppressed because it is too large Load Diff

View File

@ -11,4 +11,7 @@ project = 'QPDF'
copyright = '2005-2021, Jay Berkenbilt'
author = 'Jay Berkenbilt'
release = '10.4.0'
html_theme = 'alabaster'
html_theme = 'agogo'
html_theme_options = {
"body_max_width": None,
}

747
manual/design.rst Normal file
View File

@ -0,0 +1,747 @@
.. _ref.design:
Design and Library Notes
========================
.. _ref.design.intro:
Introduction
------------
This section was written prior to the implementation of the qpdf package
and was subsequently modified to reflect the implementation. In some
cases, for purposes of explanation, it may differ slightly from the
actual implementation. As always, the source code and test suite are
authoritative. Even if there are some errors, this document should serve
as a road map to understanding how this code works.
In general, one should adhere strictly to a specification when writing
but be liberal in reading. This way, the product of our software will be
accepted by the widest range of other programs, and we will accept the
widest range of input files. This library attempts to conform to that
philosophy whenever possible but also aims to provide strict checking
for people who want to validate PDF files. If you don't want to see
warnings and are trying to write something that is tolerant, you can
call ``setSuppressWarnings(true)``. If you want to fail on the first
error, you can call ``setAttemptRecovery(false)``. The default behavior
is to generating warnings for recoverable problems. Note that recovery
will not always produce the desired results even if it is able to get
through the file. Unlike most other PDF files that produce generic
warnings such as "This file is damaged,", qpdf generally issues a
detailed error message that would be most useful to a PDF developer.
This is by design as there seems to be a shortage of PDF validation
tools out there. This was, in fact, one of the major motivations behind
the initial creation of qpdf.
.. _ref.design-goals:
Design Goals
------------
The QPDF package includes support for reading and rewriting PDF files.
It aims to hide from the user details involving object locations,
modified (appended) PDF files, the directness/indirectness of objects,
and stream filters including encryption. It does not aim to hide
knowledge of the object hierarchy or content stream contents. Put
another way, a user of the qpdf library is expected to have knowledge
about how PDF files work, but is not expected to have to keep track of
bookkeeping details such as file positions.
A user of the library never has to care whether an object is direct or
indirect, though it is possible to determine whether an object is direct
or not if this information is needed. All access to objects deals with
this transparently. All memory management details are also handled by
the library.
The ``PointerHolder`` object is used internally by the library to deal
with memory management. This is basically a smart pointer object very
similar in spirit to C++-11's ``std::shared_ptr`` object, but predating
it by several years. This library also makes use of a technique for
giving fine-grained access to methods in one class to other classes by
using public subclasses with friends and only private members that in
turn call private methods of the containing class. See
``QPDFObjectHandle::Factory`` as an example.
The top-level qpdf class is ``QPDF``. A ``QPDF`` object represents a PDF
file. The library provides methods for both accessing and mutating PDF
files.
The primary class for interacting with PDF objects is
``QPDFObjectHandle``. Instances of this class can be passed around by
value, copied, stored in containers, etc. with very low overhead.
Instances of ``QPDFObjectHandle`` created by reading from a file will
always contain a reference back to the ``QPDF`` object from which they
were created. A ``QPDFObjectHandle`` may be direct or indirect. If
indirect, the ``QPDFObject`` the ``PointerHolder`` initially points to
is a null pointer. In this case, the first attempt to access the
underlying ``QPDFObject`` will result in the ``QPDFObject`` being
resolved via a call to the referenced ``QPDF`` instance. This makes it
essentially impossible to make coding errors in which certain things
will work for some PDF files and not for others based on which objects
are direct and which objects are indirect.
Instances of ``QPDFObjectHandle`` can be directly created and modified
using static factory methods in the ``QPDFObjectHandle`` class. There
are factory methods for each type of object as well as a convenience
method ``QPDFObjectHandle::parse`` that creates an object from a string
representation of the object. Existing instances of ``QPDFObjectHandle``
can also be modified in several ways. See comments in
:file:`QPDFObjectHandle.hh` for details.
An instance of ``QPDF`` is constructed by using the class's default
constructor. If desired, the ``QPDF`` object may be configured with
various methods that change its default behavior. Then the
``QPDF::processFile()`` method is passed the name of a PDF file, which
permanently associates the file with that QPDF object. A password may
also be given for access to password-protected files. QPDF does not
enforce encryption parameters and will treat user and owner passwords
equivalently. Either password may be used to access an encrypted file.
``QPDF`` will allow recovery of a user password given an owner password.
The input PDF file must be seekable. (Output files written by
``QPDFWriter`` need not be seekable, even when creating linearized
files.) During construction, ``QPDF`` validates the PDF file's header,
and then reads the cross reference tables and trailer dictionaries. The
``QPDF`` class keeps only the first trailer dictionary though it does
read all of them so it can check the ``/Prev`` key. ``QPDF`` class users
may request the root object and the trailer dictionary specifically. The
cross reference table is kept private. Objects may then be requested by
number of by walking the object tree.
When a PDF file has a cross-reference stream instead of a
cross-reference table and trailer, requesting the document's trailer
dictionary returns the stream dictionary from the cross-reference stream
instead.
There are some convenience routines for very common operations such as
walking the page tree and returning a vector of all page objects. For
full details, please see the header files
:file:`QPDF.hh` and
:file:`QPDFObjectHandle.hh`. There are also some
additional helper classes that provide higher level API functions for
certain document constructions. These are discussed in :ref:`ref.helper-classes`.
.. _ref.helper-classes:
Helper Classes
--------------
QPDF version 8.1 introduced the concept of helper classes. Helper
classes are intended to contain higher level APIs that allow developers
to work with certain document constructs at an abstraction level above
that of ``QPDFObjectHandle`` while staying true to qpdf's philosophy of
not hiding document structure from the developer. As with qpdf in
general, the goal is take away some of the more tedious bookkeeping
aspects of working with PDF files, not to remove the need for the
developer to understand how the PDF construction in question works. The
driving factor behind the creation of helper classes was to allow the
evolution of higher level interfaces in qpdf without polluting the
interfaces of the main top-level classes ``QPDF`` and
``QPDFObjectHandle``.
There are two kinds of helper classes: *document* helpers and *object*
helpers. Document helpers are constructed with a reference to a ``QPDF``
object and provide methods for working with structures that are at the
document level. Object helpers are constructed with an instance of a
``QPDFObjectHandle`` and provide methods for working with specific types
of objects.
Examples of document helpers include ``QPDFPageDocumentHelper``, which
contains methods for operating on the document's page trees, such as
enumerating all pages of a document and adding and removing pages; and
``QPDFAcroFormDocumentHelper``, which contains document-level methods
related to interactive forms, such as enumerating form fields and
creating mappings between form fields and annotations.
Examples of object helpers include ``QPDFPageObjectHelper`` for
performing operations on pages such as page rotation and some operations
on content streams, ``QPDFFormFieldObjectHelper`` for performing
operations related to interactive form fields, and
``QPDFAnnotationObjectHelper`` for working with annotations.
It is always possible to retrieve the underlying ``QPDF`` reference from
a document helper and the underlying ``QPDFObjectHandle`` reference from
an object helper. Helpers are designed to be helpers, not wrappers. The
intention is that, in general, it is safe to freely intermix operations
that use helpers with operations that use the underlying objects.
Document and object helpers do not attempt to provide a complete
interface for working with the things they are helping with, nor do they
attempt to encapsulate underlying structures. They just provide a few
methods to help with error-prone, repetitive, or complex tasks. In some
cases, a helper object may cache some information that is expensive to
gather. In such cases, the helper classes are implemented so that their
own methods keep the cache consistent, and the header file will provide
a method to invalidate the cache and a description of what kinds of
operations would make the cache invalid. If in doubt, you can always
discard a helper class and create a new one with the same underlying
objects, which will ensure that you have discarded any stale
information.
By Convention, document helpers are called
``QPDFSomethingDocumentHelper`` and are derived from
``QPDFDocumentHelper``, and object helpers are called
``QPDFSomethingObjectHelper`` and are derived from ``QPDFObjectHelper``.
For details on specific helpers, please see their header files. You can
find them by looking at
:file:`include/qpdf/QPDF*DocumentHelper.hh` and
:file:`include/qpdf/QPDF*ObjectHelper.hh`.
In order to avoid creation of circular dependencies, the following
general guidelines are followed with helper classes:
- Core class interfaces do not know about helper classes. For example,
no methods of ``QPDF`` or ``QPDFObjectHandle`` will include helper
classes in their interfaces.
- Interfaces of object helpers will usually not use document helpers in
their interfaces. This is because it is much more useful for document
helpers to have methods that return object helpers. Most operations
in PDF files start at the document level and go from there to the
object level rather than the other way around. It can sometimes be
useful to map back from object-level structures to document-level
structures. If there is a desire to do this, it will generally be
provided by a method in the document helper class.
- Most of the time, object helpers don't know about other object
helpers. However, in some cases, one type of object may be a
container for another type of object, in which case it may make sense
for the outer object to know about the inner object. For example,
there are methods in the ``QPDFPageObjectHelper`` that know
``QPDFAnnotationObjectHelper`` because references to annotations are
contained in page dictionaries.
- Any helper or core library class may use helpers in their
implementations.
Prior to qpdf version 8.1, higher level interfaces were added as
"convenience functions" in either ``QPDF`` or ``QPDFObjectHandle``. For
compatibility, older convenience functions for operating with pages will
remain in those classes even as alternatives are provided in helper
classes. Going forward, new higher level interfaces will be provided
using helper classes.
.. _ref.implementation-notes:
Implementation Notes
--------------------
This section contains a few notes about QPDF's internal implementation,
particularly around what it does when it first processes a file. This
section is a bit of a simplification of what it actually does, but it
could serve as a starting point to someone trying to understand the
implementation. There is nothing in this section that you need to know
to use the qpdf library.
``QPDFObject`` is the basic PDF Object class. It is an abstract base
class from which are derived classes for each type of PDF object.
Clients do not interact with Objects directly but instead interact with
``QPDFObjectHandle``.
When the ``QPDF`` class creates a new object, it dynamically allocates
the appropriate type of ``QPDFObject`` and immediately hands the pointer
to an instance of ``QPDFObjectHandle``. The parser reads a token from
the current file position. If the token is a not either a dictionary or
array opener, an object is immediately constructed from the single token
and the parser returns. Otherwise, the parser iterates in a special mode
in which it accumulates objects until it finds a balancing closer.
During this process, the "``R``" keyword is recognized and an indirect
``QPDFObjectHandle`` may be constructed.
The ``QPDF::resolve()`` method, which is used to resolve an indirect
object, may be invoked from the ``QPDFObjectHandle`` class. It first
checks a cache to see whether this object has already been read. If not,
it reads the object from the PDF file and caches it. It the returns the
resulting ``QPDFObjectHandle``. The calling object handle then replaces
its ``PointerHolder<QDFObject>`` with the one from the newly returned
``QPDFObjectHandle``. In this way, only a single copy of any direct
object need exist and clients can access objects transparently without
knowing caring whether they are direct or indirect objects.
Additionally, no object is ever read from the file more than once. That
means that only the portions of the PDF file that are actually needed
are ever read from the input file, thus allowing the qpdf package to
take advantage of this important design goal of PDF files.
If the requested object is inside of an object stream, the object stream
itself is first read into memory. Then the tokenizer reads objects from
the memory stream based on the offset information stored in the stream.
Those individual objects are cached, after which the temporary buffer
holding the object stream contents are discarded. In this way, the first
time an object in an object stream is requested, all objects in the
stream are cached.
The following example should clarify how ``QPDF`` processes a simple
file.
- Client constructs ``QPDF`` ``pdf`` and calls
``pdf.processFile("a.pdf");``.
- The ``QPDF`` class checks the beginning of
:file:`a.pdf` for a PDF header. It then reads the
cross reference table mentioned at the end of the file, ensuring that
it is looking before the last ``%%EOF``. After getting to ``trailer``
keyword, it invokes the parser.
- The parser sees "``<<``", so it calls itself recursively in
dictionary creation mode.
- In dictionary creation mode, the parser keeps accumulating objects
until it encounters "``>>``". Each object that is read is pushed onto
a stack. If "``R``" is read, the last two objects on the stack are
inspected. If they are integers, they are popped off the stack and
their values are used to construct an indirect object handle which is
then pushed onto the stack. When "``>>``" is finally read, the stack
is converted into a ``QPDF_Dictionary`` which is placed in a
``QPDFObjectHandle`` and returned.
- The resulting dictionary is saved as the trailer dictionary.
- The ``/Prev`` key is searched. If present, ``QPDF`` seeks to that
point and repeats except that the new trailer dictionary is not
saved. If ``/Prev`` is not present, the initial parsing process is
complete.
If there is an encryption dictionary, the document's encryption
parameters are initialized.
- The client requests root object. The ``QPDF`` class gets the value of
root key from trailer dictionary and returns it. It is an unresolved
indirect ``QPDFObjectHandle``.
- The client requests the ``/Pages`` key from root
``QPDFObjectHandle``. The ``QPDFObjectHandle`` notices that it is
indirect so it asks ``QPDF`` to resolve it. ``QPDF`` looks in the
object cache for an object with the root dictionary's object ID and
generation number. Upon not seeing it, it checks the cross reference
table, gets the offset, and reads the object present at that offset.
It stores the result in the object cache and returns the cached
result. The calling ``QPDFObjectHandle`` replaces its object pointer
with the one from the resolved ``QPDFObjectHandle``, verifies that it
a valid dictionary object, and returns the (unresolved indirect)
``QPDFObject`` handle to the top of the Pages hierarchy.
As the client continues to request objects, the same process is
followed for each new requested object.
.. _ref.casting:
Casting Policy
--------------
This section describes the casting policy followed by qpdf's
implementation. This is no concern to qpdf's end users and largely of no
concern to people writing code that uses qpdf, but it could be of
interest to people who are porting qpdf to a new platform or who are
making modifications to the code.
The C++ code in qpdf is free of old-style casts except where unavoidable
(e.g. where the old-style cast is in a macro provided by a third-party
header file). When there is a need for a cast, it is handled, in order
of preference, by rewriting the code to avoid the need for a cast,
calling ``const_cast``, calling ``static_cast``, calling
``reinterpret_cast``, or calling some combination of the above. As a
last resort, a compiler-specific ``#pragma`` may be used to suppress a
warning that we don't want to fix. Examples may include suppressing
warnings about the use of old-style casts in code that is shared between
C and C++ code.
The ``QIntC`` namespace, provided by
:file:`include/qpdf/QIntC.hh`, implements safe
functions for converting between integer types. These functions do range
checking and throw a ``std::range_error``, which is subclass of
``std::runtime_error``, if conversion from one integer type to another
results in loss of information. There are many cases in which we have to
move between different integer types because of incompatible integer
types used in interoperable interfaces. Some are unavoidable, such as
moving between sizes and offsets, and others are there because of old
code that is too in entrenched to be fixable without breaking source
compatibility and causing pain for users. QPDF is compiled with extra
warnings to detect conversions with potential data loss, and all such
cases should be fixed by either using a function from ``QIntC`` or a
``static_cast``.
When the intention is just to switch the type because of exchanging data
between incompatible interfaces, use ``QIntC``. This is the usual case.
However, there are some cases in which we are explicitly intending to
use the exact same bit pattern with a different type. This is most
common when switching between signed and unsigned characters. A lot of
qpdf's code uses unsigned characters internally, but ``std::string`` and
``char`` are signed. Using ``QIntC::to_char`` would be wrong for
converting from unsigned to signed characters because a negative
``char`` value and the corresponding ``unsigned char`` value greater
than 127 *mean the same thing*. There are also
cases in which we use ``static_cast`` when working with bit fields where
we are not representing a numerical value but rather a bunch of bits
packed together in some integer type. Also note that ``size_t`` and
``long`` both typically differ between 32-bit and 64-bit environments,
so sometimes an explicit cast may not be needed to avoid warnings on one
platform but may be needed on another. A conversion with ``QIntC``
should always be used when the types are different even if the
underlying size is the same. QPDF's CI build builds on 32-bit and 64-bit
platforms, and the test suite is very thorough, so it is hard to make
any of the potential errors here without being caught in build or test.
Non-const ``unsigned char*`` is used in the ``Pipeline`` interface. The
pipeline interface has a ``write`` call that uses ``unsigned char*``
without a ``const`` qualifier. The main reason for this is
to support pipelines that make calls to third-party libraries, such as
zlib, that don't include ``const`` in their interfaces. Unfortunately,
there are many places in the code where it is desirable to have
``const char*`` with pipelines. None of the pipeline implementations
in qpdf
currently modify the data passed to write, and doing so would be counter
to the intent of ``Pipeline``, but there is nothing in the code to
prevent this from being done. There are places in the code where
``const_cast`` is used to remove the const-ness of pointers going into
``Pipeline``\ s. This could theoretically be unsafe, but there is
adequate testing to assert that it is safe and will remain safe in
qpdf's code.
.. _ref.encryption:
Encryption
----------
Encryption is supported transparently by qpdf. When opening a PDF file,
if an encryption dictionary exists, the ``QPDF`` object processes this
dictionary using the password (if any) provided. The primary decryption
key is computed and cached. No further access is made to the encryption
dictionary after that time. When an object is read from a file, the
object ID and generation of the object in which it is contained is
always known. Using this information along with the stored encryption
key, all stream and string objects are transparently decrypted. Raw
encrypted objects are never stored in memory. This way, nothing in the
library ever has to know or care whether it is reading an encrypted
file.
An interface is also provided for writing encrypted streams and strings
given an encryption key. This is used by ``QPDFWriter`` when it rewrites
encrypted files.
When copying encrypted files, unless otherwise directed, qpdf will
preserve any encryption in force in the original file. qpdf can do this
with either the user or the owner password. There is no difference in
capability based on which password is used. When 40 or 128 bit
encryption keys are used, the user password can be recovered with the
owner password. With 256 keys, the user and owner passwords are used
independently to encrypt the actual encryption key, so while either can
be used, the owner password can no longer be used to recover the user
password.
Starting with version 4.0.0, qpdf can read files that are not encrypted
but that contain encrypted attachments, but it cannot write such files.
qpdf also requires the password to be specified in order to open the
file, not just to extract attachments, since once the file is open, all
decryption is handled transparently. When copying files like this while
preserving encryption, qpdf will apply the file's encryption to
everything in the file, not just to the attachments. When decrypting the
file, qpdf will decrypt the attachments. In general, when copying PDF
files with multiple encryption formats, qpdf will choose the newest
format. The only exception to this is that clear-text metadata will be
preserved as clear-text if it is that way in the original file.
One point of confusion some people have about encrypted PDF files is
that encryption is not the same as password protection. Password
protected files are always encrypted, but it is also possible to create
encrypted files that do not have passwords. Internally, such files use
the empty string as a password, and most readers try the empty string
first to see if it works and prompt for a password only if the empty
string doesn't work. Normally such files have an empty user password and
a non-empty owner password. In that way, if the file is opened by an
ordinary reader without specification of password, the restrictions
specified in the encryption dictionary can be enforced. Most users
wouldn't even realize such a file was encrypted. Since qpdf always
ignores the restrictions (except for the purpose of reporting what they
are), qpdf doesn't care which password you use. QPDF will allow you to
create PDF files with non-empty user passwords and empty owner
passwords. Some readers will require a password when you open these
files, and others will open the files without a password and not enforce
restrictions. Having a non-empty user password and an empty owner
password doesn't really make sense because it would mean that opening
the file with the user password would be more restrictive than not
supplying a password at all. QPDF also allows you to create PDF files
with the same password as both the user and owner password. Some readers
will not ever allow such files to be accessed without restrictions
because they never try the password as the owner password if it works as
the user password. Nonetheless, one of the powerful aspects of qpdf is
that it allows you to finely specify the way encrypted files are
created, even if the results are not useful to some readers. One use
case for this would be for testing a PDF reader to ensure that it
handles odd configurations of input files.
.. _ref.random-numbers:
Random Number Generation
------------------------
QPDF generates random numbers to support generation of encrypted data.
Starting in qpdf 10.0.0, qpdf uses the crypto provider as its source of
random numbers. Older versions used the OS-provided source of secure
random numbers or, if allowed at build time, insecure random numbers
from stdlib. Starting with version 5.1.0, you can disable use of
OS-provided secure random numbers at build time. This is especially
useful on Windows if you want to avoid a dependency on Microsoft's
cryptography API. You can also supply your own random data provider. For
details on how to do this, please refer to the top-level README.md file
in the source distribution and to comments in
:file:`QUtil.hh`.
.. _ref.adding-and-remove-pages:
Adding and Removing Pages
-------------------------
While qpdf's API has supported adding and modifying objects for some
time, version 3.0 introduces specific methods for adding and removing
pages. These are largely convenience routines that handle two tricky
issues: pushing inheritable resources from the ``/Pages`` tree down to
individual pages and manipulation of the ``/Pages`` tree itself. For
details, see ``addPage`` and surrounding methods in
:file:`QPDF.hh`.
.. _ref.reserved-objects:
Reserving Object Numbers
------------------------
Version 3.0 of qpdf introduced the concept of reserved objects. These
are seldom needed for ordinary operations, but there are cases in which
you may want to add a series of indirect objects with references to each
other to a ``QPDF`` object. This causes a problem because you can't
determine the object ID that a new indirect object will have until you
add it to the ``QPDF`` object with ``QPDF::makeIndirectObject``. The
only way to add two mutually referential objects to a ``QPDF`` object
prior to version 3.0 would be to add the new objects first and then make
them refer to each other after adding them. Now it is possible to create
a *reserved object* using
``QPDFObjectHandle::newReserved``. This is an indirect object that stays
"unresolved" even if it is queried for its type. So now, if you want to
create a set of mutually referential objects, you can create
reservations for each one of them and use those reservations to
construct the references. When finished, you can call
``QPDF::replaceReserved`` to replace the reserved objects with the real
ones. This functionality will never be needed by most applications, but
it is used internally by QPDF when copying objects from other PDF files,
as discussed in :ref:`ref.foreign-objects`. For an example of how to use reserved
objects, search for ``newReserved`` in
:file:`test_driver.cc` in qpdf's sources.
.. _ref.foreign-objects:
Copying Objects From Other PDF Files
------------------------------------
Version 3.0 of qpdf introduced the ability to copy objects into a
``QPDF`` object from a different ``QPDF`` object, which we refer to as
*foreign objects*. This allows arbitrary
merging of PDF files. The "from" ``QPDF`` object must remain valid after
the copy as discussed in the note below. The
:command:`qpdf` command-line tool provides limited
support for basic page selection, including merging in pages from other
files, but the library's API makes it possible to implement arbitrarily
complex merging operations. The main method for copying foreign objects
is ``QPDF::copyForeignObject``. This takes an indirect object from
another ``QPDF`` and copies it recursively into this object while
preserving all object structure, including circular references. This
means you can add a direct object that you create from scratch to a
``QPDF`` object with ``QPDF::makeIndirectObject``, and you can add an
indirect object from another file with ``QPDF::copyForeignObject``. The
fact that ``QPDF::makeIndirectObject`` does not automatically detect a
foreign object and copy it is an explicit design decision. Copying a
foreign object seems like a sufficiently significant thing to do that it
should be done explicitly.
The other way to copy foreign objects is by passing a page from one
``QPDF`` to another by calling ``QPDF::addPage``. In contrast to
``QPDF::makeIndirectObject``, this method automatically distinguishes
between indirect objects in the current file, foreign objects, and
direct objects.
Please note: when you copy objects from one ``QPDF`` to another, the
source ``QPDF`` object must remain valid until you have finished with
the destination object. This is because the original object is still
used to retrieve any referenced stream data from the copied object.
.. _ref.rewriting:
Writing PDF Files
-----------------
The qpdf library supports file writing of ``QPDF`` objects to PDF files
through the ``QPDFWriter`` class. The ``QPDFWriter`` class has two
writing modes: one for non-linearized files, and one for linearized
files. See :ref:`ref.linearization` for a description of
linearization is implemented. This section describes how we write
non-linearized files including the creation of QDF files (see :ref:`ref.qdf`.
This outline was written prior to implementation and is not exactly
accurate, but it provides a correct "notional" idea of how writing
works. Look at the code in ``QPDFWriter`` for exact details.
- Initialize state:
- next object number = 1
- object queue = empty
- renumber table: old object id/generation to new id/0 = empty
- xref table: new id -> offset = empty
- Create a QPDF object from a file.
- Write header for new PDF file.
- Request the trailer dictionary.
- For each value that is an indirect object, grab the next object
number (via an operation that returns and increments the number). Map
object to new number in renumber table. Push object onto queue.
- While there are more objects on the queue:
- Pop queue.
- Look up object's new number *n* in the renumbering table.
- Store current offset into xref table.
- Write ``:samp:`{n}` 0 obj``.
- If object is null, whether direct or indirect, write out null,
thus eliminating unresolvable indirect object references.
- If the object is a stream stream, write stream contents, piped
through any filters as required, to a memory buffer. Use this
buffer to determine the stream length.
- If object is not a stream, array, or dictionary, write out its
contents.
- If object is an array or dictionary (including stream), traverse
its elements (for array) or values (for dictionaries), handling
recursive dictionaries and arrays, looking for indirect objects.
When an indirect object is found, if it is not resolvable, ignore.
(This case is handled when writing it out.) Otherwise, look it up
in the renumbering table. If not found, grab the next available
object number, assign to the referenced object in the renumbering
table, and push the referenced object onto the queue. As a special
case, when writing out a stream dictionary, replace length,
filters, and decode parameters as required.
Write out dictionary or array, replacing any unresolvable indirect
object references with null (pdf spec says reference to
non-existent object is legal and resolves to null) and any
resolvable ones with references to the renumbered objects.
- If the object is a stream, write ``stream\n``, the stream contents
(from the memory buffer), and ``\nendstream\n``.
- When done, write ``endobj``.
Once we have finished the queue, all referenced objects will have been
written out and all deleted objects or unreferenced objects will have
been skipped. The new cross-reference table will contain an offset for
every new object number from 1 up to the number of objects written. This
can be used to write out a new xref table. Finally we can write out the
trailer dictionary with appropriately computed /ID (see spec, 8.3, File
Identifiers), the cross reference table offset, and ``%%EOF``.
.. _ref.filtered-streams:
Filtered Streams
----------------
Support for streams is implemented through the ``Pipeline`` interface
which was designed for this package.
When reading streams, create a series of ``Pipeline`` objects. The
``Pipeline`` abstract base requires implementation ``write()`` and
``finish()`` and provides an implementation of ``getNext()``. Each
pipeline object, upon receiving data, does whatever it is going to do
and then writes the data (possibly modified) to its successor.
Alternatively, a pipeline may be an end-of-the-line pipeline that does
something like store its output to a file or a memory buffer ignoring a
successor. For additional details, look at
:file:`Pipeline.hh`.
``QPDF`` can read raw or filtered streams. When reading a filtered
stream, the ``QPDF`` class creates a ``Pipeline`` object for one of each
appropriate filter object and chains them together. The last filter
should write to whatever type of output is required. The ``QPDF`` class
has an interface to write raw or filtered stream contents to a given
pipeline.
.. _ref.object-accessors:
Object Accessor Methods
-----------------------
..
This section is referenced in QPDFObjectHandle.hh
For general information about how to access instances of
``QPDFObjectHandle``, please see the comments in
:file:`QPDFObjectHandle.hh`. Search for "Accessor
methods". This section provides a more in-depth discussion of the
behavior and the rationale for the behavior.
*Why were type errors made into warnings?* When type checks were
introduced into qpdf in the early days, it was expected that type errors
would only occur as a result of programmer error. However, in practice,
type errors would occur with malformed PDF files because of assumptions
made in code, including code within the qpdf library and code written by
library users. The most common case would be chaining calls to
``getKey()`` to access keys deep within a dictionary. In many cases,
qpdf would be able to recover from these situations, but the old
behavior often resulted in crashes rather than graceful recovery. For
this reason, the errors were changed to warnings.
*Why even warn about type errors when the user can't usually do anything
about them?* Type warnings are extremely valuable during development.
Since it's impossible to catch at compile time things like typos in
dictionary key names or logic errors around what the structure of a PDF
file might be, the presence of type warnings can save lots of developer
time. They have also proven useful in exposing issues in qpdf itself
that would have otherwise gone undetected.
*Can there be a type-safe ``QPDFObjectHandle``?* It would be great if
``QPDFObjectHandle`` could be more strongly typed so that you'd have to
have check that something was of a particular type before calling
type-specific accessor methods. However, implementing this at this stage
of the library's history would be quite difficult, and it would make a
the common pattern of drilling into an object no longer work. While it
would be possible to have a parallel interface, it would create a lot of
extra code. If qpdf were written in a language like rust, an interface
like this would make a lot of sense, but, for a variety of reasons, the
qpdf API is consistent with other APIs of its time, relying on exception
handling to catch errors. The underlying PDF objects are inherently not
type-safe. Forcing stronger type safety in ``QPDFObjectHandle`` would
ultimately cause a lot more code to have to be written and would like
make software that uses qpdf more brittle, and even so, checks would
have to occur at runtime.
*Why do type errors sometimes raise exceptions?* The way warnings work
in qpdf requires a ``QPDF`` object to be associated with an object
handle for a warning to be issued. It would be nice if this could be
fixed, but it would require major changes to the API. Rather than
throwing away these conditions, we convert them to exceptions. It's not
that bad though. Since any object handle that was read from a file has
an associated ``QPDF`` object, it would only be type errors on objects
that were created explicitly that would cause exceptions, and in that
case, type errors are much more likely to be the result of a coding
error than invalid input.
*Why does the behavior of a type exception differ between the C and C++
API?* There is no way to throw and catch exceptions in C short of
something like ``setjmp`` and ``longjmp``, and that approach is not
portable across language barriers. Since the C API is often used from
other languages, it's important to keep things as simple as possible.
Starting in qpdf 10.5, exceptions that used to crash code using the C
API will be written to stderr by default, and it is possible to register
an error handler. There's no reason that the error handler can't
simulate exception handling in some way, such as by using ``setjmp`` and
``longjmp`` or by setting some variable that can be checked after
library calls are made. In retrospect, it might have been better if the
C API object handle methods returned error codes like the other methods
and set return values in passed-in pointers, but this would complicate
both the implementation and the use of the library for a case that is
actually quite rare and largely avoidable.

File diff suppressed because it is too large Load Diff

342
manual/installation.rst Normal file
View File

@ -0,0 +1,342 @@
.. _ref.installing:
Building and Installing QPDF
============================
This chapter describes how to build and install qpdf. Please see also
the :file:`README.md` and
:file:`INSTALL` files in the source distribution.
.. _ref.prerequisites:
System Requirements
-------------------
The qpdf package has few external dependencies. In order to build qpdf,
the following packages are required:
- A C++ compiler that supports C++-14.
- zlib: http://www.zlib.net/
- jpeg: http://www.ijg.org/files/ or https://libjpeg-turbo.org/
- *Recommended but not required:* gnutls: https://www.gnutls.org/ to be
able to use the gnutls crypto provider, and/or openssl:
https://openssl.org/ to be able to use the openssl crypto provider.
- gnu make 3.81 or newer: http://www.gnu.org/software/make
- perl version 5.8 or newer: http://www.perl.org/; required for running
the test suite. Starting with qpdf version 9.1.1, perl is no longer
required at runtime.
- GNU diffutils (any version): http://www.gnu.org/software/diffutils/
is required to run the test suite. Note that this is the version of
diff present on virtually all GNU/Linux systems. This is required
because the test suite uses :command:`diff -u`.
Part of qpdf's test suite does comparisons of the contents PDF files by
converting them images and comparing the images. The image comparison
tests are disabled by default. Those tests are not required for
determining correctness of a qpdf build if you have not modified the
code since the test suite also contains expected output files that are
compared literally. The image comparison tests provide an extra check to
make sure that any content transformations don't break the rendering of
pages. Transformations that affect the content streams themselves are
off by default and are only provided to help developers look into the
contents of PDF files. If you are making deep changes to the library
that cause changes in the contents of the files that qpdf generate,
then you should enable the image comparison tests. Enable them by
running :command:`configure` with the
:samp:`--enable-test-compare-images` flag. If you enable
this, the following additional requirements are required by the test
suite. Note that in no case are these items required to use qpdf.
- libtiff: http://www.remotesensing.org/libtiff/
- GhostScript version 8.60 or newer: http://www.ghostscript.com
If you do not enable this, then you do not need to have tiff and
ghostscript.
Pre-built documentation is distributed with qpdf, so you should
generally not need to rebuild the documentation. In order to build the
documentation from source, you need to install `Sphinx
<https://sphinx-doc.org>`__. To build the PDF version of the
documentation, you need `pdflatex`, `latexmk`, and a fairly complete
LaTeX installation. Detailed requirements can be found in the Sphinx
documentation.
.. _ref.building:
Build Instructions
------------------
Building qpdf on UNIX is generally just a matter of running
::
./configure
make
You can also run :command:`make check` to run the test
suite and :command:`make install` to install. Please run
:command:`./configure --help` for options on what can be
configured. You can also set the value of ``DESTDIR`` during
installation to install to a temporary location, as is common with many
open source packages. Please see also the
:file:`README.md` and
:file:`INSTALL` files in the source distribution.
Building on Windows is a little bit more complicated. For details,
please see :file:`README-windows.md` in the source
distribution. You can also download a binary distribution for Windows.
There is a port of qpdf to Visual C++ version 6 in the
:file:`contrib` area generously contributed by Jian
Ma. This is also discussed in more detail in
:file:`README-windows.md`.
While ``wchar_t`` is part of the C++ standard, qpdf uses it in only one
place in the public API, and it's just in a helper function. It is
possible to build qpdf on a system that doesn't have ``wchar_t``, and
it's also possible to compile a program that uses qpdf on a system
without ``wchar_t`` as long as you don't call that one method. This is a
very unusual situation. For a detailed discussion, please see the
top-level README.md file in qpdf's source distribution.
There are some other things you can do with the build. Although qpdf
uses :command:`autoconf`, it does not use
:command:`automake` but instead uses a
hand-crafted non-recursive Makefile that requires gnu make. If you're
really interested, please read the comments in the top-level
:file:`Makefile`.
.. _ref.crypto:
Crypto Providers
----------------
Starting with qpdf 9.1.0, the qpdf library can be built with multiple
implementations of providers of cryptographic functions, which we refer
to as "crypto providers." At the time of writing, a crypto
implementation must provide MD5 and SHA2 (256, 384, and 512-bit) hashes
and RC4 and AES256 with and without CBC encryption. In the future, if
digital signature is added to qpdf, there may be additional requirements
beyond this.
Starting with qpdf version 9.1.0, the available implementations are
``native`` and ``gnutls``. In qpdf 10.0.0, ``openssl`` was added.
Additional implementations may be added if needed. It is also possible
for a developer to provide their own implementation without modifying
the qpdf library.
.. _ref.crypto.build:
Build Support For Crypto Providers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When building with qpdf's build system, crypto providers can be enabled
at build time using various :command:`./configure`
options. The default behavior is for
:command:`./configure` to discover which crypto providers
can be supported based on available external libraries, to build all
available crypto providers, and to use an external provider as the
default over the native one. This behavior can be changed with the
following flags to :command:`./configure`:
- :samp:`--enable-crypto-{x}`
(where :samp:`{x}` is a supported crypto
provider): enable the :samp:`{x}` crypto
provider, requiring any external dependencies it needs
- :samp:`--disable-crypto-{x}`:
disable the :samp:`{x}` provider, and do not
link against its dependencies even if they are available
- :samp:`--with-default-crypto={x}`:
make :samp:`{x}` the default provider even if
a higher priority one is available
- :samp:`--disable-implicit-crypto`: only build crypto
providers that are explicitly requested with an
:samp:`--enable-crypto-{x}`
option
For example, if you want to guarantee that the gnutls crypto provider is
used and that the native provider is not built, you could run
:command:`./configure --enable-crypto-gnutls
--disable-implicit-crypto`.
If you build qpdf using your own build system, in order for qpdf to work
at all, you need to enable at least one crypto provider. The file
:file:`libqpdf/qpdf/qpdf-config.h.in` provides
macros ``DEFAULT_CRYPTO``, whose value must be a string naming the
default crypto provider, and various symbols starting with
``USE_CRYPTO_``, at least one of which has to be enabled. Additionally,
you must compile the source files that implement a crypto provider. To
get a list of those files, look at
:file:`libqpdf/build.mk`. If you want to omit a
particular crypto provider, as long as its ``USE_CRYPTO_`` symbol is
undefined, you can completely ignore the source files that belong to a
particular crypto provider. Additionally, crypto providers may have
their own external dependencies that can be omitted if the crypto
provider is not used. For example, if you are building qpdf yourself and
are using an environment that does not support gnutls or openssl, you
can ensure that ``USE_CRYPTO_NATIVE`` is defined, ``USE_CRYPTO_GNUTLS``
is not defined, and ``DEFAULT_CRYPTO`` is defined to ``"native"``. Then
you must include the source files used in the native implementation,
some of which were added or renamed from earlier versions, to your
build, and you can ignore
:file:`QPDFCrypto_gnutls.cc`. Always consult
:file:`libqpdf/build.mk` to get the list of source
files you need to build.
.. _ref.crypto.runtime:
Runtime Crypto Provider Selection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can use the :samp:`--show-crypto` option to
:command:`qpdf` to get a list of available crypto
providers. The default provider is always listed first, and the rest are
listed in lexical order. Each crypto provider is listed on a line by
itself with no other text, enabling the output of this command to be
used easily in scripts.
You can override which crypto provider is used by setting the
``QPDF_CRYPTO_PROVIDER`` environment variable. There are few reasons to
ever do this, but you might want to do it if you were explicitly trying
to compare behavior of two different crypto providers while testing
performance or reproducing a bug. It could also be useful for people who
are implementing their own crypto providers.
.. _ref.crypto.develop:
Crypto Provider Information for Developers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you are writing code that uses libqpdf and you want to force a
certain crypto provider to be used, you can call the method
``QPDFCryptoProvider::setDefaultProvider``. The argument is the name of
a built-in or developer-supplied provider. To add your own crypto
provider, you have to create a class derived from ``QPDFCryptoImpl`` and
register it with ``QPDFCryptoProvider``. For additional information, see
comments in :file:`include/qpdf/QPDFCryptoImpl.hh`.
.. _ref.crypto.design:
Crypto Provider Design Notes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This section describes a few bits of rationale for why the crypto
provider interface was set up the way it was. You don't need to know any
of this information, but it's provided for the record and in case it's
interesting.
As a general rule, I want to avoid as much as possible including large
blocks of code that are conditionally compiled such that, in most
builds, some code is never built. This is dangerous because it makes it
very easy for invalid code to creep in unnoticed. As such, I want it to
be possible to build qpdf with all available crypto providers, and this
is the way I build qpdf for local development. At the same time, if a
particular packager feels that it is a security liability for qpdf to
use crypto functionality from other than a library that gets
considerable scrutiny for this specific purpose (such as gnutls,
openssl, or nettle), then I want to give that packager the ability to
completely disable qpdf's native implementation. Or if someone wants to
avoid adding a dependency on one of the external crypto providers, I
don't want the availability of the provider to impose additional
external dependencies within that environment. Both of these are
situations that I know to be true for some users of qpdf.
I want registration and selection of crypto providers to be thread-safe,
and I want it to work deterministically for a developer to provide their
own crypto provider and be able to set it up as the default. This was
the primary motivation behind requiring C++-11 as doing so enabled me to
exploit the guaranteed thread safety of local block static
initialization. The ``QPDFCryptoProvider`` class uses a singleton
pattern with thread-safe initialization to create the singleton instance
of ``QPDFCryptoProvider`` and exposes only static methods in its public
interface. In this way, if a developer wants to call any
``QPDFCryptoProvider`` methods, the library guarantees the
``QPDFCryptoProvider`` is fully initialized and all built-in crypto
providers are registered. Making ``QPDFCryptoProvider`` actually know
about all the built-in providers may seem a bit sad at first, but this
choice makes it extremely clear exactly what the initialization behavior
is. There's no question about provider implementations automatically
registering themselves in a nondeterministic order. It also means that
implementations do not need to know anything about the provider
interface, which makes them easier to test in isolation. Another
advantage of this approach is that a developer who wants to develop
their own crypto provider can do so in complete isolation from the qpdf
library and, with just two calls, can make qpdf use their provider in
their application. If they decided to contribute their code, plugging it
into the qpdf library would require a very small change to qpdf's source
code.
The decision to make the crypto provider selectable at runtime was one I
struggled with a little, but I decided to do it for various reasons.
Allowing an end user to switch crypto providers easily could be very
useful for reproducing a potential bug. If a user reports a bug that
some cryptographic thing is broken, I can easily ask that person to try
with the ``QPDF_CRYPTO_PROVIDER`` variable set to different values. The
same could apply in the event of a performance problem. This also makes
it easier for qpdf's own test suite to exercise code with different
providers without having to make every program that links with qpdf
aware of the possibility of multiple providers. In qpdf's continuous
integration environment, the entire test suite is run for each supported
crypto provider. This is made simple by being able to select the
provider using an environment variable.
Finally, making crypto providers selectable in this way establish a
pattern that I may follow again in the future for stream filter
providers. One could imagine a future enhancement where someone could
provide their own implementations for basic filters like
``/FlateDecode`` or for other filters that qpdf doesn't support.
Implementing the registration functions and internal storage of
registered providers was also easier using C++-11's functional
interfaces, which was another reason to require C++-11 at this time.
.. _ref.packaging:
Notes for Packagers
-------------------
If you are packaging qpdf for an operating system distribution, here are
some things you may want to keep in mind:
- Starting in qpdf version 9.1.1, qpdf no longer has a runtime
dependency on perl. This is because fix-qdf was rewritten in C++.
However, qpdf still has a build-time dependency on perl.
- Make sure you are getting the intended behavior with regard to crypto
providers. Read :ref:`ref.crypto.build` for details.
- Passing :samp:`--enable-show-failed-test-output` to
:command:`./configure` will cause any failed test
output to be written to the console. This can be very useful for
seeing test failures generated by autobuilders where you can't access
qtest.log after the fact.
- If qpdf's build environment detects the presence of autoconf and
related tools, it will check to ensure that automatically generated
files are up-to-date with recorded checksums and fail if it detects a
discrepancy. This feature is intended to prevent you from
accidentally forgetting to regenerate automatic files after modifying
their sources. If your packaging environment automatically refreshes
automatic files, it can cause this check to fail. Suppress qpdf's
checks by passing :samp:`--disable-check-autofiles`
to :command:`/.configure`. This is safe since qpdf's
:command:`autogen.sh` just runs autotools in the
normal way.
- QPDF's :command:`make install` does not install
completion files by default, but as a packager, it's good if you
install them wherever your distribution expects such files to go. You
can find completion files to install in the
:file:`completions` directory.
- Packagers are encouraged to install the source files from the
:file:`examples` directory along with qpdf
development packages.

177
manual/json.rst Normal file
View File

@ -0,0 +1,177 @@
.. _ref.json:
QPDF JSON
=========
.. _ref.json-overview:
Overview
--------
Beginning with qpdf version 8.3.0, the :command:`qpdf`
command-line program can produce a JSON representation of the
non-content data in a PDF file. It includes a dump in JSON format of all
objects in the PDF file excluding the content of streams. This JSON
representation makes it very easy to look in detail at the structure of
a given PDF file, and it also provides a great way to work with PDF
files programmatically from the command-line in languages that can't
call or link with the qpdf library directly. Note that stream data can
be extracted from PDF files using other qpdf command-line options.
.. _ref.json-guarantees:
JSON Guarantees
---------------
The qpdf JSON representation includes a JSON serialization of the raw
objects in the PDF file as well as some computed information in a more
easily extracted format. QPDF provides some guarantees about its JSON
format. These guarantees are designed to simplify the experience of a
developer working with the JSON format.
Compatibility
The top-level JSON object output is a dictionary. The JSON output
contains various nested dictionaries and arrays. With the exception
of dictionaries that are populated by the fields of objects from the
file, all instances of a dictionary are guaranteed to have exactly
the same keys. Future versions of qpdf are free to add additional
keys but not to remove keys or change the type of object that a key
points to. The qpdf program validates this guarantee, and in the
unlikely event that a bug in qpdf should cause it to generate data
that doesn't conform to this rule, it will ask you to file a bug
report.
The top-level JSON structure contains a "``version``" key whose value
is simple integer. The value of the ``version`` key will be
incremented if a non-compatible change is made. A non-compatible
change would be any change that involves removal of a key, a change
to the format of data pointed to by a key, or a semantic change that
requires a different interpretation of a previously existing key. A
strong effort will be made to avoid breaking compatibility.
Documentation
The :command:`qpdf` command can be invoked with the
:samp:`--json-help` option. This will output a JSON
structure that has the same structure as the JSON output that qpdf
generates, except that each field in the help output is a description
of the corresponding field in the JSON output. The specific
guarantees are as follows:
- A dictionary in the help output means that the corresponding
location in the actual JSON output is also a dictionary with
exactly the same keys; that is, no keys present in help are absent
in the real output, and no keys will be present in the real output
that are not in help. As a special case, if the dictionary has a
single key whose name starts with ``<`` and ends with ``>``, it
means that the JSON output is a dictionary that can have any keys,
each of which conforms to the value of the special key. This is
used for cases in which the keys of the dictionary are things like
object IDs.
- A string in the help output is a description of the item that
appears in the corresponding location of the actual output. The
corresponding output can have any format.
- An array in the help output always contains a single element. It
indicates that the corresponding location in the actual output is
also an array, and that each element of the array has whatever
format is implied by the single element of the help output's
array.
For example, the help output indicates includes a "``pagelabels``"
key whose value is an array of one element. That element is a
dictionary with keys "``index``" and "``label``". In addition to
describing the meaning of those keys, this tells you that the actual
JSON output will contain a ``pagelabels`` array, each of whose
elements is a dictionary that contains an ``index`` key, a ``label``
key, and no other keys.
Directness and Simplicity
The JSON output contains the value of every object in the file, but
it also contains some processed data. This is analogous to how qpdf's
library interface works. The processed data is similar to the helper
functions in that it allows you to look at certain aspects of the PDF
file without having to understand all the nuances of the PDF
specification, while the raw objects allow you to mine the PDF for
anything that the higher-level interfaces are lacking.
.. _json.limitations:
Limitations of JSON Representation
----------------------------------
There are a few limitations to be aware of with the JSON structure:
- Strings, names, and indirect object references in the original PDF
file are all converted to strings in the JSON representation. In the
case of a "normal" PDF file, you can tell the difference because a
name starts with a slash (``/``), and an indirect object reference
looks like ``n n R``, but if there were to be a string that looked
like a name or indirect object reference, there would be no way to
tell this from the JSON output. Note that there are certain cases
where you know for sure what something is, such as knowing that
dictionary keys in objects are always names and that certain things
in the higher-level computed data are known to contain indirect
object references.
- The JSON format doesn't support binary data very well. Mostly the
details are not important, but they are presented here for
information. When qpdf outputs a string in the JSON representation,
it converts the string to UTF-8, assuming usual PDF string semantics.
Specifically, if the original string is UTF-16, it is converted to
UTF-8. Otherwise, it is assumed to have PDF doc encoding, and is
converted to UTF-8 with that assumption. This causes strange things
to happen to binary strings. For example, if you had the binary
string ``<038051>``, this would be output to the JSON as ``\u0003•Q``
because ``03`` is not a printable character and ``80`` is the bullet
character in PDF doc encoding and is mapped to the Unicode value
``2022``. Since ``51`` is ``Q``, it is output as is. If you wanted to
convert back from here to a binary string, would have to recognize
Unicode values whose code points are higher than ``0xFF`` and map
those back to their corresponding PDF doc encoding characters. There
is no way to tell the difference between a Unicode string that was
originally encoded as UTF-16 or one that was converted from PDF doc
encoding. In other words, it's best if you don't try to use the JSON
format to extract binary strings from the PDF file, but if you really
had to, it could be done. Note that qpdf's
:samp:`--show-object` option does not have this
limitation and will reveal the string as encoded in the original
file.
.. _json.considerations:
JSON: Special Considerations
----------------------------
For the most part, the built-in JSON help tells you everything you need
to know about the JSON format, but there are a few non-obvious things to
be aware of:
- While qpdf guarantees that keys present in the help will be present
in the output, those fields may be null or empty if the information
is not known or absent in the file. Also, if you specify
:samp:`--json-keys`, the keys that are not listed
will be excluded entirely except for those that
:samp:`--json-help` says are always present.
- In a few places, there are keys with names containing
``pageposfrom1``. The values of these keys are null or an integer. If
an integer, they point to a page index within the file numbering from
1. Note that JSON indexes from 0, and you would also use 0-based
indexing using the API. However, 1-based indexing is easier in this
case because the command-line syntax for specifying page ranges is
1-based. If you were going to write a program that looked through the
JSON for information about specific pages and then use the
command-line to extract those pages, 1-based indexing is easier.
Besides, it's more convenient to subtract 1 from a program in a real
programming language than it is to add 1 from shell code.
- The image information included in the ``page`` section of the JSON
output includes the key "``filterable``". Note that the value of this
field may depend on the :samp:`--decode-level` that
you invoke qpdf with. The JSON output includes a top-level key
"``parameters``" that indicates the decode level used for computing
whether a stream was filterable. For example, jpeg images will be
shown as not filterable by default, but they will be shown as
filterable if you run :command:`qpdf --json
--decode-level=all`.

91
manual/library.rst Normal file
View File

@ -0,0 +1,91 @@
.. _ref.using-library:
Using the QPDF Library
======================
.. _ref.using.from-cxx:
Using QPDF from C++
-------------------
The source tree for the qpdf package has an
:file:`examples` directory that contains a few
example programs. The :file:`qpdf/qpdf.cc` source
file also serves as a useful example since it exercises almost all of
the qpdf library's public interface. The best source of documentation on
the library itself is reading comments in
:file:`include/qpdf/QPDF.hh`,
:file:`include/qpdf/QPDFWriter.hh`, and
:file:`include/qpdf/QPDFObjectHandle.hh`.
All header files are installed in the
:file:`include/qpdf` directory. It is recommend that
you use ``#include <qpdf/QPDF.hh>`` rather than adding
:file:`include/qpdf` to your include path.
When linking against the qpdf static library, you may also need to
specify ``-lz -ljpeg`` on your link command. If your system understands
how to read libtool :file:`.la` files, this may not
be necessary.
The qpdf library is safe to use in a multithreaded program, but no
individual ``QPDF`` object instance (including ``QPDF``,
``QPDFObjectHandle``, or ``QPDFWriter``) can be used in more than one
thread at a time. Multiple threads may simultaneously work with
different instances of these and all other QPDF objects.
.. _ref.using.other-languages:
Using QPDF from other languages
-------------------------------
The qpdf library is implemented in C++, which makes it hard to use
directly in other languages. There are a few things that can help.
"C"
The qpdf library includes a "C" language interface that provides a
subset of the overall capabilities. The header file
:file:`qpdf/qpdf-c.h` includes information about
its use. As long as you use a C++ linker, you can link C programs
with qpdf and use the C API. For languages that can directly load
methods from a shared library, the C API can also be useful. People
have reported success using the C API from other languages on Windows
by directly calling functions in the DLL.
Python
A Python module called
`pikepdf <https://pypi.org/project/pikepdf/>`__ provides a clean and
highly functional set of Python bindings to the qpdf library. Using
pikepdf, you can work with PDF files in a natural way and combine
qpdf's capabilities with other functionality provided by Python's
rich standard library and available modules.
Other Languages
Starting with version 8.3.0, the :command:`qpdf`
command-line tool can produce a JSON representation of the PDF file's
non-content data. This can facilitate interacting programmatically
with PDF files through qpdf's command line interface. For more
information, please see :ref:`ref.json`.
.. _ref.unicode-files:
A Note About Unicode File Names
-------------------------------
When strings are passed to qpdf library routines either as ``char*`` or
as ``std::string``, they are treated as byte arrays except where
otherwise noted. When Unicode is desired, qpdf wants UTF-8 unless
otherwise noted in comments in header files. In modern UNIX/Linux
environments, this generally does the right thing. In Windows, it's a
bit more complicated. Starting in qpdf 8.4.0, passwords that contain
Unicode characters are handled much better, and starting in qpdf 8.4.1,
the library attempts to properly handle Unicode characters in filenames.
In particular, in Windows, if a UTF-8 encoded string is used as a
filename in either ``QPDF`` or ``QPDFWriter``, it is internally
converted to ``wchar_t*``, and Unicode-aware Windows APIs are used. As
such, qpdf will generally operate properly on files with non-ASCII
characters in their names as long as the filenames are UTF-8 encoded for
passing into the qpdf library API, but there are still some rough edges,
such as the encoding of the filenames in error messages our CLI output
messages. Patches or bug reports are welcome for any continuing issues
with Unicode file names in Windows.

12
manual/license.rst Normal file
View File

@ -0,0 +1,12 @@
.. _ref.license:
License
=======
QPDF is licensed under `the Apache License, Version 2.0
<http://www.apache.org/licenses/LICENSE-2.0>`__ (the "License").
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied. See the License for the specific language governing
permissions and limitations under the License.

197
manual/linearization.rst Normal file
View File

@ -0,0 +1,197 @@
.. _ref.linearization:
Linearization
=============
This chapter describes how ``QPDF`` and ``QPDFWriter`` implement
creation and processing of linearized PDFS.
.. _ref.linearization-strategy:
Basic Strategy for Linearization
--------------------------------
To avoid the incestuous problem of having the qpdf library validate its
own linearized files, we have a special linearized file checking mode
which can be invoked via :command:`qpdf
--check-linearization` (or :command:`qpdf
--check`). This mode reads the linearization parameter
dictionary and the hint streams and validates that object ordering,
parameters, and hint stream contents are correct. The validation code
was first tested against linearized files created by external tools
(Acrobat and pdlin) and then used to validate files created by
``QPDFWriter`` itself.
.. _ref.linearized.preparation:
Preparing For Linearization
---------------------------
Before creating a linearized PDF file from any other PDF file, the PDF
file must be altered such that all page attributes are propagated down
to the page level (and not inherited from parents in the ``/Pages``
tree). We also have to know which objects refer to which other objects,
being concerned with page boundaries and a few other cases. We refer to
this part of preparing the PDF file as
*optimization*, discussed in
:ref:`ref.optimization`. Note the, in this context, the
term *optimization* is a qpdf term, and the
term *linearization* is a term from the PDF
specification. Do not be confused by the fact that many applications
refer to linearization as optimization or web optimization.
When creating linearized PDF files from optimized PDF files, there are
really only a few issues that need to be dealt with:
- Creation of hints tables
- Placing objects in the correct order
- Filling in offsets and byte sizes
.. _ref.optimization:
Optimization
------------
In order to perform various operations such as linearization and
splitting files into pages, it is necessary to know which objects are
referenced by which pages, page thumbnails, and root and trailer
dictionary keys. It is also necessary to ensure that all page-level
attributes appear directly at the page level and are not inherited from
parents in the pages tree.
We refer to the process of enforcing these constraints as
*optimization*. As mentioned above, note
that some applications refer to linearization as optimization. Although
this optimization was initially motivated by the need to create
linearized files, we are using these terms separately.
PDF file optimization is implemented in the
:file:`QPDF_optimization.cc` source file. That file
is richly commented and serves as the primary reference for the
optimization process.
After optimization has been completed, the private member variables
``obj_user_to_objects`` and ``object_to_obj_users`` in ``QPDF`` have
been populated. Any object that has more than one value in the
``object_to_obj_users`` table is shared. Any object that has exactly one
value in the ``object_to_obj_users`` table is private. To find all the
private objects in a page or a trailer or root dictionary key, one
merely has make this determination for each element in the
``obj_user_to_objects`` table for the given page or key.
Note that pages and thumbnails have different object user types, so the
above test on a page will not include objects referenced by the page's
thumbnail dictionary and nothing else.
.. _ref.linearization.writing:
Writing Linearized Files
------------------------
We will create files with only primary hint streams. We will never write
overflow hint streams. (As of PDF version 1.4, Acrobat doesn't either,
and they are never necessary.) The hint streams contain offset
information to objects that point to where they would be if the hint
stream were not present. This means that we have to calculate all object
positions before we can generate and write the hint table. This means
that we have to generate the file in two passes. To make this reliable,
``QPDFWriter`` in linearization mode invokes exactly the same code twice
to write the file to a pipeline.
In the first pass, the target pipeline is a count pipeline chained to a
discard pipeline. The count pipeline simply passes its data through to
the next pipeline in the chain but can return the number of bytes passed
through it at any intermediate point. The discard pipeline is an end of
line pipeline that just throws its data away. The hint stream is not
written and dummy values with adequate padding are stored in the first
cross reference table, linearization parameter dictionary, and /Prev key
of the first trailer dictionary. All the offset, length, object
renumbering information, and anything else we need for the second pass
is stored.
At the end of the first pass, this information is passed to the ``QPDF``
class which constructs a compressed hint stream in a memory buffer and
returns it. ``QPDFWriter`` uses this information to write a complete
hint stream object into a memory buffer. At this point, the length of
the hint stream is known.
In the second pass, the end of the pipeline chain is a regular file
instead of a discard pipeline, and we have known values for all the
offsets and lengths that we didn't have in the first pass. We have to
adjust offsets that appear after the start of the hint stream by the
length of the hint stream, which is known. Anything that is of variable
length is padded, with the padding code surrounding any writing code
that differs in the two passes. This ensures that changes to the way
things are represented never results in offsets that were gathered
during the first pass becoming incorrect for the second pass.
Using this strategy, we can write linearized files to a non-seekable
output stream with only a single pass to disk or wherever the output is
going.
.. _ref.linearization-data:
Calculating Linearization Data
------------------------------
Once a file is optimized, we have information about which objects access
which other objects. We can then process these tables to decide which
part (as described in "Linearized PDF Document Structure" in the PDF
specification) each object is contained within. This tells us the exact
order in which objects are written. The ``QPDFWriter`` class asks for
this information and enqueues objects for writing in the proper order.
It also turns on a check that causes an exception to be thrown if an
object is encountered that has not already been queued. (This could
happen only if there were a bug in the traversal code used to calculate
the linearization data.)
.. _ref.linearization-issues:
Known Issues with Linearization
-------------------------------
There are a handful of known issues with this linearization code. These
issues do not appear to impact the behavior of linearized files which
still work as intended: it is possible for a web browser to begin to
display them before they are fully downloaded. In fact, it seems that
various other programs that create linearized files have many of these
same issues. These items make reference to terminology used in the
linearization appendix of the PDF specification.
- Thread Dictionary information keys appear in part 4 with the rest of
Threads instead of in part 9. Objects in part 9 are not grouped
together functionally.
- We are not calculating numerators for shared object positions within
content streams or interleaving them within content streams.
- We generate only page offset, shared object, and outline hint tables.
It would be relatively easy to add some additional tables. We gather
most of the information needed to create thumbnail hint tables. There
are comments in the code about this.
.. _ref.linearization-debugging:
Debugging Note
--------------
The :command:`qpdf --show-linearization` command can show
the complete contents of linearization hint streams. To look at the raw
data, you can extract the filtered contents of the linearization hint
tables using :command:`qpdf --show-object=n
--filtered-stream-data`. Then, to convert this into a bit
stream (since linearization tables are bit streams written without
regard to byte boundaries), you can pipe the resulting data through the
following perl code:
.. code-block:: perl
use bytes;
binmode STDIN;
undef $/;
my $a = <STDIN>;
my @ch = split(//, $a);
map { printf("%08b", ord($_)) } @ch;
print "\n";

186
manual/object-streams.rst Normal file
View File

@ -0,0 +1,186 @@
.. _ref.object-and-xref-streams:
Object and Cross-Reference Streams
==================================
This chapter provides information about the implementation of object
stream and cross-reference stream support in qpdf.
.. _ref.object-streams:
Object Streams
--------------
Object streams can contain any regular object except the following:
- stream objects
- objects with generation > 0
- the encryption dictionary
- objects containing the /Length of another stream
In addition, Adobe reader (at least as of version 8.0.0) appears to not
be able to handle having the document catalog appear in an object stream
if the file is encrypted, though this is not specifically disallowed by
the specification.
There are additional restrictions for linearized files. See
:ref:`ref.object-streams-linearization` for details.
The PDF specification refers to objects in object streams as "compressed
objects" regardless of whether the object stream is compressed.
The generation number of every object in an object stream must be zero.
It is possible to delete and replace an object in an object stream with
a regular object.
The object stream dictionary has the following keys:
- ``/N``: number of objects
- ``/First``: byte offset of first object
- ``/Extends``: indirect reference to stream that this extends
Stream collections are formed with ``/Extends``. They must form a
directed acyclic graph. These can be used for semantic information and
are not meaningful to the PDF document's syntactic structure. Although
qpdf preserves stream collections, it never generates them and doesn't
make use of this information in any way.
The specification recommends limiting the number of objects in object
stream for efficiency in reading and decoding. Acrobat 6 uses no more
than 100 objects per object stream for linearized files and no more 200
objects per stream for non-linearized files. ``QPDFWriter``, in object
stream generation mode, never puts more than 100 objects in an object
stream.
Object stream contents consists of *N* pairs of integers, each of which
is the object number and the byte offset of the object relative to the
first object in the stream, followed by the objects themselves,
concatenated.
.. _ref.xref-streams:
Cross-Reference Streams
-----------------------
For non-hybrid files, the value following ``startxref`` is the byte
offset to the xref stream rather than the word ``xref``.
For hybrid files (files containing both xref tables and cross-reference
streams), the xref table's trailer dictionary contains the key
``/XRefStm`` whose value is the byte offset to a cross-reference stream
that supplements the xref table. A PDF 1.5-compliant application should
read the xref table first. Then it should replace any object that it has
already seen with any defined in the xref stream. Then it should follow
any ``/Prev`` pointer in the original xref table's trailer dictionary.
The specification is not clear about what should be done, if anything,
with a ``/Prev`` pointer in the xref stream referenced by an xref table.
The ``QPDF`` class ignores it, which is probably reasonable since, if
this case were to appear for any sensible PDF file, the previous xref
table would probably have a corresponding ``/XRefStm`` pointer of its
own. For example, if a hybrid file were appended, the appended section
would have its own xref table and ``/XRefStm``. The appended xref table
would point to the previous xref table which would point the
``/XRefStm``, meaning that the new ``/XRefStm`` doesn't have to point to
it.
Since xref streams must be read very early, they may not be encrypted,
and the may not contain indirect objects for keys required to read them,
which are these:
- ``/Type``: value ``/XRef``
- ``/Size``: value *n+1*: where *n* is highest object number (same as
``/Size`` in the trailer dictionary)
- ``/Index`` (optional): value
``[:samp:`{n count}` ...]`` used to determine
which objects' information is stored in this stream. The default is
``[0 /Size]``.
- ``/Prev``: value :samp:`{offset}`: byte
offset of previous xref stream (same as ``/Prev`` in the trailer
dictionary)
- ``/W [...]``: sizes of each field in the xref table
The other fields in the xref stream, which may be indirect if desired,
are the union of those from the xref table's trailer dictionary.
.. _ref.xref-stream-data:
Cross-Reference Stream Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The stream data is binary and encoded in big-endian byte order. Entries
are concatenated, and each entry has a length equal to the total of the
entries in ``/W`` above. Each entry consists of one or more fields, the
first of which is the type of the field. The number of bytes for each
field is given by ``/W`` above. A 0 in ``/W`` indicates that the field
is omitted and has the default value. The default value for the field
type is "``1``". All other default values are "``0``".
PDF 1.5 has three field types:
- 0: for free objects. Format: ``0 obj next-generation``, same as the
free table in a traditional cross-reference table
- 1: regular non-compressed object. Format: ``1 offset generation``
- 2: for objects in object streams. Format: ``2 object-stream-number
index``, the number of object stream containing the object and the
index within the object stream of the object.
It seems standard to have the first entry in the table be ``0 0 0``
instead of ``0 0 ffff`` if there are no deleted objects.
.. _ref.object-streams-linearization:
Implications for Linearized Files
---------------------------------
For linearized files, the linearization dictionary, document catalog,
and page objects may not be contained in object streams.
Objects stored within object streams are given the highest range of
object numbers within the main and first-page cross-reference sections.
It is okay to use cross-reference streams in place of regular xref
tables. There are on special considerations.
Hint data refers to object streams themselves, not the objects in the
streams. Shared object references should also be made to the object
streams. There are no reference in any hint tables to the object numbers
of compressed objects (objects within object streams).
When numbering objects, all shared objects within both the first and
second halves of the linearized files must be numbered consecutively
after all normal uncompressed objects in that half.
.. _ref.object-stream-implementation:
Implementation Notes
--------------------
There are three modes for writing object streams:
:samp:`disable`, :samp:`preserve`, and
:samp:`generate`. In disable mode, we do not generate
any object streams, and we also generate an xref table rather than xref
streams. This can be used to generate PDF files that are viewable with
older readers. In preserve mode, we write object streams such that
written object streams contain the same objects and ``/Extends``
relationships as in the original file. This is equal to disable if the
file has no object streams. In generate, we create object streams
ourselves by grouping objects that are allowed in object streams
together in sets of no more than 100 objects. We also ensure that the
PDF version is at least 1.5 in generate mode, but we preserve the
version header in the other modes. The default is
:samp:`preserve`.
We do not support creation of hybrid files. When we write files, even in
preserve mode, we will lose any xref tables and merge any appended
sections.

33
manual/overview.rst Normal file
View File

@ -0,0 +1,33 @@
.. _ref.overview:
What is QPDF?
=============
QPDF is a program and C++ library for structural, content-preserving
transformations on PDF files. QPDF's website is located at
https://qpdf.sourceforge.io/. QPDF's source code is hosted on github
at https://github.com/qpdf/qpdf.
QPDF provides many useful capabilities to developers of PDF-producing
software or for people who just want to look at the innards of a PDF
file to learn more about how they work. With QPDF, it is possible to
copy objects from one PDF file into another and to manipulate the list
of pages in a PDF file. This makes it possible to merge and split PDF
files. The QPDF library also makes it possible for you to create PDF
files from scratch. In this mode, you are responsible for supplying
all the contents of the file, while the QPDF library takes care of all
the syntactical representation of the objects, creation of cross
references tables and, if you use them, object streams, encryption,
linearization, and other syntactic details. You are still responsible
for generating PDF content on your own.
QPDF has been designed with very few external dependencies, and it is
intentionally very lightweight. QPDF is *not* a PDF content creation
library, a PDF viewer, or a program capable of converting PDF into other
formats. In particular, QPDF knows nothing about the semantics of PDF
content streams. If you are looking for something that can do that, you
should look elsewhere. However, once you have a valid PDF file, QPDF can
be used to transform that file in ways that perhaps your original PDF
creation tool can't handle. For example, many programs generate simple PDF
files but can't password-protect them, web-optimize them, or perform
other transformations of that type.

96
manual/qdf.rst Normal file
View File

@ -0,0 +1,96 @@
.. _ref.qdf:
QDF Mode
========
In QDF mode, qpdf creates PDF files in what we call *QDF
form*. A PDF file in QDF form, sometimes called a QDF
file, is a completely valid PDF file that has ``%QDF-1.0`` as its third
line (after the pdf header and binary characters) and has certain other
characteristics. The purpose of QDF form is to make it possible to edit
PDF files, with some restrictions, in an ordinary text editor. This can
be very useful for experimenting with different PDF constructs or for
making one-off edits to PDF files (though there are other reasons why
this may not always work). Note that QDF mode does not support
linearized files. If you enable linearization, QDF mode is automatically
disabled.
It is ordinarily very difficult to edit PDF files in a text editor for
two reasons: most meaningful data in PDF files is compressed, and PDF
files are full of offset and length information that makes it hard to
add or remove data. A QDF file is organized in a manner such that, if
edits are kept within certain constraints, the
:command:`fix-qdf` program, distributed with qpdf, is
able to restore edited files to a correct state. The
:command:`fix-qdf` program takes no command-line
arguments. It reads a possibly edited QDF file from standard input and
writes a repaired file to standard output.
The following attributes characterize a QDF file:
- All objects appear in numerical order in the PDF file, including when
objects appear in object streams.
- Objects are printed in an easy-to-read format, and all line endings
are normalized to UNIX line endings.
- Unless specifically overridden, streams appear uncompressed (when
qpdf supports the filters and they are compressed with a non-lossy
compression scheme), and most content streams are normalized (line
endings are converted to just a UNIX-style linefeeds).
- All streams lengths are represented as indirect objects, and the
stream length object is always the next object after the stream. If
the stream data does not end with a newline, an extra newline is
inserted, and a special comment appears after the stream indicating
that this has been done.
- If the PDF file contains object streams, if object stream *n*
contains *k* objects, those objects are numbered from *n+1* through
*n+k*, and the object number/offset pairs appear on a separate line
for each object. Additionally, each object in the object stream is
preceded by a comment indicating its object number and index. This
makes it very easy to find objects in object streams.
- All beginnings of objects, ``stream`` tokens, ``endstream`` tokens,
and ``endobj`` tokens appear on lines by themselves. A blank line
follows every ``endobj`` token.
- If there is a cross-reference stream, it is unfiltered.
- Page dictionaries and page content streams are marked with special
comments that make them easy to find.
- Comments precede each object indicating the object number of the
corresponding object in the original file.
When editing a QDF file, any edits can be made as long as the above
constraints are maintained. This means that you can freely edit a page's
content without worrying about messing up the QDF file. It is also
possible to add new objects so long as those objects are added after the
last object in the file or subsequent objects are renumbered. If a QDF
file has object streams in it, you can always add the new objects before
the xref stream and then change the number of the xref stream, since
nothing generally ever references it by number.
It is not generally practical to remove objects from QDF files without
messing up object numbering, but if you remove all references to an
object, you can run qpdf on the file (after running
:command:`fix-qdf`), and qpdf will omit the now-orphaned
object.
When :command:`fix-qdf` is run, it goes through the file
and recomputes the following parts of the file:
- the ``/N``, ``/W``, and ``/First`` keys of all object stream
dictionaries
- the pairs of numbers representing object numbers and offsets of
objects in object streams
- all stream lengths
- the cross-reference table or cross-reference stream
- the offset to the cross-reference table or cross-reference stream
following the ``startxref`` token

2643
manual/release-notes.rst Normal file

File diff suppressed because it is too large Load Diff

33
manual/weak-crypto.rst Normal file
View File

@ -0,0 +1,33 @@
.. _ref.weak-crypto:
Weak Cryptography
=================
Start with version 10.4, qpdf is taking steps to reduce the likelihood
of a user *accidentally* creating PDF files with insecure cryptography
but will continue to allow creation of such files indefinitely with
explicit acknowledgment.
The PDF file format makes use of RC4, which is known to be a weak
cryptography algorithm, and MD5, which is a weak hashing algorithm. In
version 10.4, qpdf generates warnings for some (but not all) cases of
writing files with weak cryptography when invoked from the command-line.
These warnings can be suppressed using the
:samp:`--allow-weak-crypto` option.
It is planned for qpdf version 11 to be stricter, making it an error to
write files with insecure cryptography from the command-line tool in
most cases without specifying the
:samp:`--allow-weak-crypto` flag and also to require
explicit steps when using the C++ library to enable use of insecure
cryptography.
Note that qpdf must always retain support for weak cryptographic
algorithms since this is required for reading older PDF files that use
it. Additionally, qpdf will always retain the ability to create files
using weak cryptographic algorithms since, as a development tool, qpdf
explicitly supports creating older or deprecated types of PDF files
since these are sometimes needed to test or work with older versions of
software. Even if other cryptography libraries drop support for RC4 or
MD5, qpdf can always fall back to its internal implementations of those
algorithms, so they are not going to disappear from qpdf.