mirror of https://github.com/qpdf/qpdf.git
Split documentation into multiple pages, change theme
This commit is contained in:
parent
f3d1138b8a
commit
10fb619d3e
2
TODO
2
TODO
|
@ -30,8 +30,6 @@ Before release:
|
|||
I can do about, and it doesn't seem worth fixing. Maybe mention it
|
||||
somewhere?
|
||||
* README-maintainer: Fix installation of documentation to website
|
||||
* Get navigation working properly
|
||||
* Figure out where to put :ref:`search` so we get doc search
|
||||
|
||||
Soon:
|
||||
|
||||
|
|
|
@ -0,0 +1,14 @@
|
|||
.. _acknowledgments:
|
||||
|
||||
Acknowledgment
|
||||
==============
|
||||
|
||||
QPDF was originally created in 2001 and modified periodically between
|
||||
2001 and 2005 during my employment at `Apex CoVantage
|
||||
<http://www.apexcovantage.com>`__. Upon my departure from Apex, the
|
||||
company graciously allowed me to take ownership of the software and
|
||||
continue maintaining it as an open source project, a decision for which I
|
||||
am very grateful. I have made considerable enhancements to it since
|
||||
that time. I feel fortunate to have worked for people who would make
|
||||
such a decision. This work would not have been possible without their
|
||||
support.
|
File diff suppressed because it is too large
Load Diff
|
@ -11,4 +11,7 @@ project = 'QPDF'
|
|||
copyright = '2005-2021, Jay Berkenbilt'
|
||||
author = 'Jay Berkenbilt'
|
||||
release = '10.4.0'
|
||||
html_theme = 'alabaster'
|
||||
html_theme = 'agogo'
|
||||
html_theme_options = {
|
||||
"body_max_width": None,
|
||||
}
|
||||
|
|
|
@ -0,0 +1,747 @@
|
|||
.. _ref.design:
|
||||
|
||||
Design and Library Notes
|
||||
========================
|
||||
|
||||
.. _ref.design.intro:
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
This section was written prior to the implementation of the qpdf package
|
||||
and was subsequently modified to reflect the implementation. In some
|
||||
cases, for purposes of explanation, it may differ slightly from the
|
||||
actual implementation. As always, the source code and test suite are
|
||||
authoritative. Even if there are some errors, this document should serve
|
||||
as a road map to understanding how this code works.
|
||||
|
||||
In general, one should adhere strictly to a specification when writing
|
||||
but be liberal in reading. This way, the product of our software will be
|
||||
accepted by the widest range of other programs, and we will accept the
|
||||
widest range of input files. This library attempts to conform to that
|
||||
philosophy whenever possible but also aims to provide strict checking
|
||||
for people who want to validate PDF files. If you don't want to see
|
||||
warnings and are trying to write something that is tolerant, you can
|
||||
call ``setSuppressWarnings(true)``. If you want to fail on the first
|
||||
error, you can call ``setAttemptRecovery(false)``. The default behavior
|
||||
is to generating warnings for recoverable problems. Note that recovery
|
||||
will not always produce the desired results even if it is able to get
|
||||
through the file. Unlike most other PDF files that produce generic
|
||||
warnings such as "This file is damaged,", qpdf generally issues a
|
||||
detailed error message that would be most useful to a PDF developer.
|
||||
This is by design as there seems to be a shortage of PDF validation
|
||||
tools out there. This was, in fact, one of the major motivations behind
|
||||
the initial creation of qpdf.
|
||||
|
||||
.. _ref.design-goals:
|
||||
|
||||
Design Goals
|
||||
------------
|
||||
|
||||
The QPDF package includes support for reading and rewriting PDF files.
|
||||
It aims to hide from the user details involving object locations,
|
||||
modified (appended) PDF files, the directness/indirectness of objects,
|
||||
and stream filters including encryption. It does not aim to hide
|
||||
knowledge of the object hierarchy or content stream contents. Put
|
||||
another way, a user of the qpdf library is expected to have knowledge
|
||||
about how PDF files work, but is not expected to have to keep track of
|
||||
bookkeeping details such as file positions.
|
||||
|
||||
A user of the library never has to care whether an object is direct or
|
||||
indirect, though it is possible to determine whether an object is direct
|
||||
or not if this information is needed. All access to objects deals with
|
||||
this transparently. All memory management details are also handled by
|
||||
the library.
|
||||
|
||||
The ``PointerHolder`` object is used internally by the library to deal
|
||||
with memory management. This is basically a smart pointer object very
|
||||
similar in spirit to C++-11's ``std::shared_ptr`` object, but predating
|
||||
it by several years. This library also makes use of a technique for
|
||||
giving fine-grained access to methods in one class to other classes by
|
||||
using public subclasses with friends and only private members that in
|
||||
turn call private methods of the containing class. See
|
||||
``QPDFObjectHandle::Factory`` as an example.
|
||||
|
||||
The top-level qpdf class is ``QPDF``. A ``QPDF`` object represents a PDF
|
||||
file. The library provides methods for both accessing and mutating PDF
|
||||
files.
|
||||
|
||||
The primary class for interacting with PDF objects is
|
||||
``QPDFObjectHandle``. Instances of this class can be passed around by
|
||||
value, copied, stored in containers, etc. with very low overhead.
|
||||
Instances of ``QPDFObjectHandle`` created by reading from a file will
|
||||
always contain a reference back to the ``QPDF`` object from which they
|
||||
were created. A ``QPDFObjectHandle`` may be direct or indirect. If
|
||||
indirect, the ``QPDFObject`` the ``PointerHolder`` initially points to
|
||||
is a null pointer. In this case, the first attempt to access the
|
||||
underlying ``QPDFObject`` will result in the ``QPDFObject`` being
|
||||
resolved via a call to the referenced ``QPDF`` instance. This makes it
|
||||
essentially impossible to make coding errors in which certain things
|
||||
will work for some PDF files and not for others based on which objects
|
||||
are direct and which objects are indirect.
|
||||
|
||||
Instances of ``QPDFObjectHandle`` can be directly created and modified
|
||||
using static factory methods in the ``QPDFObjectHandle`` class. There
|
||||
are factory methods for each type of object as well as a convenience
|
||||
method ``QPDFObjectHandle::parse`` that creates an object from a string
|
||||
representation of the object. Existing instances of ``QPDFObjectHandle``
|
||||
can also be modified in several ways. See comments in
|
||||
:file:`QPDFObjectHandle.hh` for details.
|
||||
|
||||
An instance of ``QPDF`` is constructed by using the class's default
|
||||
constructor. If desired, the ``QPDF`` object may be configured with
|
||||
various methods that change its default behavior. Then the
|
||||
``QPDF::processFile()`` method is passed the name of a PDF file, which
|
||||
permanently associates the file with that QPDF object. A password may
|
||||
also be given for access to password-protected files. QPDF does not
|
||||
enforce encryption parameters and will treat user and owner passwords
|
||||
equivalently. Either password may be used to access an encrypted file.
|
||||
``QPDF`` will allow recovery of a user password given an owner password.
|
||||
The input PDF file must be seekable. (Output files written by
|
||||
``QPDFWriter`` need not be seekable, even when creating linearized
|
||||
files.) During construction, ``QPDF`` validates the PDF file's header,
|
||||
and then reads the cross reference tables and trailer dictionaries. The
|
||||
``QPDF`` class keeps only the first trailer dictionary though it does
|
||||
read all of them so it can check the ``/Prev`` key. ``QPDF`` class users
|
||||
may request the root object and the trailer dictionary specifically. The
|
||||
cross reference table is kept private. Objects may then be requested by
|
||||
number of by walking the object tree.
|
||||
|
||||
When a PDF file has a cross-reference stream instead of a
|
||||
cross-reference table and trailer, requesting the document's trailer
|
||||
dictionary returns the stream dictionary from the cross-reference stream
|
||||
instead.
|
||||
|
||||
There are some convenience routines for very common operations such as
|
||||
walking the page tree and returning a vector of all page objects. For
|
||||
full details, please see the header files
|
||||
:file:`QPDF.hh` and
|
||||
:file:`QPDFObjectHandle.hh`. There are also some
|
||||
additional helper classes that provide higher level API functions for
|
||||
certain document constructions. These are discussed in :ref:`ref.helper-classes`.
|
||||
|
||||
.. _ref.helper-classes:
|
||||
|
||||
Helper Classes
|
||||
--------------
|
||||
|
||||
QPDF version 8.1 introduced the concept of helper classes. Helper
|
||||
classes are intended to contain higher level APIs that allow developers
|
||||
to work with certain document constructs at an abstraction level above
|
||||
that of ``QPDFObjectHandle`` while staying true to qpdf's philosophy of
|
||||
not hiding document structure from the developer. As with qpdf in
|
||||
general, the goal is take away some of the more tedious bookkeeping
|
||||
aspects of working with PDF files, not to remove the need for the
|
||||
developer to understand how the PDF construction in question works. The
|
||||
driving factor behind the creation of helper classes was to allow the
|
||||
evolution of higher level interfaces in qpdf without polluting the
|
||||
interfaces of the main top-level classes ``QPDF`` and
|
||||
``QPDFObjectHandle``.
|
||||
|
||||
There are two kinds of helper classes: *document* helpers and *object*
|
||||
helpers. Document helpers are constructed with a reference to a ``QPDF``
|
||||
object and provide methods for working with structures that are at the
|
||||
document level. Object helpers are constructed with an instance of a
|
||||
``QPDFObjectHandle`` and provide methods for working with specific types
|
||||
of objects.
|
||||
|
||||
Examples of document helpers include ``QPDFPageDocumentHelper``, which
|
||||
contains methods for operating on the document's page trees, such as
|
||||
enumerating all pages of a document and adding and removing pages; and
|
||||
``QPDFAcroFormDocumentHelper``, which contains document-level methods
|
||||
related to interactive forms, such as enumerating form fields and
|
||||
creating mappings between form fields and annotations.
|
||||
|
||||
Examples of object helpers include ``QPDFPageObjectHelper`` for
|
||||
performing operations on pages such as page rotation and some operations
|
||||
on content streams, ``QPDFFormFieldObjectHelper`` for performing
|
||||
operations related to interactive form fields, and
|
||||
``QPDFAnnotationObjectHelper`` for working with annotations.
|
||||
|
||||
It is always possible to retrieve the underlying ``QPDF`` reference from
|
||||
a document helper and the underlying ``QPDFObjectHandle`` reference from
|
||||
an object helper. Helpers are designed to be helpers, not wrappers. The
|
||||
intention is that, in general, it is safe to freely intermix operations
|
||||
that use helpers with operations that use the underlying objects.
|
||||
Document and object helpers do not attempt to provide a complete
|
||||
interface for working with the things they are helping with, nor do they
|
||||
attempt to encapsulate underlying structures. They just provide a few
|
||||
methods to help with error-prone, repetitive, or complex tasks. In some
|
||||
cases, a helper object may cache some information that is expensive to
|
||||
gather. In such cases, the helper classes are implemented so that their
|
||||
own methods keep the cache consistent, and the header file will provide
|
||||
a method to invalidate the cache and a description of what kinds of
|
||||
operations would make the cache invalid. If in doubt, you can always
|
||||
discard a helper class and create a new one with the same underlying
|
||||
objects, which will ensure that you have discarded any stale
|
||||
information.
|
||||
|
||||
By Convention, document helpers are called
|
||||
``QPDFSomethingDocumentHelper`` and are derived from
|
||||
``QPDFDocumentHelper``, and object helpers are called
|
||||
``QPDFSomethingObjectHelper`` and are derived from ``QPDFObjectHelper``.
|
||||
For details on specific helpers, please see their header files. You can
|
||||
find them by looking at
|
||||
:file:`include/qpdf/QPDF*DocumentHelper.hh` and
|
||||
:file:`include/qpdf/QPDF*ObjectHelper.hh`.
|
||||
|
||||
In order to avoid creation of circular dependencies, the following
|
||||
general guidelines are followed with helper classes:
|
||||
|
||||
- Core class interfaces do not know about helper classes. For example,
|
||||
no methods of ``QPDF`` or ``QPDFObjectHandle`` will include helper
|
||||
classes in their interfaces.
|
||||
|
||||
- Interfaces of object helpers will usually not use document helpers in
|
||||
their interfaces. This is because it is much more useful for document
|
||||
helpers to have methods that return object helpers. Most operations
|
||||
in PDF files start at the document level and go from there to the
|
||||
object level rather than the other way around. It can sometimes be
|
||||
useful to map back from object-level structures to document-level
|
||||
structures. If there is a desire to do this, it will generally be
|
||||
provided by a method in the document helper class.
|
||||
|
||||
- Most of the time, object helpers don't know about other object
|
||||
helpers. However, in some cases, one type of object may be a
|
||||
container for another type of object, in which case it may make sense
|
||||
for the outer object to know about the inner object. For example,
|
||||
there are methods in the ``QPDFPageObjectHelper`` that know
|
||||
``QPDFAnnotationObjectHelper`` because references to annotations are
|
||||
contained in page dictionaries.
|
||||
|
||||
- Any helper or core library class may use helpers in their
|
||||
implementations.
|
||||
|
||||
Prior to qpdf version 8.1, higher level interfaces were added as
|
||||
"convenience functions" in either ``QPDF`` or ``QPDFObjectHandle``. For
|
||||
compatibility, older convenience functions for operating with pages will
|
||||
remain in those classes even as alternatives are provided in helper
|
||||
classes. Going forward, new higher level interfaces will be provided
|
||||
using helper classes.
|
||||
|
||||
.. _ref.implementation-notes:
|
||||
|
||||
Implementation Notes
|
||||
--------------------
|
||||
|
||||
This section contains a few notes about QPDF's internal implementation,
|
||||
particularly around what it does when it first processes a file. This
|
||||
section is a bit of a simplification of what it actually does, but it
|
||||
could serve as a starting point to someone trying to understand the
|
||||
implementation. There is nothing in this section that you need to know
|
||||
to use the qpdf library.
|
||||
|
||||
``QPDFObject`` is the basic PDF Object class. It is an abstract base
|
||||
class from which are derived classes for each type of PDF object.
|
||||
Clients do not interact with Objects directly but instead interact with
|
||||
``QPDFObjectHandle``.
|
||||
|
||||
When the ``QPDF`` class creates a new object, it dynamically allocates
|
||||
the appropriate type of ``QPDFObject`` and immediately hands the pointer
|
||||
to an instance of ``QPDFObjectHandle``. The parser reads a token from
|
||||
the current file position. If the token is a not either a dictionary or
|
||||
array opener, an object is immediately constructed from the single token
|
||||
and the parser returns. Otherwise, the parser iterates in a special mode
|
||||
in which it accumulates objects until it finds a balancing closer.
|
||||
During this process, the "``R``" keyword is recognized and an indirect
|
||||
``QPDFObjectHandle`` may be constructed.
|
||||
|
||||
The ``QPDF::resolve()`` method, which is used to resolve an indirect
|
||||
object, may be invoked from the ``QPDFObjectHandle`` class. It first
|
||||
checks a cache to see whether this object has already been read. If not,
|
||||
it reads the object from the PDF file and caches it. It the returns the
|
||||
resulting ``QPDFObjectHandle``. The calling object handle then replaces
|
||||
its ``PointerHolder<QDFObject>`` with the one from the newly returned
|
||||
``QPDFObjectHandle``. In this way, only a single copy of any direct
|
||||
object need exist and clients can access objects transparently without
|
||||
knowing caring whether they are direct or indirect objects.
|
||||
Additionally, no object is ever read from the file more than once. That
|
||||
means that only the portions of the PDF file that are actually needed
|
||||
are ever read from the input file, thus allowing the qpdf package to
|
||||
take advantage of this important design goal of PDF files.
|
||||
|
||||
If the requested object is inside of an object stream, the object stream
|
||||
itself is first read into memory. Then the tokenizer reads objects from
|
||||
the memory stream based on the offset information stored in the stream.
|
||||
Those individual objects are cached, after which the temporary buffer
|
||||
holding the object stream contents are discarded. In this way, the first
|
||||
time an object in an object stream is requested, all objects in the
|
||||
stream are cached.
|
||||
|
||||
The following example should clarify how ``QPDF`` processes a simple
|
||||
file.
|
||||
|
||||
- Client constructs ``QPDF`` ``pdf`` and calls
|
||||
``pdf.processFile("a.pdf");``.
|
||||
|
||||
- The ``QPDF`` class checks the beginning of
|
||||
:file:`a.pdf` for a PDF header. It then reads the
|
||||
cross reference table mentioned at the end of the file, ensuring that
|
||||
it is looking before the last ``%%EOF``. After getting to ``trailer``
|
||||
keyword, it invokes the parser.
|
||||
|
||||
- The parser sees "``<<``", so it calls itself recursively in
|
||||
dictionary creation mode.
|
||||
|
||||
- In dictionary creation mode, the parser keeps accumulating objects
|
||||
until it encounters "``>>``". Each object that is read is pushed onto
|
||||
a stack. If "``R``" is read, the last two objects on the stack are
|
||||
inspected. If they are integers, they are popped off the stack and
|
||||
their values are used to construct an indirect object handle which is
|
||||
then pushed onto the stack. When "``>>``" is finally read, the stack
|
||||
is converted into a ``QPDF_Dictionary`` which is placed in a
|
||||
``QPDFObjectHandle`` and returned.
|
||||
|
||||
- The resulting dictionary is saved as the trailer dictionary.
|
||||
|
||||
- The ``/Prev`` key is searched. If present, ``QPDF`` seeks to that
|
||||
point and repeats except that the new trailer dictionary is not
|
||||
saved. If ``/Prev`` is not present, the initial parsing process is
|
||||
complete.
|
||||
|
||||
If there is an encryption dictionary, the document's encryption
|
||||
parameters are initialized.
|
||||
|
||||
- The client requests root object. The ``QPDF`` class gets the value of
|
||||
root key from trailer dictionary and returns it. It is an unresolved
|
||||
indirect ``QPDFObjectHandle``.
|
||||
|
||||
- The client requests the ``/Pages`` key from root
|
||||
``QPDFObjectHandle``. The ``QPDFObjectHandle`` notices that it is
|
||||
indirect so it asks ``QPDF`` to resolve it. ``QPDF`` looks in the
|
||||
object cache for an object with the root dictionary's object ID and
|
||||
generation number. Upon not seeing it, it checks the cross reference
|
||||
table, gets the offset, and reads the object present at that offset.
|
||||
It stores the result in the object cache and returns the cached
|
||||
result. The calling ``QPDFObjectHandle`` replaces its object pointer
|
||||
with the one from the resolved ``QPDFObjectHandle``, verifies that it
|
||||
a valid dictionary object, and returns the (unresolved indirect)
|
||||
``QPDFObject`` handle to the top of the Pages hierarchy.
|
||||
|
||||
As the client continues to request objects, the same process is
|
||||
followed for each new requested object.
|
||||
|
||||
.. _ref.casting:
|
||||
|
||||
Casting Policy
|
||||
--------------
|
||||
|
||||
This section describes the casting policy followed by qpdf's
|
||||
implementation. This is no concern to qpdf's end users and largely of no
|
||||
concern to people writing code that uses qpdf, but it could be of
|
||||
interest to people who are porting qpdf to a new platform or who are
|
||||
making modifications to the code.
|
||||
|
||||
The C++ code in qpdf is free of old-style casts except where unavoidable
|
||||
(e.g. where the old-style cast is in a macro provided by a third-party
|
||||
header file). When there is a need for a cast, it is handled, in order
|
||||
of preference, by rewriting the code to avoid the need for a cast,
|
||||
calling ``const_cast``, calling ``static_cast``, calling
|
||||
``reinterpret_cast``, or calling some combination of the above. As a
|
||||
last resort, a compiler-specific ``#pragma`` may be used to suppress a
|
||||
warning that we don't want to fix. Examples may include suppressing
|
||||
warnings about the use of old-style casts in code that is shared between
|
||||
C and C++ code.
|
||||
|
||||
The ``QIntC`` namespace, provided by
|
||||
:file:`include/qpdf/QIntC.hh`, implements safe
|
||||
functions for converting between integer types. These functions do range
|
||||
checking and throw a ``std::range_error``, which is subclass of
|
||||
``std::runtime_error``, if conversion from one integer type to another
|
||||
results in loss of information. There are many cases in which we have to
|
||||
move between different integer types because of incompatible integer
|
||||
types used in interoperable interfaces. Some are unavoidable, such as
|
||||
moving between sizes and offsets, and others are there because of old
|
||||
code that is too in entrenched to be fixable without breaking source
|
||||
compatibility and causing pain for users. QPDF is compiled with extra
|
||||
warnings to detect conversions with potential data loss, and all such
|
||||
cases should be fixed by either using a function from ``QIntC`` or a
|
||||
``static_cast``.
|
||||
|
||||
When the intention is just to switch the type because of exchanging data
|
||||
between incompatible interfaces, use ``QIntC``. This is the usual case.
|
||||
However, there are some cases in which we are explicitly intending to
|
||||
use the exact same bit pattern with a different type. This is most
|
||||
common when switching between signed and unsigned characters. A lot of
|
||||
qpdf's code uses unsigned characters internally, but ``std::string`` and
|
||||
``char`` are signed. Using ``QIntC::to_char`` would be wrong for
|
||||
converting from unsigned to signed characters because a negative
|
||||
``char`` value and the corresponding ``unsigned char`` value greater
|
||||
than 127 *mean the same thing*. There are also
|
||||
cases in which we use ``static_cast`` when working with bit fields where
|
||||
we are not representing a numerical value but rather a bunch of bits
|
||||
packed together in some integer type. Also note that ``size_t`` and
|
||||
``long`` both typically differ between 32-bit and 64-bit environments,
|
||||
so sometimes an explicit cast may not be needed to avoid warnings on one
|
||||
platform but may be needed on another. A conversion with ``QIntC``
|
||||
should always be used when the types are different even if the
|
||||
underlying size is the same. QPDF's CI build builds on 32-bit and 64-bit
|
||||
platforms, and the test suite is very thorough, so it is hard to make
|
||||
any of the potential errors here without being caught in build or test.
|
||||
|
||||
Non-const ``unsigned char*`` is used in the ``Pipeline`` interface. The
|
||||
pipeline interface has a ``write`` call that uses ``unsigned char*``
|
||||
without a ``const`` qualifier. The main reason for this is
|
||||
to support pipelines that make calls to third-party libraries, such as
|
||||
zlib, that don't include ``const`` in their interfaces. Unfortunately,
|
||||
there are many places in the code where it is desirable to have
|
||||
``const char*`` with pipelines. None of the pipeline implementations
|
||||
in qpdf
|
||||
currently modify the data passed to write, and doing so would be counter
|
||||
to the intent of ``Pipeline``, but there is nothing in the code to
|
||||
prevent this from being done. There are places in the code where
|
||||
``const_cast`` is used to remove the const-ness of pointers going into
|
||||
``Pipeline``\ s. This could theoretically be unsafe, but there is
|
||||
adequate testing to assert that it is safe and will remain safe in
|
||||
qpdf's code.
|
||||
|
||||
.. _ref.encryption:
|
||||
|
||||
Encryption
|
||||
----------
|
||||
|
||||
Encryption is supported transparently by qpdf. When opening a PDF file,
|
||||
if an encryption dictionary exists, the ``QPDF`` object processes this
|
||||
dictionary using the password (if any) provided. The primary decryption
|
||||
key is computed and cached. No further access is made to the encryption
|
||||
dictionary after that time. When an object is read from a file, the
|
||||
object ID and generation of the object in which it is contained is
|
||||
always known. Using this information along with the stored encryption
|
||||
key, all stream and string objects are transparently decrypted. Raw
|
||||
encrypted objects are never stored in memory. This way, nothing in the
|
||||
library ever has to know or care whether it is reading an encrypted
|
||||
file.
|
||||
|
||||
An interface is also provided for writing encrypted streams and strings
|
||||
given an encryption key. This is used by ``QPDFWriter`` when it rewrites
|
||||
encrypted files.
|
||||
|
||||
When copying encrypted files, unless otherwise directed, qpdf will
|
||||
preserve any encryption in force in the original file. qpdf can do this
|
||||
with either the user or the owner password. There is no difference in
|
||||
capability based on which password is used. When 40 or 128 bit
|
||||
encryption keys are used, the user password can be recovered with the
|
||||
owner password. With 256 keys, the user and owner passwords are used
|
||||
independently to encrypt the actual encryption key, so while either can
|
||||
be used, the owner password can no longer be used to recover the user
|
||||
password.
|
||||
|
||||
Starting with version 4.0.0, qpdf can read files that are not encrypted
|
||||
but that contain encrypted attachments, but it cannot write such files.
|
||||
qpdf also requires the password to be specified in order to open the
|
||||
file, not just to extract attachments, since once the file is open, all
|
||||
decryption is handled transparently. When copying files like this while
|
||||
preserving encryption, qpdf will apply the file's encryption to
|
||||
everything in the file, not just to the attachments. When decrypting the
|
||||
file, qpdf will decrypt the attachments. In general, when copying PDF
|
||||
files with multiple encryption formats, qpdf will choose the newest
|
||||
format. The only exception to this is that clear-text metadata will be
|
||||
preserved as clear-text if it is that way in the original file.
|
||||
|
||||
One point of confusion some people have about encrypted PDF files is
|
||||
that encryption is not the same as password protection. Password
|
||||
protected files are always encrypted, but it is also possible to create
|
||||
encrypted files that do not have passwords. Internally, such files use
|
||||
the empty string as a password, and most readers try the empty string
|
||||
first to see if it works and prompt for a password only if the empty
|
||||
string doesn't work. Normally such files have an empty user password and
|
||||
a non-empty owner password. In that way, if the file is opened by an
|
||||
ordinary reader without specification of password, the restrictions
|
||||
specified in the encryption dictionary can be enforced. Most users
|
||||
wouldn't even realize such a file was encrypted. Since qpdf always
|
||||
ignores the restrictions (except for the purpose of reporting what they
|
||||
are), qpdf doesn't care which password you use. QPDF will allow you to
|
||||
create PDF files with non-empty user passwords and empty owner
|
||||
passwords. Some readers will require a password when you open these
|
||||
files, and others will open the files without a password and not enforce
|
||||
restrictions. Having a non-empty user password and an empty owner
|
||||
password doesn't really make sense because it would mean that opening
|
||||
the file with the user password would be more restrictive than not
|
||||
supplying a password at all. QPDF also allows you to create PDF files
|
||||
with the same password as both the user and owner password. Some readers
|
||||
will not ever allow such files to be accessed without restrictions
|
||||
because they never try the password as the owner password if it works as
|
||||
the user password. Nonetheless, one of the powerful aspects of qpdf is
|
||||
that it allows you to finely specify the way encrypted files are
|
||||
created, even if the results are not useful to some readers. One use
|
||||
case for this would be for testing a PDF reader to ensure that it
|
||||
handles odd configurations of input files.
|
||||
|
||||
.. _ref.random-numbers:
|
||||
|
||||
Random Number Generation
|
||||
------------------------
|
||||
|
||||
QPDF generates random numbers to support generation of encrypted data.
|
||||
Starting in qpdf 10.0.0, qpdf uses the crypto provider as its source of
|
||||
random numbers. Older versions used the OS-provided source of secure
|
||||
random numbers or, if allowed at build time, insecure random numbers
|
||||
from stdlib. Starting with version 5.1.0, you can disable use of
|
||||
OS-provided secure random numbers at build time. This is especially
|
||||
useful on Windows if you want to avoid a dependency on Microsoft's
|
||||
cryptography API. You can also supply your own random data provider. For
|
||||
details on how to do this, please refer to the top-level README.md file
|
||||
in the source distribution and to comments in
|
||||
:file:`QUtil.hh`.
|
||||
|
||||
.. _ref.adding-and-remove-pages:
|
||||
|
||||
Adding and Removing Pages
|
||||
-------------------------
|
||||
|
||||
While qpdf's API has supported adding and modifying objects for some
|
||||
time, version 3.0 introduces specific methods for adding and removing
|
||||
pages. These are largely convenience routines that handle two tricky
|
||||
issues: pushing inheritable resources from the ``/Pages`` tree down to
|
||||
individual pages and manipulation of the ``/Pages`` tree itself. For
|
||||
details, see ``addPage`` and surrounding methods in
|
||||
:file:`QPDF.hh`.
|
||||
|
||||
.. _ref.reserved-objects:
|
||||
|
||||
Reserving Object Numbers
|
||||
------------------------
|
||||
|
||||
Version 3.0 of qpdf introduced the concept of reserved objects. These
|
||||
are seldom needed for ordinary operations, but there are cases in which
|
||||
you may want to add a series of indirect objects with references to each
|
||||
other to a ``QPDF`` object. This causes a problem because you can't
|
||||
determine the object ID that a new indirect object will have until you
|
||||
add it to the ``QPDF`` object with ``QPDF::makeIndirectObject``. The
|
||||
only way to add two mutually referential objects to a ``QPDF`` object
|
||||
prior to version 3.0 would be to add the new objects first and then make
|
||||
them refer to each other after adding them. Now it is possible to create
|
||||
a *reserved object* using
|
||||
``QPDFObjectHandle::newReserved``. This is an indirect object that stays
|
||||
"unresolved" even if it is queried for its type. So now, if you want to
|
||||
create a set of mutually referential objects, you can create
|
||||
reservations for each one of them and use those reservations to
|
||||
construct the references. When finished, you can call
|
||||
``QPDF::replaceReserved`` to replace the reserved objects with the real
|
||||
ones. This functionality will never be needed by most applications, but
|
||||
it is used internally by QPDF when copying objects from other PDF files,
|
||||
as discussed in :ref:`ref.foreign-objects`. For an example of how to use reserved
|
||||
objects, search for ``newReserved`` in
|
||||
:file:`test_driver.cc` in qpdf's sources.
|
||||
|
||||
.. _ref.foreign-objects:
|
||||
|
||||
Copying Objects From Other PDF Files
|
||||
------------------------------------
|
||||
|
||||
Version 3.0 of qpdf introduced the ability to copy objects into a
|
||||
``QPDF`` object from a different ``QPDF`` object, which we refer to as
|
||||
*foreign objects*. This allows arbitrary
|
||||
merging of PDF files. The "from" ``QPDF`` object must remain valid after
|
||||
the copy as discussed in the note below. The
|
||||
:command:`qpdf` command-line tool provides limited
|
||||
support for basic page selection, including merging in pages from other
|
||||
files, but the library's API makes it possible to implement arbitrarily
|
||||
complex merging operations. The main method for copying foreign objects
|
||||
is ``QPDF::copyForeignObject``. This takes an indirect object from
|
||||
another ``QPDF`` and copies it recursively into this object while
|
||||
preserving all object structure, including circular references. This
|
||||
means you can add a direct object that you create from scratch to a
|
||||
``QPDF`` object with ``QPDF::makeIndirectObject``, and you can add an
|
||||
indirect object from another file with ``QPDF::copyForeignObject``. The
|
||||
fact that ``QPDF::makeIndirectObject`` does not automatically detect a
|
||||
foreign object and copy it is an explicit design decision. Copying a
|
||||
foreign object seems like a sufficiently significant thing to do that it
|
||||
should be done explicitly.
|
||||
|
||||
The other way to copy foreign objects is by passing a page from one
|
||||
``QPDF`` to another by calling ``QPDF::addPage``. In contrast to
|
||||
``QPDF::makeIndirectObject``, this method automatically distinguishes
|
||||
between indirect objects in the current file, foreign objects, and
|
||||
direct objects.
|
||||
|
||||
Please note: when you copy objects from one ``QPDF`` to another, the
|
||||
source ``QPDF`` object must remain valid until you have finished with
|
||||
the destination object. This is because the original object is still
|
||||
used to retrieve any referenced stream data from the copied object.
|
||||
|
||||
.. _ref.rewriting:
|
||||
|
||||
Writing PDF Files
|
||||
-----------------
|
||||
|
||||
The qpdf library supports file writing of ``QPDF`` objects to PDF files
|
||||
through the ``QPDFWriter`` class. The ``QPDFWriter`` class has two
|
||||
writing modes: one for non-linearized files, and one for linearized
|
||||
files. See :ref:`ref.linearization` for a description of
|
||||
linearization is implemented. This section describes how we write
|
||||
non-linearized files including the creation of QDF files (see :ref:`ref.qdf`.
|
||||
|
||||
This outline was written prior to implementation and is not exactly
|
||||
accurate, but it provides a correct "notional" idea of how writing
|
||||
works. Look at the code in ``QPDFWriter`` for exact details.
|
||||
|
||||
- Initialize state:
|
||||
|
||||
- next object number = 1
|
||||
|
||||
- object queue = empty
|
||||
|
||||
- renumber table: old object id/generation to new id/0 = empty
|
||||
|
||||
- xref table: new id -> offset = empty
|
||||
|
||||
- Create a QPDF object from a file.
|
||||
|
||||
- Write header for new PDF file.
|
||||
|
||||
- Request the trailer dictionary.
|
||||
|
||||
- For each value that is an indirect object, grab the next object
|
||||
number (via an operation that returns and increments the number). Map
|
||||
object to new number in renumber table. Push object onto queue.
|
||||
|
||||
- While there are more objects on the queue:
|
||||
|
||||
- Pop queue.
|
||||
|
||||
- Look up object's new number *n* in the renumbering table.
|
||||
|
||||
- Store current offset into xref table.
|
||||
|
||||
- Write ``:samp:`{n}` 0 obj``.
|
||||
|
||||
- If object is null, whether direct or indirect, write out null,
|
||||
thus eliminating unresolvable indirect object references.
|
||||
|
||||
- If the object is a stream stream, write stream contents, piped
|
||||
through any filters as required, to a memory buffer. Use this
|
||||
buffer to determine the stream length.
|
||||
|
||||
- If object is not a stream, array, or dictionary, write out its
|
||||
contents.
|
||||
|
||||
- If object is an array or dictionary (including stream), traverse
|
||||
its elements (for array) or values (for dictionaries), handling
|
||||
recursive dictionaries and arrays, looking for indirect objects.
|
||||
When an indirect object is found, if it is not resolvable, ignore.
|
||||
(This case is handled when writing it out.) Otherwise, look it up
|
||||
in the renumbering table. If not found, grab the next available
|
||||
object number, assign to the referenced object in the renumbering
|
||||
table, and push the referenced object onto the queue. As a special
|
||||
case, when writing out a stream dictionary, replace length,
|
||||
filters, and decode parameters as required.
|
||||
|
||||
Write out dictionary or array, replacing any unresolvable indirect
|
||||
object references with null (pdf spec says reference to
|
||||
non-existent object is legal and resolves to null) and any
|
||||
resolvable ones with references to the renumbered objects.
|
||||
|
||||
- If the object is a stream, write ``stream\n``, the stream contents
|
||||
(from the memory buffer), and ``\nendstream\n``.
|
||||
|
||||
- When done, write ``endobj``.
|
||||
|
||||
Once we have finished the queue, all referenced objects will have been
|
||||
written out and all deleted objects or unreferenced objects will have
|
||||
been skipped. The new cross-reference table will contain an offset for
|
||||
every new object number from 1 up to the number of objects written. This
|
||||
can be used to write out a new xref table. Finally we can write out the
|
||||
trailer dictionary with appropriately computed /ID (see spec, 8.3, File
|
||||
Identifiers), the cross reference table offset, and ``%%EOF``.
|
||||
|
||||
.. _ref.filtered-streams:
|
||||
|
||||
Filtered Streams
|
||||
----------------
|
||||
|
||||
Support for streams is implemented through the ``Pipeline`` interface
|
||||
which was designed for this package.
|
||||
|
||||
When reading streams, create a series of ``Pipeline`` objects. The
|
||||
``Pipeline`` abstract base requires implementation ``write()`` and
|
||||
``finish()`` and provides an implementation of ``getNext()``. Each
|
||||
pipeline object, upon receiving data, does whatever it is going to do
|
||||
and then writes the data (possibly modified) to its successor.
|
||||
Alternatively, a pipeline may be an end-of-the-line pipeline that does
|
||||
something like store its output to a file or a memory buffer ignoring a
|
||||
successor. For additional details, look at
|
||||
:file:`Pipeline.hh`.
|
||||
|
||||
``QPDF`` can read raw or filtered streams. When reading a filtered
|
||||
stream, the ``QPDF`` class creates a ``Pipeline`` object for one of each
|
||||
appropriate filter object and chains them together. The last filter
|
||||
should write to whatever type of output is required. The ``QPDF`` class
|
||||
has an interface to write raw or filtered stream contents to a given
|
||||
pipeline.
|
||||
|
||||
.. _ref.object-accessors:
|
||||
|
||||
Object Accessor Methods
|
||||
-----------------------
|
||||
|
||||
..
|
||||
This section is referenced in QPDFObjectHandle.hh
|
||||
|
||||
For general information about how to access instances of
|
||||
``QPDFObjectHandle``, please see the comments in
|
||||
:file:`QPDFObjectHandle.hh`. Search for "Accessor
|
||||
methods". This section provides a more in-depth discussion of the
|
||||
behavior and the rationale for the behavior.
|
||||
|
||||
*Why were type errors made into warnings?* When type checks were
|
||||
introduced into qpdf in the early days, it was expected that type errors
|
||||
would only occur as a result of programmer error. However, in practice,
|
||||
type errors would occur with malformed PDF files because of assumptions
|
||||
made in code, including code within the qpdf library and code written by
|
||||
library users. The most common case would be chaining calls to
|
||||
``getKey()`` to access keys deep within a dictionary. In many cases,
|
||||
qpdf would be able to recover from these situations, but the old
|
||||
behavior often resulted in crashes rather than graceful recovery. For
|
||||
this reason, the errors were changed to warnings.
|
||||
|
||||
*Why even warn about type errors when the user can't usually do anything
|
||||
about them?* Type warnings are extremely valuable during development.
|
||||
Since it's impossible to catch at compile time things like typos in
|
||||
dictionary key names or logic errors around what the structure of a PDF
|
||||
file might be, the presence of type warnings can save lots of developer
|
||||
time. They have also proven useful in exposing issues in qpdf itself
|
||||
that would have otherwise gone undetected.
|
||||
|
||||
*Can there be a type-safe ``QPDFObjectHandle``?* It would be great if
|
||||
``QPDFObjectHandle`` could be more strongly typed so that you'd have to
|
||||
have check that something was of a particular type before calling
|
||||
type-specific accessor methods. However, implementing this at this stage
|
||||
of the library's history would be quite difficult, and it would make a
|
||||
the common pattern of drilling into an object no longer work. While it
|
||||
would be possible to have a parallel interface, it would create a lot of
|
||||
extra code. If qpdf were written in a language like rust, an interface
|
||||
like this would make a lot of sense, but, for a variety of reasons, the
|
||||
qpdf API is consistent with other APIs of its time, relying on exception
|
||||
handling to catch errors. The underlying PDF objects are inherently not
|
||||
type-safe. Forcing stronger type safety in ``QPDFObjectHandle`` would
|
||||
ultimately cause a lot more code to have to be written and would like
|
||||
make software that uses qpdf more brittle, and even so, checks would
|
||||
have to occur at runtime.
|
||||
|
||||
*Why do type errors sometimes raise exceptions?* The way warnings work
|
||||
in qpdf requires a ``QPDF`` object to be associated with an object
|
||||
handle for a warning to be issued. It would be nice if this could be
|
||||
fixed, but it would require major changes to the API. Rather than
|
||||
throwing away these conditions, we convert them to exceptions. It's not
|
||||
that bad though. Since any object handle that was read from a file has
|
||||
an associated ``QPDF`` object, it would only be type errors on objects
|
||||
that were created explicitly that would cause exceptions, and in that
|
||||
case, type errors are much more likely to be the result of a coding
|
||||
error than invalid input.
|
||||
|
||||
*Why does the behavior of a type exception differ between the C and C++
|
||||
API?* There is no way to throw and catch exceptions in C short of
|
||||
something like ``setjmp`` and ``longjmp``, and that approach is not
|
||||
portable across language barriers. Since the C API is often used from
|
||||
other languages, it's important to keep things as simple as possible.
|
||||
Starting in qpdf 10.5, exceptions that used to crash code using the C
|
||||
API will be written to stderr by default, and it is possible to register
|
||||
an error handler. There's no reason that the error handler can't
|
||||
simulate exception handling in some way, such as by using ``setjmp`` and
|
||||
``longjmp`` or by setting some variable that can be checked after
|
||||
library calls are made. In retrospect, it might have been better if the
|
||||
C API object handle methods returned error codes like the other methods
|
||||
and set return values in passed-in pointers, but this would complicate
|
||||
both the implementation and the use of the library for a case that is
|
||||
actually quite rare and largely avoidable.
|
6271
manual/index.rst
6271
manual/index.rst
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,342 @@
|
|||
.. _ref.installing:
|
||||
|
||||
Building and Installing QPDF
|
||||
============================
|
||||
|
||||
This chapter describes how to build and install qpdf. Please see also
|
||||
the :file:`README.md` and
|
||||
:file:`INSTALL` files in the source distribution.
|
||||
|
||||
.. _ref.prerequisites:
|
||||
|
||||
System Requirements
|
||||
-------------------
|
||||
|
||||
The qpdf package has few external dependencies. In order to build qpdf,
|
||||
the following packages are required:
|
||||
|
||||
- A C++ compiler that supports C++-14.
|
||||
|
||||
- zlib: http://www.zlib.net/
|
||||
|
||||
- jpeg: http://www.ijg.org/files/ or https://libjpeg-turbo.org/
|
||||
|
||||
- *Recommended but not required:* gnutls: https://www.gnutls.org/ to be
|
||||
able to use the gnutls crypto provider, and/or openssl:
|
||||
https://openssl.org/ to be able to use the openssl crypto provider.
|
||||
|
||||
- gnu make 3.81 or newer: http://www.gnu.org/software/make
|
||||
|
||||
- perl version 5.8 or newer: http://www.perl.org/; required for running
|
||||
the test suite. Starting with qpdf version 9.1.1, perl is no longer
|
||||
required at runtime.
|
||||
|
||||
- GNU diffutils (any version): http://www.gnu.org/software/diffutils/
|
||||
is required to run the test suite. Note that this is the version of
|
||||
diff present on virtually all GNU/Linux systems. This is required
|
||||
because the test suite uses :command:`diff -u`.
|
||||
|
||||
Part of qpdf's test suite does comparisons of the contents PDF files by
|
||||
converting them images and comparing the images. The image comparison
|
||||
tests are disabled by default. Those tests are not required for
|
||||
determining correctness of a qpdf build if you have not modified the
|
||||
code since the test suite also contains expected output files that are
|
||||
compared literally. The image comparison tests provide an extra check to
|
||||
make sure that any content transformations don't break the rendering of
|
||||
pages. Transformations that affect the content streams themselves are
|
||||
off by default and are only provided to help developers look into the
|
||||
contents of PDF files. If you are making deep changes to the library
|
||||
that cause changes in the contents of the files that qpdf generate,
|
||||
then you should enable the image comparison tests. Enable them by
|
||||
running :command:`configure` with the
|
||||
:samp:`--enable-test-compare-images` flag. If you enable
|
||||
this, the following additional requirements are required by the test
|
||||
suite. Note that in no case are these items required to use qpdf.
|
||||
|
||||
- libtiff: http://www.remotesensing.org/libtiff/
|
||||
|
||||
- GhostScript version 8.60 or newer: http://www.ghostscript.com
|
||||
|
||||
If you do not enable this, then you do not need to have tiff and
|
||||
ghostscript.
|
||||
|
||||
Pre-built documentation is distributed with qpdf, so you should
|
||||
generally not need to rebuild the documentation. In order to build the
|
||||
documentation from source, you need to install `Sphinx
|
||||
<https://sphinx-doc.org>`__. To build the PDF version of the
|
||||
documentation, you need `pdflatex`, `latexmk`, and a fairly complete
|
||||
LaTeX installation. Detailed requirements can be found in the Sphinx
|
||||
documentation.
|
||||
|
||||
.. _ref.building:
|
||||
|
||||
Build Instructions
|
||||
------------------
|
||||
|
||||
Building qpdf on UNIX is generally just a matter of running
|
||||
|
||||
::
|
||||
|
||||
./configure
|
||||
make
|
||||
|
||||
You can also run :command:`make check` to run the test
|
||||
suite and :command:`make install` to install. Please run
|
||||
:command:`./configure --help` for options on what can be
|
||||
configured. You can also set the value of ``DESTDIR`` during
|
||||
installation to install to a temporary location, as is common with many
|
||||
open source packages. Please see also the
|
||||
:file:`README.md` and
|
||||
:file:`INSTALL` files in the source distribution.
|
||||
|
||||
Building on Windows is a little bit more complicated. For details,
|
||||
please see :file:`README-windows.md` in the source
|
||||
distribution. You can also download a binary distribution for Windows.
|
||||
There is a port of qpdf to Visual C++ version 6 in the
|
||||
:file:`contrib` area generously contributed by Jian
|
||||
Ma. This is also discussed in more detail in
|
||||
:file:`README-windows.md`.
|
||||
|
||||
While ``wchar_t`` is part of the C++ standard, qpdf uses it in only one
|
||||
place in the public API, and it's just in a helper function. It is
|
||||
possible to build qpdf on a system that doesn't have ``wchar_t``, and
|
||||
it's also possible to compile a program that uses qpdf on a system
|
||||
without ``wchar_t`` as long as you don't call that one method. This is a
|
||||
very unusual situation. For a detailed discussion, please see the
|
||||
top-level README.md file in qpdf's source distribution.
|
||||
|
||||
There are some other things you can do with the build. Although qpdf
|
||||
uses :command:`autoconf`, it does not use
|
||||
:command:`automake` but instead uses a
|
||||
hand-crafted non-recursive Makefile that requires gnu make. If you're
|
||||
really interested, please read the comments in the top-level
|
||||
:file:`Makefile`.
|
||||
|
||||
.. _ref.crypto:
|
||||
|
||||
Crypto Providers
|
||||
----------------
|
||||
|
||||
Starting with qpdf 9.1.0, the qpdf library can be built with multiple
|
||||
implementations of providers of cryptographic functions, which we refer
|
||||
to as "crypto providers." At the time of writing, a crypto
|
||||
implementation must provide MD5 and SHA2 (256, 384, and 512-bit) hashes
|
||||
and RC4 and AES256 with and without CBC encryption. In the future, if
|
||||
digital signature is added to qpdf, there may be additional requirements
|
||||
beyond this.
|
||||
|
||||
Starting with qpdf version 9.1.0, the available implementations are
|
||||
``native`` and ``gnutls``. In qpdf 10.0.0, ``openssl`` was added.
|
||||
Additional implementations may be added if needed. It is also possible
|
||||
for a developer to provide their own implementation without modifying
|
||||
the qpdf library.
|
||||
|
||||
.. _ref.crypto.build:
|
||||
|
||||
Build Support For Crypto Providers
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When building with qpdf's build system, crypto providers can be enabled
|
||||
at build time using various :command:`./configure`
|
||||
options. The default behavior is for
|
||||
:command:`./configure` to discover which crypto providers
|
||||
can be supported based on available external libraries, to build all
|
||||
available crypto providers, and to use an external provider as the
|
||||
default over the native one. This behavior can be changed with the
|
||||
following flags to :command:`./configure`:
|
||||
|
||||
- :samp:`--enable-crypto-{x}`
|
||||
(where :samp:`{x}` is a supported crypto
|
||||
provider): enable the :samp:`{x}` crypto
|
||||
provider, requiring any external dependencies it needs
|
||||
|
||||
- :samp:`--disable-crypto-{x}`:
|
||||
disable the :samp:`{x}` provider, and do not
|
||||
link against its dependencies even if they are available
|
||||
|
||||
- :samp:`--with-default-crypto={x}`:
|
||||
make :samp:`{x}` the default provider even if
|
||||
a higher priority one is available
|
||||
|
||||
- :samp:`--disable-implicit-crypto`: only build crypto
|
||||
providers that are explicitly requested with an
|
||||
:samp:`--enable-crypto-{x}`
|
||||
option
|
||||
|
||||
For example, if you want to guarantee that the gnutls crypto provider is
|
||||
used and that the native provider is not built, you could run
|
||||
:command:`./configure --enable-crypto-gnutls
|
||||
--disable-implicit-crypto`.
|
||||
|
||||
If you build qpdf using your own build system, in order for qpdf to work
|
||||
at all, you need to enable at least one crypto provider. The file
|
||||
:file:`libqpdf/qpdf/qpdf-config.h.in` provides
|
||||
macros ``DEFAULT_CRYPTO``, whose value must be a string naming the
|
||||
default crypto provider, and various symbols starting with
|
||||
``USE_CRYPTO_``, at least one of which has to be enabled. Additionally,
|
||||
you must compile the source files that implement a crypto provider. To
|
||||
get a list of those files, look at
|
||||
:file:`libqpdf/build.mk`. If you want to omit a
|
||||
particular crypto provider, as long as its ``USE_CRYPTO_`` symbol is
|
||||
undefined, you can completely ignore the source files that belong to a
|
||||
particular crypto provider. Additionally, crypto providers may have
|
||||
their own external dependencies that can be omitted if the crypto
|
||||
provider is not used. For example, if you are building qpdf yourself and
|
||||
are using an environment that does not support gnutls or openssl, you
|
||||
can ensure that ``USE_CRYPTO_NATIVE`` is defined, ``USE_CRYPTO_GNUTLS``
|
||||
is not defined, and ``DEFAULT_CRYPTO`` is defined to ``"native"``. Then
|
||||
you must include the source files used in the native implementation,
|
||||
some of which were added or renamed from earlier versions, to your
|
||||
build, and you can ignore
|
||||
:file:`QPDFCrypto_gnutls.cc`. Always consult
|
||||
:file:`libqpdf/build.mk` to get the list of source
|
||||
files you need to build.
|
||||
|
||||
.. _ref.crypto.runtime:
|
||||
|
||||
Runtime Crypto Provider Selection
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You can use the :samp:`--show-crypto` option to
|
||||
:command:`qpdf` to get a list of available crypto
|
||||
providers. The default provider is always listed first, and the rest are
|
||||
listed in lexical order. Each crypto provider is listed on a line by
|
||||
itself with no other text, enabling the output of this command to be
|
||||
used easily in scripts.
|
||||
|
||||
You can override which crypto provider is used by setting the
|
||||
``QPDF_CRYPTO_PROVIDER`` environment variable. There are few reasons to
|
||||
ever do this, but you might want to do it if you were explicitly trying
|
||||
to compare behavior of two different crypto providers while testing
|
||||
performance or reproducing a bug. It could also be useful for people who
|
||||
are implementing their own crypto providers.
|
||||
|
||||
.. _ref.crypto.develop:
|
||||
|
||||
Crypto Provider Information for Developers
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If you are writing code that uses libqpdf and you want to force a
|
||||
certain crypto provider to be used, you can call the method
|
||||
``QPDFCryptoProvider::setDefaultProvider``. The argument is the name of
|
||||
a built-in or developer-supplied provider. To add your own crypto
|
||||
provider, you have to create a class derived from ``QPDFCryptoImpl`` and
|
||||
register it with ``QPDFCryptoProvider``. For additional information, see
|
||||
comments in :file:`include/qpdf/QPDFCryptoImpl.hh`.
|
||||
|
||||
.. _ref.crypto.design:
|
||||
|
||||
Crypto Provider Design Notes
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This section describes a few bits of rationale for why the crypto
|
||||
provider interface was set up the way it was. You don't need to know any
|
||||
of this information, but it's provided for the record and in case it's
|
||||
interesting.
|
||||
|
||||
As a general rule, I want to avoid as much as possible including large
|
||||
blocks of code that are conditionally compiled such that, in most
|
||||
builds, some code is never built. This is dangerous because it makes it
|
||||
very easy for invalid code to creep in unnoticed. As such, I want it to
|
||||
be possible to build qpdf with all available crypto providers, and this
|
||||
is the way I build qpdf for local development. At the same time, if a
|
||||
particular packager feels that it is a security liability for qpdf to
|
||||
use crypto functionality from other than a library that gets
|
||||
considerable scrutiny for this specific purpose (such as gnutls,
|
||||
openssl, or nettle), then I want to give that packager the ability to
|
||||
completely disable qpdf's native implementation. Or if someone wants to
|
||||
avoid adding a dependency on one of the external crypto providers, I
|
||||
don't want the availability of the provider to impose additional
|
||||
external dependencies within that environment. Both of these are
|
||||
situations that I know to be true for some users of qpdf.
|
||||
|
||||
I want registration and selection of crypto providers to be thread-safe,
|
||||
and I want it to work deterministically for a developer to provide their
|
||||
own crypto provider and be able to set it up as the default. This was
|
||||
the primary motivation behind requiring C++-11 as doing so enabled me to
|
||||
exploit the guaranteed thread safety of local block static
|
||||
initialization. The ``QPDFCryptoProvider`` class uses a singleton
|
||||
pattern with thread-safe initialization to create the singleton instance
|
||||
of ``QPDFCryptoProvider`` and exposes only static methods in its public
|
||||
interface. In this way, if a developer wants to call any
|
||||
``QPDFCryptoProvider`` methods, the library guarantees the
|
||||
``QPDFCryptoProvider`` is fully initialized and all built-in crypto
|
||||
providers are registered. Making ``QPDFCryptoProvider`` actually know
|
||||
about all the built-in providers may seem a bit sad at first, but this
|
||||
choice makes it extremely clear exactly what the initialization behavior
|
||||
is. There's no question about provider implementations automatically
|
||||
registering themselves in a nondeterministic order. It also means that
|
||||
implementations do not need to know anything about the provider
|
||||
interface, which makes them easier to test in isolation. Another
|
||||
advantage of this approach is that a developer who wants to develop
|
||||
their own crypto provider can do so in complete isolation from the qpdf
|
||||
library and, with just two calls, can make qpdf use their provider in
|
||||
their application. If they decided to contribute their code, plugging it
|
||||
into the qpdf library would require a very small change to qpdf's source
|
||||
code.
|
||||
|
||||
The decision to make the crypto provider selectable at runtime was one I
|
||||
struggled with a little, but I decided to do it for various reasons.
|
||||
Allowing an end user to switch crypto providers easily could be very
|
||||
useful for reproducing a potential bug. If a user reports a bug that
|
||||
some cryptographic thing is broken, I can easily ask that person to try
|
||||
with the ``QPDF_CRYPTO_PROVIDER`` variable set to different values. The
|
||||
same could apply in the event of a performance problem. This also makes
|
||||
it easier for qpdf's own test suite to exercise code with different
|
||||
providers without having to make every program that links with qpdf
|
||||
aware of the possibility of multiple providers. In qpdf's continuous
|
||||
integration environment, the entire test suite is run for each supported
|
||||
crypto provider. This is made simple by being able to select the
|
||||
provider using an environment variable.
|
||||
|
||||
Finally, making crypto providers selectable in this way establish a
|
||||
pattern that I may follow again in the future for stream filter
|
||||
providers. One could imagine a future enhancement where someone could
|
||||
provide their own implementations for basic filters like
|
||||
``/FlateDecode`` or for other filters that qpdf doesn't support.
|
||||
Implementing the registration functions and internal storage of
|
||||
registered providers was also easier using C++-11's functional
|
||||
interfaces, which was another reason to require C++-11 at this time.
|
||||
|
||||
.. _ref.packaging:
|
||||
|
||||
Notes for Packagers
|
||||
-------------------
|
||||
|
||||
If you are packaging qpdf for an operating system distribution, here are
|
||||
some things you may want to keep in mind:
|
||||
|
||||
- Starting in qpdf version 9.1.1, qpdf no longer has a runtime
|
||||
dependency on perl. This is because fix-qdf was rewritten in C++.
|
||||
However, qpdf still has a build-time dependency on perl.
|
||||
|
||||
- Make sure you are getting the intended behavior with regard to crypto
|
||||
providers. Read :ref:`ref.crypto.build` for details.
|
||||
|
||||
- Passing :samp:`--enable-show-failed-test-output` to
|
||||
:command:`./configure` will cause any failed test
|
||||
output to be written to the console. This can be very useful for
|
||||
seeing test failures generated by autobuilders where you can't access
|
||||
qtest.log after the fact.
|
||||
|
||||
- If qpdf's build environment detects the presence of autoconf and
|
||||
related tools, it will check to ensure that automatically generated
|
||||
files are up-to-date with recorded checksums and fail if it detects a
|
||||
discrepancy. This feature is intended to prevent you from
|
||||
accidentally forgetting to regenerate automatic files after modifying
|
||||
their sources. If your packaging environment automatically refreshes
|
||||
automatic files, it can cause this check to fail. Suppress qpdf's
|
||||
checks by passing :samp:`--disable-check-autofiles`
|
||||
to :command:`/.configure`. This is safe since qpdf's
|
||||
:command:`autogen.sh` just runs autotools in the
|
||||
normal way.
|
||||
|
||||
- QPDF's :command:`make install` does not install
|
||||
completion files by default, but as a packager, it's good if you
|
||||
install them wherever your distribution expects such files to go. You
|
||||
can find completion files to install in the
|
||||
:file:`completions` directory.
|
||||
|
||||
- Packagers are encouraged to install the source files from the
|
||||
:file:`examples` directory along with qpdf
|
||||
development packages.
|
|
@ -0,0 +1,177 @@
|
|||
.. _ref.json:
|
||||
|
||||
QPDF JSON
|
||||
=========
|
||||
|
||||
.. _ref.json-overview:
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Beginning with qpdf version 8.3.0, the :command:`qpdf`
|
||||
command-line program can produce a JSON representation of the
|
||||
non-content data in a PDF file. It includes a dump in JSON format of all
|
||||
objects in the PDF file excluding the content of streams. This JSON
|
||||
representation makes it very easy to look in detail at the structure of
|
||||
a given PDF file, and it also provides a great way to work with PDF
|
||||
files programmatically from the command-line in languages that can't
|
||||
call or link with the qpdf library directly. Note that stream data can
|
||||
be extracted from PDF files using other qpdf command-line options.
|
||||
|
||||
.. _ref.json-guarantees:
|
||||
|
||||
JSON Guarantees
|
||||
---------------
|
||||
|
||||
The qpdf JSON representation includes a JSON serialization of the raw
|
||||
objects in the PDF file as well as some computed information in a more
|
||||
easily extracted format. QPDF provides some guarantees about its JSON
|
||||
format. These guarantees are designed to simplify the experience of a
|
||||
developer working with the JSON format.
|
||||
|
||||
Compatibility
|
||||
The top-level JSON object output is a dictionary. The JSON output
|
||||
contains various nested dictionaries and arrays. With the exception
|
||||
of dictionaries that are populated by the fields of objects from the
|
||||
file, all instances of a dictionary are guaranteed to have exactly
|
||||
the same keys. Future versions of qpdf are free to add additional
|
||||
keys but not to remove keys or change the type of object that a key
|
||||
points to. The qpdf program validates this guarantee, and in the
|
||||
unlikely event that a bug in qpdf should cause it to generate data
|
||||
that doesn't conform to this rule, it will ask you to file a bug
|
||||
report.
|
||||
|
||||
The top-level JSON structure contains a "``version``" key whose value
|
||||
is simple integer. The value of the ``version`` key will be
|
||||
incremented if a non-compatible change is made. A non-compatible
|
||||
change would be any change that involves removal of a key, a change
|
||||
to the format of data pointed to by a key, or a semantic change that
|
||||
requires a different interpretation of a previously existing key. A
|
||||
strong effort will be made to avoid breaking compatibility.
|
||||
|
||||
Documentation
|
||||
The :command:`qpdf` command can be invoked with the
|
||||
:samp:`--json-help` option. This will output a JSON
|
||||
structure that has the same structure as the JSON output that qpdf
|
||||
generates, except that each field in the help output is a description
|
||||
of the corresponding field in the JSON output. The specific
|
||||
guarantees are as follows:
|
||||
|
||||
- A dictionary in the help output means that the corresponding
|
||||
location in the actual JSON output is also a dictionary with
|
||||
exactly the same keys; that is, no keys present in help are absent
|
||||
in the real output, and no keys will be present in the real output
|
||||
that are not in help. As a special case, if the dictionary has a
|
||||
single key whose name starts with ``<`` and ends with ``>``, it
|
||||
means that the JSON output is a dictionary that can have any keys,
|
||||
each of which conforms to the value of the special key. This is
|
||||
used for cases in which the keys of the dictionary are things like
|
||||
object IDs.
|
||||
|
||||
- A string in the help output is a description of the item that
|
||||
appears in the corresponding location of the actual output. The
|
||||
corresponding output can have any format.
|
||||
|
||||
- An array in the help output always contains a single element. It
|
||||
indicates that the corresponding location in the actual output is
|
||||
also an array, and that each element of the array has whatever
|
||||
format is implied by the single element of the help output's
|
||||
array.
|
||||
|
||||
For example, the help output indicates includes a "``pagelabels``"
|
||||
key whose value is an array of one element. That element is a
|
||||
dictionary with keys "``index``" and "``label``". In addition to
|
||||
describing the meaning of those keys, this tells you that the actual
|
||||
JSON output will contain a ``pagelabels`` array, each of whose
|
||||
elements is a dictionary that contains an ``index`` key, a ``label``
|
||||
key, and no other keys.
|
||||
|
||||
Directness and Simplicity
|
||||
The JSON output contains the value of every object in the file, but
|
||||
it also contains some processed data. This is analogous to how qpdf's
|
||||
library interface works. The processed data is similar to the helper
|
||||
functions in that it allows you to look at certain aspects of the PDF
|
||||
file without having to understand all the nuances of the PDF
|
||||
specification, while the raw objects allow you to mine the PDF for
|
||||
anything that the higher-level interfaces are lacking.
|
||||
|
||||
.. _json.limitations:
|
||||
|
||||
Limitations of JSON Representation
|
||||
----------------------------------
|
||||
|
||||
There are a few limitations to be aware of with the JSON structure:
|
||||
|
||||
- Strings, names, and indirect object references in the original PDF
|
||||
file are all converted to strings in the JSON representation. In the
|
||||
case of a "normal" PDF file, you can tell the difference because a
|
||||
name starts with a slash (``/``), and an indirect object reference
|
||||
looks like ``n n R``, but if there were to be a string that looked
|
||||
like a name or indirect object reference, there would be no way to
|
||||
tell this from the JSON output. Note that there are certain cases
|
||||
where you know for sure what something is, such as knowing that
|
||||
dictionary keys in objects are always names and that certain things
|
||||
in the higher-level computed data are known to contain indirect
|
||||
object references.
|
||||
|
||||
- The JSON format doesn't support binary data very well. Mostly the
|
||||
details are not important, but they are presented here for
|
||||
information. When qpdf outputs a string in the JSON representation,
|
||||
it converts the string to UTF-8, assuming usual PDF string semantics.
|
||||
Specifically, if the original string is UTF-16, it is converted to
|
||||
UTF-8. Otherwise, it is assumed to have PDF doc encoding, and is
|
||||
converted to UTF-8 with that assumption. This causes strange things
|
||||
to happen to binary strings. For example, if you had the binary
|
||||
string ``<038051>``, this would be output to the JSON as ``\u0003•Q``
|
||||
because ``03`` is not a printable character and ``80`` is the bullet
|
||||
character in PDF doc encoding and is mapped to the Unicode value
|
||||
``2022``. Since ``51`` is ``Q``, it is output as is. If you wanted to
|
||||
convert back from here to a binary string, would have to recognize
|
||||
Unicode values whose code points are higher than ``0xFF`` and map
|
||||
those back to their corresponding PDF doc encoding characters. There
|
||||
is no way to tell the difference between a Unicode string that was
|
||||
originally encoded as UTF-16 or one that was converted from PDF doc
|
||||
encoding. In other words, it's best if you don't try to use the JSON
|
||||
format to extract binary strings from the PDF file, but if you really
|
||||
had to, it could be done. Note that qpdf's
|
||||
:samp:`--show-object` option does not have this
|
||||
limitation and will reveal the string as encoded in the original
|
||||
file.
|
||||
|
||||
.. _json.considerations:
|
||||
|
||||
JSON: Special Considerations
|
||||
----------------------------
|
||||
|
||||
For the most part, the built-in JSON help tells you everything you need
|
||||
to know about the JSON format, but there are a few non-obvious things to
|
||||
be aware of:
|
||||
|
||||
- While qpdf guarantees that keys present in the help will be present
|
||||
in the output, those fields may be null or empty if the information
|
||||
is not known or absent in the file. Also, if you specify
|
||||
:samp:`--json-keys`, the keys that are not listed
|
||||
will be excluded entirely except for those that
|
||||
:samp:`--json-help` says are always present.
|
||||
|
||||
- In a few places, there are keys with names containing
|
||||
``pageposfrom1``. The values of these keys are null or an integer. If
|
||||
an integer, they point to a page index within the file numbering from
|
||||
1. Note that JSON indexes from 0, and you would also use 0-based
|
||||
indexing using the API. However, 1-based indexing is easier in this
|
||||
case because the command-line syntax for specifying page ranges is
|
||||
1-based. If you were going to write a program that looked through the
|
||||
JSON for information about specific pages and then use the
|
||||
command-line to extract those pages, 1-based indexing is easier.
|
||||
Besides, it's more convenient to subtract 1 from a program in a real
|
||||
programming language than it is to add 1 from shell code.
|
||||
|
||||
- The image information included in the ``page`` section of the JSON
|
||||
output includes the key "``filterable``". Note that the value of this
|
||||
field may depend on the :samp:`--decode-level` that
|
||||
you invoke qpdf with. The JSON output includes a top-level key
|
||||
"``parameters``" that indicates the decode level used for computing
|
||||
whether a stream was filterable. For example, jpeg images will be
|
||||
shown as not filterable by default, but they will be shown as
|
||||
filterable if you run :command:`qpdf --json
|
||||
--decode-level=all`.
|
|
@ -0,0 +1,91 @@
|
|||
.. _ref.using-library:
|
||||
|
||||
Using the QPDF Library
|
||||
======================
|
||||
|
||||
.. _ref.using.from-cxx:
|
||||
|
||||
Using QPDF from C++
|
||||
-------------------
|
||||
|
||||
The source tree for the qpdf package has an
|
||||
:file:`examples` directory that contains a few
|
||||
example programs. The :file:`qpdf/qpdf.cc` source
|
||||
file also serves as a useful example since it exercises almost all of
|
||||
the qpdf library's public interface. The best source of documentation on
|
||||
the library itself is reading comments in
|
||||
:file:`include/qpdf/QPDF.hh`,
|
||||
:file:`include/qpdf/QPDFWriter.hh`, and
|
||||
:file:`include/qpdf/QPDFObjectHandle.hh`.
|
||||
|
||||
All header files are installed in the
|
||||
:file:`include/qpdf` directory. It is recommend that
|
||||
you use ``#include <qpdf/QPDF.hh>`` rather than adding
|
||||
:file:`include/qpdf` to your include path.
|
||||
|
||||
When linking against the qpdf static library, you may also need to
|
||||
specify ``-lz -ljpeg`` on your link command. If your system understands
|
||||
how to read libtool :file:`.la` files, this may not
|
||||
be necessary.
|
||||
|
||||
The qpdf library is safe to use in a multithreaded program, but no
|
||||
individual ``QPDF`` object instance (including ``QPDF``,
|
||||
``QPDFObjectHandle``, or ``QPDFWriter``) can be used in more than one
|
||||
thread at a time. Multiple threads may simultaneously work with
|
||||
different instances of these and all other QPDF objects.
|
||||
|
||||
.. _ref.using.other-languages:
|
||||
|
||||
Using QPDF from other languages
|
||||
-------------------------------
|
||||
|
||||
The qpdf library is implemented in C++, which makes it hard to use
|
||||
directly in other languages. There are a few things that can help.
|
||||
|
||||
"C"
|
||||
The qpdf library includes a "C" language interface that provides a
|
||||
subset of the overall capabilities. The header file
|
||||
:file:`qpdf/qpdf-c.h` includes information about
|
||||
its use. As long as you use a C++ linker, you can link C programs
|
||||
with qpdf and use the C API. For languages that can directly load
|
||||
methods from a shared library, the C API can also be useful. People
|
||||
have reported success using the C API from other languages on Windows
|
||||
by directly calling functions in the DLL.
|
||||
|
||||
Python
|
||||
A Python module called
|
||||
`pikepdf <https://pypi.org/project/pikepdf/>`__ provides a clean and
|
||||
highly functional set of Python bindings to the qpdf library. Using
|
||||
pikepdf, you can work with PDF files in a natural way and combine
|
||||
qpdf's capabilities with other functionality provided by Python's
|
||||
rich standard library and available modules.
|
||||
|
||||
Other Languages
|
||||
Starting with version 8.3.0, the :command:`qpdf`
|
||||
command-line tool can produce a JSON representation of the PDF file's
|
||||
non-content data. This can facilitate interacting programmatically
|
||||
with PDF files through qpdf's command line interface. For more
|
||||
information, please see :ref:`ref.json`.
|
||||
|
||||
.. _ref.unicode-files:
|
||||
|
||||
A Note About Unicode File Names
|
||||
-------------------------------
|
||||
|
||||
When strings are passed to qpdf library routines either as ``char*`` or
|
||||
as ``std::string``, they are treated as byte arrays except where
|
||||
otherwise noted. When Unicode is desired, qpdf wants UTF-8 unless
|
||||
otherwise noted in comments in header files. In modern UNIX/Linux
|
||||
environments, this generally does the right thing. In Windows, it's a
|
||||
bit more complicated. Starting in qpdf 8.4.0, passwords that contain
|
||||
Unicode characters are handled much better, and starting in qpdf 8.4.1,
|
||||
the library attempts to properly handle Unicode characters in filenames.
|
||||
In particular, in Windows, if a UTF-8 encoded string is used as a
|
||||
filename in either ``QPDF`` or ``QPDFWriter``, it is internally
|
||||
converted to ``wchar_t*``, and Unicode-aware Windows APIs are used. As
|
||||
such, qpdf will generally operate properly on files with non-ASCII
|
||||
characters in their names as long as the filenames are UTF-8 encoded for
|
||||
passing into the qpdf library API, but there are still some rough edges,
|
||||
such as the encoding of the filenames in error messages our CLI output
|
||||
messages. Patches or bug reports are welcome for any continuing issues
|
||||
with Unicode file names in Windows.
|
|
@ -0,0 +1,12 @@
|
|||
.. _ref.license:
|
||||
|
||||
License
|
||||
=======
|
||||
|
||||
QPDF is licensed under `the Apache License, Version 2.0
|
||||
<http://www.apache.org/licenses/LICENSE-2.0>`__ (the "License").
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
implied. See the License for the specific language governing
|
||||
permissions and limitations under the License.
|
|
@ -0,0 +1,197 @@
|
|||
.. _ref.linearization:
|
||||
|
||||
Linearization
|
||||
=============
|
||||
|
||||
This chapter describes how ``QPDF`` and ``QPDFWriter`` implement
|
||||
creation and processing of linearized PDFS.
|
||||
|
||||
.. _ref.linearization-strategy:
|
||||
|
||||
Basic Strategy for Linearization
|
||||
--------------------------------
|
||||
|
||||
To avoid the incestuous problem of having the qpdf library validate its
|
||||
own linearized files, we have a special linearized file checking mode
|
||||
which can be invoked via :command:`qpdf
|
||||
--check-linearization` (or :command:`qpdf
|
||||
--check`). This mode reads the linearization parameter
|
||||
dictionary and the hint streams and validates that object ordering,
|
||||
parameters, and hint stream contents are correct. The validation code
|
||||
was first tested against linearized files created by external tools
|
||||
(Acrobat and pdlin) and then used to validate files created by
|
||||
``QPDFWriter`` itself.
|
||||
|
||||
.. _ref.linearized.preparation:
|
||||
|
||||
Preparing For Linearization
|
||||
---------------------------
|
||||
|
||||
Before creating a linearized PDF file from any other PDF file, the PDF
|
||||
file must be altered such that all page attributes are propagated down
|
||||
to the page level (and not inherited from parents in the ``/Pages``
|
||||
tree). We also have to know which objects refer to which other objects,
|
||||
being concerned with page boundaries and a few other cases. We refer to
|
||||
this part of preparing the PDF file as
|
||||
*optimization*, discussed in
|
||||
:ref:`ref.optimization`. Note the, in this context, the
|
||||
term *optimization* is a qpdf term, and the
|
||||
term *linearization* is a term from the PDF
|
||||
specification. Do not be confused by the fact that many applications
|
||||
refer to linearization as optimization or web optimization.
|
||||
|
||||
When creating linearized PDF files from optimized PDF files, there are
|
||||
really only a few issues that need to be dealt with:
|
||||
|
||||
- Creation of hints tables
|
||||
|
||||
- Placing objects in the correct order
|
||||
|
||||
- Filling in offsets and byte sizes
|
||||
|
||||
.. _ref.optimization:
|
||||
|
||||
Optimization
|
||||
------------
|
||||
|
||||
In order to perform various operations such as linearization and
|
||||
splitting files into pages, it is necessary to know which objects are
|
||||
referenced by which pages, page thumbnails, and root and trailer
|
||||
dictionary keys. It is also necessary to ensure that all page-level
|
||||
attributes appear directly at the page level and are not inherited from
|
||||
parents in the pages tree.
|
||||
|
||||
We refer to the process of enforcing these constraints as
|
||||
*optimization*. As mentioned above, note
|
||||
that some applications refer to linearization as optimization. Although
|
||||
this optimization was initially motivated by the need to create
|
||||
linearized files, we are using these terms separately.
|
||||
|
||||
PDF file optimization is implemented in the
|
||||
:file:`QPDF_optimization.cc` source file. That file
|
||||
is richly commented and serves as the primary reference for the
|
||||
optimization process.
|
||||
|
||||
After optimization has been completed, the private member variables
|
||||
``obj_user_to_objects`` and ``object_to_obj_users`` in ``QPDF`` have
|
||||
been populated. Any object that has more than one value in the
|
||||
``object_to_obj_users`` table is shared. Any object that has exactly one
|
||||
value in the ``object_to_obj_users`` table is private. To find all the
|
||||
private objects in a page or a trailer or root dictionary key, one
|
||||
merely has make this determination for each element in the
|
||||
``obj_user_to_objects`` table for the given page or key.
|
||||
|
||||
Note that pages and thumbnails have different object user types, so the
|
||||
above test on a page will not include objects referenced by the page's
|
||||
thumbnail dictionary and nothing else.
|
||||
|
||||
.. _ref.linearization.writing:
|
||||
|
||||
Writing Linearized Files
|
||||
------------------------
|
||||
|
||||
We will create files with only primary hint streams. We will never write
|
||||
overflow hint streams. (As of PDF version 1.4, Acrobat doesn't either,
|
||||
and they are never necessary.) The hint streams contain offset
|
||||
information to objects that point to where they would be if the hint
|
||||
stream were not present. This means that we have to calculate all object
|
||||
positions before we can generate and write the hint table. This means
|
||||
that we have to generate the file in two passes. To make this reliable,
|
||||
``QPDFWriter`` in linearization mode invokes exactly the same code twice
|
||||
to write the file to a pipeline.
|
||||
|
||||
In the first pass, the target pipeline is a count pipeline chained to a
|
||||
discard pipeline. The count pipeline simply passes its data through to
|
||||
the next pipeline in the chain but can return the number of bytes passed
|
||||
through it at any intermediate point. The discard pipeline is an end of
|
||||
line pipeline that just throws its data away. The hint stream is not
|
||||
written and dummy values with adequate padding are stored in the first
|
||||
cross reference table, linearization parameter dictionary, and /Prev key
|
||||
of the first trailer dictionary. All the offset, length, object
|
||||
renumbering information, and anything else we need for the second pass
|
||||
is stored.
|
||||
|
||||
At the end of the first pass, this information is passed to the ``QPDF``
|
||||
class which constructs a compressed hint stream in a memory buffer and
|
||||
returns it. ``QPDFWriter`` uses this information to write a complete
|
||||
hint stream object into a memory buffer. At this point, the length of
|
||||
the hint stream is known.
|
||||
|
||||
In the second pass, the end of the pipeline chain is a regular file
|
||||
instead of a discard pipeline, and we have known values for all the
|
||||
offsets and lengths that we didn't have in the first pass. We have to
|
||||
adjust offsets that appear after the start of the hint stream by the
|
||||
length of the hint stream, which is known. Anything that is of variable
|
||||
length is padded, with the padding code surrounding any writing code
|
||||
that differs in the two passes. This ensures that changes to the way
|
||||
things are represented never results in offsets that were gathered
|
||||
during the first pass becoming incorrect for the second pass.
|
||||
|
||||
Using this strategy, we can write linearized files to a non-seekable
|
||||
output stream with only a single pass to disk or wherever the output is
|
||||
going.
|
||||
|
||||
.. _ref.linearization-data:
|
||||
|
||||
Calculating Linearization Data
|
||||
------------------------------
|
||||
|
||||
Once a file is optimized, we have information about which objects access
|
||||
which other objects. We can then process these tables to decide which
|
||||
part (as described in "Linearized PDF Document Structure" in the PDF
|
||||
specification) each object is contained within. This tells us the exact
|
||||
order in which objects are written. The ``QPDFWriter`` class asks for
|
||||
this information and enqueues objects for writing in the proper order.
|
||||
It also turns on a check that causes an exception to be thrown if an
|
||||
object is encountered that has not already been queued. (This could
|
||||
happen only if there were a bug in the traversal code used to calculate
|
||||
the linearization data.)
|
||||
|
||||
.. _ref.linearization-issues:
|
||||
|
||||
Known Issues with Linearization
|
||||
-------------------------------
|
||||
|
||||
There are a handful of known issues with this linearization code. These
|
||||
issues do not appear to impact the behavior of linearized files which
|
||||
still work as intended: it is possible for a web browser to begin to
|
||||
display them before they are fully downloaded. In fact, it seems that
|
||||
various other programs that create linearized files have many of these
|
||||
same issues. These items make reference to terminology used in the
|
||||
linearization appendix of the PDF specification.
|
||||
|
||||
- Thread Dictionary information keys appear in part 4 with the rest of
|
||||
Threads instead of in part 9. Objects in part 9 are not grouped
|
||||
together functionally.
|
||||
|
||||
- We are not calculating numerators for shared object positions within
|
||||
content streams or interleaving them within content streams.
|
||||
|
||||
- We generate only page offset, shared object, and outline hint tables.
|
||||
It would be relatively easy to add some additional tables. We gather
|
||||
most of the information needed to create thumbnail hint tables. There
|
||||
are comments in the code about this.
|
||||
|
||||
.. _ref.linearization-debugging:
|
||||
|
||||
Debugging Note
|
||||
--------------
|
||||
|
||||
The :command:`qpdf --show-linearization` command can show
|
||||
the complete contents of linearization hint streams. To look at the raw
|
||||
data, you can extract the filtered contents of the linearization hint
|
||||
tables using :command:`qpdf --show-object=n
|
||||
--filtered-stream-data`. Then, to convert this into a bit
|
||||
stream (since linearization tables are bit streams written without
|
||||
regard to byte boundaries), you can pipe the resulting data through the
|
||||
following perl code:
|
||||
|
||||
.. code-block:: perl
|
||||
|
||||
use bytes;
|
||||
binmode STDIN;
|
||||
undef $/;
|
||||
my $a = <STDIN>;
|
||||
my @ch = split(//, $a);
|
||||
map { printf("%08b", ord($_)) } @ch;
|
||||
print "\n";
|
|
@ -0,0 +1,186 @@
|
|||
.. _ref.object-and-xref-streams:
|
||||
|
||||
Object and Cross-Reference Streams
|
||||
==================================
|
||||
|
||||
This chapter provides information about the implementation of object
|
||||
stream and cross-reference stream support in qpdf.
|
||||
|
||||
.. _ref.object-streams:
|
||||
|
||||
Object Streams
|
||||
--------------
|
||||
|
||||
Object streams can contain any regular object except the following:
|
||||
|
||||
- stream objects
|
||||
|
||||
- objects with generation > 0
|
||||
|
||||
- the encryption dictionary
|
||||
|
||||
- objects containing the /Length of another stream
|
||||
|
||||
In addition, Adobe reader (at least as of version 8.0.0) appears to not
|
||||
be able to handle having the document catalog appear in an object stream
|
||||
if the file is encrypted, though this is not specifically disallowed by
|
||||
the specification.
|
||||
|
||||
There are additional restrictions for linearized files. See
|
||||
:ref:`ref.object-streams-linearization` for details.
|
||||
|
||||
The PDF specification refers to objects in object streams as "compressed
|
||||
objects" regardless of whether the object stream is compressed.
|
||||
|
||||
The generation number of every object in an object stream must be zero.
|
||||
It is possible to delete and replace an object in an object stream with
|
||||
a regular object.
|
||||
|
||||
The object stream dictionary has the following keys:
|
||||
|
||||
- ``/N``: number of objects
|
||||
|
||||
- ``/First``: byte offset of first object
|
||||
|
||||
- ``/Extends``: indirect reference to stream that this extends
|
||||
|
||||
Stream collections are formed with ``/Extends``. They must form a
|
||||
directed acyclic graph. These can be used for semantic information and
|
||||
are not meaningful to the PDF document's syntactic structure. Although
|
||||
qpdf preserves stream collections, it never generates them and doesn't
|
||||
make use of this information in any way.
|
||||
|
||||
The specification recommends limiting the number of objects in object
|
||||
stream for efficiency in reading and decoding. Acrobat 6 uses no more
|
||||
than 100 objects per object stream for linearized files and no more 200
|
||||
objects per stream for non-linearized files. ``QPDFWriter``, in object
|
||||
stream generation mode, never puts more than 100 objects in an object
|
||||
stream.
|
||||
|
||||
Object stream contents consists of *N* pairs of integers, each of which
|
||||
is the object number and the byte offset of the object relative to the
|
||||
first object in the stream, followed by the objects themselves,
|
||||
concatenated.
|
||||
|
||||
.. _ref.xref-streams:
|
||||
|
||||
Cross-Reference Streams
|
||||
-----------------------
|
||||
|
||||
For non-hybrid files, the value following ``startxref`` is the byte
|
||||
offset to the xref stream rather than the word ``xref``.
|
||||
|
||||
For hybrid files (files containing both xref tables and cross-reference
|
||||
streams), the xref table's trailer dictionary contains the key
|
||||
``/XRefStm`` whose value is the byte offset to a cross-reference stream
|
||||
that supplements the xref table. A PDF 1.5-compliant application should
|
||||
read the xref table first. Then it should replace any object that it has
|
||||
already seen with any defined in the xref stream. Then it should follow
|
||||
any ``/Prev`` pointer in the original xref table's trailer dictionary.
|
||||
The specification is not clear about what should be done, if anything,
|
||||
with a ``/Prev`` pointer in the xref stream referenced by an xref table.
|
||||
The ``QPDF`` class ignores it, which is probably reasonable since, if
|
||||
this case were to appear for any sensible PDF file, the previous xref
|
||||
table would probably have a corresponding ``/XRefStm`` pointer of its
|
||||
own. For example, if a hybrid file were appended, the appended section
|
||||
would have its own xref table and ``/XRefStm``. The appended xref table
|
||||
would point to the previous xref table which would point the
|
||||
``/XRefStm``, meaning that the new ``/XRefStm`` doesn't have to point to
|
||||
it.
|
||||
|
||||
Since xref streams must be read very early, they may not be encrypted,
|
||||
and the may not contain indirect objects for keys required to read them,
|
||||
which are these:
|
||||
|
||||
- ``/Type``: value ``/XRef``
|
||||
|
||||
- ``/Size``: value *n+1*: where *n* is highest object number (same as
|
||||
``/Size`` in the trailer dictionary)
|
||||
|
||||
- ``/Index`` (optional): value
|
||||
``[:samp:`{n count}` ...]`` used to determine
|
||||
which objects' information is stored in this stream. The default is
|
||||
``[0 /Size]``.
|
||||
|
||||
- ``/Prev``: value :samp:`{offset}`: byte
|
||||
offset of previous xref stream (same as ``/Prev`` in the trailer
|
||||
dictionary)
|
||||
|
||||
- ``/W [...]``: sizes of each field in the xref table
|
||||
|
||||
The other fields in the xref stream, which may be indirect if desired,
|
||||
are the union of those from the xref table's trailer dictionary.
|
||||
|
||||
.. _ref.xref-stream-data:
|
||||
|
||||
Cross-Reference Stream Data
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The stream data is binary and encoded in big-endian byte order. Entries
|
||||
are concatenated, and each entry has a length equal to the total of the
|
||||
entries in ``/W`` above. Each entry consists of one or more fields, the
|
||||
first of which is the type of the field. The number of bytes for each
|
||||
field is given by ``/W`` above. A 0 in ``/W`` indicates that the field
|
||||
is omitted and has the default value. The default value for the field
|
||||
type is "``1``". All other default values are "``0``".
|
||||
|
||||
PDF 1.5 has three field types:
|
||||
|
||||
- 0: for free objects. Format: ``0 obj next-generation``, same as the
|
||||
free table in a traditional cross-reference table
|
||||
|
||||
- 1: regular non-compressed object. Format: ``1 offset generation``
|
||||
|
||||
- 2: for objects in object streams. Format: ``2 object-stream-number
|
||||
index``, the number of object stream containing the object and the
|
||||
index within the object stream of the object.
|
||||
|
||||
It seems standard to have the first entry in the table be ``0 0 0``
|
||||
instead of ``0 0 ffff`` if there are no deleted objects.
|
||||
|
||||
.. _ref.object-streams-linearization:
|
||||
|
||||
Implications for Linearized Files
|
||||
---------------------------------
|
||||
|
||||
For linearized files, the linearization dictionary, document catalog,
|
||||
and page objects may not be contained in object streams.
|
||||
|
||||
Objects stored within object streams are given the highest range of
|
||||
object numbers within the main and first-page cross-reference sections.
|
||||
|
||||
It is okay to use cross-reference streams in place of regular xref
|
||||
tables. There are on special considerations.
|
||||
|
||||
Hint data refers to object streams themselves, not the objects in the
|
||||
streams. Shared object references should also be made to the object
|
||||
streams. There are no reference in any hint tables to the object numbers
|
||||
of compressed objects (objects within object streams).
|
||||
|
||||
When numbering objects, all shared objects within both the first and
|
||||
second halves of the linearized files must be numbered consecutively
|
||||
after all normal uncompressed objects in that half.
|
||||
|
||||
.. _ref.object-stream-implementation:
|
||||
|
||||
Implementation Notes
|
||||
--------------------
|
||||
|
||||
There are three modes for writing object streams:
|
||||
:samp:`disable`, :samp:`preserve`, and
|
||||
:samp:`generate`. In disable mode, we do not generate
|
||||
any object streams, and we also generate an xref table rather than xref
|
||||
streams. This can be used to generate PDF files that are viewable with
|
||||
older readers. In preserve mode, we write object streams such that
|
||||
written object streams contain the same objects and ``/Extends``
|
||||
relationships as in the original file. This is equal to disable if the
|
||||
file has no object streams. In generate, we create object streams
|
||||
ourselves by grouping objects that are allowed in object streams
|
||||
together in sets of no more than 100 objects. We also ensure that the
|
||||
PDF version is at least 1.5 in generate mode, but we preserve the
|
||||
version header in the other modes. The default is
|
||||
:samp:`preserve`.
|
||||
|
||||
We do not support creation of hybrid files. When we write files, even in
|
||||
preserve mode, we will lose any xref tables and merge any appended
|
||||
sections.
|
|
@ -0,0 +1,33 @@
|
|||
.. _ref.overview:
|
||||
|
||||
What is QPDF?
|
||||
=============
|
||||
|
||||
QPDF is a program and C++ library for structural, content-preserving
|
||||
transformations on PDF files. QPDF's website is located at
|
||||
https://qpdf.sourceforge.io/. QPDF's source code is hosted on github
|
||||
at https://github.com/qpdf/qpdf.
|
||||
|
||||
QPDF provides many useful capabilities to developers of PDF-producing
|
||||
software or for people who just want to look at the innards of a PDF
|
||||
file to learn more about how they work. With QPDF, it is possible to
|
||||
copy objects from one PDF file into another and to manipulate the list
|
||||
of pages in a PDF file. This makes it possible to merge and split PDF
|
||||
files. The QPDF library also makes it possible for you to create PDF
|
||||
files from scratch. In this mode, you are responsible for supplying
|
||||
all the contents of the file, while the QPDF library takes care of all
|
||||
the syntactical representation of the objects, creation of cross
|
||||
references tables and, if you use them, object streams, encryption,
|
||||
linearization, and other syntactic details. You are still responsible
|
||||
for generating PDF content on your own.
|
||||
|
||||
QPDF has been designed with very few external dependencies, and it is
|
||||
intentionally very lightweight. QPDF is *not* a PDF content creation
|
||||
library, a PDF viewer, or a program capable of converting PDF into other
|
||||
formats. In particular, QPDF knows nothing about the semantics of PDF
|
||||
content streams. If you are looking for something that can do that, you
|
||||
should look elsewhere. However, once you have a valid PDF file, QPDF can
|
||||
be used to transform that file in ways that perhaps your original PDF
|
||||
creation tool can't handle. For example, many programs generate simple PDF
|
||||
files but can't password-protect them, web-optimize them, or perform
|
||||
other transformations of that type.
|
|
@ -0,0 +1,96 @@
|
|||
.. _ref.qdf:
|
||||
|
||||
QDF Mode
|
||||
========
|
||||
|
||||
In QDF mode, qpdf creates PDF files in what we call *QDF
|
||||
form*. A PDF file in QDF form, sometimes called a QDF
|
||||
file, is a completely valid PDF file that has ``%QDF-1.0`` as its third
|
||||
line (after the pdf header and binary characters) and has certain other
|
||||
characteristics. The purpose of QDF form is to make it possible to edit
|
||||
PDF files, with some restrictions, in an ordinary text editor. This can
|
||||
be very useful for experimenting with different PDF constructs or for
|
||||
making one-off edits to PDF files (though there are other reasons why
|
||||
this may not always work). Note that QDF mode does not support
|
||||
linearized files. If you enable linearization, QDF mode is automatically
|
||||
disabled.
|
||||
|
||||
It is ordinarily very difficult to edit PDF files in a text editor for
|
||||
two reasons: most meaningful data in PDF files is compressed, and PDF
|
||||
files are full of offset and length information that makes it hard to
|
||||
add or remove data. A QDF file is organized in a manner such that, if
|
||||
edits are kept within certain constraints, the
|
||||
:command:`fix-qdf` program, distributed with qpdf, is
|
||||
able to restore edited files to a correct state. The
|
||||
:command:`fix-qdf` program takes no command-line
|
||||
arguments. It reads a possibly edited QDF file from standard input and
|
||||
writes a repaired file to standard output.
|
||||
|
||||
The following attributes characterize a QDF file:
|
||||
|
||||
- All objects appear in numerical order in the PDF file, including when
|
||||
objects appear in object streams.
|
||||
|
||||
- Objects are printed in an easy-to-read format, and all line endings
|
||||
are normalized to UNIX line endings.
|
||||
|
||||
- Unless specifically overridden, streams appear uncompressed (when
|
||||
qpdf supports the filters and they are compressed with a non-lossy
|
||||
compression scheme), and most content streams are normalized (line
|
||||
endings are converted to just a UNIX-style linefeeds).
|
||||
|
||||
- All streams lengths are represented as indirect objects, and the
|
||||
stream length object is always the next object after the stream. If
|
||||
the stream data does not end with a newline, an extra newline is
|
||||
inserted, and a special comment appears after the stream indicating
|
||||
that this has been done.
|
||||
|
||||
- If the PDF file contains object streams, if object stream *n*
|
||||
contains *k* objects, those objects are numbered from *n+1* through
|
||||
*n+k*, and the object number/offset pairs appear on a separate line
|
||||
for each object. Additionally, each object in the object stream is
|
||||
preceded by a comment indicating its object number and index. This
|
||||
makes it very easy to find objects in object streams.
|
||||
|
||||
- All beginnings of objects, ``stream`` tokens, ``endstream`` tokens,
|
||||
and ``endobj`` tokens appear on lines by themselves. A blank line
|
||||
follows every ``endobj`` token.
|
||||
|
||||
- If there is a cross-reference stream, it is unfiltered.
|
||||
|
||||
- Page dictionaries and page content streams are marked with special
|
||||
comments that make them easy to find.
|
||||
|
||||
- Comments precede each object indicating the object number of the
|
||||
corresponding object in the original file.
|
||||
|
||||
When editing a QDF file, any edits can be made as long as the above
|
||||
constraints are maintained. This means that you can freely edit a page's
|
||||
content without worrying about messing up the QDF file. It is also
|
||||
possible to add new objects so long as those objects are added after the
|
||||
last object in the file or subsequent objects are renumbered. If a QDF
|
||||
file has object streams in it, you can always add the new objects before
|
||||
the xref stream and then change the number of the xref stream, since
|
||||
nothing generally ever references it by number.
|
||||
|
||||
It is not generally practical to remove objects from QDF files without
|
||||
messing up object numbering, but if you remove all references to an
|
||||
object, you can run qpdf on the file (after running
|
||||
:command:`fix-qdf`), and qpdf will omit the now-orphaned
|
||||
object.
|
||||
|
||||
When :command:`fix-qdf` is run, it goes through the file
|
||||
and recomputes the following parts of the file:
|
||||
|
||||
- the ``/N``, ``/W``, and ``/First`` keys of all object stream
|
||||
dictionaries
|
||||
|
||||
- the pairs of numbers representing object numbers and offsets of
|
||||
objects in object streams
|
||||
|
||||
- all stream lengths
|
||||
|
||||
- the cross-reference table or cross-reference stream
|
||||
|
||||
- the offset to the cross-reference table or cross-reference stream
|
||||
following the ``startxref`` token
|
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,33 @@
|
|||
.. _ref.weak-crypto:
|
||||
|
||||
Weak Cryptography
|
||||
=================
|
||||
|
||||
Start with version 10.4, qpdf is taking steps to reduce the likelihood
|
||||
of a user *accidentally* creating PDF files with insecure cryptography
|
||||
but will continue to allow creation of such files indefinitely with
|
||||
explicit acknowledgment.
|
||||
|
||||
The PDF file format makes use of RC4, which is known to be a weak
|
||||
cryptography algorithm, and MD5, which is a weak hashing algorithm. In
|
||||
version 10.4, qpdf generates warnings for some (but not all) cases of
|
||||
writing files with weak cryptography when invoked from the command-line.
|
||||
These warnings can be suppressed using the
|
||||
:samp:`--allow-weak-crypto` option.
|
||||
|
||||
It is planned for qpdf version 11 to be stricter, making it an error to
|
||||
write files with insecure cryptography from the command-line tool in
|
||||
most cases without specifying the
|
||||
:samp:`--allow-weak-crypto` flag and also to require
|
||||
explicit steps when using the C++ library to enable use of insecure
|
||||
cryptography.
|
||||
|
||||
Note that qpdf must always retain support for weak cryptographic
|
||||
algorithms since this is required for reading older PDF files that use
|
||||
it. Additionally, qpdf will always retain the ability to create files
|
||||
using weak cryptographic algorithms since, as a development tool, qpdf
|
||||
explicitly supports creating older or deprecated types of PDF files
|
||||
since these are sometimes needed to test or work with older versions of
|
||||
software. Even if other cryptography libraries drop support for RC4 or
|
||||
MD5, qpdf can always fall back to its internal implementations of those
|
||||
algorithms, so they are not going to disappear from qpdf.
|
Loading…
Reference in New Issue