2
1
mirror of https://github.com/qpdf/qpdf.git synced 2025-01-05 08:02:11 +00:00

Update internals documentation to reflect QPDFObject split

This commit is contained in:
Jay Berkenbilt 2022-09-05 10:30:27 -04:00
parent 55cc2ab680
commit ed04b80caf

View File

@ -67,17 +67,20 @@ files.
The primary class for interacting with PDF objects is The primary class for interacting with PDF objects is
``QPDFObjectHandle``. Instances of this class can be passed around by ``QPDFObjectHandle``. Instances of this class can be passed around by
value, copied, stored in containers, etc. with very low overhead. The value, copied, stored in containers, etc. with very low overhead. The
``QPDFObjectHandle`` object contains an internal shared pointer to an ``QPDFObjectHandle`` object contains an internal shared pointer to the
underlying ``QPDFObject``. Instances of ``QPDFObjectHandle`` created underlying object. Instances of ``QPDFObjectHandle`` created by
by reading from a file will always contain a reference back to the reading from a file will always contain a reference back to the
``QPDF`` object from which they were created. A ``QPDFObjectHandle`` ``QPDF`` object from which they were created. A ``QPDFObjectHandle``
may be direct or indirect. If indirect, the ``QPDFObject`` shared may be direct or indirect. If indirect, object is initially
pointer is initially null. In this case, the first attempt to access *unresolved*. In this case, the first attempt to access the underlying
the underlying ``QPDFObject`` will result in the ``QPDFObject`` being object will result in the object being resolved via a call to the
resolved via a call to the referenced ``QPDF`` instance. This makes it referenced ``QPDF`` instance. This makes it essentially impossible to
essentially impossible to make coding errors in which certain things make coding errors in which certain things will work for some PDF
will work for some PDF files and not for others based on which objects files and not for others based on which objects are direct and which
are direct and which objects are indirect. objects are indirect. In cases where it is necessary to know whether
an object is indirect or not, this information can be obtained from
the ``QPDFObjectHandle``. It is also possible to convert direct
objects to indirect objects and vice versa.
Instances of ``QPDFObjectHandle`` can be directly created and modified Instances of ``QPDFObjectHandle`` can be directly created and modified
using static factory methods in the ``QPDFObjectHandle`` class. There using static factory methods in the ``QPDFObjectHandle`` class. There
@ -230,43 +233,46 @@ could serve as a starting point to someone trying to understand the
implementation. There is nothing in this section that you need to know implementation. There is nothing in this section that you need to know
to use the qpdf library. to use the qpdf library.
``QPDFObject`` is the basic PDF Object class. It is an abstract base In a PDF file, objects may be direct or indirect. Direct objects are
class from which are derived classes for each type of PDF object. objects whose representations appear directly in PDF syntax. Indirect
Clients do not interact with Objects directly but instead interact with objects are references to objects by their ID. The qpdf library uses
``QPDFObjectHandle``. the ``QPDFObjectHandle`` type to hold onto objects and to abstract
away in most cases whether the object is direct or indirect.
When the ``QPDF`` class creates a new object, it dynamically allocates Internally, ``QPDFObjectHandle`` holds onto a shared pointer to the
the appropriate type of ``QPDFObject`` and immediately hands the pointer underlying object value. When direct object is created, the
to an instance of ``QPDFObjectHandle``. The parser reads a token from ``QPDFObjectHandle`` that holds it is not associated with a ``QPDF``
the current file position. If the token is a not either a dictionary or object. When an indirect object reference is created, it starts off in
array opener, an object is immediately constructed from the single token an *unresolved* state and must be associated with a ``QPDF`` object,
and the parser returns. Otherwise, the parser iterates in a special mode which is considered its *owner*. To access the actual value of the
in which it accumulates objects until it finds a balancing closer. object, the object must be *resolved*. This happens automatically when
During this process, the ``R`` keyword is recognized and an indirect the the object is accessed in any way.
``QPDFObjectHandle`` may be constructed.
The ``QPDF::resolve()`` method, which is used to resolve an indirect To resolve an object, qpdf checks its object cache. If not found in
object, may be invoked from the ``QPDFObjectHandle`` class. It first the cache, it attempts to read the object from the input source
checks a cache to see whether this object has already been read. If associated with the ``QPDF`` object. If it is not found, a ``null``
not, it reads the object from the PDF file and caches it. It the object is returned. A ``null`` object is an object type, just like
returns the resulting ``QPDFObjectHandle``. The calling object handle boolean, string, number, etc. It is not a null pointer. The PDF
then replaces its ``std::shared_ptr<QDFObject>`` with the one from the specification states that an indirect reference to an object that
newly returned ``QPDFObjectHandle``. In this way, only a single copy doesn't exist is to be treated as a ``null``. The resulting object,
of any direct object need exist and clients can access objects whether a ``null`` or the actual object that was read, is stored in
transparently without knowing or caring whether they are direct or the cache. If the object is later replaced or swapped, the underlying
indirect objects. Additionally, no object is ever read from the file object remains the same, but its value is replaced. This way, if you
more than once. That means that only the portions of the PDF file that have a ``QPDFObjectHandle`` to an indirect object and the object by
are actually needed are ever read from the input file, thus allowing that number is replaced (by calling ``QPDF::replaceObject`` or
the qpdf package to take advantage of this important design goal of ``QPDF::swapObjects``), your ``QPDFObjectHandle`` will reflect the new
PDF files. value of the object. This is consistent with what would happen to PDF
objects if you were to replace the definition of an object in the
file.
If the requested object is inside of an object stream, the object stream When reading an object from the input source, if the requested object
itself is first read into memory. Then the tokenizer reads objects from is inside of an object stream, the object stream itself is first read
the memory stream based on the offset information stored in the stream. into memory. Then the tokenizer reads objects from the memory stream
Those individual objects are cached, after which the temporary buffer based on the offset information stored in the stream. Those individual
holding the object stream contents is discarded. In this way, the first objects are cached, after which the temporary buffer holding the
time an object in an object stream is requested, all objects in the object stream contents is discarded. In this way, the first time an
stream are cached. object in an object stream is requested, all objects in the stream are
cached.
The following example should clarify how ``QPDF`` processes a simple The following example should clarify how ``QPDF`` processes a simple
file. file.
@ -287,9 +293,10 @@ file.
until it encounters ``>>``. Each object that is read is pushed onto until it encounters ``>>``. Each object that is read is pushed onto
a stack. If ``R`` is read, the last two objects on the stack are a stack. If ``R`` is read, the last two objects on the stack are
inspected. If they are integers, they are popped off the stack and inspected. If they are integers, they are popped off the stack and
their values are used to construct an indirect object handle which is their values are used to construct an indirect object handle which
then pushed onto the stack. When ``>>`` is finally read, the stack is then pushed onto the stack. When ``>>`` is finally read, the
is converted into a ``QPDF_Dictionary`` which is placed in a stack is converted into a ``QPDF_Dictionary`` (not directly
accessible through the API) which is placed in a
``QPDFObjectHandle`` and returned. ``QPDFObjectHandle`` and returned.
- The resulting dictionary is saved as the trailer dictionary. - The resulting dictionary is saved as the trailer dictionary.
@ -299,7 +306,7 @@ file.
saved. If ``/Prev`` is not present, the initial parsing process is saved. If ``/Prev`` is not present, the initial parsing process is
complete. complete.
If there is an encryption dictionary, the document's encryption - If there is an encryption dictionary, the document's encryption
parameters are initialized. parameters are initialized.
- The client requests root object. The ``QPDF`` class gets the value of - The client requests root object. The ``QPDF`` class gets the value of
@ -312,14 +319,103 @@ file.
object cache for an object with the root dictionary's object ID and object cache for an object with the root dictionary's object ID and
generation number. Upon not seeing it, it checks the cross reference generation number. Upon not seeing it, it checks the cross reference
table, gets the offset, and reads the object present at that offset. table, gets the offset, and reads the object present at that offset.
It stores the result in the object cache and returns the cached It stores the result in the object cache. The cache entry's value is
result. The calling ``QPDFObjectHandle`` replaces its object pointer replaced by the actual value, which causes any previously unresolved
with the one from the resolved ``QPDFObjectHandle``, verifies that it ``QPDFObjectHandle`` objects that that pointed there to now have a
a valid dictionary object, and returns the (unresolved indirect) shared copy of the actual object. Modifications through any such
``QPDFObject`` handle to the top of the Pages hierarchy. ``QPDFObjectHandle`` will be reflected in all of them. As the client
continues to request objects, the same process is followed for each
new requested object.
As the client continues to request objects, the same process is .. _object_internals:
followed for each new requested object.
QPDF Object Internals
---------------------
The internals of ``QPDFObjectHandle`` and how qpdf stores objects were
significantly rewritten for QPDF 11. Here are some additional details.
Object Internals
~~~~~~~~~~~~~~~~
The ``QPDF`` object has an object cache which contains a shared
pointer to each object that was read from the file. Changes can be
made to any of those objects through ``QPDFObjectHandle`` methods. Any
such changes are visible to all ``QPDFObjectHandle`` instances that
point to the same object. When a ``QPDF`` object is written by
``QPDFWriter`` or serialized to JSON, any changes are reflected.
Objects in qpdf 11 and Newer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The object cache in ``QPDF`` contains a shared pointer to
``QPDFValueProxy``. Any ``QPDFObjectHandle`` resolved from an indirect
reference to that object has a copy of that shared pointer. Each
``QPDFValueProxy`` object contains a shared pointer to an object of
type ``QPDFValue``. The ``QPDFValue`` type is an abstract base class.
There is an implementation for each of the basic object types (array,
dictionary, null, boolean, string, number, etc.) as well as a few
special ones including ``uninitialized``, ``unresolved``, and
``reserved``. When an object is first referenced, its underlying
``QPDFValue`` has type ``unresolved``. When the object is first
resolved, the ``QPDFValueProxy`` in the cache has its internal
``QPDFValue`` replaced with the object as read from the file. Since it
is the ``QPDFValueProxy`` object that is shared by all referencing
``QPDFObjectHandle`` objects as well as by the owning ``QPDF`` object,
this ensures that any future changes to the object, including
replacing the object with a completely different one, will be
reflected across all ``QPDFObjectHandle`` objects that reference it.
A ``QPDFValue`` that originated from a PDF input source maintains a
pointer to the ``QPDF`` object that read it (its *owner*). When that
``QPDF`` object is destroyed, it replaces the value of each
``QPDFValueProxy`` in its cache with a direct ``null`` object and
clears the pointer to the owning ``QPDF``. This means that, if there
are still any referencing ``QPDFObjectHandle`` objects floating
around, requesting their owning ``QPDF`` will return a null pointer
rather than a pointer to a ``QPDF`` object that is either invalid or
points to something else. This operation also has the effect of
breaking any circular references (which are common and, in some cases,
required by the PDF specification), thus preventing memory leaks when
``QPDF`` objects are destroyed.
Objects prior to qpdf 11
~~~~~~~~~~~~~~~~~~~~~~~~
Prior to qpdf 11, the functionality of the ``QPDFValue`` and
``QPDFValueProxy`` classes were combined into a single ``QPDFObject``
class, which served the dual purpose of being the cache entry for
``QPDF`` and being the abstract base class for all the different PDF
object types. The behavior was nearly the same, but there were a few
problems:
- While changes to a ``QPDFObjectHandle`` through mutation were
visible across all referencing ``QPDFObjectHandle`` objects,
*replacing* an object with ``QPDF::replaceObject`` or
``QPDF::swapObjects`` would leave ``QPDF`` with no way of notifying
``QPDFObjectHandle`` objects that pointed to the old ``QPDFObject``.
To work around this, every attempt to access the underlying object
that a ``QPDFObjectHandle`` pointed to had to ask the owning
``QPDF`` whether the object had changed, and if so, it had to
replace its internal ``QPDFObject`` pointer. This added overhead to
every indirect object access even if no objects were ever changed.
- When a ``QPDF`` object was destroyed, it was necessary to
recursively traverse the structure of every object in the file to
break any circular references. For complex files, this significantly
increased the cost of destroying ``QPDF`` objects.
- When a ``QPDF`` object was destroyed, any ``QPDFObjectHandle``
objects that referenced it would maintain a potentially invalid
pointer as the owning ``QPDF``. In practice, this wasn't usually a
problem since generally people would have no need to maintain copies
of a ``QPDFObjectHandle`` from a destroyed ``QPDF`` object, but
in cases where this was possible, it was necessary for other
software to do its own bookkeeping to ensure that an object's owner
was still valid.
All of these problems were effectively solved by splitting
``QPDFObject`` into ``QPDFValueProxy`` and ``QPDFValue``.
.. _casting: .. _casting: