mirror of
https://github.com/qpdf/qpdf.git
synced 2025-01-05 08:02:11 +00:00
Update internals documentation to reflect QPDFObject split
This commit is contained in:
parent
55cc2ab680
commit
ed04b80caf
@ -67,17 +67,20 @@ files.
|
|||||||
The primary class for interacting with PDF objects is
|
The primary class for interacting with PDF objects is
|
||||||
``QPDFObjectHandle``. Instances of this class can be passed around by
|
``QPDFObjectHandle``. Instances of this class can be passed around by
|
||||||
value, copied, stored in containers, etc. with very low overhead. The
|
value, copied, stored in containers, etc. with very low overhead. The
|
||||||
``QPDFObjectHandle`` object contains an internal shared pointer to an
|
``QPDFObjectHandle`` object contains an internal shared pointer to the
|
||||||
underlying ``QPDFObject``. Instances of ``QPDFObjectHandle`` created
|
underlying object. Instances of ``QPDFObjectHandle`` created by
|
||||||
by reading from a file will always contain a reference back to the
|
reading from a file will always contain a reference back to the
|
||||||
``QPDF`` object from which they were created. A ``QPDFObjectHandle``
|
``QPDF`` object from which they were created. A ``QPDFObjectHandle``
|
||||||
may be direct or indirect. If indirect, the ``QPDFObject`` shared
|
may be direct or indirect. If indirect, object is initially
|
||||||
pointer is initially null. In this case, the first attempt to access
|
*unresolved*. In this case, the first attempt to access the underlying
|
||||||
the underlying ``QPDFObject`` will result in the ``QPDFObject`` being
|
object will result in the object being resolved via a call to the
|
||||||
resolved via a call to the referenced ``QPDF`` instance. This makes it
|
referenced ``QPDF`` instance. This makes it essentially impossible to
|
||||||
essentially impossible to make coding errors in which certain things
|
make coding errors in which certain things will work for some PDF
|
||||||
will work for some PDF files and not for others based on which objects
|
files and not for others based on which objects are direct and which
|
||||||
are direct and which objects are indirect.
|
objects are indirect. In cases where it is necessary to know whether
|
||||||
|
an object is indirect or not, this information can be obtained from
|
||||||
|
the ``QPDFObjectHandle``. It is also possible to convert direct
|
||||||
|
objects to indirect objects and vice versa.
|
||||||
|
|
||||||
Instances of ``QPDFObjectHandle`` can be directly created and modified
|
Instances of ``QPDFObjectHandle`` can be directly created and modified
|
||||||
using static factory methods in the ``QPDFObjectHandle`` class. There
|
using static factory methods in the ``QPDFObjectHandle`` class. There
|
||||||
@ -230,43 +233,46 @@ could serve as a starting point to someone trying to understand the
|
|||||||
implementation. There is nothing in this section that you need to know
|
implementation. There is nothing in this section that you need to know
|
||||||
to use the qpdf library.
|
to use the qpdf library.
|
||||||
|
|
||||||
``QPDFObject`` is the basic PDF Object class. It is an abstract base
|
In a PDF file, objects may be direct or indirect. Direct objects are
|
||||||
class from which are derived classes for each type of PDF object.
|
objects whose representations appear directly in PDF syntax. Indirect
|
||||||
Clients do not interact with Objects directly but instead interact with
|
objects are references to objects by their ID. The qpdf library uses
|
||||||
``QPDFObjectHandle``.
|
the ``QPDFObjectHandle`` type to hold onto objects and to abstract
|
||||||
|
away in most cases whether the object is direct or indirect.
|
||||||
|
|
||||||
When the ``QPDF`` class creates a new object, it dynamically allocates
|
Internally, ``QPDFObjectHandle`` holds onto a shared pointer to the
|
||||||
the appropriate type of ``QPDFObject`` and immediately hands the pointer
|
underlying object value. When direct object is created, the
|
||||||
to an instance of ``QPDFObjectHandle``. The parser reads a token from
|
``QPDFObjectHandle`` that holds it is not associated with a ``QPDF``
|
||||||
the current file position. If the token is a not either a dictionary or
|
object. When an indirect object reference is created, it starts off in
|
||||||
array opener, an object is immediately constructed from the single token
|
an *unresolved* state and must be associated with a ``QPDF`` object,
|
||||||
and the parser returns. Otherwise, the parser iterates in a special mode
|
which is considered its *owner*. To access the actual value of the
|
||||||
in which it accumulates objects until it finds a balancing closer.
|
object, the object must be *resolved*. This happens automatically when
|
||||||
During this process, the ``R`` keyword is recognized and an indirect
|
the the object is accessed in any way.
|
||||||
``QPDFObjectHandle`` may be constructed.
|
|
||||||
|
|
||||||
The ``QPDF::resolve()`` method, which is used to resolve an indirect
|
To resolve an object, qpdf checks its object cache. If not found in
|
||||||
object, may be invoked from the ``QPDFObjectHandle`` class. It first
|
the cache, it attempts to read the object from the input source
|
||||||
checks a cache to see whether this object has already been read. If
|
associated with the ``QPDF`` object. If it is not found, a ``null``
|
||||||
not, it reads the object from the PDF file and caches it. It the
|
object is returned. A ``null`` object is an object type, just like
|
||||||
returns the resulting ``QPDFObjectHandle``. The calling object handle
|
boolean, string, number, etc. It is not a null pointer. The PDF
|
||||||
then replaces its ``std::shared_ptr<QDFObject>`` with the one from the
|
specification states that an indirect reference to an object that
|
||||||
newly returned ``QPDFObjectHandle``. In this way, only a single copy
|
doesn't exist is to be treated as a ``null``. The resulting object,
|
||||||
of any direct object need exist and clients can access objects
|
whether a ``null`` or the actual object that was read, is stored in
|
||||||
transparently without knowing or caring whether they are direct or
|
the cache. If the object is later replaced or swapped, the underlying
|
||||||
indirect objects. Additionally, no object is ever read from the file
|
object remains the same, but its value is replaced. This way, if you
|
||||||
more than once. That means that only the portions of the PDF file that
|
have a ``QPDFObjectHandle`` to an indirect object and the object by
|
||||||
are actually needed are ever read from the input file, thus allowing
|
that number is replaced (by calling ``QPDF::replaceObject`` or
|
||||||
the qpdf package to take advantage of this important design goal of
|
``QPDF::swapObjects``), your ``QPDFObjectHandle`` will reflect the new
|
||||||
PDF files.
|
value of the object. This is consistent with what would happen to PDF
|
||||||
|
objects if you were to replace the definition of an object in the
|
||||||
|
file.
|
||||||
|
|
||||||
If the requested object is inside of an object stream, the object stream
|
When reading an object from the input source, if the requested object
|
||||||
itself is first read into memory. Then the tokenizer reads objects from
|
is inside of an object stream, the object stream itself is first read
|
||||||
the memory stream based on the offset information stored in the stream.
|
into memory. Then the tokenizer reads objects from the memory stream
|
||||||
Those individual objects are cached, after which the temporary buffer
|
based on the offset information stored in the stream. Those individual
|
||||||
holding the object stream contents is discarded. In this way, the first
|
objects are cached, after which the temporary buffer holding the
|
||||||
time an object in an object stream is requested, all objects in the
|
object stream contents is discarded. In this way, the first time an
|
||||||
stream are cached.
|
object in an object stream is requested, all objects in the stream are
|
||||||
|
cached.
|
||||||
|
|
||||||
The following example should clarify how ``QPDF`` processes a simple
|
The following example should clarify how ``QPDF`` processes a simple
|
||||||
file.
|
file.
|
||||||
@ -287,9 +293,10 @@ file.
|
|||||||
until it encounters ``>>``. Each object that is read is pushed onto
|
until it encounters ``>>``. Each object that is read is pushed onto
|
||||||
a stack. If ``R`` is read, the last two objects on the stack are
|
a stack. If ``R`` is read, the last two objects on the stack are
|
||||||
inspected. If they are integers, they are popped off the stack and
|
inspected. If they are integers, they are popped off the stack and
|
||||||
their values are used to construct an indirect object handle which is
|
their values are used to construct an indirect object handle which
|
||||||
then pushed onto the stack. When ``>>`` is finally read, the stack
|
is then pushed onto the stack. When ``>>`` is finally read, the
|
||||||
is converted into a ``QPDF_Dictionary`` which is placed in a
|
stack is converted into a ``QPDF_Dictionary`` (not directly
|
||||||
|
accessible through the API) which is placed in a
|
||||||
``QPDFObjectHandle`` and returned.
|
``QPDFObjectHandle`` and returned.
|
||||||
|
|
||||||
- The resulting dictionary is saved as the trailer dictionary.
|
- The resulting dictionary is saved as the trailer dictionary.
|
||||||
@ -299,7 +306,7 @@ file.
|
|||||||
saved. If ``/Prev`` is not present, the initial parsing process is
|
saved. If ``/Prev`` is not present, the initial parsing process is
|
||||||
complete.
|
complete.
|
||||||
|
|
||||||
If there is an encryption dictionary, the document's encryption
|
- If there is an encryption dictionary, the document's encryption
|
||||||
parameters are initialized.
|
parameters are initialized.
|
||||||
|
|
||||||
- The client requests root object. The ``QPDF`` class gets the value of
|
- The client requests root object. The ``QPDF`` class gets the value of
|
||||||
@ -312,14 +319,103 @@ file.
|
|||||||
object cache for an object with the root dictionary's object ID and
|
object cache for an object with the root dictionary's object ID and
|
||||||
generation number. Upon not seeing it, it checks the cross reference
|
generation number. Upon not seeing it, it checks the cross reference
|
||||||
table, gets the offset, and reads the object present at that offset.
|
table, gets the offset, and reads the object present at that offset.
|
||||||
It stores the result in the object cache and returns the cached
|
It stores the result in the object cache. The cache entry's value is
|
||||||
result. The calling ``QPDFObjectHandle`` replaces its object pointer
|
replaced by the actual value, which causes any previously unresolved
|
||||||
with the one from the resolved ``QPDFObjectHandle``, verifies that it
|
``QPDFObjectHandle`` objects that that pointed there to now have a
|
||||||
a valid dictionary object, and returns the (unresolved indirect)
|
shared copy of the actual object. Modifications through any such
|
||||||
``QPDFObject`` handle to the top of the Pages hierarchy.
|
``QPDFObjectHandle`` will be reflected in all of them. As the client
|
||||||
|
continues to request objects, the same process is followed for each
|
||||||
|
new requested object.
|
||||||
|
|
||||||
As the client continues to request objects, the same process is
|
.. _object_internals:
|
||||||
followed for each new requested object.
|
|
||||||
|
QPDF Object Internals
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
The internals of ``QPDFObjectHandle`` and how qpdf stores objects were
|
||||||
|
significantly rewritten for QPDF 11. Here are some additional details.
|
||||||
|
|
||||||
|
Object Internals
|
||||||
|
~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The ``QPDF`` object has an object cache which contains a shared
|
||||||
|
pointer to each object that was read from the file. Changes can be
|
||||||
|
made to any of those objects through ``QPDFObjectHandle`` methods. Any
|
||||||
|
such changes are visible to all ``QPDFObjectHandle`` instances that
|
||||||
|
point to the same object. When a ``QPDF`` object is written by
|
||||||
|
``QPDFWriter`` or serialized to JSON, any changes are reflected.
|
||||||
|
|
||||||
|
Objects in qpdf 11 and Newer
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The object cache in ``QPDF`` contains a shared pointer to
|
||||||
|
``QPDFValueProxy``. Any ``QPDFObjectHandle`` resolved from an indirect
|
||||||
|
reference to that object has a copy of that shared pointer. Each
|
||||||
|
``QPDFValueProxy`` object contains a shared pointer to an object of
|
||||||
|
type ``QPDFValue``. The ``QPDFValue`` type is an abstract base class.
|
||||||
|
There is an implementation for each of the basic object types (array,
|
||||||
|
dictionary, null, boolean, string, number, etc.) as well as a few
|
||||||
|
special ones including ``uninitialized``, ``unresolved``, and
|
||||||
|
``reserved``. When an object is first referenced, its underlying
|
||||||
|
``QPDFValue`` has type ``unresolved``. When the object is first
|
||||||
|
resolved, the ``QPDFValueProxy`` in the cache has its internal
|
||||||
|
``QPDFValue`` replaced with the object as read from the file. Since it
|
||||||
|
is the ``QPDFValueProxy`` object that is shared by all referencing
|
||||||
|
``QPDFObjectHandle`` objects as well as by the owning ``QPDF`` object,
|
||||||
|
this ensures that any future changes to the object, including
|
||||||
|
replacing the object with a completely different one, will be
|
||||||
|
reflected across all ``QPDFObjectHandle`` objects that reference it.
|
||||||
|
|
||||||
|
A ``QPDFValue`` that originated from a PDF input source maintains a
|
||||||
|
pointer to the ``QPDF`` object that read it (its *owner*). When that
|
||||||
|
``QPDF`` object is destroyed, it replaces the value of each
|
||||||
|
``QPDFValueProxy`` in its cache with a direct ``null`` object and
|
||||||
|
clears the pointer to the owning ``QPDF``. This means that, if there
|
||||||
|
are still any referencing ``QPDFObjectHandle`` objects floating
|
||||||
|
around, requesting their owning ``QPDF`` will return a null pointer
|
||||||
|
rather than a pointer to a ``QPDF`` object that is either invalid or
|
||||||
|
points to something else. This operation also has the effect of
|
||||||
|
breaking any circular references (which are common and, in some cases,
|
||||||
|
required by the PDF specification), thus preventing memory leaks when
|
||||||
|
``QPDF`` objects are destroyed.
|
||||||
|
|
||||||
|
Objects prior to qpdf 11
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Prior to qpdf 11, the functionality of the ``QPDFValue`` and
|
||||||
|
``QPDFValueProxy`` classes were combined into a single ``QPDFObject``
|
||||||
|
class, which served the dual purpose of being the cache entry for
|
||||||
|
``QPDF`` and being the abstract base class for all the different PDF
|
||||||
|
object types. The behavior was nearly the same, but there were a few
|
||||||
|
problems:
|
||||||
|
|
||||||
|
- While changes to a ``QPDFObjectHandle`` through mutation were
|
||||||
|
visible across all referencing ``QPDFObjectHandle`` objects,
|
||||||
|
*replacing* an object with ``QPDF::replaceObject`` or
|
||||||
|
``QPDF::swapObjects`` would leave ``QPDF`` with no way of notifying
|
||||||
|
``QPDFObjectHandle`` objects that pointed to the old ``QPDFObject``.
|
||||||
|
To work around this, every attempt to access the underlying object
|
||||||
|
that a ``QPDFObjectHandle`` pointed to had to ask the owning
|
||||||
|
``QPDF`` whether the object had changed, and if so, it had to
|
||||||
|
replace its internal ``QPDFObject`` pointer. This added overhead to
|
||||||
|
every indirect object access even if no objects were ever changed.
|
||||||
|
|
||||||
|
- When a ``QPDF`` object was destroyed, it was necessary to
|
||||||
|
recursively traverse the structure of every object in the file to
|
||||||
|
break any circular references. For complex files, this significantly
|
||||||
|
increased the cost of destroying ``QPDF`` objects.
|
||||||
|
|
||||||
|
- When a ``QPDF`` object was destroyed, any ``QPDFObjectHandle``
|
||||||
|
objects that referenced it would maintain a potentially invalid
|
||||||
|
pointer as the owning ``QPDF``. In practice, this wasn't usually a
|
||||||
|
problem since generally people would have no need to maintain copies
|
||||||
|
of a ``QPDFObjectHandle`` from a destroyed ``QPDF`` object, but
|
||||||
|
in cases where this was possible, it was necessary for other
|
||||||
|
software to do its own bookkeeping to ensure that an object's owner
|
||||||
|
was still valid.
|
||||||
|
|
||||||
|
All of these problems were effectively solved by splitting
|
||||||
|
``QPDFObject`` into ``QPDFValueProxy`` and ``QPDFValue``.
|
||||||
|
|
||||||
.. _casting:
|
.. _casting:
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user