mirror of
https://github.com/qpdf/qpdf.git
synced 2025-01-07 09:04:04 +00:00
198 lines
8.2 KiB
ReStructuredText
198 lines
8.2 KiB
ReStructuredText
.. _linearization:
|
|
|
|
Linearization
|
|
=============
|
|
|
|
This chapter describes how ``QPDF`` and ``QPDFWriter`` implement
|
|
creation and processing of linearized PDFS.
|
|
|
|
.. _linearization-strategy:
|
|
|
|
Basic Strategy for Linearization
|
|
--------------------------------
|
|
|
|
To avoid the incestuous problem of having the qpdf library validate its
|
|
own linearized files, we have a special linearized file checking mode
|
|
which can be invoked via :command:`qpdf
|
|
--check-linearization` (or :command:`qpdf
|
|
--check`). This mode reads the linearization parameter
|
|
dictionary and the hint streams and validates that object ordering,
|
|
parameters, and hint stream contents are correct. The validation code
|
|
was first tested against linearized files created by external tools
|
|
(Acrobat and pdlin) and then used to validate files created by
|
|
``QPDFWriter`` itself.
|
|
|
|
.. _linearized.preparation:
|
|
|
|
Preparing For Linearization
|
|
---------------------------
|
|
|
|
Before creating a linearized PDF file from any other PDF file, the PDF
|
|
file must be altered such that all page attributes are propagated down
|
|
to the page level (and not inherited from parents in the ``/Pages``
|
|
tree). We also have to know which objects refer to which other objects,
|
|
being concerned with page boundaries and a few other cases. We refer to
|
|
this part of preparing the PDF file as
|
|
*optimization*, discussed in
|
|
:ref:`optimization`. Note the, in this context, the
|
|
term *optimization* is a qpdf term, and the
|
|
term *linearization* is a term from the PDF
|
|
specification. Do not be confused by the fact that many applications
|
|
refer to linearization as optimization or web optimization.
|
|
|
|
When creating linearized PDF files from optimized PDF files, there are
|
|
really only a few issues that need to be dealt with:
|
|
|
|
- Creation of hints tables
|
|
|
|
- Placing objects in the correct order
|
|
|
|
- Filling in offsets and byte sizes
|
|
|
|
.. _optimization:
|
|
|
|
Optimization
|
|
------------
|
|
|
|
In order to perform various operations such as linearization and
|
|
splitting files into pages, it is necessary to know which objects are
|
|
referenced by which pages, page thumbnails, and root and trailer
|
|
dictionary keys. It is also necessary to ensure that all page-level
|
|
attributes appear directly at the page level and are not inherited from
|
|
parents in the pages tree.
|
|
|
|
We refer to the process of enforcing these constraints as
|
|
*optimization*. As mentioned above, note
|
|
that some applications refer to linearization as optimization. Although
|
|
this optimization was initially motivated by the need to create
|
|
linearized files, we are using these terms separately.
|
|
|
|
PDF file optimization is implemented in the
|
|
:file:`QPDF_optimization.cc` source file. That file
|
|
is richly commented and serves as the primary reference for the
|
|
optimization process.
|
|
|
|
After optimization has been completed, the private member variables
|
|
``obj_user_to_objects`` and ``object_to_obj_users`` in ``QPDF`` have
|
|
been populated. Any object that has more than one value in the
|
|
``object_to_obj_users`` table is shared. Any object that has exactly one
|
|
value in the ``object_to_obj_users`` table is private. To find all the
|
|
private objects in a page or a trailer or root dictionary key, one
|
|
merely has make this determination for each element in the
|
|
``obj_user_to_objects`` table for the given page or key.
|
|
|
|
Note that pages and thumbnails have different object user types, so the
|
|
above test on a page will not include objects referenced by the page's
|
|
thumbnail dictionary and nothing else.
|
|
|
|
.. _linearization.writing:
|
|
|
|
Writing Linearized Files
|
|
------------------------
|
|
|
|
We will create files with only primary hint streams. We will never write
|
|
overflow hint streams. (As of PDF version 1.4, Acrobat doesn't either,
|
|
and they are never necessary.) The hint streams contain offset
|
|
information to objects that point to where they would be if the hint
|
|
stream were not present. This means that we have to calculate all object
|
|
positions before we can generate and write the hint table. This means
|
|
that we have to generate the file in two passes. To make this reliable,
|
|
``QPDFWriter`` in linearization mode invokes exactly the same code twice
|
|
to write the file to a pipeline.
|
|
|
|
In the first pass, the target pipeline is a count pipeline chained to a
|
|
discard pipeline. The count pipeline simply passes its data through to
|
|
the next pipeline in the chain but can return the number of bytes passed
|
|
through it at any intermediate point. The discard pipeline is an end of
|
|
line pipeline that just throws its data away. The hint stream is not
|
|
written and dummy values with adequate padding are stored in the first
|
|
cross reference table, linearization parameter dictionary, and /Prev key
|
|
of the first trailer dictionary. All the offset, length, object
|
|
renumbering information, and anything else we need for the second pass
|
|
is stored.
|
|
|
|
At the end of the first pass, this information is passed to the ``QPDF``
|
|
class which constructs a compressed hint stream in a memory buffer and
|
|
returns it. ``QPDFWriter`` uses this information to write a complete
|
|
hint stream object into a memory buffer. At this point, the length of
|
|
the hint stream is known.
|
|
|
|
In the second pass, the end of the pipeline chain is a regular file
|
|
instead of a discard pipeline, and we have known values for all the
|
|
offsets and lengths that we didn't have in the first pass. We have to
|
|
adjust offsets that appear after the start of the hint stream by the
|
|
length of the hint stream, which is known. Anything that is of variable
|
|
length is padded, with the padding code surrounding any writing code
|
|
that differs in the two passes. This ensures that changes to the way
|
|
things are represented never results in offsets that were gathered
|
|
during the first pass becoming incorrect for the second pass.
|
|
|
|
Using this strategy, we can write linearized files to a non-seekable
|
|
output stream with only a single pass to disk or wherever the output is
|
|
going.
|
|
|
|
.. _linearization-data:
|
|
|
|
Calculating Linearization Data
|
|
------------------------------
|
|
|
|
Once a file is optimized, we have information about which objects access
|
|
which other objects. We can then process these tables to decide which
|
|
part (as described in "Linearized PDF Document Structure" in the PDF
|
|
specification) each object is contained within. This tells us the exact
|
|
order in which objects are written. The ``QPDFWriter`` class asks for
|
|
this information and enqueues objects for writing in the proper order.
|
|
It also turns on a check that causes an exception to be thrown if an
|
|
object is encountered that has not already been queued. (This could
|
|
happen only if there were a bug in the traversal code used to calculate
|
|
the linearization data.)
|
|
|
|
.. _linearization-issues:
|
|
|
|
Known Issues with Linearization
|
|
-------------------------------
|
|
|
|
There are a handful of known issues with this linearization code. These
|
|
issues do not appear to impact the behavior of linearized files which
|
|
still work as intended: it is possible for a web browser to begin to
|
|
display them before they are fully downloaded. In fact, it seems that
|
|
various other programs that create linearized files have many of these
|
|
same issues. These items make reference to terminology used in the
|
|
linearization appendix of the PDF specification.
|
|
|
|
- Thread Dictionary information keys appear in part 4 with the rest of
|
|
Threads instead of in part 9. Objects in part 9 are not grouped
|
|
together functionally.
|
|
|
|
- We are not calculating numerators for shared object positions within
|
|
content streams or interleaving them within content streams.
|
|
|
|
- We generate only page offset, shared object, and outline hint tables.
|
|
It would be relatively easy to add some additional tables. We gather
|
|
most of the information needed to create thumbnail hint tables. There
|
|
are comments in the code about this.
|
|
|
|
.. _linearization-debugging:
|
|
|
|
Debugging Note
|
|
--------------
|
|
|
|
The :command:`qpdf --show-linearization` command can show
|
|
the complete contents of linearization hint streams. To look at the raw
|
|
data, you can extract the filtered contents of the linearization hint
|
|
tables using :command:`qpdf --show-object=n
|
|
--filtered-stream-data`. Then, to convert this into a bit
|
|
stream (since linearization tables are bit streams written without
|
|
regard to byte boundaries), you can pipe the resulting data through the
|
|
following perl code:
|
|
|
|
.. code-block:: perl
|
|
|
|
use bytes;
|
|
binmode STDIN;
|
|
undef $/;
|
|
my $a = <STDIN>;
|
|
my @ch = split(//, $a);
|
|
map { printf("%08b", ord($_)) } @ch;
|
|
print "\n";
|