Prior to this fix, if there was a loop detected in following /Prev
pointers in xref streams/tables, it would cause qpdf to lose data.
Note that this condition causes many PDF readers to hang or fail.
Significant enhancements to the lexer to improve EOF handling and to
support comments and spaces as tokens. Various other minor issues were
fixed as well.
While scanning the file looking for objects, limit the length of
tokens we allow. This prevents us from getting caught up in reading a
file character by character while digging through large streams.
There is no need for a --precheck-streams option. We can do the
precheck without imposing any penalty, only re-encoding the stream if
it fails the first time.
This commit adds several API methods that enable control over which
types of filters QPDF will attempt to decode. It also adds support for
/RunLengthDecode and /DCTDecode filters for both encoding and
decoding.
When requested, QPDFWriter will do more aggress prechecking of streams
to make sure it can actually succeed in decoding them before
attempting to do so. This will allow preservation of raw data even
when the raw data is corrupted relative to the specified filters.
QPDFObjectHandle::parseInternal now issues warnings instead of
throwing exceptions for all error conditions that it finds (except
internal logic errors) and has stronger recovery for things like
invalid tokens and malformed dictionaries. This should improve qpdf's
ability to recover from a wide range of broken files that currently
cause it to fail.
During parsing of an object, sometimes parts of the object have to be
resolved. An example is stream lengths. If such an object directly or
indirectly points to the object being parsed, it can cause an infinite
loop. Guard against all cases of re-entrant resolution of objects.
This is CVE-2017-9208.
The QPDF library uses object ID 0 internally as a sentinel to
represent a direct object, but prior to this fix, was not blocking
handling of 0 0 obj or 0 0 R as a special case. Creating an object in
the file with 0 0 obj could cause various infinite loops. The PDF spec
doesn't allow for object 0. Having qpdf handle object 0 might be a
better fix, but changing all the places in the code that assumes objid
== 0 means direct would be risky.
For non-encrypted files, determinstic ID generation uses file contents
instead of timestamp and file name. At a small runtime cost, this
enables generation of the same /ID if the same inputs are converted in
the same way multiple times.
Space rather than newline after xref, missing /ID in trailer for
encrypted file. This enables qpdf to handle some files that xpdf can
handle. Adobe reader can't necessarily handle them.
Rework QPDFWriter to always track old object IDs and QPDFObjGen
instead of int, thus not discarding the generation number. Switch to
QPDF::getCompressibleObjGen() to properly handle the case of an old
object eligible for compression that has a generation of other than
zero.
Original code was written before we could shallow copy objects, so all
the filtering was done by suppressing the output of certain keys and
replacing them with other keys. Now we can simplify the code greatly
by modifying shallow copies of dictionaries in place.
Read and write support is implemented for /V=5 with /R=5 as well as
/R=6. /R=5 is the deprecated encryption method used by Acrobat IX.
/R=6 is the encryption method used by PDF 2.0 from ISO 32000-2.