Instead of calling assert for problems found during checking
linearization data, throw an exception which is later caught and
issued as an error. Ideally we would handle errors more robustly, but
this is still a significant improvement.
On certain operations, such as iterating through all objects and
adding new indirect objects, walk through the entire object structure
and explicitly resolve any indirect references to non-existent
objects. That prevents new objects from springing into existence and
causing the previously dangling references to point to them.
Instead of directly putting the contents of the annotation appearance
streams into the page's content stream, add commands to render the
form xobjects directly. This is a more robust way to do it than the
original solution as it works properly with patterns and avoids
problems with resource name clashes between the pages and the form
xobjects.
Flatten annotations by integrating their appearance streams into the
content stream of the containing page. In the case of form fields,
only flatten if /NeedAppearance is false (or equivalently absent). If
flattening form fields, also remove /AcroForm from the document
catalog.
Unparse is admittedly strange, but I'd rather be strange and
consistent, and everything else in the qpdf library uses unparse to
serialize. (If you're reading this, the convention of using "unparse"
comes from the "clu" programming language.)
Rather than keeping a list of buffers for every write, accumulate
bytes in a single buffer, doubling the size of the buffer when needed
to accommodate new data.
This is not the best possible implementation, but the change was
implemented in this way to avoid changing the shape of Pl_Buffer and
thus breaking backward compatibility.
There were a few places in the code that were checking that a pointer
wasn't null before deleting it, even though C++ has always allowed
delete 0. Most of the code did not perform these checks.
If we are unable to filter a page's content streams, don't attempt to
remove objects from the page's resource dictionary. Also provide a
command line option to suppress resource removal in case we ever need
this as a workaround for some bug or broken PDF files.
If parsing content streams is treated as a warning, there is no way
for a caller to know if a parsing operation has failed. This is very
dangerous and will likely result in data loss when token filters are
parser callbacks are in use.
It's not really a shallow copy. It just doesn't cross indirect object
boundaries. The old implementation had a bug that would cause multiple
shallow copies of the same object to share memory, which was not the
intention.
This is the beginning of higher-level API support using helper
classes. The goal is to be able to add more helpers without continuing
to pollute QPDF's and QPDFObjectHandle's public interfaces.
The special case around name token was not reachable. This would only
affect constructors of name tokens that were represented in
non-canonical form such as with a hex substitution for a printable
character. The error was harmless but still a bug.
Remove calls to assertPageObject(). All cases in the library that
called assertPageObject() work fine if you don't call
assertPageObject() because nothing assumes anything that was being
checked by that call. Removing the calls enables more files to be
successfully processed.
Prior to this fix, if there was a loop detected in following /Prev
pointers in xref streams/tables, it would cause qpdf to lose data.
Note that this condition causes many PDF readers to hang or fail.
Bump to an alpha release. This version is not being widely released
but is being used to push the new shared library version through the
debian packaging system and to test out github releases.
The QPDF_String::getUTF8Val() method was not treating strings that
weren't explicitly Unicode as PDF Doc Encoded. This only affects
characters in the range 0x80 through 0xa0.
Implement a TokenFilter class and refactor Pl_QPDFTokenizer to use a
TokenFilter class called ContentNormalizer. Pl_QPDFTokenizer is now a
general filter that passes data through a TokenFilter.
Adding a trailing newline in content normalization damages files whose
contents are split across streams in the middle of tokens. Let
QPDFWriter add the newline with the indicator to ignore the newline,
which it already does. This changes the way some qdf files look.
Remove a redundant method that was equal to another one with
additional arguments. This breaks binary compatibility, but there are
other ABI breaking changes in the upcoming release, so now is the time
to do it.
Significant enhancements to the lexer to improve EOF handling and to
support comments and spaces as tokens. Various other minor issues were
fixed as well.
Add options to enable the raw encryption key to be directly shown or
specified. Thanks to Didier Stevens <didier.stevens@gmail.com> for the
idea and contribution of one implementation of this idea.
If the stream isn't filterable but we call getStreamData, throw a
regular exception instead of a logic error so that normal error
handling and reporting mechanisms will be used.
Avoid calling jpeg_mem_src and jpeg_mem_dest. The custom destination
manager writes to the pipeline in smaller chunks to avoid having the
whole image in memory at once. The source manager works directly with
the Buffer object. Using customer managers avoids use of memory source
and destination managers, which are not present in older versions of
libjpeg still in use by some Linux distributions.
While scanning the file looking for objects, limit the length of
tokens we allow. This prevents us from getting caught up in reading a
file character by character while digging through large streams.
* Add support for PCLm using setPCLm() and writePCLm() methods in
QPDFWriter.hh and QPDFWriter.cc
* Add a function writePCLmHeader() for PCLm header in QPDFWriter
There is no need for a --precheck-streams option. We can do the
precheck without imposing any penalty, only re-encoding the stream if
it fails the first time.
This commit adds several API methods that enable control over which
types of filters QPDF will attempt to decode. It also adds support for
/RunLengthDecode and /DCTDecode filters for both encoding and
decoding.
Very badly corrupted files may not have a retrievable root dictionary.
Handle that as a special case so that a more helpful error message can
be provided.
When requested, QPDFWriter will do more aggress prechecking of streams
to make sure it can actually succeed in decoding them before
attempting to do so. This will allow preservation of raw data even
when the raw data is corrupted relative to the specified filters.
QPDFObjectHandle::parseInternal now issues warnings instead of
throwing exceptions for all error conditions that it finds (except
internal logic errors) and has stronger recovery for things like
invalid tokens and malformed dictionaries. This should improve qpdf's
ability to recover from a wide range of broken files that currently
cause it to fail.
During parsing of an object, sometimes parts of the object have to be
resolved. An example is stream lengths. If such an object directly or
indirectly points to the object being parsed, it can cause an infinite
loop. Guard against all cases of re-entrant resolution of objects.
This is CVE-2017-9208.
The QPDF library uses object ID 0 internally as a sentinel to
represent a direct object, but prior to this fix, was not blocking
handling of 0 0 obj or 0 0 R as a special case. Creating an object in
the file with 0 0 obj could cause various infinite loops. The PDF spec
doesn't allow for object 0. Having qpdf handle object 0 might be a
better fix, but changing all the places in the code that assumes objid
== 0 means direct would be risky.
This is CVE-2017-9210.
The description string for an error message included unparsing an
object, which is too complex of a thing to try to do while throwing an
exception. There was only one example of this in the entire codebase,
so it is not a pervasive problem. Fixing this eliminated one class of
infinite loop errors.
The 64 Bit file functions are supported by C++-Builder as well and
need to be used, else fseek will error out on larger files than 4 GB
like used in the large file test.
For non-encrypted files, determinstic ID generation uses file contents
instead of timestamp and file name. At a small runtime cost, this
enables generation of the same /ID if the same inputs are converted in
the same way multiple times.
As reported in issue #40, a call to CryptAcquireContext in
SecureRandomDataProvider fails in a fresh windows install prior to any
user keys being created in AppData\Roaming\Microsoft\Crypto\RSA.
Thanks michalrames.
Pushing inherited objects to pages and getting all pages were both
prone to stack overflow infinite loops if there were loops in the
Pages dictionary. There is a general weakness in the code in that any
part of the code that traverses the Pages structure would be prone to
this and would have to implement its own loop detection. A more robust
fix may provide some general method for handling the Pages structure,
but it's probably not worth doing.
Note: addition of *Internal2 private functions was done rather than
changing signatures of existing methods to avoid breaking
compatibility.
Converting a password to an encryption key is supposed to copy up to a
certain number of bytes from a digest. Make sure never to copy more
than the size of the digest.
When checking two objects preceding R while parsing, ensure that the
objects are direct. This avoids stuff like 1 0 obj containing 1 0 R 0 R
from causing an infinite loop in object resolution.
Original reported here:
https://bugs.launchpad.net/ubuntu/+source/qpdf/+bug/1397413
The PDF specification says that the /Type key for nodes in the pages
dictionary (both /Page and /Pages) is required, but some PDF files
omit them. Use the presence of other keys to determine the type of
pages tree node this is if the type key is not found.
QPDFWriter was trying to make /Filter and /DecodeParms direct in all
cases, but there are some cases where /DecodeParms may refer to a
stream, which can't be direct. QPDFWriter doesn't actually need
/DecodeParms to be direct in that case because it won't be able to
filter the stream. Until we can handle this type of stream, just don't
make /Filter and /DecodeParms direct if we can't filter the stream
anyway.
Fixes #34
Fix problem: if the last object in the first part of a linearized file
had an offset that was below 65536 by less than the size of the hint
stream, the xref stream was invalid and the resulting file is not
usable.
Add new RandomDataProvider object and implement existing random number
generation in terms of that. This enables end users to supply their
own random data providers.
If NO_GET_ENVIRONMENT is #defined at compile time on Windows, do not
call GetEnvironmentVariable. QUtil::get_env will always return
false. This option is not available through configure. This was
added to support a specific user's requirements to avoid calling
GetEnvironmentVariable from the Windows API. Nothing in qpdf outside
the test coverage system in qtest relies on QUtil::get_env.
Ideally, the library should never call assert outside of test code,
but it does in several places. For some cases where the assertion
might conceivably fail because of a problem with the input data,
replace assertions with exceptions so that they can be trapped by the
calling application. This commit surely misses some cases and
replaced some cases unnecessarily, but it should still be an
improvement.
In places where std::vector<T>(size_t) was used, either validate that
the size parameter is sane or refactor code to avoid the need to
pre-allocate the vector.
The /W array was not sanitized, possibly causing an integer overflow
in a multiplication. An analysis of the code suggests that there were
no possible exploits based on this since the problems were in checking
expected values but bounds checks were performed on actual values.
4.2.0 was binary incompatible in spite of there being no deletions or
changes to any public methods. As such, we have to bump the ABI and
are fixing some API breakage while we're at it.
Previous 4.3.0 target is now 5.1.0.
Space rather than newline after xref, missing /ID in trailer for
encrypted file. This enables qpdf to handle some files that xpdf can
handle. Adobe reader can't necessarily handle them.
Rework QPDFWriter to always track old object IDs and QPDFObjGen
instead of int, thus not discarding the generation number. Switch to
QPDF::getCompressibleObjGen() to properly handle the case of an old
object eligible for compression that has a generation of other than
zero.
Versions prior to 4.6 didn't allow gcc diagnostic pragmas with push
and pop and to appear anywhere in the file. Just let the warning be
there for those versions.
Remove const qualifier from getTypeCode and get getTypeName methods of
QPDFObjectHandle, make them work properly for indirect objects, and
exercise them much better in the test suite.
Put a specific comment marker next to every piece of code that MSVC
gives warning 4996 for. This warning is generated for calls to
functions that Microsoft considers insecure or deprecated. This
change is in preparation for fixing all these cases even though none
of them are actually incorrect or insecure as used in qpdf. The
comment marker makes them easier to find so they can be fixed in
subsequent commits.
Fix exit status for case of errors without warnings, continue after
errors when possible, add test case for parsing a file with content
stream errors on some but not all pages.
Change object type Keyword to Operator, and place the order of the
object types in object_type_e in the same order as they are mentioned
in the PDF specification.
Note that this change only breaks backward compatibility with code
that has not yet been released.
The upcoming 3.1 release contains non-compatible API changes, though
they only affect parts of the interface that are extremely unlikely to
have been used outside of qpdf itself. The methods and data types
affected were used for communication between QPDFWriter and QPDF and
would have had no real use in end user code.
Original code was written before we could shallow copy objects, so all
the filtering was done by suppressing the output of certain keys and
replacing them with other keys. Now we can simplify the code greatly
by modifying shallow copies of dictionaries in place.
Read and write support is implemented for /V=5 with /R=5 as well as
/R=6. /R=5 is the deprecated encryption method used by Acrobat IX.
/R=6 is the encryption method used by PDF 2.0 from ISO 32000-2.
Changes from upstream are limited to change #include paths so that I
can place header files and included "c" files in a subdirectory. I
didn't keep the unit tests from sphlib but instead verified them by
running them manually. I will implement the same tests using the
Pl_SHA2 pipeline except that sphlib's sha2 implementation supports
partial bytes, which I will not exercise in qpdf or our tests.