This takes pages from the file in groups of n with default = 1. This
partially fixes the enhancement in issue #505 but doesn't implement
the entire suggestion.
Also fix a bug in checking consistency of length for stream data
providers. Length should not be checked or recorded if the provider
says it failed to generate the data.
I thought /EFF was supposed to be used as a default for decrypting
embedded file streams, but actually it's supposed to be advice to a
conforming writer about handling new ones. This makes sense since the
findAttachmentStreams code, which is not actually needed, was never
right.
When removing unreferenced resources, notice if a page (recursively)
contains a form XObject with unreferenced resources, and count any
such resources as referenced by the page.
Keep a std::pair internal to the iterators so that operator* can
return a reference and operator-> can work, and each can work without
copying pairs of objects around.
Create a computationally and memory efficient implementation of name
and number trees that does binary searches as intended by the data
structure rather than loading into a map, which can use a great deal
of memory and can be very slow.
Avoid calling finish() multiple times on the pipeline passed to
pipeContentStreams. This commit also fixes a bug in which qpdf was not
exiting with the proper exit status if warnings found while splitting
pages; this was exposed by a test case that changed.
Make some more methods in QPDFPageObjectHelper work with form
XObjects, provide forEach methods to walk through nested form
XObjects, possibly recursively. This should make it easier to work
with form XObjects from user code.
Also removes preclusion of stream references in stream parameters of
filterable streams and reduces write times by about 8% by eliminating
an extra traversal of the objects.
This reverts an incorrect fix to #449 and codes it properly. The real
problem was that we were looking at the local dictionaries rather than
the foreign dictionaries when saving the foreign stream data. In the
case of direct objects, these happened to be the same, but in the case
of indirect objects, the object references could be pointing anywhere
since object numbers don't match up between the old and new files.
Specifically, if a stream had its stream data replaced and had
indirect /Filter or /DecodeParms, it would result in non-silent loss
of data and/or internal error.
Wildcard expansion is different in Windows from non-Windows and
sometimes requires special link options to work. Add tests that fail
if we link incorrectly.
Issue #399 mentioned a use case for which qpdf has support, but the
fact that it is supported was not documented or in the test suite,
making it vulerable to accidental breakage.
If the value of /CS in the inline image dictionary was is key in the
page's /Resource -> /ColorSpace dictionary, properly resolve it by
referencing the proper colorspace, and not just the name, in the
external image dictionary.
Allow exit status-based checking of whether a file is encrypted or
requires a password without necessarily supplying the correct
password. Useful for scripting.
For wildcard expansion to work properly with the msvc binary, it is
necessary to link with setargv.obj or wsetargv.obj, depending on
whether wmain is in use.
Various PDF digital signing tools do not encrypt /Contents value in
signature dictionary. Adobe Acrobat Reader DC can handle a PDF with
the /Contents value not encrypted.
Write Contents in signature dictionary without encryption
Tests ensure that string /Contents are not handled specially when not
found in sig dicts.
It seems better not to compress signature dictionaries. Various PDF
digital signing tools, including Adobe Acrobat Reader DC, do not
compress signature dictionaries.
Table 8.93 "Entries in a signature dictionary" in PDF 1.5 reference
describes that /ByteRange in the signature dictionary shall be used to
describe a digest that does not include the signature value
(/Contents) itself.
The byte ranges cannot be determined if the dictionary is compressed.
Table 8.93 "Entries in a signature dictionary" in PDF 1.5 reference
describes that the value of Contents entry is a hexadecimal string
representation when ByteRange is specified.
This commit makes QPDF always uses hexadecimal strings representation
instead of literal strings for it.
It's detected in QPDFWriter instead of at parse time because I can't
figure out how to construct a test case in a reasonable time. This
commit moves the fuzz file into the regular test suite for a QTC
coverage case.
For some reason, qpdf from the beginning was replacing indirect
references to null with literal null in arrays even after removing the
old behavior of flattening scalar references. This seems like a bad
idea.
This message used to only appear for PDF >= 1.2. The invalid name is
valid for PDF 1.0 and 1.1. However, since QPDFWriter may write a newer
version, it's better to detect and warn in all cases. Therefore make
the warning more informative.
This change works around STL problems with Embarcadero C++ Builder
version 10.2, but std::vector is more common than std::list in qpdf,
and this is a relatively new API, so an API change is tolerable.
Thanks to Thorsten Schöning <6223655+ams-tschoening@users.noreply.github.com>
for the fix.
This also reverts the addition of a new checkLinearization that
distinguishes errors from warnings. There's no practical distinction
between what was considered an error and what was considered a
warning.
* Several assertions in linearization were not always true; change
them to run time errors
* Handle a few cases of uninitialized objects
* Handle pages with no contents when doing form operations
* Handle invalid page tree nodes when traversing pages
This makes all integer type conversions that have potential data loss
explicit with calls that do range checks and raise an exception. After
this commit, qpdf builds with no warnings when -Wsign-conversion
-Wconversion is used with gcc or clang or when -W3 -Wd4800 is used
with MSVC. This significantly reduces the likelihood of potential
crashes from bogus integer values.
There are some parts of the code that take int when they should take
size_t or an offset. Such places would make qpdf not support files
with more than 2^31 of something that usually wouldn't be so large. In
the event that such a file shows up and is valid, at least qpdf would
raise an error in the right spot so the issue could be legitimately
addressed rather than failing in some weird way because of a silent
overflow condition.
Bounding box X coordinates could be truncated, causing them to be off
by a fraction of a point. This was most likely not visible, but it was
still wrong.
On read, ignore /DecodeParms when empty list; on write, delete it.
Some files have been found that include an empty list for
/DecodeParms, but this is not technically compliant with the spec, and
the only sensible interpretation is to treat it as if there are no
decode parameters.
The preservation of outlines didn't provide very useful behavior
anyway as it copied all outlines but most didn't work. This
implementation also caused a very significant performance hit and so
is being reverted until a proper solution can be coded. The eventual
solution will not be compatible with the reverted solution anyway, so
it's best not to leave this in.
Embarcadero C++Builder doesn't support more than 50 files open at the same time for legacy 32 Bit apps, which makes a test fail trying to open more than that many files. This changes the number of open files for that test to far less to make the test succeed. Alternatively one could reduce the hard coded number of 200 in QPDF itself, which I didn't do currently because it needs adoption of manuals etc. and is something which needs to be discussed with the author of QPDF. I guess chances are better to get the test changed upstream.
This fixes #288: https://github.com/qpdf/qpdf/issues/288
There have been issues reported where exceptions are not thrown
properly across shared library/DLL boundaries, so add a test
specifically to ensure that exceptions are caught as thrown.
We've actually seen a PDF file in the wild that contained EI
surrounded by delimiters inside the image data, which confused qpdf's
naive code. This significantly improves EI detection.
Add a version of expectInlineImage that takes an input source and
searches for EI. This is in preparation for improving the way EI is
found. This commit just refactors the code without changing the
functionality and adds tests to make sure the old and new code behave
identically.
When qpdf can't optimize an image because of an unsupported color
space, state this specifically. Recognize that many valid colorspaces
are not represented as name objects.
When linearizing a file or getting the list of all pages in a file,
detect if the pages tree contains a duplicated page object and, if so,
shallow copy it. This makes it possible to have a one to one mapping
of page positions to page objects.
When generating appearance streams for variable text annotations,
properly handle the cases of there being no appearance dictionary, no
appearance stream, or an appearance stream with no BMC..EMC marker.
With the exception of form field annotations when /NeedAppearances is
true, remove annotations that don't have appearance streams when
flattening. There is no reason to keep these when flattening since
they are invisible. This may include unchecked checkboxes, unshown
popup windows, etc.
Allow fine control over how passwords are encoded for writing, and
allow password for reading to be given as a hexademical encoded
string. Allow suppression of password recovery as a means to ensure
that the password you specify is actually the right one.
Setting encryption permissions for R >= 3 set permission bits in
groups corresponding to menu options in Acrobat 5. The new API allows
the bits to be set individually.
Explicitly abandon removal of unreferenced resources if there are any
lexical errors in the page's contents. This case always generated a
warning, but it now also prevents removal of unreferenced resources,
this strongly decreasing the likelihood of data loss.
The original QPDF is only required now when the source
QPDFObjectHandle is a stream that gets its stream data from a
QPDFObjectHandle::StreamDataProvider.
Some of the images were supposed to have no filter, but somewhere
along the line, they ended up with /FlateDecode, most likely because
qpdf rewrote the file without having --compress-streams=n specified.
If this error is repeated, it will cause a test failure.
On certain operations, such as iterating through all objects and
adding new indirect objects, walk through the entire object structure
and explicitly resolve any indirect references to non-existent
objects. That prevents new objects from springing into existence and
causing the previously dangling references to point to them.
Instead of directly putting the contents of the annotation appearance
streams into the page's content stream, add commands to render the
form xobjects directly. This is a more robust way to do it than the
original solution as it works properly with patterns and avoids
problems with resource name clashes between the pages and the form
xobjects.
Flatten annotations by integrating their appearance streams into the
content stream of the containing page. In the case of form fields,
only flatten if /NeedAppearance is false (or equivalently absent). If
flattening form fields, also remove /AcroForm from the document
catalog.
Unparse is admittedly strange, but I'd rather be strange and
consistent, and everything else in the qpdf library uses unparse to
serialize. (If you're reading this, the convention of using "unparse"
comes from the "clu" programming language.)
Some files in the test suite trigger antivirus warnings. These are
not infected files with malicious intent. They are test files to
ensure that qpdf does not crash when it encounters the files. This
change enables those files to be obfuscated in the source repository
so that checking out qpdf from version control or extracting the
source code doesn't trigger antivirus warnings.
If we are unable to filter a page's content streams, don't attempt to
remove objects from the page's resource dictionary. Also provide a
command line option to suppress resource removal in case we ever need
this as a workaround for some bug or broken PDF files.
If parsing content streams is treated as a warning, there is no way
for a caller to know if a parsing operation has failed. This is very
dangerous and will likely result in data loss when token filters are
parser callbacks are in use.
It's not really a shallow copy. It just doesn't cross indirect object
boundaries. The old implementation had a bug that would cause multiple
shallow copies of the same object to share memory, which was not the
intention.
Remove calls to assertPageObject(). All cases in the library that
called assertPageObject() work fine if you don't call
assertPageObject() because nothing assumes anything that was being
checked by that call. Removing the calls enables more files to be
successfully processed.
Prior to this fix, if there was a loop detected in following /Prev
pointers in xref streams/tables, it would cause qpdf to lose data.
Note that this condition causes many PDF readers to hang or fail.
The QPDF_String::getUTF8Val() method was not treating strings that
weren't explicitly Unicode as PDF Doc Encoded. This only affects
characters in the range 0x80 through 0xa0.
Implement a TokenFilter class and refactor Pl_QPDFTokenizer to use a
TokenFilter class called ContentNormalizer. Pl_QPDFTokenizer is now a
general filter that passes data through a TokenFilter.
Adding a trailing newline in content normalization damages files whose
contents are split across streams in the middle of tokens. Let
QPDFWriter add the newline with the indicator to ignore the newline,
which it already does. This changes the way some qdf files look.
Significant enhancements to the lexer to improve EOF handling and to
support comments and spaces as tokens. Various other minor issues were
fixed as well.
This tokenizes outer parts of the file, page content streams, and
object streams. It is for exercising the tokenizer in isolation and is
being introduced before reworking the lexical layer of qpdf.
Add options to enable the raw encryption key to be directly shown or
specified. Thanks to Didier Stevens <didier.stevens@gmail.com> for the
idea and contribution of one implementation of this idea.
Make sure to link from the source tree before linking from the system.
In many environments, this is necessary to allow a newly built qpdf to
link properly instead of trying to link or resolve libraries from an
older installed version.
If the stream isn't filterable but we call getStreamData, throw a
regular exception instead of a logic error so that normal error
handling and reporting mechanisms will be used.
While scanning the file looking for objects, limit the length of
tokens we allow. This prevents us from getting caught up in reading a
file character by character while digging through large streams.
Files written in PCLm mode have to be created in a very specific way.
qpdf doesn't know how to create PCLm files from scratch. All it knows
how to do is to write an already valid file in a suitable way.
Therefore there is no command-line support for PCLm.
There is no need for a --precheck-streams option. We can do the
precheck without imposing any penalty, only re-encoding the stream if
it fails the first time.
This commit adds several API methods that enable control over which
types of filters QPDF will attempt to decode. It also adds support for
/RunLengthDecode and /DCTDecode filters for both encoding and
decoding.
Very badly corrupted files may not have a retrievable root dictionary.
Handle that as a special case so that a more helpful error message can
be provided.
When requested, QPDFWriter will do more aggress prechecking of streams
to make sure it can actually succeed in decoding them before
attempting to do so. This will allow preservation of raw data even
when the raw data is corrupted relative to the specified filters.
QPDFObjectHandle::parseInternal now issues warnings instead of
throwing exceptions for all error conditions that it finds (except
internal logic errors) and has stronger recovery for things like
invalid tokens and malformed dictionaries. This should improve qpdf's
ability to recover from a wide range of broken files that currently
cause it to fail.
fixes #117
fixes #118
fixes #119
fixes #120
Several other infinite loop bugs were fixed by previous changes.
Include their test files in the test suite.
During parsing of an object, sometimes parts of the object have to be
resolved. An example is stream lengths. If such an object directly or
indirectly points to the object being parsed, it can cause an infinite
loop. Guard against all cases of re-entrant resolution of objects.
This is CVE-2017-9208.
The QPDF library uses object ID 0 internally as a sentinel to
represent a direct object, but prior to this fix, was not blocking
handling of 0 0 obj or 0 0 R as a special case. Creating an object in
the file with 0 0 obj could cause various infinite loops. The PDF spec
doesn't allow for object 0. Having qpdf handle object 0 might be a
better fix, but changing all the places in the code that assumes objid
== 0 means direct would be risky.
This is CVE-2017-9210.
The description string for an error message included unparsing an
object, which is too complex of a thing to try to do while throwing an
exception. There was only one example of this in the entire codebase,
so it is not a pervasive problem. Fixing this eliminated one class of
infinite loop errors.
/dev/null is not portable, so use File::Spec instead, which provides
portable "paths" and especially "nul" on Windows. I changed all places
with hard coded /dev/null to be sure, while I think it only is a
problem in direct system calls, because the other executed commands go
to sh.exe from MSYS which itself should port /dev/null to NUL. The
test still pass, so shouldn't have made any harm...
For non-encrypted files, determinstic ID generation uses file contents
instead of timestamp and file name. At a small runtime cost, this
enables generation of the same /ID if the same inputs are converted in
the same way multiple times.
fix-qdf was previously hard-coding the number of bytes for the f2
field of the xref stream entry. This addresses issue #37. Thanks
aluebcke for reporting.
Pushing inherited objects to pages and getting all pages were both
prone to stack overflow infinite loops if there were loops in the
Pages dictionary. There is a general weakness in the code in that any
part of the code that traverses the Pages structure would be prone to
this and would have to implement its own loop detection. A more robust
fix may provide some general method for handling the Pages structure,
but it's probably not worth doing.
Note: addition of *Internal2 private functions was done rather than
changing signatures of existing methods to avoid breaking
compatibility.
When checking two objects preceding R while parsing, ensure that the
objects are direct. This avoids stuff like 1 0 obj containing 1 0 R 0 R
from causing an infinite loop in object resolution.
Original reported here:
https://bugs.launchpad.net/ubuntu/+source/qpdf/+bug/1397413
The PDF specification says that the /Type key for nodes in the pages
dictionary (both /Page and /Pages) is required, but some PDF files
omit them. Use the presence of other keys to determine the type of
pages tree node this is if the type key is not found.