Create a computationally and memory efficient implementation of name
and number trees that does binary searches as intended by the data
structure rather than loading into a map, which can use a great deal
of memory and can be very slow.
If we ever had an encrypted file with different filters for
attachments and either the /EmbeddedFiles name tree was deep or some
of the file specs didn't have /Type, we would have overlooked those as
attachment streams. The code now properly handles /EmbeddedFiles as a
name tree.
Avoid calling finish() multiple times on the pipeline passed to
pipeContentStreams. This commit also fixes a bug in which qpdf was not
exiting with the proper exit status if warnings found while splitting
pages; this was exposed by a test case that changed.
Make some more methods in QPDFPageObjectHelper work with form
XObjects, provide forEach methods to walk through nested form
XObjects, possibly recursively. This should make it easier to work
with form XObjects from user code.
Also removes preclusion of stream references in stream parameters of
filterable streams and reduces write times by about 8% by eliminating
an extra traversal of the objects.
This reverts an incorrect fix to #449 and codes it properly. The real
problem was that we were looking at the local dictionaries rather than
the foreign dictionaries when saving the foreign stream data. In the
case of direct objects, these happened to be the same, but in the case
of indirect objects, the object references could be pointing anywhere
since object numbers don't match up between the old and new files.
The jpeg library has some assembly code that is missed by the compiler
instrumentation used by memory sanitization. There is a runtime
environment variable that is used to work around this issue.
Specifically, if a stream had its stream data replaced and had
indirect /Filter or /DecodeParms, it would result in non-silent loss
of data and/or internal error.
OPENSSL_IS_BORINGSSL is not actually set by configure, so it will be
undefined until a BoringSSL header is included. Hence the #ifdef logic
in QPDFCrypto_openssl.h would usually never apply.
This still worked because evp.h transitively included BoringSSL's
cipher.h and digest.h, but the latter are the correct (documented)
headers.
By re-ordering the includes, we can ensure the macro is defined when we
use it.
Also: fix case in the header guards.
On large files with predominantly \n line endings, memchr(..'\r'..)
seems to waste a considerable amount of time searching for a line
ending candidate that we don't need.
On the Adobe PDF Reference Manual 1.7, this commit is 8x faster at
QPDF::processMemoryFile().
If the value of /CS in the inline image dictionary was is key in the
page's /Resource -> /ColorSpace dictionary, properly resolve it by
referencing the proper colorspace, and not just the name, in the
external image dictionary.
Includes updates to m4/ax_cxx_compile_stdcxx.m4 to make it work with
msvc, which supports C++-11 with no flags but doesn't set __cplusplus
to a recent value.
Various PDF digital signing tools do not encrypt /Contents value in
signature dictionary. Adobe Acrobat Reader DC can handle a PDF with
the /Contents value not encrypted.
Write Contents in signature dictionary without encryption
Tests ensure that string /Contents are not handled specially when not
found in sig dicts.
It seems better not to compress signature dictionaries. Various PDF
digital signing tools, including Adobe Acrobat Reader DC, do not
compress signature dictionaries.
Table 8.93 "Entries in a signature dictionary" in PDF 1.5 reference
describes that /ByteRange in the signature dictionary shall be used to
describe a digest that does not include the signature value
(/Contents) itself.
The byte ranges cannot be determined if the dictionary is compressed.
Table 8.93 "Entries in a signature dictionary" in PDF 1.5 reference
describes that the value of Contents entry is a hexadecimal string
representation when ByteRange is specified.
This commit makes QPDF always uses hexadecimal strings representation
instead of literal strings for it.
Ordinarily the trailer doesn't contain any strings, so this is usually
a non-issue, but if the trailer contains strings, linearizing and
encrypting with object streams would include encrypted strings in the
trailer, which would blow out the padding because encrypted strings
are longer than their cleartext counterparts.
It's detected in QPDFWriter instead of at parse time because I can't
figure out how to construct a test case in a reasonable time. This
commit moves the fuzz file into the regular test suite for a QTC
coverage case.
When seeing to a position based on a value read from the input, we are
prone to integer overflow (fuzz issue 15442). Seek in two stages to
move the overflow check into the input source code.
For some reason, qpdf from the beginning was replacing indirect
references to null with literal null in arrays even after removing the
old behavior of flattening scalar references. This seems like a bad
idea.
This message used to only appear for PDF >= 1.2. The invalid name is
valid for PDF 1.0 and 1.1. However, since QPDFWriter may write a newer
version, it's better to detect and warn in all cases. Therefore make
the warning more informative.
This change works around STL problems with Embarcadero C++ Builder
version 10.2, but std::vector is more common than std::list in qpdf,
and this is a relatively new API, so an API change is tolerable.
Thanks to Thorsten Schöning <6223655+ams-tschoening@users.noreply.github.com>
for the fix.
This also reverts the addition of a new checkLinearization that
distinguishes errors from warnings. There's no practical distinction
between what was considered an error and what was considered a
warning.
Use PointerHolder in several places where manually memory allocation
and deallocation were being used. This helps to protect against memory
leaks when exceptions are thrown in surprising places.
In a small number of cases, it makes sense to replace an overloaded
function with a function that takes a default argument. We can do this
now because we've already broken binary compatibility since the last
release.
Have classes contain only a single private member of type
PointerHolder<Members>. This makes it safe to change the structure of
the Members class without breaking binary compatibility. Many of the
classes already follow this pattern quite successfully. This brings in
the rest of the class that are part of the public API.
* Several assertions in linearization were not always true; change
them to run time errors
* Handle a few cases of uninitialized objects
* Handle pages with no contents when doing form operations
* Handle invalid page tree nodes when traversing pages
This makes all integer type conversions that have potential data loss
explicit with calls that do range checks and raise an exception. After
this commit, qpdf builds with no warnings when -Wsign-conversion
-Wconversion is used with gcc or clang or when -W3 -Wd4800 is used
with MSVC. This significantly reduces the likelihood of potential
crashes from bogus integer values.
There are some parts of the code that take int when they should take
size_t or an offset. Such places would make qpdf not support files
with more than 2^31 of something that usually wouldn't be so large. In
the event that such a file shows up and is valid, at least qpdf would
raise an error in the right spot so the issue could be legitimately
addressed rather than failing in some weird way because of a silent
overflow condition.
Bounding box X coordinates could be truncated, causing them to be off
by a fraction of a point. This was most likely not visible, but it was
still wrong.
On read, ignore /DecodeParms when empty list; on write, delete it.
Some files have been found that include an empty list for
/DecodeParms, but this is not technically compliant with the spec, and
the only sensible interpretation is to treat it as if there are no
decode parameters.
* [bcc32 Error] QPDF.cc(375): E2268 Call to undefined function 'atof'
Full parser context
QPDF.cc(358): parsing: void QPDF::parse(const char *)
* [bcc32 Error] QPDFTokenizer.cc(183): E2268 Call to undefined function 'strtol'
Full parser context
QPDFTokenizer.cc(163): parsing: void QPDFTokenizer::resolveLiteral()
* [bcc32 Error] pdf-split-pages.cc(52): E2268 Call to undefined function 'exit'
Full parser context
pdf-split-pages.cc(50): parsing: void usage()
* PR #295: Including "cstdlib" should be replaced with "stdlib.h" to be more consistent. At the same time I changed the order of the surrounding includes to reflect alphabetical order, because at some files this already have been the case.
We've actually seen a PDF file in the wild that contained EI
surrounded by delimiters inside the image data, which confused qpdf's
naive code. This significantly improves EI detection.
Add a version of expectInlineImage that takes an input source and
searches for EI. This is in preparation for improving the way EI is
found. This commit just refactors the code without changing the
functionality and adds tests to make sure the old and new code behave
identically.
When linearizing a file or getting the list of all pages in a file,
detect if the pages tree contains a duplicated page object and, if so,
shallow copy it. This makes it possible to have a one to one mapping
of page positions to page objects.
When generating appearance streams for variable text annotations,
properly handle the cases of there being no appearance dictionary, no
appearance stream, or an appearance stream with no BMC..EMC marker.
With the exception of form field annotations when /NeedAppearances is
true, remove annotations that don't have appearance streams when
flattening. There is no reason to keep these when flattening since
they are invisible. This may include unchecked checkboxes, unshown
popup windows, etc.
Setting encryption permissions for R >= 3 set permission bits in
groups corresponding to menu options in Acrobat 5. The new API allows
the bits to be set individually.
Explicitly abandon removal of unreferenced resources if there are any
lexical errors in the page's contents. This case always generated a
warning, but it now also prevents removal of unreferenced resources,
this strongly decreasing the likelihood of data loss.
When removing unreferenced resources, the code was copying the overall
resource dictionaries but not the subdictionaries being modified. This
was a "typo" in the code -- the comment clearly stated the need to do
this, but the code replaced the dictionary with itself rather than
with a shallow copy of itself.
If set, we avoid using Windows I/O HANDLE, which is disallowed in some
versions of the Windows SDK, such as for Windows phones.
QUtil::same_file will always return false in this case. Only applies
to Windows builds.
The original QPDF is only required now when the source
QPDFObjectHandle is a stream that gets its stream data from a
QPDFObjectHandle::StreamDataProvider.
Instead of calling assert for problems found during checking
linearization data, throw an exception which is later caught and
issued as an error. Ideally we would handle errors more robustly, but
this is still a significant improvement.