This includes the output PDF, streams from --show-object and
attachments from --save-attachment. This also enables --verbose and
--progress to work with saving to stdout.
* Replace --create-from-json=file with --json-input, which causes the
regular input to be treated as json.
* Eliminate --to-json
* In --json=2, bring back "objects" and eliminate "objectinfo". Stream
data is never present.
* In --json-output=2, write "qpdf-v2" with "objects" and include
stream data.
These mean to leave the original values alone. This is needed for
reconstructing streams from JSON given that the stream data and stream
dictionary may appear in any order in the JSON.
When an empty string was passed to replaceStreamData, the code was
passing a null pointer to memcpy. Since a 0 size was also passed, this
was harmless, but it triggers sanitizer errors. The code properly
handles a null pointer as the buffer in other places.
Call the parent container's item method before calling the child
item's start method so we can easily know the current nesting level
when nested items are added.
moddify -> modify. Also carefully spell checked all remaining keys by
splitting them into words and running a spell checker, not just
relying on visual proofreading. That was the only one.
This script was used on test data:
----------
#!/usr/bin/env python3
import json
import sys
import re
def json_dumps(data):
return json.dumps(data, ensure_ascii=False,
indent=2, separators=(',', ': '))
for filename in sys.argv[1:]:
with open(filename, 'r') as f:
data = json.loads(f.read())
if 'objectinfo' not in data:
continue
trailer = None
to_sort = []
for k, v in data['objectinfo'].items():
if k == 'trailer':
trailer = v
else:
m = re.match(r'^(\d+) \d+ R', k)
if m:
to_sort.append([int(m.group(1)), k, v])
newobjectinfo = {x[1]: x[2] for x in sorted(to_sort)}
if trailer is not None:
newobjectinfo['trailer'] = trailer
data['objectinfo'] = newobjectinfo
print(json_dumps(data))
----------
The following script was used to adjust test data:
----------
#!/usr/bin/env python3
import json
import sys
import re
def json_dumps(data):
return json.dumps(data, ensure_ascii=False,
indent=2, separators=(',', ': '))
for filename in sys.argv[1:]:
with open(filename, 'r') as f:
data = json.loads(f.read())
if 'objects' not in data:
continue
trailer = None
to_sort = []
for k, v in data['objects'].items():
if k == 'trailer':
trailer = v
else:
m = re.match(r'^(\d+) \d+ R', k)
if m:
to_sort.append([int(m.group(1)), k, v])
newobjects = {x[1]: x[2] for x in sorted(to_sort)}
if trailer is not None:
newobjects['trailer'] = trailer
data['objects'] = newobjects
print(json_dumps(data))
----------
This commit just changes the order in which fields are written to the
json without changing their content. All the json files in the test
suite were modified with this script to ensure that we didn't get any
changes other than ordering.
----------
#!/usr/bin/env python3
import json
import sys
def json_dumps(data):
return json.dumps(data, ensure_ascii=False,
indent=2, separators=(',', ': '))
for filename in sys.argv[1:]:
with open(filename, 'r') as f:
data = json.loads(f.read())
newdata = {}
for i in ('version', 'parameters', 'pages', 'pagelabels',
'acroform', 'attachments', 'encrypt', 'outlines',
'objects', 'objectinfo'):
if i in data:
newdata[i] = data[i]
print(json_dumps(newdata))
----------
Where not possible, use "auto" to get the iterator type.
Editorial note: I have avoid this change for a long time because of
not wanting to make gratuitous changes to version history, which can
obscure when certain changes were made, but with having recently
touched every single file to apply automatic code formatting and with
making several broad changes to the API, I decided it was time to take
the plunge and get rid of the older (pre-C++11) verbose iterator
syntax. The new code is just easier to read and understand, and in
many cases, it will be more effecient as fewer temporary copies are
being made.
m-holger, if you're reading, you can see that I've finally come
around. :-)
Character transcoding from Unicode to single-byte characters used
hard-coded switch statements because the code predated our adoption of
C++11. Now we have thread-safe, static initialization of map literals,
so use that instead.
* Change DLL_EXPORT to libqpdf_EXPORTS (internal to the build). The
new name is cmake's default, is more conventional, and is less
likely to clash with other symbols.
* Add QPDF_DLL_PRIVATE for non-Windows
* Make logic around when to define QPDF_DLL et al more explicit
* Add detailed comments
Prior to the cmake conversion, several private classes had methods
that were exported into the shared library so they could be tested
with libtests. With cmake, we build libtests using an object library,
so this is no longer necessary. The methods that are disappearing from
the ABI were never exposed through public headers, so no code should
be using them. Removal had to wait until the window for ABI-breaking
changes was open.
Add comments to force line breaks, parenthesize function arguments
that are contatenated strings, etc. -- these kinds of changes improve
clang-format's results and also cause emacs cc-mode to match
clang-format. After this type of change, most of the time, when
clang-format and emacs disagree, clang-format is better.
The executables that libtool built invoked the underlying binary with
an "lt-" prefix. The code contained numerous workarounds for testing,
which can now be removed.
There are codepoints in PDFDoc that are not valid UTF-8 but map to
valid UTF-8. We were handling those correctly with bidirectional
mapping.
However, if those same code points appeared in UTF-8, where they have
no meaning, they were left as fixed points when converting to PDFDoc,
where they do have meaning. This change recognizes them as errors.
Remove test for type == /XObject in QPDFObjectHandle::isFormXObject
as type value is optional (as per spec 8.10.2).
Replace code to test for /Form in QPDFJob::shouldRemoveUnreferencedResources
with a call to isFormXObject.
This comment expands all tabs using an 8-character tab-width. You
should ignore this commit when using git blame or use git blame -w.
In the early days, I used to use tabs where possible for indentation,
since emacs did this automatically. In recent years, I have switched
to only using spaces, which means qpdf source code has been a mixture
of spaces and tabs. I have avoided cleaning this up because of not
wanting gratuitous whitespaces change to cloud the output of git
blame, but I changed my mind after discussing with users who view qpdf
source code in editors/IDEs that have other tab widths by default and
in light of the fact that I am planning to start applying automatic
code formatting soon.
* Use unique_ptr in place of shared_ptr in some cases
* unique_ptr for arrays does not require a custom deleter
* use std::make_unique (c++14) where possible
Use get() and use_count() instead. Add #define
NO_POINTERHOLDER_DEPRECATION to remove deprecation markers for these
only.
This commit also removes all deprecated PointerHolder API calls from
qpdf's code except in PointerHolder's test suite, which must continue
to test the deprecated APIs.
All the coverage cases that used to be in qpdf.cc are now in
QPDFJob*.cc. It doesn't really matter, but better to follow the
convention of starting with the class that includes the coverage call.
For optional parameter/choices, generate an overloaded config method
that takes no arguments. This makes it possible to convert from a bare
argument to one that takes an optional parameter without breaking
binary compatibility.
Changing from bool requiring true to string requiring the empty string
is more consistent with the CLI and makes it possible to add an
optional parameter or choices later without breaking compatibility.
Flatten everything to make it easier to map command-line flags to
json. The old structure was an illusion anyway because there was no
mechanism to enforce that things were in the right place. This also
helps with future flexibility.
The code was assuming everything was happening inside dictionaries.
Instead, make the dictionary key handler creatino explicit only when
iterating through dictionary keys.
If some keys depend on others, we have to check up front since there
is no control of what order key handlers will be called. Anyway, keys
are unordered in json, so we don't want to depend on ordering.
Since the functionality of argument parsing has moved into QPDFJob,
these classes no longer need to be public. Their methods still have to
be in the library's binary interface so they can be tested in libtests.
Why? The main methods that create them return smart pointers so that
users can initialize them when needed, which you can't do with
references. Returning pointers instead of references makes for a more
uniform interface.
Create QPDFJob_options.cc to hold API implementation functions.
Reorganize a little in preparation for moving public member variables
private and creating the real QPDFJob API that will be used by callers
as well as the argv/json initialization methods.
The previous commits have removed all references to memory from
QPDFArgParser from QPDFJob. This commit removes the constraint that
QPDFArgParser remain in scope. This is a prerequisite to allowing JSON
as an alternative way to initialize QPDFJob and to initialize it
directly using a public API.
This is used to generate a schema for the job json, which can't
contain `)"` because it breaks the R"(...)" syntax in C++. While C++
accepts R"anything(...)anything" to avoid this, as of this writing,
MSVC 2019 doesn't understand that. For now, just avoid it by removing
parentheses from the end of short help.
This is a massive rewrite of the help text and cli.rst section of the
manual. All command-line flags now have their own help and are
specifically index. qpdf --help is completely redone.
Handle optional choices in addition to required choices. Refactor the
way help options are added to completion to make it work with optional
help choices.
Move ArgParser from qpdf.cc into QPDFJob.cc. It still works with
millions of public member variables, but now qpdf.cc is minimal and
just calls stable library functions.
Convert remaining static functions that take QPDFJob& as a parameter
to member functions. Utility functions that don't take QPDFJob& remain
static functions and can probably just stay that way since the keep
extra complexity out of QPDFJob.hh.
Remove remaining temporary duplication of hard-coded values and direct
access to std::cout, std::cerr, and whoami in favor of parameters in
QPDFJob. This moves a few more static methods into QPDFJob member
functions.
Remove all calls to exit() from QPDFJob. Handle code that runs in
verbose mode to enable it to make use of output streams and message
prefix (whoami) from QPDFJob. This removes temporarily duplicated exit
code logic and most access to whoami/std::cout outside of QPDFJob
proper.
Move most of the methods called from qpdf.cc after argument parsing
into QPDFJob. In this increment, enough QPDFJob API has been added to
handle the branch of QPDFJob::run() that creates output with an
appropriate division between qpdf.cc and QPDFJob.
There are temporary bits of code to enable everything to compile and
pass the test suite, including some duplication and hard-coded values.
They have to be ot_* rather than qpdf_ot_* for compatibility.
* Different enumerated types are not assignment-compatible in C++, at
least with strict compiler settings
* While you can do `constexpr ot_xyz = ::qpdf_ot_xyz` in QPDFObject.hh to
make QPDFObject::ot_xyz work, QPDFObject::object_type_e::ot_xyz will
only work if the enumerated type names are the same.
* Handle error conditions that occur when using the object handle
interfaces. In the past, some exceptions were not correctly
converted to errors or warnings.
* Add more detailed information to qpdf-c.h
* Make it possible to work more explicitly with uninitialized objects
Don't assume endobj is at the beginning of the line. This means we are
looking at tokens for every line, but the odds of n n obj appearing in
the middle of the object are likely much lower than endobj not being
at the beginning of the line or missing entirely. This will probably
have a negative impact on recovery time for very large files.
Hopefully it will be worth it.
When making resources indirect in from_dr, the code was using the
wrong owning QPDF, forgetting that from_dr had already been copied
using CopyForeignObject.
When adding a QPDFObjectHandle to an array or dictionary, if possible,
check if the new object belongs to the same QPDF. This makes it much
easier to find incorrect code than waiting for the situation to be
detected when the file is written.
This was originally not public because I wanted to get rid fo the
pages cache, but I recently realized there were deep reasons not to do
that, and the author of pikepdf wanted this, so I decided to make it
public.
Operations that add the same object to multiple places in the pages
tree are throwing exceptions and then later causing assertion
failures. The assert calls shouldn't be there.
Converted ResourceFinder to ParserCallbacks so we can better detect
the name that precedes various operators and use the operators to sort
the names into resource types. This enables us to be smarter about
detecting unreferenced resources in pages and also sets the stage for
reconciling differences in /DR across documents.
If not found in the field hierarchy, /Q and /DA are supposed to be
looked up in the document-level form dictionary. /DR is supposed to
only come from the document dictionary.
This results in a performance penalty of 1% to 2% when replaceObject
and swapObjects are never called and a somewhat larger penalty if they
are called, but it's worth it to avoid very confusing behavior as
discussed in depth in qpdf#507.
Also fix a bug in checking consistency of length for stream data
providers. Length should not be checked or recorded if the provider
says it failed to generate the data.
I thought /EFF was supposed to be used as a default for decrypting
embedded file streams, but actually it's supposed to be advice to a
conforming writer about handling new ones. This makes sense since the
findAttachmentStreams code, which is not actually needed, was never
right.
When removing unreferenced resources, notice if a page (recursively)
contains a form XObject with unreferenced resources, and count any
such resources as referenced by the page.
Keep a std::pair internal to the iterators so that operator* can
return a reference and operator-> can work, and each can work without
copying pairs of objects around.
Create a computationally and memory efficient implementation of name
and number trees that does binary searches as intended by the data
structure rather than loading into a map, which can use a great deal
of memory and can be very slow.
If we ever had an encrypted file with different filters for
attachments and either the /EmbeddedFiles name tree was deep or some
of the file specs didn't have /Type, we would have overlooked those as
attachment streams. The code now properly handles /EmbeddedFiles as a
name tree.
Avoid calling finish() multiple times on the pipeline passed to
pipeContentStreams. This commit also fixes a bug in which qpdf was not
exiting with the proper exit status if warnings found while splitting
pages; this was exposed by a test case that changed.
Make some more methods in QPDFPageObjectHelper work with form
XObjects, provide forEach methods to walk through nested form
XObjects, possibly recursively. This should make it easier to work
with form XObjects from user code.
Also removes preclusion of stream references in stream parameters of
filterable streams and reduces write times by about 8% by eliminating
an extra traversal of the objects.
This reverts an incorrect fix to #449 and codes it properly. The real
problem was that we were looking at the local dictionaries rather than
the foreign dictionaries when saving the foreign stream data. In the
case of direct objects, these happened to be the same, but in the case
of indirect objects, the object references could be pointing anywhere
since object numbers don't match up between the old and new files.
The jpeg library has some assembly code that is missed by the compiler
instrumentation used by memory sanitization. There is a runtime
environment variable that is used to work around this issue.
Specifically, if a stream had its stream data replaced and had
indirect /Filter or /DecodeParms, it would result in non-silent loss
of data and/or internal error.
OPENSSL_IS_BORINGSSL is not actually set by configure, so it will be
undefined until a BoringSSL header is included. Hence the #ifdef logic
in QPDFCrypto_openssl.h would usually never apply.
This still worked because evp.h transitively included BoringSSL's
cipher.h and digest.h, but the latter are the correct (documented)
headers.
By re-ordering the includes, we can ensure the macro is defined when we
use it.
Also: fix case in the header guards.
On large files with predominantly \n line endings, memchr(..'\r'..)
seems to waste a considerable amount of time searching for a line
ending candidate that we don't need.
On the Adobe PDF Reference Manual 1.7, this commit is 8x faster at
QPDF::processMemoryFile().
If the value of /CS in the inline image dictionary was is key in the
page's /Resource -> /ColorSpace dictionary, properly resolve it by
referencing the proper colorspace, and not just the name, in the
external image dictionary.
Includes updates to m4/ax_cxx_compile_stdcxx.m4 to make it work with
msvc, which supports C++-11 with no flags but doesn't set __cplusplus
to a recent value.
Various PDF digital signing tools do not encrypt /Contents value in
signature dictionary. Adobe Acrobat Reader DC can handle a PDF with
the /Contents value not encrypted.
Write Contents in signature dictionary without encryption
Tests ensure that string /Contents are not handled specially when not
found in sig dicts.
It seems better not to compress signature dictionaries. Various PDF
digital signing tools, including Adobe Acrobat Reader DC, do not
compress signature dictionaries.
Table 8.93 "Entries in a signature dictionary" in PDF 1.5 reference
describes that /ByteRange in the signature dictionary shall be used to
describe a digest that does not include the signature value
(/Contents) itself.
The byte ranges cannot be determined if the dictionary is compressed.
Table 8.93 "Entries in a signature dictionary" in PDF 1.5 reference
describes that the value of Contents entry is a hexadecimal string
representation when ByteRange is specified.
This commit makes QPDF always uses hexadecimal strings representation
instead of literal strings for it.
Ordinarily the trailer doesn't contain any strings, so this is usually
a non-issue, but if the trailer contains strings, linearizing and
encrypting with object streams would include encrypted strings in the
trailer, which would blow out the padding because encrypted strings
are longer than their cleartext counterparts.
It's detected in QPDFWriter instead of at parse time because I can't
figure out how to construct a test case in a reasonable time. This
commit moves the fuzz file into the regular test suite for a QTC
coverage case.
When seeing to a position based on a value read from the input, we are
prone to integer overflow (fuzz issue 15442). Seek in two stages to
move the overflow check into the input source code.
For some reason, qpdf from the beginning was replacing indirect
references to null with literal null in arrays even after removing the
old behavior of flattening scalar references. This seems like a bad
idea.
This message used to only appear for PDF >= 1.2. The invalid name is
valid for PDF 1.0 and 1.1. However, since QPDFWriter may write a newer
version, it's better to detect and warn in all cases. Therefore make
the warning more informative.
This change works around STL problems with Embarcadero C++ Builder
version 10.2, but std::vector is more common than std::list in qpdf,
and this is a relatively new API, so an API change is tolerable.
Thanks to Thorsten Schöning <6223655+ams-tschoening@users.noreply.github.com>
for the fix.
This also reverts the addition of a new checkLinearization that
distinguishes errors from warnings. There's no practical distinction
between what was considered an error and what was considered a
warning.
Use PointerHolder in several places where manually memory allocation
and deallocation were being used. This helps to protect against memory
leaks when exceptions are thrown in surprising places.
In a small number of cases, it makes sense to replace an overloaded
function with a function that takes a default argument. We can do this
now because we've already broken binary compatibility since the last
release.
Have classes contain only a single private member of type
PointerHolder<Members>. This makes it safe to change the structure of
the Members class without breaking binary compatibility. Many of the
classes already follow this pattern quite successfully. This brings in
the rest of the class that are part of the public API.
* Several assertions in linearization were not always true; change
them to run time errors
* Handle a few cases of uninitialized objects
* Handle pages with no contents when doing form operations
* Handle invalid page tree nodes when traversing pages
This makes all integer type conversions that have potential data loss
explicit with calls that do range checks and raise an exception. After
this commit, qpdf builds with no warnings when -Wsign-conversion
-Wconversion is used with gcc or clang or when -W3 -Wd4800 is used
with MSVC. This significantly reduces the likelihood of potential
crashes from bogus integer values.
There are some parts of the code that take int when they should take
size_t or an offset. Such places would make qpdf not support files
with more than 2^31 of something that usually wouldn't be so large. In
the event that such a file shows up and is valid, at least qpdf would
raise an error in the right spot so the issue could be legitimately
addressed rather than failing in some weird way because of a silent
overflow condition.
Bounding box X coordinates could be truncated, causing them to be off
by a fraction of a point. This was most likely not visible, but it was
still wrong.
On read, ignore /DecodeParms when empty list; on write, delete it.
Some files have been found that include an empty list for
/DecodeParms, but this is not technically compliant with the spec, and
the only sensible interpretation is to treat it as if there are no
decode parameters.
* [bcc32 Error] QPDF.cc(375): E2268 Call to undefined function 'atof'
Full parser context
QPDF.cc(358): parsing: void QPDF::parse(const char *)
* [bcc32 Error] QPDFTokenizer.cc(183): E2268 Call to undefined function 'strtol'
Full parser context
QPDFTokenizer.cc(163): parsing: void QPDFTokenizer::resolveLiteral()
* [bcc32 Error] pdf-split-pages.cc(52): E2268 Call to undefined function 'exit'
Full parser context
pdf-split-pages.cc(50): parsing: void usage()
* PR #295: Including "cstdlib" should be replaced with "stdlib.h" to be more consistent. At the same time I changed the order of the surrounding includes to reflect alphabetical order, because at some files this already have been the case.
We've actually seen a PDF file in the wild that contained EI
surrounded by delimiters inside the image data, which confused qpdf's
naive code. This significantly improves EI detection.
Add a version of expectInlineImage that takes an input source and
searches for EI. This is in preparation for improving the way EI is
found. This commit just refactors the code without changing the
functionality and adds tests to make sure the old and new code behave
identically.
When linearizing a file or getting the list of all pages in a file,
detect if the pages tree contains a duplicated page object and, if so,
shallow copy it. This makes it possible to have a one to one mapping
of page positions to page objects.
When generating appearance streams for variable text annotations,
properly handle the cases of there being no appearance dictionary, no
appearance stream, or an appearance stream with no BMC..EMC marker.
With the exception of form field annotations when /NeedAppearances is
true, remove annotations that don't have appearance streams when
flattening. There is no reason to keep these when flattening since
they are invisible. This may include unchecked checkboxes, unshown
popup windows, etc.
Setting encryption permissions for R >= 3 set permission bits in
groups corresponding to menu options in Acrobat 5. The new API allows
the bits to be set individually.
Explicitly abandon removal of unreferenced resources if there are any
lexical errors in the page's contents. This case always generated a
warning, but it now also prevents removal of unreferenced resources,
this strongly decreasing the likelihood of data loss.
When removing unreferenced resources, the code was copying the overall
resource dictionaries but not the subdictionaries being modified. This
was a "typo" in the code -- the comment clearly stated the need to do
this, but the code replaced the dictionary with itself rather than
with a shallow copy of itself.
If set, we avoid using Windows I/O HANDLE, which is disallowed in some
versions of the Windows SDK, such as for Windows phones.
QUtil::same_file will always return false in this case. Only applies
to Windows builds.
The original QPDF is only required now when the source
QPDFObjectHandle is a stream that gets its stream data from a
QPDFObjectHandle::StreamDataProvider.