mirror of
https://github.com/qpdf/qpdf.git
synced 2025-01-06 00:30:40 +00:00
479 lines
22 KiB
Plaintext
479 lines
22 KiB
Plaintext
Fuzz Errors
|
|
===========
|
|
|
|
* https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=<N>
|
|
|
|
* To fix:
|
|
* 19253 - indirect leaks. Not sure of the cause, but it might have
|
|
something to do with multiple instances of the same object being
|
|
read and discarded during file recovery. Maybe there's a missing
|
|
call to releaseResolved.
|
|
|
|
* Ignoring these:
|
|
* Problems inside the jpeg library: 15470, 15751, 18633, 18732,
|
|
18745, 20391
|
|
* Timeout: 15471, 17630
|
|
|
|
Windows Build
|
|
=============
|
|
|
|
Update the Windows build so that it uses current versions of external
|
|
libraries and uses either gnutls or openssl as its crypto provider.
|
|
|
|
ABI Changes
|
|
===========
|
|
|
|
This is a list of changes to make next time there is an ABI change.
|
|
Comments appear in the code prefixed by "ABI"
|
|
|
|
C++-11
|
|
======
|
|
|
|
* Search for ::iterator and ::const_iterator and replace with either
|
|
auto or foreach-style iteration.
|
|
|
|
* There may be some places where std::function and lambdas can
|
|
simplify handlers rather than using classes with apply methods.
|
|
|
|
* My c++11 branch adds re-implements PointerHolder so that it is
|
|
interchangeable with std::shared_ptr. We may not actually want to
|
|
ever do this because it turns out PointerHolder is slightly more
|
|
performant than std::shared_ptr, at least as of g++ 9.2.1. It is not
|
|
actually possible to just replace PointerHolder with std::shared_ptr
|
|
for two reasons: there is no automatic creation of
|
|
std::shared_ptr<T> from T* like there is for PointerHolder, which
|
|
breaks some code, and also there is no automatic conversion from
|
|
something like std::vector<PointerHolder<T>> to
|
|
std::vector<std::shared_ptr<T>>. It may be a good idea to replace
|
|
PointerHolder with std::shared_ptr in the API even if it requires
|
|
some work for the developer, but even if that isn't worth it, we
|
|
should find all occurrences of PointerHolder within the code and
|
|
replace with std::shared_ptr or std::unique_ptr as needed. This will
|
|
definitely break binary compatibility as the PointerHolder<Members>
|
|
pattern is part of the ABI for almost every class.
|
|
|
|
Page splitting/merging
|
|
======================
|
|
|
|
* Update page splitting and merging to handle document-level
|
|
constructs with page impact such as interactive forms and article
|
|
threading. Check keys in the document catalog for others, such as
|
|
outlines, page labels, thumbnails, and zones. For threads,
|
|
Subramanyam provided a test file; see ../misc/article-threads.pdf.
|
|
Email Q-Count: 431864 from 2009-11-03.
|
|
|
|
* bookmarks (outlines) 12.3.3
|
|
* support bookmarks when merging
|
|
* prune bookmarks that don't point to a surviving page when merging
|
|
or splitting
|
|
* make sure conflicting named destinations work possibly test by
|
|
including the same file by two paths in a merge
|
|
* see also comments in issue 343
|
|
|
|
Note: original implementation of bookmark preservation for split
|
|
pages caused a very high performance hit. The problem was
|
|
introduced in 313ba081265f69ac9a0324f9fe87087c72918191 and reverted
|
|
in the commit that adds this paragraph. The revert includes marking
|
|
a few tests cases as $td->EXPECT_FAILURE. When properly coded, the
|
|
test cases will need to be adjusted to only include the parts of
|
|
the outlines that are actually copied. The tests in question are
|
|
"split page with outlines". When implementing properly, ensure that
|
|
the performance is not adversely affected by timing split-pages on
|
|
a large file with complex outlines such as the PDF specification.
|
|
|
|
When pruning outlines, keep all outlines in the hierarchy that are
|
|
above an outline for a page we care about. If one of the ancestor
|
|
outlines points to a non-existent page, clear its dest. If an
|
|
outline does not have any children that point to pages in the
|
|
document, just omit it.
|
|
|
|
Possible strategy:
|
|
* resolve all named destinations to explicit destinations
|
|
* concatenate top-level outlines
|
|
* prune outlines whose dests don't point to a valid page
|
|
* recompute all /Count fields
|
|
|
|
Test files
|
|
* page-labels-and-outlines.pdf: old file with both page labels and
|
|
outlines. All destinations are explicit destinations. Each page
|
|
has Potato and a number. All titles are feline names.
|
|
* outlines-with-actions.pdf: mixture of explicit destinations,
|
|
named destinations, goto actions with explicit destinations, and
|
|
goto actions with named destinations; uses /Dests key in names
|
|
dictionary. Each page has Salad and a number. All titles are
|
|
silly words. One destination is an indirect object.
|
|
* outlines-with-old-root-dests.pdf: like outlines-with-actions
|
|
except it uses the PDF-1.1 /Dests dictionary for named
|
|
destinations, and each page has Soup and a number. Also pages are
|
|
numbered with upper-case Roman numerals starting with 0. All
|
|
titles are silly words preceded by a bullet.
|
|
|
|
If outline handling is significantly improved, see
|
|
../misc/bad-outlines/bad-outlines.pdf and email:
|
|
https://mail.google.com/mail/u/0/#search/rfc822msgid%3A02aa01d3d013%249f766990%24de633cb0%24%40mono.hr)
|
|
|
|
* Form fields: should be similar to outlines.
|
|
|
|
MSVC Wildcard Expansion
|
|
=======================
|
|
|
|
(This section is referenced in azure_pipelines.yml.)
|
|
|
|
The qpdf executable built with msvc is linked with setargv.obj or
|
|
wsetargv.obj so that wildcard expansion works. It doesn't work exactly
|
|
the way a UNIX system would work in that the program itself does the
|
|
expansion (rather than the shell), which means that invoking qpdf.exe
|
|
as built by msvc will expand "*.pdf" just as it will expand *.pdf. In
|
|
some earlier versions, wildcard expansion didn't work with the msvc
|
|
executable. The way to make this work appears to be different in some
|
|
versions of MSVC than in others. As such, if upgrading MSVC or
|
|
changing the build environment, the wildcard expansion behavior of the
|
|
qpdf executable in Windows should be retested manually.
|
|
|
|
Unfortunately, there is no automated test for wildcard expansion with
|
|
MSVC because I can't figure out how to prevent qtest from expanding
|
|
the wildcards before passing them in, and explicitly running "cmd /c
|
|
..." from qtest doesn't seem to work in Azure Pipelines, though I can
|
|
make it work locally.
|
|
|
|
Ideally, we should figure out a way to test this in CI by having a
|
|
test that fails if wildcard expansion is broken. In the absence of
|
|
this, it will be necessary to test the behavior manually in both mingw
|
|
and msvc when run from cmd and from msys bash.
|
|
|
|
Performance
|
|
===========
|
|
|
|
As described in https://github.com/qpdf/qpdf/issues/401, there was
|
|
great performance degradation between qpdf 7.1.1 and 9.1.1. Doing a
|
|
bisect between dac65a21fb4fa5f871e31c314280b75adde89a6c and
|
|
release-qpdf-7.1.1, I found several commits that damaged performance.
|
|
I fixed some of them to improve performance by about 70% (as measured
|
|
by saying that old times were 170% of new times). The remaining
|
|
commits that broke performance either can't be correct because they
|
|
would re-introduce an old bug or aren't worth correcting because of
|
|
the high value they offer relative to a relatively low penalty. For
|
|
historical reference, here are the commits. The numbers are the time
|
|
in seconds on the machine I happened to be using of splitting the
|
|
first 100 pages of PDF32000_2008.pdf 20 times and taking an average
|
|
duration.
|
|
|
|
Commits that broke performance:
|
|
|
|
* d0e99f195a987c483bbb6c5449cf39bee34e08a1 -- object description and
|
|
context: 0.39 -> 0.45
|
|
* a01359189b32c60c2d55b039f7aefd6c3ce0ebde (minus 313ba08) -- fix
|
|
dangling references: 0.55 -> 0.6
|
|
* e5f504b6c5dc34337cc0b316b4a7b1fca7e614b1 -- sparse array: 0.6 -> 0.62
|
|
|
|
Other intermediate steps that were previously fixed:
|
|
|
|
* 313ba081265f69ac9a0324f9fe87087c72918191 -- copy outlines into
|
|
split: 0.55 -> 4.0
|
|
* a01359189b32c60c2d55b039f7aefd6c3ce0ebde -- fix dangling references:
|
|
4.0 -> 9.0
|
|
|
|
This commit fixed the awful problem introduced in 313ba081:
|
|
|
|
* a5a016cdd26a8e5c99e5f019bc30d1bdf6c050a2 -- revert outline
|
|
preservation: 9.0 -> 0.6
|
|
|
|
Note that the fix dangling references commit had a much worse impact
|
|
prior to removing the outline preservation, so I also measured its
|
|
impact in isolation.
|
|
|
|
A few important lessons:
|
|
|
|
* Indirection through PointerHolder<Members> is expensive, and should
|
|
not be used for things that are created and destroyed frequently
|
|
such as QPDFObjectHandle and QPDFObject.
|
|
* Traversal of objects is expensive and should be avoided where
|
|
possible.
|
|
|
|
Future ideas:
|
|
|
|
* Look at places in the code where object traversal is being done and,
|
|
where possible, try to avoid it entirely or at least avoid ever
|
|
traversing the same objects multiple times.
|
|
* Avoid attaching too much metadata to objects and object handles
|
|
since those have to get copied around a lot.
|
|
|
|
Also, it turns out that PointerHolder is more performant than
|
|
std::shared_ptr.
|
|
|
|
Analytics
|
|
=========
|
|
|
|
Consider features that make it easier to detect certain patterns in
|
|
PDF files. The information below could be computed using an external
|
|
program that reads the existing json, but if it's useful enough, we
|
|
could add it directly to the json output.
|
|
|
|
* Add to "pages" in the json:
|
|
* "inheritsresources": bool; whether there are any inherited
|
|
attributes from ancestor page tree nodes
|
|
* "sharedresources": a list of indirect objects that are
|
|
"/Resources" dictionaries or "XObject" resource dictionary subkeys
|
|
of either the page itself or of any form XObject referenced by the
|
|
page.
|
|
|
|
* Add to "objectinfo" in json: "directpagerefcount": the number of
|
|
pages that directly reference this object (i.e., you can find an
|
|
indirect reference to the object in the page dictionary without
|
|
traversing over any indirect objects)
|
|
|
|
General
|
|
=======
|
|
|
|
NOTE: Some items in this list refer to files in my personal home
|
|
directory or that are otherwise not publicly accessible. This includes
|
|
things sent to me by email that are specifically not public. Even so,
|
|
I find it useful to make reference to them in this list
|
|
|
|
* Add support for writing name and number trees
|
|
|
|
* Figure out how to render Gajić correctly in the PDF version of the
|
|
qpdf manual.
|
|
|
|
* Consider creating a PPA for Ubuntu
|
|
|
|
* Investigate whether there is a way to automate the memory checker
|
|
tests for Windows.
|
|
|
|
* Support user-pluggable stream filters. This would enable external
|
|
code to provide interpretation for filters that are missing from
|
|
qpdf. Make it possible for user-provided filters to override
|
|
built-in filters. Make sure that the pluggable filters can be
|
|
prioritized so that we can poll all registered filters to see
|
|
whether they are capable of filtering a particular stream.
|
|
|
|
* If possible, consider adding CCITT3, CCITT4, or any other easy
|
|
filters. For some reference code that we probably can't use but may
|
|
be handy anyway, see
|
|
http://partners.adobe.com/public/developer/ps/sdk/index_archive.html
|
|
|
|
* If possible, support the following types of broken files:
|
|
|
|
- Files that have no whitespace token after "endobj" such that
|
|
endobj collides with the start of the next object
|
|
|
|
- See ../misc/broken-files
|
|
|
|
* Additional form features
|
|
* set value from CLI? Specify title, and provide way to
|
|
disambiguate, probably by giving objgen of field
|
|
|
|
* replace mode: --replace-object, --replace-stream-raw,
|
|
--replace-stream-filtered
|
|
* update first paragraph of QPDF JSON in the manual to mention this
|
|
* object numbers are not preserved by write, so object ID lookup
|
|
has to be done separately for each invocation
|
|
* you don't have to specify length for streams
|
|
* you only have to specify filtering for streams if providing raw data
|
|
|
|
* Pl_TIFFPredictor is pretty slow.
|
|
|
|
* Support for handling file names with Unicode characters in Windows
|
|
is incomplete. qpdf seems to support them okay from a functionality
|
|
standpoint, and the right thing happens if you pass in UTF-8
|
|
encoded filenames to QPDF library routines in Windows (they are
|
|
converted internally to wchar_t*), but file names are encoded in
|
|
UTF-8 on output, which doesn't produce nice error messages or
|
|
output on Windows in some cases.
|
|
|
|
* If we ever wanted to do anything more with character encoding, see
|
|
../misc/character-encoding/, which includes machine-readable dump
|
|
of table D.2 in the ISO-32000 PDF spec. This shows the mapping
|
|
between Unicode, StandardEncoding, WinAnsiEncoding,
|
|
MacRomanEncoding, and PDFDocEncoding.
|
|
|
|
* Some test cases on bad files fail because qpdf is unable to find
|
|
the root dictionary when it fails to read the trailer. Recovery
|
|
could find the root dictionary and even the info dictionary in
|
|
other ways. In particular, issue-202.pdf can be opened by evince,
|
|
and there's no real reason that qpdf couldn't be made to be able to
|
|
recover that file as well.
|
|
|
|
* Audit every place where qpdf allocates memory to see whether there
|
|
are cases where malicious inputs could cause qpdf to attempt to
|
|
grab very large amounts of memory. Certainly there are cases like
|
|
this, such as if a very highly compressed, very large image stream
|
|
is requested in a buffer. Hopefully normal input to output
|
|
filtering doesn't ever try to do this. QPDFWriter should be checked
|
|
carefully too. See also bugs/private/from-email-663916/
|
|
|
|
* Interactive form modification:
|
|
https://github.com/qpdf/qpdf/issues/213 contains a good discussion
|
|
of some ideas for adding methods to modify annotations and form
|
|
fields if we want to make it easier to support modifications to
|
|
interactive forms. Some of the ideas have been implemented, and
|
|
some of the probably never will be implemented, but it's worth a
|
|
read if there is an intention to work on this. In the issue, search
|
|
for "Regarding write functionality", and read that comment and the
|
|
responses to it.
|
|
|
|
* Look at ~/Q/pdf-collection/forms-from-appian/
|
|
|
|
* Consider adding "uninstall" target to makefile. It should only
|
|
uninstall what it installed, which means that you must run
|
|
uninstall from the version you ran install with. It would only be
|
|
supported for the toolchains that support the install target
|
|
(libtool).
|
|
|
|
* Provide support in QPDFWriter for writing incremental updates.
|
|
Provide support in qpdf for preserving incremental updates. The
|
|
goal should be that QDF mode should be fully functional for files
|
|
with incremental updates including fix_qdf.
|
|
|
|
Note that there's nothing that says an indirect object in one
|
|
update can't refer to an object that doesn't appear until a later
|
|
update. This means that QPDF has to treat indirect null objects
|
|
differently from how it does now. QPDF drops indirect null objects
|
|
that appear as members of arrays or dictionaries. For arrays, it's
|
|
handled in QPDFWriter where we make indirect nulls direct. This is
|
|
in a single if block, and nothing else in the code cares about it.
|
|
We could just remove that if block and not break anything except a
|
|
few test cases that exercise the current behavior. For
|
|
dictionaries, it's more complicated. In this case,
|
|
QPDF_Dictionary::getKeys() ignores all keys with null values, and
|
|
hasKey() returns false for keys that have null values. We would
|
|
probably want to make QPDF_Dictionary able to handle the special
|
|
case of keys that are indirect nulls and basically never have it
|
|
drop any keys that are indirect objects.
|
|
|
|
If we make a change to have qpdf preserve indirect references to
|
|
null objects, we have to note this in ChangeLog and in the release
|
|
notes since this will change output files. We did this before when
|
|
we stopped flattening scalar references, so this is probably not a
|
|
big deal. We also have to make sure that the testing for this
|
|
handles non-trivial cases of the targets of indirect nulls being
|
|
replaced by real objects in an update. I'm not sure how this plays
|
|
with linearization, if at all. For cases where incremental updates
|
|
are not being preserved as incremental updates and where the data
|
|
is being folded in (as is always the case with qpdf now), none of
|
|
this should make any difference in the actual semantics of the
|
|
files.
|
|
|
|
* When decrypting files with /R=6, hash_V5 is called more than once
|
|
with the same inputs. Caching the results or refactoring to reduce
|
|
the number of identical calls could improve performance for
|
|
workloads that involve processing large numbers of small files.
|
|
|
|
* Consider adding a method to balance the pages tree. It would call
|
|
pushInheritedAttributesToPage, construct a pages tree from scratch,
|
|
and replace the /Pages key of the root dictionary with the new
|
|
tree.
|
|
|
|
* Secure random number generation could be made more efficient by
|
|
using a local static to ensure a single random device or crypt
|
|
provider as long as this can be done in a thread-safe fashion. In
|
|
the initial implementation, this is being skipped to avoid having
|
|
to add any dependencies on threading libraries.
|
|
|
|
* Study what's required to support savable forms that can be saved by
|
|
Adobe Reader. Does this require actually signing the document with
|
|
an Adobe private key? Search for "Digital signatures" in the PDF
|
|
spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which
|
|
came from Adobe's example site. See also
|
|
../misc/digital-sign-from-trueroad/. If digital signatures are
|
|
implemented, update the docs on crypto providers, which mention
|
|
that this may happen in the future.
|
|
|
|
* See if we can avoid preserving unreferenced objects in object
|
|
streams even when preserving the object streams.
|
|
|
|
* Provide APIs for embedded files. See *attachments*.pdf in test
|
|
suite. The private method findAttachmentStreams finds at least
|
|
cases for modern versions of Adobe Reader (>= 1.7, maybe earlier).
|
|
PDF Reference 1.7 section 3.10, "File Specifications", discusses
|
|
this.
|
|
|
|
A sourceforge user asks if qpdf can handle extracting and embedded
|
|
resources and references these tools, which may be useful as a
|
|
reference.
|
|
|
|
http://multivalent.sourceforge.net/Tools/pdf/Extract.html
|
|
http://multivalent.sourceforge.net/Tools/pdf/Embed.html
|
|
|
|
* The description of Crypt filters is unclear with respect to how to
|
|
use them to override /StmF for specific streams. I'm not sure
|
|
whether qpdf will do the right thing for any specific individual
|
|
streams that might have crypt filters, but I believe it does based
|
|
on my testing of a limited subset. The specification seems to imply
|
|
that only embedded file streams and metadata streams can have crypt
|
|
filters, and there are already special cases in the code to handle
|
|
those. Most likely, it won't be a problem, but someday someone may
|
|
find a file that qpdf doesn't work on because of crypt filters.
|
|
There is an example in the spec of using a crypt filter on a
|
|
metadata stream.
|
|
|
|
For now, we notice /Crypt filters and decode parameters consistent
|
|
with the example in the PDF specification, and the right thing
|
|
happens for metadata filters that happen to be uncompressed or
|
|
otherwise compressed in a way we can filter. This should handle
|
|
all normal cases, but it's more or less just a guess since I don't
|
|
have any test files that actually use stream-specific crypt filters
|
|
in them.
|
|
|
|
* The second xref stream for linearized files has to be padded only
|
|
because we need file_size as computed in pass 1 to be accurate. If
|
|
we were not allowing writing to a pipe, we could seek back to the
|
|
beginning and fill in the value of /L in the linearization
|
|
dictionary as an optimization to alleviate the need for this
|
|
padding. Doing so would require us to pad the /L value
|
|
individually and also to save the file descriptor and determine
|
|
whether it's seekable. This is probably not worth bothering with.
|
|
|
|
* The whole xref handling code in the QPDF object allows the same
|
|
object with more than one generation to coexist, but a lot of logic
|
|
assumes this isn't the case. Anything that creates mappings only
|
|
with the object number and not the generation is this way,
|
|
including most of the interaction between QPDFWriter and QPDF. If
|
|
we wanted to allow the same object with more than one generation to
|
|
coexist, which I'm not sure is allowed, we could fix this by
|
|
changing xref_table. Alternatively, we could detect and disallow
|
|
that case. In fact, it appears that Adobe reader and other PDF
|
|
viewing software silently ignores objects of this type, so this is
|
|
probably not a big deal.
|
|
|
|
* Based on an idea suggested by user "Atom Smasher", consider
|
|
providing some mechanism to recover earlier versions of a file
|
|
embedded prior to appended sections.
|
|
|
|
* From a suggestion in bug 3152169, consider having an option to
|
|
re-encode inline images with an ASCII encoding.
|
|
|
|
* From github issue 2, provide more in-depth output for examining
|
|
hint stream contents. Consider adding on option to provide a
|
|
human-readable dump of linearization hint tables. This should
|
|
include improving the 'overflow reading bit stream' message as
|
|
reported in issue #2. There are multiple calls to stopOnError in
|
|
the linearization checking code. Ideally, these should not
|
|
terminate checking. It would require re-acquiring an understanding
|
|
of all that code to make the checks more robust. In particular,
|
|
it's hard to look at the code and quickly determine what is a true
|
|
logic error and what could happen because of malformed user input.
|
|
See also ../misc/linearization-errors.
|
|
|
|
* If I ever decide to make appearance stream-generation aware of
|
|
fonts or font metrics, see email from Tobias with Message-ID
|
|
<5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14.
|
|
|
|
* Consider creating a sanitizer to make it easier for people to send
|
|
broken files. Now that we have json mode, this is probably no
|
|
longer worth doing. Here is the previous idea, possibly implemented
|
|
by making it possible to run the lexer (tokenizer) over a whole
|
|
file. Make it possible to replace all strings in a file lexically
|
|
even on badly broken files. Ideally this should work files that are
|
|
lacking xref, have broken links, etc., and ideally it should work
|
|
with encrypted files if possible. This should go through the
|
|
streams and strings and replace them with fixed or random
|
|
characters, preferably, but not necessarily, in a manner that works
|
|
with fonts. One possibility would be to detect whether a string
|
|
contains characters with normal encoding, and if so, use 0x41. If
|
|
the string uses character maps, use 0x01. The output should
|
|
otherwise be unrelated to the input. This could be built after the
|
|
filtering and tokenizer rewrite and should be done in a manner that
|
|
takes advantage of the other lexical features. This sanitizer
|
|
should also clear metadata and replace images.
|