mirror of
https://github.com/qpdf/qpdf.git
synced 2025-01-03 15:17:29 +00:00
TODO: rescope some items
This commit is contained in:
parent
433be3718a
commit
48dfae6443
331
TODO
331
TODO
@ -21,31 +21,15 @@ Pending changes:
|
|||||||
appimage build specifically is setting the runpath, which is
|
appimage build specifically is setting the runpath, which is
|
||||||
actually desirable in this case. Make sure to understand and
|
actually desirable in this case. Make sure to understand and
|
||||||
document this. Maybe add a check for it in the build.
|
document this. Maybe add a check for it in the build.
|
||||||
* Decide what to do about #664 (get*Box)
|
|
||||||
* Add an option --ignore-encryption to ignore encryption information
|
|
||||||
and treat encrypted files as if they weren't encrypted. This should
|
|
||||||
make it possible to solve #598 (--show-encryption without a
|
|
||||||
password). We'll need to make sure we don't try to filter any
|
|
||||||
streams in this mode. Ideally we should be able to combine this with
|
|
||||||
--json so we can look at the raw encrypted strings and streams if we
|
|
||||||
want to, though be sure to document that the resulting JSON won't be
|
|
||||||
convertible back to a valid PDF. Since providing the password may
|
|
||||||
reveal additional details, --show-encryption could potentially retry
|
|
||||||
with this option if the first time doesn't work. Then, with the file
|
|
||||||
open, we can read the encryption dictionary normally.
|
|
||||||
* In libtests, separate executables that need the object library
|
|
||||||
from those that strictly use public API. Move as many of the test
|
|
||||||
drivers from the qpdf directory into the latter category as long
|
|
||||||
as doing so isn't too troublesome from a coverage standpoint.
|
|
||||||
* Consider adding fuzzer code for JSON
|
|
||||||
* Consider generating a non-flat pages tree before creating output to
|
|
||||||
better handle files with lots of pages. If there are more than 256
|
|
||||||
pages, add a second layer with the second layer nodes having no more
|
|
||||||
than 256 nodes and being as evenly sizes as possible. Don't worry
|
|
||||||
about the case of more than 65,536 pages. If the top node has more
|
|
||||||
than 256 children, we'll live with it.
|
|
||||||
|
|
||||||
Parent pointer idea:
|
Soon: Break ground on "Document-level work"
|
||||||
|
|
||||||
|
Fix Multiple Direct Object Owner Issue
|
||||||
|
======================================
|
||||||
|
|
||||||
|
These are some ideas I've had, but I'm parking them until I fully
|
||||||
|
understand m-holger's proposal to split QPDFObject into QPDFObject and
|
||||||
|
QPDFValue.
|
||||||
|
|
||||||
* Add std::weak_ptr<QPDFObject> parent to QPDFObject. When adding a
|
* Add std::weak_ptr<QPDFObject> parent to QPDFObject. When adding a
|
||||||
direct object to an array or dictionary, set its parent. When
|
direct object to an array or dictionary, set its parent. When
|
||||||
@ -65,8 +49,6 @@ Note that arrays and dictionaries still need to contain
|
|||||||
QPDFObjectHandle because of indirect objects. This only pertains to
|
QPDFObjectHandle because of indirect objects. This only pertains to
|
||||||
direct objects, which are always "resolved" in QPDFObjectHandle.
|
direct objects, which are always "resolved" in QPDFObjectHandle.
|
||||||
|
|
||||||
Soon: Break ground on "Document-level work"
|
|
||||||
|
|
||||||
Possible future JSON enhancements
|
Possible future JSON enhancements
|
||||||
=================================
|
=================================
|
||||||
|
|
||||||
@ -376,169 +358,196 @@ directory or that are otherwise not publicly accessible. This includes
|
|||||||
things sent to me by email that are specifically not public. Even so,
|
things sent to me by email that are specifically not public. Even so,
|
||||||
I find it useful to make reference to them in this list.
|
I find it useful to make reference to them in this list.
|
||||||
|
|
||||||
* Look at https://bestpractices.coreinfrastructure.org/en
|
* Add an option --ignore-encryption to ignore encryption information
|
||||||
|
and treat encrypted files as if they weren't encrypted. This should
|
||||||
|
make it possible to solve #598 (--show-encryption without a
|
||||||
|
password). We'll need to make sure we don't try to filter any
|
||||||
|
streams in this mode. Ideally we should be able to combine this with
|
||||||
|
--json so we can look at the raw encrypted strings and streams if we
|
||||||
|
want to, though be sure to document that the resulting JSON won't be
|
||||||
|
convertible back to a valid PDF. Since providing the password may
|
||||||
|
reveal additional details, --show-encryption could potentially retry
|
||||||
|
with this option if the first time doesn't work. Then, with the file
|
||||||
|
open, we can read the encryption dictionary normally.
|
||||||
|
|
||||||
* Rework tests so that nothing is written into the source directory.
|
* In libtests, separate executables that need the object library
|
||||||
Ideally then the entire build could be done with a read-only
|
from those that strictly use public API. Move as many of the test
|
||||||
source tree.
|
drivers from the qpdf directory into the latter category as long
|
||||||
|
as doing so isn't too troublesome from a coverage standpoint.
|
||||||
|
|
||||||
* Large file tests fail with linux32 before and after cmake. This was
|
* Consider generating a non-flat pages tree before creating output to
|
||||||
first noticed after 10.6.3. I don't think it's worth fixing.
|
better handle files with lots of pages. If there are more than 256
|
||||||
|
pages, add a second layer with the second layer nodes having no more
|
||||||
|
than 256 nodes and being as evenly sizes as possible. Don't worry
|
||||||
|
about the case of more than 65,536 pages. If the top node has more
|
||||||
|
than 256 children, we'll live with it. This is only safe if all
|
||||||
|
intermediate page nodes have only /Kids, /Parent, /Type, and /Count.
|
||||||
|
|
||||||
* Consider updating the fuzzer with code that exercises
|
* Look at https://bestpractices.coreinfrastructure.org/en
|
||||||
copyAnnotations, file attachments, and name and number trees. Check
|
|
||||||
fuzzer coverage.
|
|
||||||
|
|
||||||
* Add code for creation of a file attachment annotation. It should
|
* Consider adding fuzzer code for JSON
|
||||||
also be possible to create a widget annotation and a form field.
|
|
||||||
Update the pdf-attach-file.cc example with new APIs when ready.
|
|
||||||
|
|
||||||
* Flattening of form XObjects seems like something that would be
|
* Rework tests so that nothing is written into the source directory.
|
||||||
useful in the library. We are seeing more cases of completely valid
|
Ideally then the entire build could be done with a read-only
|
||||||
PDF files with form XObjects that cause problems in other software.
|
source tree.
|
||||||
Flattening of form XObjects could be a useful way to work around
|
|
||||||
those issues or to prepare files for additional processing, making
|
|
||||||
it possible for users of the qpdf library to not be concerned about
|
|
||||||
form XObjects. This could be done recursively; i.e., we could have a
|
|
||||||
method to embed a form XObject into whatever contains it, whether
|
|
||||||
that is a form XObject or a page. This would require more
|
|
||||||
significant interpretation of the content stream. We would need a
|
|
||||||
test file in which the placement of the form XObject has to be in
|
|
||||||
the right place, e.g., the form XObject partially obscures earlier
|
|
||||||
code and is partially obscured by later code. Keys in the resource
|
|
||||||
dictionary may need to be changed -- create test cases with lots of
|
|
||||||
duplicated/overlapping keys.
|
|
||||||
|
|
||||||
* Part of closed_file_input_source.cc is disabled on Windows because
|
* Large file tests fail with linux32 before and after cmake. This was
|
||||||
of odd failures. It might be worth investigating so we can fully
|
first noticed after 10.6.3. I don't think it's worth fixing.
|
||||||
exercise this in the test suite. That said, ClosedFileInputSource
|
|
||||||
is exercised elsewhere in qpdf's test suite, so this is not that
|
|
||||||
pressing.
|
|
||||||
|
|
||||||
* If possible, consider adding CCITT3, CCITT4, or any other easy
|
* Consider updating the fuzzer with code that exercises
|
||||||
filters. For some reference code that we probably can't use but may
|
copyAnnotations, file attachments, and name and number trees. Check
|
||||||
be handy anyway, see
|
fuzzer coverage.
|
||||||
http://partners.adobe.com/public/developer/ps/sdk/index_archive.html
|
|
||||||
|
|
||||||
* If possible, support the following types of broken files:
|
* Add code for creation of a file attachment annotation. It should
|
||||||
|
also be possible to create a widget annotation and a form field.
|
||||||
|
Update the pdf-attach-file.cc example with new APIs when ready.
|
||||||
|
|
||||||
- Files that have no whitespace token after "endobj" such that
|
* Flattening of form XObjects seems like something that would be
|
||||||
endobj collides with the start of the next object
|
useful in the library. We are seeing more cases of completely valid
|
||||||
|
PDF files with form XObjects that cause problems in other software.
|
||||||
|
Flattening of form XObjects could be a useful way to work around
|
||||||
|
those issues or to prepare files for additional processing, making
|
||||||
|
it possible for users of the qpdf library to not be concerned about
|
||||||
|
form XObjects. This could be done recursively; i.e., we could have a
|
||||||
|
method to embed a form XObject into whatever contains it, whether
|
||||||
|
that is a form XObject or a page. This would require more
|
||||||
|
significant interpretation of the content stream. We would need a
|
||||||
|
test file in which the placement of the form XObject has to be in
|
||||||
|
the right place, e.g., the form XObject partially obscures earlier
|
||||||
|
code and is partially obscured by later code. Keys in the resource
|
||||||
|
dictionary may need to be changed -- create test cases with lots of
|
||||||
|
duplicated/overlapping keys.
|
||||||
|
|
||||||
- See ../misc/broken-files
|
* Part of closed_file_input_source.cc is disabled on Windows because
|
||||||
|
of odd failures. It might be worth investigating so we can fully
|
||||||
|
exercise this in the test suite. That said, ClosedFileInputSource
|
||||||
|
is exercised elsewhere in qpdf's test suite, so this is not that
|
||||||
|
pressing.
|
||||||
|
|
||||||
- See ../misc/bad-files-issue-476. This directory contains a
|
* If possible, consider adding CCITT3, CCITT4, or any other easy
|
||||||
snapshot of the google doc and linked PDF files from issue #476.
|
filters. For some reference code that we probably can't use but may
|
||||||
Please see the issue for details.
|
be handy anyway, see
|
||||||
|
http://partners.adobe.com/public/developer/ps/sdk/index_archive.html
|
||||||
|
|
||||||
* Additional form features
|
* If possible, support the following types of broken files:
|
||||||
* set value from CLI? Specify title, and provide way to
|
|
||||||
disambiguate, probably by giving objgen of field
|
|
||||||
|
|
||||||
* Pl_TIFFPredictor is pretty slow.
|
- Files that have no whitespace token after "endobj" such that
|
||||||
|
endobj collides with the start of the next object
|
||||||
|
|
||||||
* Support for handling file names with Unicode characters in Windows
|
- See ../misc/broken-files
|
||||||
is incomplete. qpdf seems to support them okay from a functionality
|
|
||||||
standpoint, and the right thing happens if you pass in UTF-8
|
|
||||||
encoded filenames to QPDF library routines in Windows (they are
|
|
||||||
converted internally to wchar_t*), but file names are encoded in
|
|
||||||
UTF-8 on output, which doesn't produce nice error messages or
|
|
||||||
output on Windows in some cases.
|
|
||||||
|
|
||||||
* If we ever wanted to do anything more with character encoding, see
|
- See ../misc/bad-files-issue-476. This directory contains a
|
||||||
../misc/character-encoding/, which includes machine-readable dump
|
snapshot of the google doc and linked PDF files from issue #476.
|
||||||
of table D.2 in the ISO-32000 PDF spec. This shows the mapping
|
Please see the issue for details.
|
||||||
between Unicode, StandardEncoding, WinAnsiEncoding,
|
|
||||||
MacRomanEncoding, and PDFDocEncoding.
|
|
||||||
|
|
||||||
* Some test cases on bad files fail because qpdf is unable to find
|
* Additional form features
|
||||||
the root dictionary when it fails to read the trailer. Recovery
|
* set value from CLI? Specify title, and provide way to
|
||||||
could find the root dictionary and even the info dictionary in
|
disambiguate, probably by giving objgen of field
|
||||||
other ways. In particular, issue-202.pdf can be opened by evince,
|
|
||||||
and there's no real reason that qpdf couldn't be made to be able to
|
|
||||||
recover that file as well.
|
|
||||||
|
|
||||||
* Audit every place where qpdf allocates memory to see whether there
|
* Pl_TIFFPredictor is pretty slow.
|
||||||
are cases where malicious inputs could cause qpdf to attempt to
|
|
||||||
grab very large amounts of memory. Certainly there are cases like
|
|
||||||
this, such as if a very highly compressed, very large image stream
|
|
||||||
is requested in a buffer. Hopefully normal input to output
|
|
||||||
filtering doesn't ever try to do this. QPDFWriter should be checked
|
|
||||||
carefully too. See also bugs/private/from-email-663916/
|
|
||||||
|
|
||||||
* Interactive form modification:
|
* Support for handling file names with Unicode characters in Windows
|
||||||
https://github.com/qpdf/qpdf/issues/213 contains a good discussion
|
is incomplete. qpdf seems to support them okay from a functionality
|
||||||
of some ideas for adding methods to modify annotations and form
|
standpoint, and the right thing happens if you pass in UTF-8
|
||||||
fields if we want to make it easier to support modifications to
|
encoded filenames to QPDF library routines in Windows (they are
|
||||||
interactive forms. Some of the ideas have been implemented, and
|
converted internally to wchar_t*), but file names are encoded in
|
||||||
some of the probably never will be implemented, but it's worth a
|
UTF-8 on output, which doesn't produce nice error messages or
|
||||||
read if there is an intention to work on this. In the issue, search
|
output on Windows in some cases.
|
||||||
for "Regarding write functionality", and read that comment and the
|
|
||||||
responses to it.
|
|
||||||
|
|
||||||
* Look at ~/Q/pdf-collection/forms-from-appian/
|
* If we ever wanted to do anything more with character encoding, see
|
||||||
|
../misc/character-encoding/, which includes machine-readable dump
|
||||||
|
of table D.2 in the ISO-32000 PDF spec. This shows the mapping
|
||||||
|
between Unicode, StandardEncoding, WinAnsiEncoding,
|
||||||
|
MacRomanEncoding, and PDFDocEncoding.
|
||||||
|
|
||||||
* When decrypting files with /R=6, hash_V5 is called more than once
|
* Some test cases on bad files fail because qpdf is unable to find
|
||||||
with the same inputs. Caching the results or refactoring to reduce
|
the root dictionary when it fails to read the trailer. Recovery
|
||||||
the number of identical calls could improve performance for
|
could find the root dictionary and even the info dictionary in
|
||||||
workloads that involve processing large numbers of small files.
|
other ways. In particular, issue-202.pdf can be opened by evince,
|
||||||
|
and there's no real reason that qpdf couldn't be made to be able to
|
||||||
|
recover that file as well.
|
||||||
|
|
||||||
* Consider adding a method to balance the pages tree. It would call
|
* Audit every place where qpdf allocates memory to see whether there
|
||||||
pushInheritedAttributesToPage, construct a pages tree from scratch,
|
are cases where malicious inputs could cause qpdf to attempt to
|
||||||
and replace the /Pages key of the root dictionary with the new
|
grab very large amounts of memory. Certainly there are cases like
|
||||||
tree.
|
this, such as if a very highly compressed, very large image stream
|
||||||
|
is requested in a buffer. Hopefully normal input to output
|
||||||
|
filtering doesn't ever try to do this. QPDFWriter should be checked
|
||||||
|
carefully too. See also bugs/private/from-email-663916/
|
||||||
|
|
||||||
* Study what's required to support savable forms that can be saved by
|
* Interactive form modification:
|
||||||
Adobe Reader. Does this require actually signing the document with
|
https://github.com/qpdf/qpdf/issues/213 contains a good discussion
|
||||||
an Adobe private key? Search for "Digital signatures" in the PDF
|
of some ideas for adding methods to modify annotations and form
|
||||||
spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which
|
fields if we want to make it easier to support modifications to
|
||||||
came from Adobe's example site. See also
|
interactive forms. Some of the ideas have been implemented, and
|
||||||
../misc/digital-sign-from-trueroad/. If digital signatures are
|
some of the probably never will be implemented, but it's worth a
|
||||||
implemented, update the docs on crypto providers, which mention
|
read if there is an intention to work on this. In the issue, search
|
||||||
that this may happen in the future.
|
for "Regarding write functionality", and read that comment and the
|
||||||
|
responses to it.
|
||||||
|
|
||||||
* Qpdf does not honor /EFF when adding new file attachments. When it
|
* Look at ~/Q/pdf-collection/forms-from-appian/
|
||||||
encrypts, it never generates streams with explicit crypt filters.
|
|
||||||
Prior to 10.2, there was an incorrect attempt to treat /EFF as a
|
|
||||||
default value for decrypting file attachment streams, but it is not
|
|
||||||
supposed to mean that. Instead, it is intended for conforming
|
|
||||||
writers to obey this when adding new attachments. Qpdf is not a
|
|
||||||
conforming writer in that respect.
|
|
||||||
|
|
||||||
* The whole xref handling code in the QPDF object allows the same
|
* When decrypting files with /R=6, hash_V5 is called more than once
|
||||||
object with more than one generation to coexist, but a lot of logic
|
with the same inputs. Caching the results or refactoring to reduce
|
||||||
assumes this isn't the case. Anything that creates mappings only
|
the number of identical calls could improve performance for
|
||||||
with the object number and not the generation is this way,
|
workloads that involve processing large numbers of small files.
|
||||||
including most of the interaction between QPDFWriter and QPDF. If
|
|
||||||
we wanted to allow the same object with more than one generation to
|
|
||||||
coexist, which I'm not sure is allowed, we could fix this by
|
|
||||||
changing xref_table. Alternatively, we could detect and disallow
|
|
||||||
that case. In fact, it appears that Adobe reader and other PDF
|
|
||||||
viewing software silently ignores objects of this type, so this is
|
|
||||||
probably not a big deal.
|
|
||||||
|
|
||||||
* From a suggestion in bug 3152169, consider having an option to
|
* Consider adding a method to balance the pages tree. It would call
|
||||||
re-encode inline images with an ASCII encoding.
|
pushInheritedAttributesToPage, construct a pages tree from scratch,
|
||||||
|
and replace the /Pages key of the root dictionary with the new
|
||||||
|
tree.
|
||||||
|
|
||||||
* From github issue 2, provide more in-depth output for examining
|
* Study what's required to support savable forms that can be saved by
|
||||||
hint stream contents. Consider adding on option to provide a
|
Adobe Reader. Does this require actually signing the document with
|
||||||
human-readable dump of linearization hint tables. This should
|
an Adobe private key? Search for "Digital signatures" in the PDF
|
||||||
include improving the 'overflow reading bit stream' message as
|
spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which
|
||||||
reported in issue #2. There are multiple calls to stopOnError in
|
came from Adobe's example site. See also
|
||||||
the linearization checking code. Ideally, these should not
|
../misc/digital-sign-from-trueroad/. If digital signatures are
|
||||||
terminate checking. It would require re-acquiring an understanding
|
implemented, update the docs on crypto providers, which mention
|
||||||
of all that code to make the checks more robust. In particular,
|
that this may happen in the future.
|
||||||
it's hard to look at the code and quickly determine what is a true
|
|
||||||
logic error and what could happen because of malformed user input.
|
|
||||||
See also ../misc/linearization-errors.
|
|
||||||
|
|
||||||
* If I ever decide to make appearance stream-generation aware of
|
* Qpdf does not honor /EFF when adding new file attachments. When it
|
||||||
fonts or font metrics, see email from Tobias with Message-ID
|
encrypts, it never generates streams with explicit crypt filters.
|
||||||
<5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14.
|
Prior to 10.2, there was an incorrect attempt to treat /EFF as a
|
||||||
|
default value for decrypting file attachment streams, but it is not
|
||||||
|
supposed to mean that. Instead, it is intended for conforming
|
||||||
|
writers to obey this when adding new attachments. Qpdf is not a
|
||||||
|
conforming writer in that respect.
|
||||||
|
|
||||||
* Look at places in the code where object traversal is being done and,
|
* The whole xref handling code in the QPDF object allows the same
|
||||||
where possible, try to avoid it entirely or at least avoid ever
|
object with more than one generation to coexist, but a lot of logic
|
||||||
traversing the same objects multiple times.
|
assumes this isn't the case. Anything that creates mappings only
|
||||||
|
with the object number and not the generation is this way,
|
||||||
|
including most of the interaction between QPDFWriter and QPDF. If
|
||||||
|
we wanted to allow the same object with more than one generation to
|
||||||
|
coexist, which I'm not sure is allowed, we could fix this by
|
||||||
|
changing xref_table. Alternatively, we could detect and disallow
|
||||||
|
that case. In fact, it appears that Adobe reader and other PDF
|
||||||
|
viewing software silently ignores objects of this type, so this is
|
||||||
|
probably not a big deal.
|
||||||
|
|
||||||
|
* From a suggestion in bug 3152169, consider having an option to
|
||||||
|
re-encode inline images with an ASCII encoding.
|
||||||
|
|
||||||
|
* From github issue 2, provide more in-depth output for examining
|
||||||
|
hint stream contents. Consider adding on option to provide a
|
||||||
|
human-readable dump of linearization hint tables. This should
|
||||||
|
include improving the 'overflow reading bit stream' message as
|
||||||
|
reported in issue #2. There are multiple calls to stopOnError in
|
||||||
|
the linearization checking code. Ideally, these should not
|
||||||
|
terminate checking. It would require re-acquiring an understanding
|
||||||
|
of all that code to make the checks more robust. In particular,
|
||||||
|
it's hard to look at the code and quickly determine what is a true
|
||||||
|
logic error and what could happen because of malformed user input.
|
||||||
|
See also ../misc/linearization-errors.
|
||||||
|
|
||||||
|
* If I ever decide to make appearance stream-generation aware of
|
||||||
|
fonts or font metrics, see email from Tobias with Message-ID
|
||||||
|
<5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14.
|
||||||
|
|
||||||
|
* Look at places in the code where object traversal is being done and,
|
||||||
|
where possible, try to avoid it entirely or at least avoid ever
|
||||||
|
traversing the same objects multiple times.
|
||||||
|
|
||||||
----------------------------------------------------------------------
|
----------------------------------------------------------------------
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user