TODO: rescope some items

This commit is contained in:
Jay Berkenbilt 2022-08-06 16:35:40 -04:00
parent 433be3718a
commit 48dfae6443
1 changed files with 170 additions and 161 deletions

331
TODO
View File

@ -21,31 +21,15 @@ Pending changes:
appimage build specifically is setting the runpath, which is appimage build specifically is setting the runpath, which is
actually desirable in this case. Make sure to understand and actually desirable in this case. Make sure to understand and
document this. Maybe add a check for it in the build. document this. Maybe add a check for it in the build.
* Decide what to do about #664 (get*Box)
* Add an option --ignore-encryption to ignore encryption information
and treat encrypted files as if they weren't encrypted. This should
make it possible to solve #598 (--show-encryption without a
password). We'll need to make sure we don't try to filter any
streams in this mode. Ideally we should be able to combine this with
--json so we can look at the raw encrypted strings and streams if we
want to, though be sure to document that the resulting JSON won't be
convertible back to a valid PDF. Since providing the password may
reveal additional details, --show-encryption could potentially retry
with this option if the first time doesn't work. Then, with the file
open, we can read the encryption dictionary normally.
* In libtests, separate executables that need the object library
from those that strictly use public API. Move as many of the test
drivers from the qpdf directory into the latter category as long
as doing so isn't too troublesome from a coverage standpoint.
* Consider adding fuzzer code for JSON
* Consider generating a non-flat pages tree before creating output to
better handle files with lots of pages. If there are more than 256
pages, add a second layer with the second layer nodes having no more
than 256 nodes and being as evenly sizes as possible. Don't worry
about the case of more than 65,536 pages. If the top node has more
than 256 children, we'll live with it.
Parent pointer idea: Soon: Break ground on "Document-level work"
Fix Multiple Direct Object Owner Issue
======================================
These are some ideas I've had, but I'm parking them until I fully
understand m-holger's proposal to split QPDFObject into QPDFObject and
QPDFValue.
* Add std::weak_ptr<QPDFObject> parent to QPDFObject. When adding a * Add std::weak_ptr<QPDFObject> parent to QPDFObject. When adding a
direct object to an array or dictionary, set its parent. When direct object to an array or dictionary, set its parent. When
@ -65,8 +49,6 @@ Note that arrays and dictionaries still need to contain
QPDFObjectHandle because of indirect objects. This only pertains to QPDFObjectHandle because of indirect objects. This only pertains to
direct objects, which are always "resolved" in QPDFObjectHandle. direct objects, which are always "resolved" in QPDFObjectHandle.
Soon: Break ground on "Document-level work"
Possible future JSON enhancements Possible future JSON enhancements
================================= =================================
@ -376,169 +358,196 @@ directory or that are otherwise not publicly accessible. This includes
things sent to me by email that are specifically not public. Even so, things sent to me by email that are specifically not public. Even so,
I find it useful to make reference to them in this list. I find it useful to make reference to them in this list.
* Look at https://bestpractices.coreinfrastructure.org/en * Add an option --ignore-encryption to ignore encryption information
and treat encrypted files as if they weren't encrypted. This should
make it possible to solve #598 (--show-encryption without a
password). We'll need to make sure we don't try to filter any
streams in this mode. Ideally we should be able to combine this with
--json so we can look at the raw encrypted strings and streams if we
want to, though be sure to document that the resulting JSON won't be
convertible back to a valid PDF. Since providing the password may
reveal additional details, --show-encryption could potentially retry
with this option if the first time doesn't work. Then, with the file
open, we can read the encryption dictionary normally.
* Rework tests so that nothing is written into the source directory. * In libtests, separate executables that need the object library
Ideally then the entire build could be done with a read-only from those that strictly use public API. Move as many of the test
source tree. drivers from the qpdf directory into the latter category as long
as doing so isn't too troublesome from a coverage standpoint.
* Large file tests fail with linux32 before and after cmake. This was * Consider generating a non-flat pages tree before creating output to
first noticed after 10.6.3. I don't think it's worth fixing. better handle files with lots of pages. If there are more than 256
pages, add a second layer with the second layer nodes having no more
than 256 nodes and being as evenly sizes as possible. Don't worry
about the case of more than 65,536 pages. If the top node has more
than 256 children, we'll live with it. This is only safe if all
intermediate page nodes have only /Kids, /Parent, /Type, and /Count.
* Consider updating the fuzzer with code that exercises * Look at https://bestpractices.coreinfrastructure.org/en
copyAnnotations, file attachments, and name and number trees. Check
fuzzer coverage.
* Add code for creation of a file attachment annotation. It should * Consider adding fuzzer code for JSON
also be possible to create a widget annotation and a form field.
Update the pdf-attach-file.cc example with new APIs when ready.
* Flattening of form XObjects seems like something that would be * Rework tests so that nothing is written into the source directory.
useful in the library. We are seeing more cases of completely valid Ideally then the entire build could be done with a read-only
PDF files with form XObjects that cause problems in other software. source tree.
Flattening of form XObjects could be a useful way to work around
those issues or to prepare files for additional processing, making
it possible for users of the qpdf library to not be concerned about
form XObjects. This could be done recursively; i.e., we could have a
method to embed a form XObject into whatever contains it, whether
that is a form XObject or a page. This would require more
significant interpretation of the content stream. We would need a
test file in which the placement of the form XObject has to be in
the right place, e.g., the form XObject partially obscures earlier
code and is partially obscured by later code. Keys in the resource
dictionary may need to be changed -- create test cases with lots of
duplicated/overlapping keys.
* Part of closed_file_input_source.cc is disabled on Windows because * Large file tests fail with linux32 before and after cmake. This was
of odd failures. It might be worth investigating so we can fully first noticed after 10.6.3. I don't think it's worth fixing.
exercise this in the test suite. That said, ClosedFileInputSource
is exercised elsewhere in qpdf's test suite, so this is not that
pressing.
* If possible, consider adding CCITT3, CCITT4, or any other easy * Consider updating the fuzzer with code that exercises
filters. For some reference code that we probably can't use but may copyAnnotations, file attachments, and name and number trees. Check
be handy anyway, see fuzzer coverage.
http://partners.adobe.com/public/developer/ps/sdk/index_archive.html
* If possible, support the following types of broken files: * Add code for creation of a file attachment annotation. It should
also be possible to create a widget annotation and a form field.
Update the pdf-attach-file.cc example with new APIs when ready.
- Files that have no whitespace token after "endobj" such that * Flattening of form XObjects seems like something that would be
endobj collides with the start of the next object useful in the library. We are seeing more cases of completely valid
PDF files with form XObjects that cause problems in other software.
Flattening of form XObjects could be a useful way to work around
those issues or to prepare files for additional processing, making
it possible for users of the qpdf library to not be concerned about
form XObjects. This could be done recursively; i.e., we could have a
method to embed a form XObject into whatever contains it, whether
that is a form XObject or a page. This would require more
significant interpretation of the content stream. We would need a
test file in which the placement of the form XObject has to be in
the right place, e.g., the form XObject partially obscures earlier
code and is partially obscured by later code. Keys in the resource
dictionary may need to be changed -- create test cases with lots of
duplicated/overlapping keys.
- See ../misc/broken-files * Part of closed_file_input_source.cc is disabled on Windows because
of odd failures. It might be worth investigating so we can fully
exercise this in the test suite. That said, ClosedFileInputSource
is exercised elsewhere in qpdf's test suite, so this is not that
pressing.
- See ../misc/bad-files-issue-476. This directory contains a * If possible, consider adding CCITT3, CCITT4, or any other easy
snapshot of the google doc and linked PDF files from issue #476. filters. For some reference code that we probably can't use but may
Please see the issue for details. be handy anyway, see
http://partners.adobe.com/public/developer/ps/sdk/index_archive.html
* Additional form features * If possible, support the following types of broken files:
* set value from CLI? Specify title, and provide way to
disambiguate, probably by giving objgen of field
* Pl_TIFFPredictor is pretty slow. - Files that have no whitespace token after "endobj" such that
endobj collides with the start of the next object
* Support for handling file names with Unicode characters in Windows - See ../misc/broken-files
is incomplete. qpdf seems to support them okay from a functionality
standpoint, and the right thing happens if you pass in UTF-8
encoded filenames to QPDF library routines in Windows (they are
converted internally to wchar_t*), but file names are encoded in
UTF-8 on output, which doesn't produce nice error messages or
output on Windows in some cases.
* If we ever wanted to do anything more with character encoding, see - See ../misc/bad-files-issue-476. This directory contains a
../misc/character-encoding/, which includes machine-readable dump snapshot of the google doc and linked PDF files from issue #476.
of table D.2 in the ISO-32000 PDF spec. This shows the mapping Please see the issue for details.
between Unicode, StandardEncoding, WinAnsiEncoding,
MacRomanEncoding, and PDFDocEncoding.
* Some test cases on bad files fail because qpdf is unable to find * Additional form features
the root dictionary when it fails to read the trailer. Recovery * set value from CLI? Specify title, and provide way to
could find the root dictionary and even the info dictionary in disambiguate, probably by giving objgen of field
other ways. In particular, issue-202.pdf can be opened by evince,
and there's no real reason that qpdf couldn't be made to be able to
recover that file as well.
* Audit every place where qpdf allocates memory to see whether there * Pl_TIFFPredictor is pretty slow.
are cases where malicious inputs could cause qpdf to attempt to
grab very large amounts of memory. Certainly there are cases like
this, such as if a very highly compressed, very large image stream
is requested in a buffer. Hopefully normal input to output
filtering doesn't ever try to do this. QPDFWriter should be checked
carefully too. See also bugs/private/from-email-663916/
* Interactive form modification: * Support for handling file names with Unicode characters in Windows
https://github.com/qpdf/qpdf/issues/213 contains a good discussion is incomplete. qpdf seems to support them okay from a functionality
of some ideas for adding methods to modify annotations and form standpoint, and the right thing happens if you pass in UTF-8
fields if we want to make it easier to support modifications to encoded filenames to QPDF library routines in Windows (they are
interactive forms. Some of the ideas have been implemented, and converted internally to wchar_t*), but file names are encoded in
some of the probably never will be implemented, but it's worth a UTF-8 on output, which doesn't produce nice error messages or
read if there is an intention to work on this. In the issue, search output on Windows in some cases.
for "Regarding write functionality", and read that comment and the
responses to it.
* Look at ~/Q/pdf-collection/forms-from-appian/ * If we ever wanted to do anything more with character encoding, see
../misc/character-encoding/, which includes machine-readable dump
of table D.2 in the ISO-32000 PDF spec. This shows the mapping
between Unicode, StandardEncoding, WinAnsiEncoding,
MacRomanEncoding, and PDFDocEncoding.
* When decrypting files with /R=6, hash_V5 is called more than once * Some test cases on bad files fail because qpdf is unable to find
with the same inputs. Caching the results or refactoring to reduce the root dictionary when it fails to read the trailer. Recovery
the number of identical calls could improve performance for could find the root dictionary and even the info dictionary in
workloads that involve processing large numbers of small files. other ways. In particular, issue-202.pdf can be opened by evince,
and there's no real reason that qpdf couldn't be made to be able to
recover that file as well.
* Consider adding a method to balance the pages tree. It would call * Audit every place where qpdf allocates memory to see whether there
pushInheritedAttributesToPage, construct a pages tree from scratch, are cases where malicious inputs could cause qpdf to attempt to
and replace the /Pages key of the root dictionary with the new grab very large amounts of memory. Certainly there are cases like
tree. this, such as if a very highly compressed, very large image stream
is requested in a buffer. Hopefully normal input to output
filtering doesn't ever try to do this. QPDFWriter should be checked
carefully too. See also bugs/private/from-email-663916/
* Study what's required to support savable forms that can be saved by * Interactive form modification:
Adobe Reader. Does this require actually signing the document with https://github.com/qpdf/qpdf/issues/213 contains a good discussion
an Adobe private key? Search for "Digital signatures" in the PDF of some ideas for adding methods to modify annotations and form
spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which fields if we want to make it easier to support modifications to
came from Adobe's example site. See also interactive forms. Some of the ideas have been implemented, and
../misc/digital-sign-from-trueroad/. If digital signatures are some of the probably never will be implemented, but it's worth a
implemented, update the docs on crypto providers, which mention read if there is an intention to work on this. In the issue, search
that this may happen in the future. for "Regarding write functionality", and read that comment and the
responses to it.
* Qpdf does not honor /EFF when adding new file attachments. When it * Look at ~/Q/pdf-collection/forms-from-appian/
encrypts, it never generates streams with explicit crypt filters.
Prior to 10.2, there was an incorrect attempt to treat /EFF as a
default value for decrypting file attachment streams, but it is not
supposed to mean that. Instead, it is intended for conforming
writers to obey this when adding new attachments. Qpdf is not a
conforming writer in that respect.
* The whole xref handling code in the QPDF object allows the same * When decrypting files with /R=6, hash_V5 is called more than once
object with more than one generation to coexist, but a lot of logic with the same inputs. Caching the results or refactoring to reduce
assumes this isn't the case. Anything that creates mappings only the number of identical calls could improve performance for
with the object number and not the generation is this way, workloads that involve processing large numbers of small files.
including most of the interaction between QPDFWriter and QPDF. If
we wanted to allow the same object with more than one generation to
coexist, which I'm not sure is allowed, we could fix this by
changing xref_table. Alternatively, we could detect and disallow
that case. In fact, it appears that Adobe reader and other PDF
viewing software silently ignores objects of this type, so this is
probably not a big deal.
* From a suggestion in bug 3152169, consider having an option to * Consider adding a method to balance the pages tree. It would call
re-encode inline images with an ASCII encoding. pushInheritedAttributesToPage, construct a pages tree from scratch,
and replace the /Pages key of the root dictionary with the new
tree.
* From github issue 2, provide more in-depth output for examining * Study what's required to support savable forms that can be saved by
hint stream contents. Consider adding on option to provide a Adobe Reader. Does this require actually signing the document with
human-readable dump of linearization hint tables. This should an Adobe private key? Search for "Digital signatures" in the PDF
include improving the 'overflow reading bit stream' message as spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which
reported in issue #2. There are multiple calls to stopOnError in came from Adobe's example site. See also
the linearization checking code. Ideally, these should not ../misc/digital-sign-from-trueroad/. If digital signatures are
terminate checking. It would require re-acquiring an understanding implemented, update the docs on crypto providers, which mention
of all that code to make the checks more robust. In particular, that this may happen in the future.
it's hard to look at the code and quickly determine what is a true
logic error and what could happen because of malformed user input.
See also ../misc/linearization-errors.
* If I ever decide to make appearance stream-generation aware of * Qpdf does not honor /EFF when adding new file attachments. When it
fonts or font metrics, see email from Tobias with Message-ID encrypts, it never generates streams with explicit crypt filters.
<5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14. Prior to 10.2, there was an incorrect attempt to treat /EFF as a
default value for decrypting file attachment streams, but it is not
supposed to mean that. Instead, it is intended for conforming
writers to obey this when adding new attachments. Qpdf is not a
conforming writer in that respect.
* Look at places in the code where object traversal is being done and, * The whole xref handling code in the QPDF object allows the same
where possible, try to avoid it entirely or at least avoid ever object with more than one generation to coexist, but a lot of logic
traversing the same objects multiple times. assumes this isn't the case. Anything that creates mappings only
with the object number and not the generation is this way,
including most of the interaction between QPDFWriter and QPDF. If
we wanted to allow the same object with more than one generation to
coexist, which I'm not sure is allowed, we could fix this by
changing xref_table. Alternatively, we could detect and disallow
that case. In fact, it appears that Adobe reader and other PDF
viewing software silently ignores objects of this type, so this is
probably not a big deal.
* From a suggestion in bug 3152169, consider having an option to
re-encode inline images with an ASCII encoding.
* From github issue 2, provide more in-depth output for examining
hint stream contents. Consider adding on option to provide a
human-readable dump of linearization hint tables. This should
include improving the 'overflow reading bit stream' message as
reported in issue #2. There are multiple calls to stopOnError in
the linearization checking code. Ideally, these should not
terminate checking. It would require re-acquiring an understanding
of all that code to make the checks more robust. In particular,
it's hard to look at the code and quickly determine what is a true
logic error and what could happen because of malformed user input.
See also ../misc/linearization-errors.
* If I ever decide to make appearance stream-generation aware of
fonts or font metrics, see email from Tobias with Message-ID
<5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14.
* Look at places in the code where object traversal is being done and,
where possible, try to avoid it entirely or at least avoid ever
traversing the same objects multiple times.
---------------------------------------------------------------------- ----------------------------------------------------------------------