TODO: rescope some items

2022-08-06 16:35:40 -04:00 · 2022-08-06 16:35:40 -04:00 · 48dfae6443
parent 433be3718a
commit 48dfae6443
1 changed files with 170 additions and 161 deletions
--- a/331
+++ b/331
@ -21,31 +21,15 @@ Pending changes:
  appimage build specifically is setting the runpath, which is
  actually desirable in this case. Make sure to understand and
  document this. Maybe add a check for it in the build.
 * Decide what to do about #664 (get*Box)
 * Add an option --ignore-encryption to ignore encryption information
  and treat encrypted files as if they weren't encrypted. This should
  make it possible to solve #598 (--show-encryption without a
  password). We'll need to make sure we don't try to filter any
  streams in this mode. Ideally we should be able to combine this with
  --json so we can look at the raw encrypted strings and streams if we
  want to, though be sure to document that the resulting JSON won't be
  convertible back to a valid PDF. Since providing the password may
  reveal additional details, --show-encryption could potentially retry
  with this option if the first time doesn't work. Then, with the file
  open, we can read the encryption dictionary normally.
 * In libtests, separate executables that need the object library
  from those that strictly use public API. Move as many of the test
  drivers from the qpdf directory into the latter category as long
  as doing so isn't too troublesome from a coverage standpoint.
 * Consider adding fuzzer code for JSON
 * Consider generating a non-flat pages tree before creating output to
  better handle files with lots of pages. If there are more than 256
  pages, add a second layer with the second layer nodes having no more
  than 256 nodes and being as evenly sizes as possible. Don't worry
  about the case of more than 65,536 pages. If the top node has more
  than 256 children, we'll live with it.
-Parent pointer idea:
+Soon: Break ground on "Document-level work"
 Fix Multiple Direct Object Owner Issue
 ======================================
 These are some ideas I've had, but I'm parking them until I fully
 understand m-holger's proposal to split QPDFObject into QPDFObject and
 QPDFValue.
 * Add std::weak_ptr<QPDFObject> parent to QPDFObject. When adding a
  direct object to an array or dictionary, set its parent. When
@ -65,8 +49,6 @@ Note that arrays and dictionaries still need to contain
 QPDFObjectHandle because of indirect objects. This only pertains to
 direct objects, which are always "resolved" in QPDFObjectHandle.
 Soon: Break ground on "Document-level work"
 Possible future JSON enhancements
 =================================
@ -376,169 +358,196 @@ directory or that are otherwise not publicly accessible. This includes
 things sent to me by email that are specifically not public. Even so,
 I find it useful to make reference to them in this list.
- * Look at https://bestpractices.coreinfrastructure.org/en
+* Add an option --ignore-encryption to ignore encryption information
  and treat encrypted files as if they weren't encrypted. This should
  make it possible to solve #598 (--show-encryption without a
  password). We'll need to make sure we don't try to filter any
  streams in this mode. Ideally we should be able to combine this with
  --json so we can look at the raw encrypted strings and streams if we
  want to, though be sure to document that the resulting JSON won't be
  convertible back to a valid PDF. Since providing the password may
  reveal additional details, --show-encryption could potentially retry
  with this option if the first time doesn't work. Then, with the file
  open, we can read the encryption dictionary normally.
- * Rework tests so that nothing is written into the source directory.
+* In libtests, separate executables that need the object library
-   Ideally then the entire build could be done with a read-only
+  from those that strictly use public API. Move as many of the test
-   source tree.
+  drivers from the qpdf directory into the latter category as long
  as doing so isn't too troublesome from a coverage standpoint.
- * Large file tests fail with linux32 before and after cmake. This was
+* Consider generating a non-flat pages tree before creating output to
-   first noticed after 10.6.3. I don't think it's worth fixing.
+  better handle files with lots of pages. If there are more than 256
  pages, add a second layer with the second layer nodes having no more
  than 256 nodes and being as evenly sizes as possible. Don't worry
  about the case of more than 65,536 pages. If the top node has more
  than 256 children, we'll live with it. This is only safe if all
  intermediate page nodes have only /Kids, /Parent, /Type, and /Count.
- * Consider updating the fuzzer with code that exercises
+* Look at https://bestpractices.coreinfrastructure.org/en
   copyAnnotations, file attachments, and name and number trees. Check
   fuzzer coverage.
- * Add code for creation of a file attachment annotation. It should
+* Consider adding fuzzer code for JSON
   also be possible to create a widget annotation and a form field.
   Update the pdf-attach-file.cc example with new APIs when ready.
- * Flattening of form XObjects seems like something that would be
+* Rework tests so that nothing is written into the source directory.
-   useful in the library. We are seeing more cases of completely valid
+  Ideally then the entire build could be done with a read-only
-   PDF files with form XObjects that cause problems in other software.
+  source tree.
   Flattening of form XObjects could be a useful way to work around
   those issues or to prepare files for additional processing, making
   it possible for users of the qpdf library to not be concerned about
   form XObjects. This could be done recursively; i.e., we could have a
   method to embed a form XObject into whatever contains it, whether
   that is a form XObject or a page. This would require more
   significant interpretation of the content stream. We would need a
   test file in which the placement of the form XObject has to be in
   the right place, e.g., the form XObject partially obscures earlier
   code and is partially obscured by later code. Keys in the resource
   dictionary may need to be changed -- create test cases with lots of
   duplicated/overlapping keys.
- * Part of closed_file_input_source.cc is disabled on Windows because
+* Large file tests fail with linux32 before and after cmake. This was
-   of odd failures. It might be worth investigating so we can fully
+  first noticed after 10.6.3. I don't think it's worth fixing.
   exercise this in the test suite. That said, ClosedFileInputSource
   is exercised elsewhere in qpdf's test suite, so this is not that
   pressing.
- * If possible, consider adding CCITT3, CCITT4, or any other easy
+* Consider updating the fuzzer with code that exercises
-   filters. For some reference code that we probably can't use but may
+  copyAnnotations, file attachments, and name and number trees. Check
-   be handy anyway, see
+  fuzzer coverage.
   http://partners.adobe.com/public/developer/ps/sdk/index_archive.html
- * If possible, support the following types of broken files:
+* Add code for creation of a file attachment annotation. It should
  also be possible to create a widget annotation and a form field.
  Update the pdf-attach-file.cc example with new APIs when ready.
-    - Files that have no whitespace token after "endobj" such that
+* Flattening of form XObjects seems like something that would be
-      endobj collides with the start of the next object
+  useful in the library. We are seeing more cases of completely valid
  PDF files with form XObjects that cause problems in other software.
  Flattening of form XObjects could be a useful way to work around
  those issues or to prepare files for additional processing, making
  it possible for users of the qpdf library to not be concerned about
  form XObjects. This could be done recursively; i.e., we could have a
  method to embed a form XObject into whatever contains it, whether
  that is a form XObject or a page. This would require more
  significant interpretation of the content stream. We would need a
  test file in which the placement of the form XObject has to be in
  the right place, e.g., the form XObject partially obscures earlier
  code and is partially obscured by later code. Keys in the resource
  dictionary may need to be changed -- create test cases with lots of
  duplicated/overlapping keys.
-    - See ../misc/broken-files
+* Part of closed_file_input_source.cc is disabled on Windows because
  of odd failures. It might be worth investigating so we can fully
  exercise this in the test suite. That said, ClosedFileInputSource
  is exercised elsewhere in qpdf's test suite, so this is not that
  pressing.
-    - See ../misc/bad-files-issue-476. This directory contains a
+* If possible, consider adding CCITT3, CCITT4, or any other easy
-      snapshot of the google doc and linked PDF files from issue #476.
+  filters. For some reference code that we probably can't use but may
-      Please see the issue for details.
+  be handy anyway, see
  http://partners.adobe.com/public/developer/ps/sdk/index_archive.html
- * Additional form features
+* If possible, support the following types of broken files:
   * set value from CLI? Specify title, and provide way to
     disambiguate, probably by giving objgen of field
- * Pl_TIFFPredictor is pretty slow.
+   - Files that have no whitespace token after "endobj" such that
     endobj collides with the start of the next object
- * Support for handling file names with Unicode characters in Windows
+   - See ../misc/broken-files
   is incomplete. qpdf seems to support them okay from a functionality
   standpoint, and the right thing happens if you pass in UTF-8
   encoded filenames to QPDF library routines in Windows (they are
   converted internally to wchar_t*), but file names are encoded in
   UTF-8 on output, which doesn't produce nice error messages or
   output on Windows in some cases.
- * If we ever wanted to do anything more with character encoding, see
+   - See ../misc/bad-files-issue-476. This directory contains a
-   ../misc/character-encoding/, which includes machine-readable dump
+     snapshot of the google doc and linked PDF files from issue #476.
-   of table D.2 in the ISO-32000 PDF spec. This shows the mapping
+     Please see the issue for details.
   between Unicode, StandardEncoding, WinAnsiEncoding,
   MacRomanEncoding, and PDFDocEncoding.
- * Some test cases on bad files fail because qpdf is unable to find
+* Additional form features
-   the root dictionary when it fails to read the trailer. Recovery
+  * set value from CLI? Specify title, and provide way to
-   could find the root dictionary and even the info dictionary in
+    disambiguate, probably by giving objgen of field
   other ways. In particular, issue-202.pdf can be opened by evince,
   and there's no real reason that qpdf couldn't be made to be able to
   recover that file as well.
- * Audit every place where qpdf allocates memory to see whether there
+* Pl_TIFFPredictor is pretty slow.
   are cases where malicious inputs could cause qpdf to attempt to
   grab very large amounts of memory. Certainly there are cases like
   this, such as if a very highly compressed, very large image stream
   is requested in a buffer. Hopefully normal input to output
   filtering doesn't ever try to do this. QPDFWriter should be checked
   carefully too. See also bugs/private/from-email-663916/
- * Interactive form modification:
+* Support for handling file names with Unicode characters in Windows
-   https://github.com/qpdf/qpdf/issues/213 contains a good discussion
+  is incomplete. qpdf seems to support them okay from a functionality
-   of some ideas for adding methods to modify annotations and form
+  standpoint, and the right thing happens if you pass in UTF-8
-   fields if we want to make it easier to support modifications to
+  encoded filenames to QPDF library routines in Windows (they are
-   interactive forms. Some of the ideas have been implemented, and
+  converted internally to wchar_t*), but file names are encoded in
-   some of the probably never will be implemented, but it's worth a
+  UTF-8 on output, which doesn't produce nice error messages or
-   read if there is an intention to work on this. In the issue, search
+  output on Windows in some cases.
   for "Regarding write functionality", and read that comment and the
   responses to it.
- * Look at ~/Q/pdf-collection/forms-from-appian/
+* If we ever wanted to do anything more with character encoding, see
  ../misc/character-encoding/, which includes machine-readable dump
  of table D.2 in the ISO-32000 PDF spec. This shows the mapping
  between Unicode, StandardEncoding, WinAnsiEncoding,
  MacRomanEncoding, and PDFDocEncoding.
- * When decrypting files with /R=6, hash_V5 is called more than once
+* Some test cases on bad files fail because qpdf is unable to find
-   with the same inputs.  Caching the results or refactoring to reduce
+  the root dictionary when it fails to read the trailer. Recovery
-   the number of identical calls could improve performance for
+  could find the root dictionary and even the info dictionary in
-   workloads that involve processing large numbers of small files.
+  other ways. In particular, issue-202.pdf can be opened by evince,
  and there's no real reason that qpdf couldn't be made to be able to
  recover that file as well.
- * Consider adding a method to balance the pages tree.  It would call
+* Audit every place where qpdf allocates memory to see whether there
-   pushInheritedAttributesToPage, construct a pages tree from scratch,
+  are cases where malicious inputs could cause qpdf to attempt to
-   and replace the /Pages key of the root dictionary with the new
+  grab very large amounts of memory. Certainly there are cases like
-   tree.
+  this, such as if a very highly compressed, very large image stream
  is requested in a buffer. Hopefully normal input to output
  filtering doesn't ever try to do this. QPDFWriter should be checked
  carefully too. See also bugs/private/from-email-663916/
- * Study what's required to support savable forms that can be saved by
+* Interactive form modification:
-   Adobe Reader. Does this require actually signing the document with
+  https://github.com/qpdf/qpdf/issues/213 contains a good discussion
-   an Adobe private key? Search for "Digital signatures" in the PDF
+  of some ideas for adding methods to modify annotations and form
-   spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which
+  fields if we want to make it easier to support modifications to
-   came from Adobe's example site. See also
+  interactive forms. Some of the ideas have been implemented, and
-   ../misc/digital-sign-from-trueroad/. If digital signatures are
+  some of the probably never will be implemented, but it's worth a
-   implemented, update the docs on crypto providers, which mention
+  read if there is an intention to work on this. In the issue, search
-   that this may happen in the future.
+  for "Regarding write functionality", and read that comment and the
  responses to it.
- * Qpdf does not honor /EFF when adding new file attachments. When it
+* Look at ~/Q/pdf-collection/forms-from-appian/
   encrypts, it never generates streams with explicit crypt filters.
   Prior to 10.2, there was an incorrect attempt to treat /EFF as a
   default value for decrypting file attachment streams, but it is not
   supposed to mean that. Instead, it is intended for conforming
   writers to obey this when adding new attachments. Qpdf is not a
   conforming writer in that respect.
- * The whole xref handling code in the QPDF object allows the same
+* When decrypting files with /R=6, hash_V5 is called more than once
-   object with more than one generation to coexist, but a lot of logic
+  with the same inputs.  Caching the results or refactoring to reduce
-   assumes this isn't the case.  Anything that creates mappings only
+  the number of identical calls could improve performance for
-   with the object number and not the generation is this way,
+  workloads that involve processing large numbers of small files.
   including most of the interaction between QPDFWriter and QPDF.  If
   we wanted to allow the same object with more than one generation to
   coexist, which I'm not sure is allowed, we could fix this by
   changing xref_table.  Alternatively, we could detect and disallow
   that case.  In fact, it appears that Adobe reader and other PDF
   viewing software silently ignores objects of this type, so this is
   probably not a big deal.
- * From a suggestion in bug 3152169, consider having an option to
+* Consider adding a method to balance the pages tree.  It would call
-   re-encode inline images with an ASCII encoding.
+  pushInheritedAttributesToPage, construct a pages tree from scratch,
  and replace the /Pages key of the root dictionary with the new
  tree.
- * From github issue 2, provide more in-depth output for examining
+* Study what's required to support savable forms that can be saved by
-   hint stream contents. Consider adding on option to provide a
+  Adobe Reader. Does this require actually signing the document with
-   human-readable dump of linearization hint tables. This should
+  an Adobe private key? Search for "Digital signatures" in the PDF
-   include improving the 'overflow reading bit stream' message as
+  spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which
-   reported in issue #2. There are multiple calls to stopOnError in
+  came from Adobe's example site. See also
-   the linearization checking code. Ideally, these should not
+  ../misc/digital-sign-from-trueroad/. If digital signatures are
-   terminate checking. It would require re-acquiring an understanding
+  implemented, update the docs on crypto providers, which mention
-   of all that code to make the checks more robust. In particular,
+  that this may happen in the future.
   it's hard to look at the code and quickly determine what is a true
   logic error and what could happen because of malformed user input.
   See also ../misc/linearization-errors.
- * If I ever decide to make appearance stream-generation aware of
+* Qpdf does not honor /EFF when adding new file attachments. When it
-   fonts or font metrics, see email from Tobias with Message-ID
+  encrypts, it never generates streams with explicit crypt filters.
-   <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14.
+  Prior to 10.2, there was an incorrect attempt to treat /EFF as a
  default value for decrypting file attachment streams, but it is not
  supposed to mean that. Instead, it is intended for conforming
  writers to obey this when adding new attachments. Qpdf is not a
  conforming writer in that respect.
- * Look at places in the code where object traversal is being done and,
+* The whole xref handling code in the QPDF object allows the same
-   where possible, try to avoid it entirely or at least avoid ever
+  object with more than one generation to coexist, but a lot of logic
-   traversing the same objects multiple times.
+  assumes this isn't the case.  Anything that creates mappings only
  with the object number and not the generation is this way,
  including most of the interaction between QPDFWriter and QPDF.  If
  we wanted to allow the same object with more than one generation to
  coexist, which I'm not sure is allowed, we could fix this by
  changing xref_table.  Alternatively, we could detect and disallow
  that case.  In fact, it appears that Adobe reader and other PDF
  viewing software silently ignores objects of this type, so this is
  probably not a big deal.
 * From a suggestion in bug 3152169, consider having an option to
  re-encode inline images with an ASCII encoding.
 * From github issue 2, provide more in-depth output for examining
  hint stream contents. Consider adding on option to provide a
  human-readable dump of linearization hint tables. This should
  include improving the 'overflow reading bit stream' message as
  reported in issue #2. There are multiple calls to stopOnError in
  the linearization checking code. Ideally, these should not
  terminate checking. It would require re-acquiring an understanding
  of all that code to make the checks more robust. In particular,
  it's hard to look at the code and quickly determine what is a true
  logic error and what could happen because of malformed user input.
  See also ../misc/linearization-errors.
 * If I ever decide to make appearance stream-generation aware of
  fonts or font metrics, see email from Tobias with Message-ID
  <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14.
 * Look at places in the code where object traversal is being done and,
  where possible, try to avoid it entirely or at least avoid ever
  traversing the same objects multiple times.
 ----------------------------------------------------------------------