diff --git a/TODO b/TODO index 9128e3ad..52fcf61a 100644 --- a/TODO +++ b/TODO @@ -21,31 +21,15 @@ Pending changes: appimage build specifically is setting the runpath, which is actually desirable in this case. Make sure to understand and document this. Maybe add a check for it in the build. -* Decide what to do about #664 (get*Box) -* Add an option --ignore-encryption to ignore encryption information - and treat encrypted files as if they weren't encrypted. This should - make it possible to solve #598 (--show-encryption without a - password). We'll need to make sure we don't try to filter any - streams in this mode. Ideally we should be able to combine this with - --json so we can look at the raw encrypted strings and streams if we - want to, though be sure to document that the resulting JSON won't be - convertible back to a valid PDF. Since providing the password may - reveal additional details, --show-encryption could potentially retry - with this option if the first time doesn't work. Then, with the file - open, we can read the encryption dictionary normally. -* In libtests, separate executables that need the object library - from those that strictly use public API. Move as many of the test - drivers from the qpdf directory into the latter category as long - as doing so isn't too troublesome from a coverage standpoint. -* Consider adding fuzzer code for JSON -* Consider generating a non-flat pages tree before creating output to - better handle files with lots of pages. If there are more than 256 - pages, add a second layer with the second layer nodes having no more - than 256 nodes and being as evenly sizes as possible. Don't worry - about the case of more than 65,536 pages. If the top node has more - than 256 children, we'll live with it. -Parent pointer idea: +Soon: Break ground on "Document-level work" + +Fix Multiple Direct Object Owner Issue +====================================== + +These are some ideas I've had, but I'm parking them until I fully +understand m-holger's proposal to split QPDFObject into QPDFObject and +QPDFValue. * Add std::weak_ptr parent to QPDFObject. When adding a direct object to an array or dictionary, set its parent. When @@ -65,8 +49,6 @@ Note that arrays and dictionaries still need to contain QPDFObjectHandle because of indirect objects. This only pertains to direct objects, which are always "resolved" in QPDFObjectHandle. -Soon: Break ground on "Document-level work" - Possible future JSON enhancements ================================= @@ -376,169 +358,196 @@ directory or that are otherwise not publicly accessible. This includes things sent to me by email that are specifically not public. Even so, I find it useful to make reference to them in this list. - * Look at https://bestpractices.coreinfrastructure.org/en +* Add an option --ignore-encryption to ignore encryption information + and treat encrypted files as if they weren't encrypted. This should + make it possible to solve #598 (--show-encryption without a + password). We'll need to make sure we don't try to filter any + streams in this mode. Ideally we should be able to combine this with + --json so we can look at the raw encrypted strings and streams if we + want to, though be sure to document that the resulting JSON won't be + convertible back to a valid PDF. Since providing the password may + reveal additional details, --show-encryption could potentially retry + with this option if the first time doesn't work. Then, with the file + open, we can read the encryption dictionary normally. - * Rework tests so that nothing is written into the source directory. - Ideally then the entire build could be done with a read-only - source tree. +* In libtests, separate executables that need the object library + from those that strictly use public API. Move as many of the test + drivers from the qpdf directory into the latter category as long + as doing so isn't too troublesome from a coverage standpoint. - * Large file tests fail with linux32 before and after cmake. This was - first noticed after 10.6.3. I don't think it's worth fixing. +* Consider generating a non-flat pages tree before creating output to + better handle files with lots of pages. If there are more than 256 + pages, add a second layer with the second layer nodes having no more + than 256 nodes and being as evenly sizes as possible. Don't worry + about the case of more than 65,536 pages. If the top node has more + than 256 children, we'll live with it. This is only safe if all + intermediate page nodes have only /Kids, /Parent, /Type, and /Count. - * Consider updating the fuzzer with code that exercises - copyAnnotations, file attachments, and name and number trees. Check - fuzzer coverage. +* Look at https://bestpractices.coreinfrastructure.org/en - * Add code for creation of a file attachment annotation. It should - also be possible to create a widget annotation and a form field. - Update the pdf-attach-file.cc example with new APIs when ready. +* Consider adding fuzzer code for JSON - * Flattening of form XObjects seems like something that would be - useful in the library. We are seeing more cases of completely valid - PDF files with form XObjects that cause problems in other software. - Flattening of form XObjects could be a useful way to work around - those issues or to prepare files for additional processing, making - it possible for users of the qpdf library to not be concerned about - form XObjects. This could be done recursively; i.e., we could have a - method to embed a form XObject into whatever contains it, whether - that is a form XObject or a page. This would require more - significant interpretation of the content stream. We would need a - test file in which the placement of the form XObject has to be in - the right place, e.g., the form XObject partially obscures earlier - code and is partially obscured by later code. Keys in the resource - dictionary may need to be changed -- create test cases with lots of - duplicated/overlapping keys. +* Rework tests so that nothing is written into the source directory. + Ideally then the entire build could be done with a read-only + source tree. - * Part of closed_file_input_source.cc is disabled on Windows because - of odd failures. It might be worth investigating so we can fully - exercise this in the test suite. That said, ClosedFileInputSource - is exercised elsewhere in qpdf's test suite, so this is not that - pressing. +* Large file tests fail with linux32 before and after cmake. This was + first noticed after 10.6.3. I don't think it's worth fixing. - * If possible, consider adding CCITT3, CCITT4, or any other easy - filters. For some reference code that we probably can't use but may - be handy anyway, see - http://partners.adobe.com/public/developer/ps/sdk/index_archive.html +* Consider updating the fuzzer with code that exercises + copyAnnotations, file attachments, and name and number trees. Check + fuzzer coverage. - * If possible, support the following types of broken files: +* Add code for creation of a file attachment annotation. It should + also be possible to create a widget annotation and a form field. + Update the pdf-attach-file.cc example with new APIs when ready. - - Files that have no whitespace token after "endobj" such that - endobj collides with the start of the next object +* Flattening of form XObjects seems like something that would be + useful in the library. We are seeing more cases of completely valid + PDF files with form XObjects that cause problems in other software. + Flattening of form XObjects could be a useful way to work around + those issues or to prepare files for additional processing, making + it possible for users of the qpdf library to not be concerned about + form XObjects. This could be done recursively; i.e., we could have a + method to embed a form XObject into whatever contains it, whether + that is a form XObject or a page. This would require more + significant interpretation of the content stream. We would need a + test file in which the placement of the form XObject has to be in + the right place, e.g., the form XObject partially obscures earlier + code and is partially obscured by later code. Keys in the resource + dictionary may need to be changed -- create test cases with lots of + duplicated/overlapping keys. - - See ../misc/broken-files +* Part of closed_file_input_source.cc is disabled on Windows because + of odd failures. It might be worth investigating so we can fully + exercise this in the test suite. That said, ClosedFileInputSource + is exercised elsewhere in qpdf's test suite, so this is not that + pressing. - - See ../misc/bad-files-issue-476. This directory contains a - snapshot of the google doc and linked PDF files from issue #476. - Please see the issue for details. +* If possible, consider adding CCITT3, CCITT4, or any other easy + filters. For some reference code that we probably can't use but may + be handy anyway, see + http://partners.adobe.com/public/developer/ps/sdk/index_archive.html - * Additional form features - * set value from CLI? Specify title, and provide way to - disambiguate, probably by giving objgen of field +* If possible, support the following types of broken files: - * Pl_TIFFPredictor is pretty slow. + - Files that have no whitespace token after "endobj" such that + endobj collides with the start of the next object - * Support for handling file names with Unicode characters in Windows - is incomplete. qpdf seems to support them okay from a functionality - standpoint, and the right thing happens if you pass in UTF-8 - encoded filenames to QPDF library routines in Windows (they are - converted internally to wchar_t*), but file names are encoded in - UTF-8 on output, which doesn't produce nice error messages or - output on Windows in some cases. + - See ../misc/broken-files - * If we ever wanted to do anything more with character encoding, see - ../misc/character-encoding/, which includes machine-readable dump - of table D.2 in the ISO-32000 PDF spec. This shows the mapping - between Unicode, StandardEncoding, WinAnsiEncoding, - MacRomanEncoding, and PDFDocEncoding. + - See ../misc/bad-files-issue-476. This directory contains a + snapshot of the google doc and linked PDF files from issue #476. + Please see the issue for details. - * Some test cases on bad files fail because qpdf is unable to find - the root dictionary when it fails to read the trailer. Recovery - could find the root dictionary and even the info dictionary in - other ways. In particular, issue-202.pdf can be opened by evince, - and there's no real reason that qpdf couldn't be made to be able to - recover that file as well. +* Additional form features + * set value from CLI? Specify title, and provide way to + disambiguate, probably by giving objgen of field - * Audit every place where qpdf allocates memory to see whether there - are cases where malicious inputs could cause qpdf to attempt to - grab very large amounts of memory. Certainly there are cases like - this, such as if a very highly compressed, very large image stream - is requested in a buffer. Hopefully normal input to output - filtering doesn't ever try to do this. QPDFWriter should be checked - carefully too. See also bugs/private/from-email-663916/ +* Pl_TIFFPredictor is pretty slow. - * Interactive form modification: - https://github.com/qpdf/qpdf/issues/213 contains a good discussion - of some ideas for adding methods to modify annotations and form - fields if we want to make it easier to support modifications to - interactive forms. Some of the ideas have been implemented, and - some of the probably never will be implemented, but it's worth a - read if there is an intention to work on this. In the issue, search - for "Regarding write functionality", and read that comment and the - responses to it. +* Support for handling file names with Unicode characters in Windows + is incomplete. qpdf seems to support them okay from a functionality + standpoint, and the right thing happens if you pass in UTF-8 + encoded filenames to QPDF library routines in Windows (they are + converted internally to wchar_t*), but file names are encoded in + UTF-8 on output, which doesn't produce nice error messages or + output on Windows in some cases. - * Look at ~/Q/pdf-collection/forms-from-appian/ +* If we ever wanted to do anything more with character encoding, see + ../misc/character-encoding/, which includes machine-readable dump + of table D.2 in the ISO-32000 PDF spec. This shows the mapping + between Unicode, StandardEncoding, WinAnsiEncoding, + MacRomanEncoding, and PDFDocEncoding. - * When decrypting files with /R=6, hash_V5 is called more than once - with the same inputs. Caching the results or refactoring to reduce - the number of identical calls could improve performance for - workloads that involve processing large numbers of small files. +* Some test cases on bad files fail because qpdf is unable to find + the root dictionary when it fails to read the trailer. Recovery + could find the root dictionary and even the info dictionary in + other ways. In particular, issue-202.pdf can be opened by evince, + and there's no real reason that qpdf couldn't be made to be able to + recover that file as well. - * Consider adding a method to balance the pages tree. It would call - pushInheritedAttributesToPage, construct a pages tree from scratch, - and replace the /Pages key of the root dictionary with the new - tree. +* Audit every place where qpdf allocates memory to see whether there + are cases where malicious inputs could cause qpdf to attempt to + grab very large amounts of memory. Certainly there are cases like + this, such as if a very highly compressed, very large image stream + is requested in a buffer. Hopefully normal input to output + filtering doesn't ever try to do this. QPDFWriter should be checked + carefully too. See also bugs/private/from-email-663916/ - * Study what's required to support savable forms that can be saved by - Adobe Reader. Does this require actually signing the document with - an Adobe private key? Search for "Digital signatures" in the PDF - spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which - came from Adobe's example site. See also - ../misc/digital-sign-from-trueroad/. If digital signatures are - implemented, update the docs on crypto providers, which mention - that this may happen in the future. +* Interactive form modification: + https://github.com/qpdf/qpdf/issues/213 contains a good discussion + of some ideas for adding methods to modify annotations and form + fields if we want to make it easier to support modifications to + interactive forms. Some of the ideas have been implemented, and + some of the probably never will be implemented, but it's worth a + read if there is an intention to work on this. In the issue, search + for "Regarding write functionality", and read that comment and the + responses to it. - * Qpdf does not honor /EFF when adding new file attachments. When it - encrypts, it never generates streams with explicit crypt filters. - Prior to 10.2, there was an incorrect attempt to treat /EFF as a - default value for decrypting file attachment streams, but it is not - supposed to mean that. Instead, it is intended for conforming - writers to obey this when adding new attachments. Qpdf is not a - conforming writer in that respect. +* Look at ~/Q/pdf-collection/forms-from-appian/ - * The whole xref handling code in the QPDF object allows the same - object with more than one generation to coexist, but a lot of logic - assumes this isn't the case. Anything that creates mappings only - with the object number and not the generation is this way, - including most of the interaction between QPDFWriter and QPDF. If - we wanted to allow the same object with more than one generation to - coexist, which I'm not sure is allowed, we could fix this by - changing xref_table. Alternatively, we could detect and disallow - that case. In fact, it appears that Adobe reader and other PDF - viewing software silently ignores objects of this type, so this is - probably not a big deal. +* When decrypting files with /R=6, hash_V5 is called more than once + with the same inputs. Caching the results or refactoring to reduce + the number of identical calls could improve performance for + workloads that involve processing large numbers of small files. - * From a suggestion in bug 3152169, consider having an option to - re-encode inline images with an ASCII encoding. +* Consider adding a method to balance the pages tree. It would call + pushInheritedAttributesToPage, construct a pages tree from scratch, + and replace the /Pages key of the root dictionary with the new + tree. - * From github issue 2, provide more in-depth output for examining - hint stream contents. Consider adding on option to provide a - human-readable dump of linearization hint tables. This should - include improving the 'overflow reading bit stream' message as - reported in issue #2. There are multiple calls to stopOnError in - the linearization checking code. Ideally, these should not - terminate checking. It would require re-acquiring an understanding - of all that code to make the checks more robust. In particular, - it's hard to look at the code and quickly determine what is a true - logic error and what could happen because of malformed user input. - See also ../misc/linearization-errors. +* Study what's required to support savable forms that can be saved by + Adobe Reader. Does this require actually signing the document with + an Adobe private key? Search for "Digital signatures" in the PDF + spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which + came from Adobe's example site. See also + ../misc/digital-sign-from-trueroad/. If digital signatures are + implemented, update the docs on crypto providers, which mention + that this may happen in the future. - * If I ever decide to make appearance stream-generation aware of - fonts or font metrics, see email from Tobias with Message-ID - <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14. +* Qpdf does not honor /EFF when adding new file attachments. When it + encrypts, it never generates streams with explicit crypt filters. + Prior to 10.2, there was an incorrect attempt to treat /EFF as a + default value for decrypting file attachment streams, but it is not + supposed to mean that. Instead, it is intended for conforming + writers to obey this when adding new attachments. Qpdf is not a + conforming writer in that respect. - * Look at places in the code where object traversal is being done and, - where possible, try to avoid it entirely or at least avoid ever - traversing the same objects multiple times. +* The whole xref handling code in the QPDF object allows the same + object with more than one generation to coexist, but a lot of logic + assumes this isn't the case. Anything that creates mappings only + with the object number and not the generation is this way, + including most of the interaction between QPDFWriter and QPDF. If + we wanted to allow the same object with more than one generation to + coexist, which I'm not sure is allowed, we could fix this by + changing xref_table. Alternatively, we could detect and disallow + that case. In fact, it appears that Adobe reader and other PDF + viewing software silently ignores objects of this type, so this is + probably not a big deal. + +* From a suggestion in bug 3152169, consider having an option to + re-encode inline images with an ASCII encoding. + +* From github issue 2, provide more in-depth output for examining + hint stream contents. Consider adding on option to provide a + human-readable dump of linearization hint tables. This should + include improving the 'overflow reading bit stream' message as + reported in issue #2. There are multiple calls to stopOnError in + the linearization checking code. Ideally, these should not + terminate checking. It would require re-acquiring an understanding + of all that code to make the checks more robust. In particular, + it's hard to look at the code and quickly determine what is a true + logic error and what could happen because of malformed user input. + See also ../misc/linearization-errors. + +* If I ever decide to make appearance stream-generation aware of + fonts or font metrics, see email from Tobias with Message-ID + <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14. + +* Look at places in the code where object traversal is being done and, + where possible, try to avoid it entirely or at least avoid ever + traversing the same objects multiple times. ----------------------------------------------------------------------