mirror of
https://github.com/qpdf/qpdf.git
synced 2024-12-22 19:08:59 +00:00
260 lines
12 KiB
Plaintext
260 lines
12 KiB
Plaintext
5.2.0
|
|
=====
|
|
|
|
* Figure out what about a3576a73593987b26cd3eff346f8f7c11f713cbd
|
|
broke binary compatibility.
|
|
|
|
* Implement automated testing for binary compatibility and add to
|
|
release checklist.
|
|
|
|
* Add method to push inheritable resources to a single page by
|
|
walking up and copying without overwrite. Above logic will also be
|
|
sufficient to fix the limitation in
|
|
QPDFObjectHandle::getPageImages(). Maybe add a method to get the
|
|
effective resources for a page without modifying the page and then
|
|
implement both changes in terms of that method.
|
|
|
|
* Provide an option for QPDFWriter to preserve unreferenced objects
|
|
when writing out a file.
|
|
|
|
* Look at all the exceptions and error conditions in QPDF_stream and
|
|
figure out which ones should be converted to warnings and treating
|
|
the stream as not filterable.
|
|
|
|
* Support user-pluggable stream filters. This would enable external
|
|
code to provide interpretation for filters that are missing from
|
|
qpdf. Make it possible for user-provided fitlers to override
|
|
built-in filters. Make sure that the pluggable filters can be
|
|
prioritized so that we can poll all registered filters to see
|
|
whether they are capable of filtering a particular stream.
|
|
|
|
* If possible, consider adding RLE, CCITT3, CCITT4, or any other easy
|
|
filters. For some reference code that we probably can't use but
|
|
may be handy anyway, see
|
|
http://partners.adobe.com/public/developer/ps/sdk/index_archive.html
|
|
|
|
* If possible, support the following types of broken files:
|
|
|
|
- Files that lack %%EOF at the end but otherwise have a valid
|
|
startxref near the end
|
|
|
|
- Files that have no whitespace token after "endobj" such that
|
|
endobj collides with the start of the next object
|
|
|
|
- Files with individual corrupted streams. Just leave the streams
|
|
unfiltered after giving a warning, or maybe do something else
|
|
like applying as many of the filters as possible, etc.
|
|
QPDFWriter can have some kind of retry mechanism on streams
|
|
where filtering fails after filterable returns true.
|
|
|
|
- Files whose PDF header is malformed, perhaps with no version
|
|
number (as literally %PDF-a.b). Maybe keep track of features to
|
|
try to infer a version based on encryption formats and object
|
|
streams.
|
|
|
|
- For really hard errors like corrupted streams where there is
|
|
virtually guaranteed to be loss, maybe require an additional
|
|
option to tell qpdf that it's okay to continue and treat those
|
|
as warnings. Probably need separate options for each type of
|
|
error plus a generic tryReallyHard kind of method that enables
|
|
them all. Then the qpdf command-line tool can have a single
|
|
flag that enables all supported aggressive recovery techniques.
|
|
|
|
- See ../misc/broken-files
|
|
|
|
|
|
|
|
General
|
|
=======
|
|
|
|
* Provide support in QPDFWriter for writing incremental updates.
|
|
Provide support in qpdf for preserving incremental updates. The
|
|
goal should be that QDF mode should be fully functional for files
|
|
with incremental updates including fix_qdf.
|
|
|
|
Note that there's nothing that says an indirect object in one
|
|
update can't refer to an object that doesn't appear until a later
|
|
update. This means that QPDF has to treat indirect null objects
|
|
differently from how it does now. QPDF drops indirect null objects
|
|
that appear as members of arrays or dictionaries. For arrays, it's
|
|
handled in QPDFWriter where we make indirect nulls direct. This is
|
|
in a single if block, and nothing else in the code cares about it.
|
|
We could just remove that if block and not break anything except a
|
|
few test cases that exercise the current behavior. For
|
|
dictionaries, it's more complicated. In this case,
|
|
QPDF_Dictionary::getKeys() ignores all keys with null values, and
|
|
hasKey() returns false for keys that have null values. We would
|
|
probably want to make QPDF_Dictionary able to handle the special
|
|
case of keys that are indirect nulls and basically never have it
|
|
drop any keys that are indirect objects.
|
|
|
|
If we make a change to have qpdf preserve indirect references to
|
|
null objects, we have to note this in ChangeLog and in the release
|
|
notes since this will change output files. We did this before when
|
|
we stopped flattening scalar references, so this is probably not a
|
|
big deal. We also have to make sure that the testing for this
|
|
handles non-trivial cases of the targets of indirect nulls being
|
|
replaced by real objects in an update. I'm not sure how this plays
|
|
with linearization, if at all. For cases where incremental updates
|
|
are not being preserved as incremental updates and where the data
|
|
is being folded in (as is always the case with qpdf now), none of
|
|
this should make any difference in the actual semantics of the
|
|
files.
|
|
|
|
* When decrypting files with /R=6, hash_V5 is called more than once
|
|
with the same inputs. Caching the results or refactoring to reduce
|
|
the number of identical calls could improve performance for
|
|
workloads that involve processing large numbers of small files.
|
|
|
|
* Consider providing a Windows installer for qpdf using NSIS.
|
|
|
|
* Consider adding a method to balance the pages tree. It would call
|
|
pushInheritedAttributesToPage, construct a pages tree from scratch,
|
|
and replace the /Pages key of the root dictionary with the new
|
|
tree.
|
|
|
|
* Secure random number generation could be made more efficient by
|
|
using a local static to ensure a single random device or crypt
|
|
provider as long as this can be done in a thread-safe fashion. In
|
|
the initial implementation, this is being skipped to avoid having
|
|
to add any dependencies on threading libraries.
|
|
|
|
* Study what's required to support savable forms that can be saved by
|
|
Adobe Reader. Does this require actually signing the document with
|
|
an Adobe private key? Search for "Digital signatures" in the PDF
|
|
spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which
|
|
came from Adobe's example site.
|
|
|
|
* Consider the possibility of doing something locale-aware to support
|
|
non-ASCII passwords. Update documentation if this is done.
|
|
Consider implementing full Unicode password algorithms from newer
|
|
encryption formats.
|
|
|
|
* Consider impact of article threads on page splitting/merging.
|
|
Subramanyam provided a test file; see ../misc/article-threads.pdf.
|
|
Email Q-Count: 431864 from 2009-11-03. Other things to consider:
|
|
outlines, page labels, thumbnails, zones. There are probably
|
|
others.
|
|
|
|
* See if we can avoid preserving unreferenced objects in object
|
|
streams even when preserving the object streams.
|
|
|
|
* For debugging linearization bugs, consider adding an option to save
|
|
pass 1 of linearization. This code is sufficient. Change the
|
|
interface to allow specification of a pass1 file, which would
|
|
change the behavior as in this patch.
|
|
|
|
------------------------------
|
|
Index: QPDFWriter.cc
|
|
===================================================================
|
|
--- QPDFWriter.cc (revision 932)
|
|
+++ QPDFWriter.cc (working copy)
|
|
@@ -1965,11 +1965,15 @@
|
|
|
|
// Write file in two passes. Part numbers refer to PDF spec 1.4.
|
|
|
|
+ FILE* XXX = 0;
|
|
for (int pass = 1; pass <= 2; ++pass)
|
|
{
|
|
if (pass == 1)
|
|
{
|
|
- pushDiscardFilter();
|
|
+// pushDiscardFilter();
|
|
+ XXX = QUtil::safe_fopen("/tmp/pass1.pdf", "w");
|
|
+ pushPipeline(new Pl_StdioFile("pass1", XXX));
|
|
+ activatePipelineStack();
|
|
}
|
|
|
|
// Part 1: header
|
|
@@ -2204,6 +2208,8 @@
|
|
|
|
// Restore hint offset
|
|
this->xref[hint_id] = QPDFXRefEntry(1, hint_offset, 0);
|
|
+ fclose(XXX);
|
|
+ XXX = 0;
|
|
}
|
|
}
|
|
}
|
|
------------------------------
|
|
|
|
* Provide APIs for embedded files. See *attachments*.pdf in test
|
|
suite. The private method findAttachmentStreams finds at least
|
|
cases for modern versions of Adobe Reader (>= 1.7, maybe earlier).
|
|
PDF Reference 1.7 section 3.10, "File Specifications", discusses
|
|
this.
|
|
|
|
A sourceforge user asks if qpdf can handle extracting and embedded
|
|
resources and references these tools, which may be useful as a
|
|
reference.
|
|
|
|
http://multivalent.sourceforge.net/Tools/pdf/Extract.html
|
|
http://multivalent.sourceforge.net/Tools/pdf/Embed.html
|
|
|
|
* The description of Crypt filters is unclear with respect to how to
|
|
use them to override /StmF for specific streams. I'm not sure
|
|
whether qpdf will do the right thing for any specific individual
|
|
streams that might have crypt filters, but I believe it does based
|
|
on my testing of a limited subset. The specification seems to imply
|
|
that only embedded file streams and metadata streams can have crypt
|
|
filters, and there are already special cases in the code to handle
|
|
those. Most likely, it won't be a problem, but someday someone may
|
|
find a file that qpdf doesn't work on because of crypt filters.
|
|
There is an example in the spec of using a crypt filter on a
|
|
metadata stream.
|
|
|
|
For now, we notice /Crypt filters and decode parameters consistent
|
|
with the example in the PDF specification, and the right thing
|
|
happens for metadata filters that happen to be uncompressed or
|
|
otherwise compressed in a way we can filter. This should handle
|
|
all normal cases, but it's more or less just a guess since I don't
|
|
have any test files that actually use stream-specific crypt filters
|
|
in them.
|
|
|
|
* The second xref stream for linearized files has to be padded only
|
|
because we need file_size as computed in pass 1 to be accurate. If
|
|
we were not allowing writing to a pipe, we could seek back to the
|
|
beginning and fill in the value of /L in the linearization
|
|
dictionary as an optimization to alleviate the need for this
|
|
padding. Doing so would require us to pad the /L value
|
|
individually and also to save the file descriptor and determine
|
|
whether it's seekable. This is probably not worth bothering with.
|
|
|
|
* The whole xref handling code in the QPDF object allows the same
|
|
object with more than one generation to coexist, but a lot of logic
|
|
assumes this isn't the case. Anything that creates mappings only
|
|
with the object number and not the generation is this way,
|
|
including most of the interaction between QPDFWriter and QPDF. If
|
|
we wanted to allow the same object with more than one generation to
|
|
coexist, which I'm not sure is allowed, we could fix this by
|
|
changing xref_table. Alternatively, we could detect and disallow
|
|
that case. In fact, it appears that Adobe reader and other PDF
|
|
viewing software silently ignores objects of this type, so this is
|
|
probably not a big deal.
|
|
|
|
* Pl_PNGFilter is only partially implemented. If we ever decoded
|
|
images, we'd have to finish implementing it along with the other
|
|
filter decode parameters and types. For just handling xref
|
|
streams, there's really no need as it wouldn't make sense to use
|
|
any kind of predictor other than 12 (PNG UP filter).
|
|
|
|
* If we ever want to have check mode check the integrity of the free
|
|
list, this can be done by looking at the code from prior to the
|
|
object stream support of 4/5/2008. It's in an if (0) block and
|
|
there's a comment about it. There's also something about it in
|
|
qpdf.test -- search for "free table". On the other hand, the value
|
|
of doing this seems very low since no viewer seems to care, so it's
|
|
probably not worth it.
|
|
|
|
* QPDFObjectHandle::getPageImages() doesn't notice images in
|
|
inherited resource dictionaries. See comments in that function.
|
|
|
|
* Based on an idea suggested by user "Atom Smasher", consider
|
|
providing some mechanism to recover earlier versions of a file
|
|
embedded prior to appended sections.
|
|
|
|
* From a suggestion in bug 3152169, consider having an option to
|
|
re-encode inline images with an ASCII encoding.
|
|
|
|
* From github issue 2, provide more in-depth output for examining
|
|
hint stream contents.
|