mirror of
https://github.com/qpdf/qpdf.git
synced 2024-11-19 10:55:10 +00:00
b8bdef0ad1
For non-encrypted files, determinstic ID generation uses file contents instead of timestamp and file name. At a small runtime cost, this enables generation of the same /ID if the same inputs are converted in the same way multiple times.
323 lines
14 KiB
Plaintext
323 lines
14 KiB
Plaintext
Small, command-line tool only enhancements to do soon
|
|
=====================================================
|
|
|
|
* Handle input file = output file as a special case. See issue 29.
|
|
Behavior: detect if output file is the same as one of the input
|
|
files. If so, refuse to operate unless --allow-overwrite is
|
|
specified. In that case, write to a temporary file and, if there
|
|
are no errors or warnings, rename the temporary output file over
|
|
the input file. If rename fails, delete the temporary file.
|
|
|
|
* Consider providing alternative methods for specifying passwords.
|
|
The methods should be general enough to use for both encryption and
|
|
decryption passwords. Example methods could be reading the password
|
|
from a file, a file descriptor, or prompting. Prompting should
|
|
never be done with being specifically requested though; we don't
|
|
want to create a situation where running qpdf might block waiting
|
|
for input where it previously did not. Test case: encrypt an
|
|
encrypted file with the output file having different user/owner
|
|
passwords. Make sure we have a predictable way to read all three
|
|
passwords (input, output user, output owner). Maybe we have
|
|
something like --password-source=<method>:<which>,... where method could
|
|
be file=/path, fd=n, or prompt and which could be one of input,
|
|
user, owner. If a password source is provided for input, it takes
|
|
precedence over --password if specified later on the command line.
|
|
If a password source is specified for output passwords, the
|
|
corresponding passwords must be '-'. If more than one password is
|
|
read from the same source, passwords are newline separated.
|
|
Trailing newlines are ignored. Example:
|
|
|
|
qpdf --password-source=fd=3:input,owner a.pdf b.pdf
|
|
|
|
would read two lines from file descriptor 3. The first would the
|
|
password for reading a.pdf, and the second would be the owner
|
|
password for b.pdf. The encryption arguments would specify the
|
|
actual user password for b.pdf and - as the owner password.
|
|
|
|
qpdf --password-source=file=/tmp/a:input --password=source=prompt:user,owner
|
|
|
|
would read the input file from /tmp/a and would prompt twice: one
|
|
for the user password and once for the owner password.
|
|
|
|
* Consider adding "uninstall" target to makefile. It should only
|
|
uninstall what it installed, which means that you must run
|
|
uninstall from the version you ran install with. It would only be
|
|
supported for the toolchains that support the install target
|
|
(libtool).
|
|
|
|
|
|
Next ABI change
|
|
===============
|
|
|
|
Remove private methods that are there only for ABI compatibility
|
|
including extra QPDFWriter writeTrailer, writeXRefTable,
|
|
writeXRefStream.
|
|
|
|
|
|
5.2.0
|
|
=====
|
|
|
|
* Before release: remember to bump minor shared library version since
|
|
new methods were added (even though private).
|
|
|
|
* Figure out what about a3576a73593987b26cd3eff346f8f7c11f713cbd
|
|
broke binary compatibility.
|
|
|
|
* Implement automated testing for binary compatibility and add to
|
|
release checklist.
|
|
|
|
* Add method to push inheritable resources to a single page by
|
|
walking up and copying without overwrite. Above logic will also be
|
|
sufficient to fix the limitation in
|
|
QPDFObjectHandle::getPageImages(). Maybe add a method to get the
|
|
effective resources for a page without modifying the page and then
|
|
implement both changes in terms of that method.
|
|
|
|
* Provide an option for QPDFWriter to preserve unreferenced objects
|
|
when writing out a file.
|
|
|
|
* Look at all the exceptions and error conditions in QPDF_stream and
|
|
figure out which ones should be converted to warnings and treating
|
|
the stream as not filterable.
|
|
|
|
* Support user-pluggable stream filters. This would enable external
|
|
code to provide interpretation for filters that are missing from
|
|
qpdf. Make it possible for user-provided fitlers to override
|
|
built-in filters. Make sure that the pluggable filters can be
|
|
prioritized so that we can poll all registered filters to see
|
|
whether they are capable of filtering a particular stream.
|
|
|
|
* If possible, consider adding RLE, CCITT3, CCITT4, or any other easy
|
|
filters. For some reference code that we probably can't use but
|
|
may be handy anyway, see
|
|
http://partners.adobe.com/public/developer/ps/sdk/index_archive.html
|
|
|
|
* If possible, support the following types of broken files:
|
|
|
|
- Files that lack %%EOF at the end but otherwise have a valid
|
|
startxref near the end
|
|
|
|
- Files that have no whitespace token after "endobj" such that
|
|
endobj collides with the start of the next object
|
|
|
|
- Files with individual corrupted streams. Just leave the streams
|
|
unfiltered after giving a warning, or maybe do something else
|
|
like applying as many of the filters as possible, etc.
|
|
QPDFWriter can have some kind of retry mechanism on streams
|
|
where filtering fails after filterable returns true.
|
|
|
|
- Files whose PDF header is malformed, perhaps with no version
|
|
number (as literally %PDF-a.b). Maybe keep track of features to
|
|
try to infer a version based on encryption formats and object
|
|
streams.
|
|
|
|
- For really hard errors like corrupted streams where there is
|
|
virtually guaranteed to be loss, maybe require an additional
|
|
option to tell qpdf that it's okay to continue and treat those
|
|
as warnings. Probably need separate options for each type of
|
|
error plus a generic tryReallyHard kind of method that enables
|
|
them all. Then the qpdf command-line tool can have a single
|
|
flag that enables all supported aggressive recovery techniques.
|
|
|
|
- See ../misc/broken-files
|
|
|
|
|
|
|
|
General
|
|
=======
|
|
|
|
* Figure out how to find Visual Studio in Windows registry and see if
|
|
I can get it to work with make so I can simplify creation of
|
|
Windows releases.
|
|
|
|
* Provide support in QPDFWriter for writing incremental updates.
|
|
Provide support in qpdf for preserving incremental updates. The
|
|
goal should be that QDF mode should be fully functional for files
|
|
with incremental updates including fix_qdf.
|
|
|
|
Note that there's nothing that says an indirect object in one
|
|
update can't refer to an object that doesn't appear until a later
|
|
update. This means that QPDF has to treat indirect null objects
|
|
differently from how it does now. QPDF drops indirect null objects
|
|
that appear as members of arrays or dictionaries. For arrays, it's
|
|
handled in QPDFWriter where we make indirect nulls direct. This is
|
|
in a single if block, and nothing else in the code cares about it.
|
|
We could just remove that if block and not break anything except a
|
|
few test cases that exercise the current behavior. For
|
|
dictionaries, it's more complicated. In this case,
|
|
QPDF_Dictionary::getKeys() ignores all keys with null values, and
|
|
hasKey() returns false for keys that have null values. We would
|
|
probably want to make QPDF_Dictionary able to handle the special
|
|
case of keys that are indirect nulls and basically never have it
|
|
drop any keys that are indirect objects.
|
|
|
|
If we make a change to have qpdf preserve indirect references to
|
|
null objects, we have to note this in ChangeLog and in the release
|
|
notes since this will change output files. We did this before when
|
|
we stopped flattening scalar references, so this is probably not a
|
|
big deal. We also have to make sure that the testing for this
|
|
handles non-trivial cases of the targets of indirect nulls being
|
|
replaced by real objects in an update. I'm not sure how this plays
|
|
with linearization, if at all. For cases where incremental updates
|
|
are not being preserved as incremental updates and where the data
|
|
is being folded in (as is always the case with qpdf now), none of
|
|
this should make any difference in the actual semantics of the
|
|
files.
|
|
|
|
* When decrypting files with /R=6, hash_V5 is called more than once
|
|
with the same inputs. Caching the results or refactoring to reduce
|
|
the number of identical calls could improve performance for
|
|
workloads that involve processing large numbers of small files.
|
|
|
|
* Consider providing a Windows installer for qpdf using NSIS.
|
|
|
|
* Consider adding a method to balance the pages tree. It would call
|
|
pushInheritedAttributesToPage, construct a pages tree from scratch,
|
|
and replace the /Pages key of the root dictionary with the new
|
|
tree.
|
|
|
|
* Secure random number generation could be made more efficient by
|
|
using a local static to ensure a single random device or crypt
|
|
provider as long as this can be done in a thread-safe fashion. In
|
|
the initial implementation, this is being skipped to avoid having
|
|
to add any dependencies on threading libraries.
|
|
|
|
* Study what's required to support savable forms that can be saved by
|
|
Adobe Reader. Does this require actually signing the document with
|
|
an Adobe private key? Search for "Digital signatures" in the PDF
|
|
spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which
|
|
came from Adobe's example site.
|
|
|
|
* Consider the possibility of doing something locale-aware to support
|
|
non-ASCII passwords. Update documentation if this is done.
|
|
Consider implementing full Unicode password algorithms from newer
|
|
encryption formats.
|
|
|
|
* Consider impact of article threads on page splitting/merging.
|
|
Subramanyam provided a test file; see ../misc/article-threads.pdf.
|
|
Email Q-Count: 431864 from 2009-11-03. Other things to consider:
|
|
outlines, page labels, thumbnails, zones. There are probably
|
|
others.
|
|
|
|
* See if we can avoid preserving unreferenced objects in object
|
|
streams even when preserving the object streams.
|
|
|
|
* For debugging linearization bugs, consider adding an option to save
|
|
pass 1 of linearization. This code is sufficient. Change the
|
|
interface to allow specification of a pass1 file, which would
|
|
change the behavior as in this patch.
|
|
|
|
------------------------------
|
|
Index: QPDFWriter.cc
|
|
===================================================================
|
|
--- QPDFWriter.cc (revision 932)
|
|
+++ QPDFWriter.cc (working copy)
|
|
@@ -1965,11 +1965,15 @@
|
|
|
|
// Write file in two passes. Part numbers refer to PDF spec 1.4.
|
|
|
|
+ FILE* XXX = 0;
|
|
for (int pass = 1; pass <= 2; ++pass)
|
|
{
|
|
if (pass == 1)
|
|
{
|
|
- pushDiscardFilter();
|
|
+// pushDiscardFilter();
|
|
+ XXX = QUtil::safe_fopen("/tmp/pass1.pdf", "w");
|
|
+ pushPipeline(new Pl_StdioFile("pass1", XXX));
|
|
+ activatePipelineStack();
|
|
}
|
|
|
|
// Part 1: header
|
|
@@ -2204,6 +2208,8 @@
|
|
|
|
// Restore hint offset
|
|
this->xref[hint_id] = QPDFXRefEntry(1, hint_offset, 0);
|
|
+ fclose(XXX);
|
|
+ XXX = 0;
|
|
}
|
|
}
|
|
}
|
|
------------------------------
|
|
|
|
* Provide APIs for embedded files. See *attachments*.pdf in test
|
|
suite. The private method findAttachmentStreams finds at least
|
|
cases for modern versions of Adobe Reader (>= 1.7, maybe earlier).
|
|
PDF Reference 1.7 section 3.10, "File Specifications", discusses
|
|
this.
|
|
|
|
A sourceforge user asks if qpdf can handle extracting and embedded
|
|
resources and references these tools, which may be useful as a
|
|
reference.
|
|
|
|
http://multivalent.sourceforge.net/Tools/pdf/Extract.html
|
|
http://multivalent.sourceforge.net/Tools/pdf/Embed.html
|
|
|
|
* The description of Crypt filters is unclear with respect to how to
|
|
use them to override /StmF for specific streams. I'm not sure
|
|
whether qpdf will do the right thing for any specific individual
|
|
streams that might have crypt filters, but I believe it does based
|
|
on my testing of a limited subset. The specification seems to imply
|
|
that only embedded file streams and metadata streams can have crypt
|
|
filters, and there are already special cases in the code to handle
|
|
those. Most likely, it won't be a problem, but someday someone may
|
|
find a file that qpdf doesn't work on because of crypt filters.
|
|
There is an example in the spec of using a crypt filter on a
|
|
metadata stream.
|
|
|
|
For now, we notice /Crypt filters and decode parameters consistent
|
|
with the example in the PDF specification, and the right thing
|
|
happens for metadata filters that happen to be uncompressed or
|
|
otherwise compressed in a way we can filter. This should handle
|
|
all normal cases, but it's more or less just a guess since I don't
|
|
have any test files that actually use stream-specific crypt filters
|
|
in them.
|
|
|
|
* The second xref stream for linearized files has to be padded only
|
|
because we need file_size as computed in pass 1 to be accurate. If
|
|
we were not allowing writing to a pipe, we could seek back to the
|
|
beginning and fill in the value of /L in the linearization
|
|
dictionary as an optimization to alleviate the need for this
|
|
padding. Doing so would require us to pad the /L value
|
|
individually and also to save the file descriptor and determine
|
|
whether it's seekable. This is probably not worth bothering with.
|
|
|
|
* The whole xref handling code in the QPDF object allows the same
|
|
object with more than one generation to coexist, but a lot of logic
|
|
assumes this isn't the case. Anything that creates mappings only
|
|
with the object number and not the generation is this way,
|
|
including most of the interaction between QPDFWriter and QPDF. If
|
|
we wanted to allow the same object with more than one generation to
|
|
coexist, which I'm not sure is allowed, we could fix this by
|
|
changing xref_table. Alternatively, we could detect and disallow
|
|
that case. In fact, it appears that Adobe reader and other PDF
|
|
viewing software silently ignores objects of this type, so this is
|
|
probably not a big deal.
|
|
|
|
* Pl_PNGFilter is only partially implemented. If we ever decoded
|
|
images, we'd have to finish implementing it along with the other
|
|
filter decode parameters and types. For just handling xref
|
|
streams, there's really no need as it wouldn't make sense to use
|
|
any kind of predictor other than 12 (PNG UP filter).
|
|
|
|
* If we ever want to have check mode check the integrity of the free
|
|
list, this can be done by looking at the code from prior to the
|
|
object stream support of 4/5/2008. It's in an if (0) block and
|
|
there's a comment about it. There's also something about it in
|
|
qpdf.test -- search for "free table". On the other hand, the value
|
|
of doing this seems very low since no viewer seems to care, so it's
|
|
probably not worth it.
|
|
|
|
* QPDFObjectHandle::getPageImages() doesn't notice images in
|
|
inherited resource dictionaries. See comments in that function.
|
|
|
|
* Based on an idea suggested by user "Atom Smasher", consider
|
|
providing some mechanism to recover earlier versions of a file
|
|
embedded prior to appended sections.
|
|
|
|
* From a suggestion in bug 3152169, consider having an option to
|
|
re-encode inline images with an ASCII encoding.
|
|
|
|
* From github issue 2, provide more in-depth output for examining
|
|
hint stream contents.
|