mirror of
https://github.com/qpdf/qpdf.git
synced 2025-01-08 17:24:06 +00:00
318 lines
14 KiB
Plaintext
318 lines
14 KiB
Plaintext
Next
|
||
====
|
||
|
||
*** ABI changes have been made. build.mk has been updated.
|
||
|
||
* 64-bit windows build, remaining steps
|
||
|
||
- new external-libs have been built and copied into
|
||
~/Q/storage/releases/qpdf/external-libs. Release is done in
|
||
git. Just need to upload when ready. Remember to document that
|
||
this version is needed for > 2.3.1.
|
||
|
||
- update README-windows.txt docs to indicate that MSVC 2010 is the
|
||
supported version and to update the information about mingw.
|
||
|
||
* Testing for files > 4GB
|
||
|
||
- Create a PDF from scratch. Each page has a page number as text
|
||
and an image. The image can be 5000x5000 pixels using 8-bit
|
||
gray scale. It will be divided into 10 stripes of 500 pixels
|
||
each. The left and right 500 pixels of each stripe will
|
||
alternate black and white. The remaining part of the image will
|
||
have white stripes indicating 1 and black stripes indicating 0
|
||
with the most-significant bit on top to indicate the page
|
||
number. In this way, every page will be unique and will consume
|
||
approximately 25 megabytes. Creating 200 pages like this will
|
||
make a file that is 5 GB.
|
||
|
||
- The file will have to have object streams since a regular xref
|
||
table won't be able to support offsets that large.
|
||
|
||
- A separate test program can create this file and do various
|
||
manipulations on it. This can be enabled with an environment
|
||
variable controlled by configure in much the same way image
|
||
comparison tests are enabled now. The argument to
|
||
--enable-large-file-test should be a path that has enough disk
|
||
space to do the tests, probably enough space for two coipes of
|
||
the file. The test program should also have an interactive mode
|
||
so we can generate the large file and then look at it with a
|
||
PDF viewer like Adobe Reader.
|
||
|
||
* Consider adding an example that uses the page APIs, or update the
|
||
documentation to refer the user to the test suite.
|
||
|
||
Soon
|
||
====
|
||
|
||
* Provide an option to copy encryption parameters from another file.
|
||
This would make it possible to decrypt a file, manually work with
|
||
it, and then re-encrypt it using the original encryption parameters
|
||
including a possibly unknown owner password.
|
||
|
||
* See if I can support the new encryption formats mentioned in the
|
||
open bug on sourceforge. Check other sourceforge bugs.
|
||
|
||
* Splitting/merging concepts
|
||
|
||
newPDF() could create a PDF with just a trailer, no pages, and a
|
||
minimal info. Then the page routines could be used to add pages to
|
||
it.
|
||
|
||
Starting with any pdf, you should be able to copy objects from
|
||
another pdf. The copy should be smart about never traversing into
|
||
a /Page or /Pages.
|
||
|
||
We could provide a method of copying objects from one PDF into
|
||
another. This would do whatever optimization is necessary (maybe
|
||
just optimizePagesTree) and then traverse the set of objects
|
||
specified to find all objects referenced by the set. Each of those
|
||
would be copied over with a table mapping old ID to new ID. This
|
||
would be done from bottom up most likely disallowing cycles or
|
||
handling them sanely.
|
||
|
||
Command line could be something like
|
||
|
||
--pages [ --new ] { file [password] numeric-range ... } ... --
|
||
|
||
The first file referenced would be the one whose other data would
|
||
be preserved (like trailer, info, encryption, outlines, etc.).
|
||
--new as first file would just use an empty file as the starting
|
||
point.
|
||
|
||
Example: to grab pages 1-5 from file1 and 11-15 from file2
|
||
|
||
--pages file1.pdf 1-5 file2.pdf 11-15 --
|
||
|
||
To implement this, we would remove all pages from file1 except
|
||
pages 1 through 5. Then we would take pages 11 through 15 from
|
||
file2 and add them to a set for transfer. This would end up
|
||
generating a list of indirect objects. We would copy those objects
|
||
shallowly to the new PDF keeping track of the mapping and replacing
|
||
any indirect object keys as appropriate, much like QPDFWriter does.
|
||
|
||
When all the objects are registered, we would add those pages to
|
||
the result.
|
||
|
||
This approach could work for both splitting and merging. It's
|
||
possible it could be implemented now without any new APIs, but most
|
||
of the work should be doable by the library with only a small set
|
||
of additions.
|
||
|
||
newPDF()
|
||
QPDFObjectCopier c(qpdf1, qpdf2)
|
||
QPDFObjectHandle obj = c.copyObject(<object from qpdf1>)
|
||
Without traversing pages, copies all indirect objects referenced
|
||
by <object from qpdf1> preserving referential integrity and
|
||
returns an object handle in qpdf2 of the same object. If called
|
||
multiple times on the same object, retraverses in case there were
|
||
changes.
|
||
|
||
QPDFObjectHandle obj = c.getMapping(<object from qpdf1>)
|
||
find the object in qpdf2 corresponding to the object from qpdf1.
|
||
Return the null object if none.
|
||
|
||
General
|
||
=======
|
||
|
||
* Look for %PDF header somewhere within the first 1024 bytes of the
|
||
file. Also accept headers of the form "%!PS−Adobe−N.n PDF−M.m".
|
||
See Implementation notes 13 and 14 in appendix H of the PDF 1.7
|
||
specification. This is bug 3267974.
|
||
|
||
* Update qpdf docs about non-ascii passwords. See thread from
|
||
2010-12-07,08 for details.
|
||
|
||
* Look at page splitting. Subramanyam provided a test file; see
|
||
../misc/article-threads.pdf. Email Q-Count: 431864 from
|
||
2009-11-03. See also "Splitting by Pages" below.
|
||
|
||
* Consider writing a PDF merge utility. With 2.2, it would be
|
||
possible to have a StreamDataProvider that would allow stream data
|
||
to be directly copied from one PDF file to another. One possible
|
||
strategy would be to have a program that adds all the pages of one
|
||
file to the end of another file. The basic
|
||
strategy would be to create a table that adds new streams to the
|
||
original file, mapping the new streams' obj/gen to a stream in the
|
||
file whose pages are being appended. The StreamDataProvider, when
|
||
asked, could simply pipe the streams of the file being appended to
|
||
the provided pipeline and could copy the filter and decode
|
||
parameters from the original file. Being able to do this requires
|
||
a lot of the same logic as being able to do splitting, so a general
|
||
split/merge program would be a great addition.
|
||
|
||
* See whether it's possible to remove the call to
|
||
flattenScalarReferences. I can't easily figure out why I do it,
|
||
but removing it causes strange test failures in linearization. I
|
||
would have to study the optimization and linearization code to
|
||
figure out why I added this to begin with and what in the code
|
||
assumes it's the case. For enqueueObject and unparseChild in
|
||
QPDFWriter, simply removing the checks for indirect scalars seems
|
||
sufficient. Looking back at the branch in the apex epub
|
||
repository, before flattening scalar references, there was special
|
||
case code in QPDFWriter to avoid writing out indirect nulls. It's
|
||
still not obvious to me why I did it though.
|
||
|
||
To pursue this, remove the call to flattenScalarReferences in
|
||
QPDFWriter.cc and disable the logic_error exceptions for indirect
|
||
scalars. Just search for flattenScalarReferences in QPDFWriter.cc
|
||
since the logic errors have comments that mention
|
||
flattenScalarReferences. Then run the test suite. Several files
|
||
that explicitly test flattening of scalar references fail, but the
|
||
indirect scalars are properly preserved and written. But then
|
||
there are some linearized files that have a bunch of unreferenced
|
||
objects that contain scalars. Need to figure out what these are
|
||
and why they're there. Maybe they're objects that used to be
|
||
stream lengths. Probably we just need to make sure don't traverse
|
||
through a stream's /Length stream when enqueueing stream
|
||
dictionaries. This could potentially happen with any object that
|
||
QPDFWriter replaces when writing out files. Such objects would be
|
||
orphaned in the newly written file. This could be fixed, but it
|
||
may not be worth fixing.
|
||
|
||
If flattenScalarReferences is removed, a new method will be needed
|
||
for checking PDF files.
|
||
|
||
* See if we can avoid preserving unreferenced objects in object
|
||
streams even when preserving the object streams.
|
||
|
||
* For debugging linearization bugs, consider adding an option to save
|
||
pass 1 of linearization. This code is sufficient. Change the
|
||
interface to allow specification of a pass1 file, which would
|
||
change the behavior as in this patch.
|
||
|
||
------------------------------
|
||
Index: QPDFWriter.cc
|
||
===================================================================
|
||
--- QPDFWriter.cc (revision 932)
|
||
+++ QPDFWriter.cc (working copy)
|
||
@@ -1965,11 +1965,15 @@
|
||
|
||
// Write file in two passes. Part numbers refer to PDF spec 1.4.
|
||
|
||
+ FILE* XXX = 0;
|
||
for (int pass = 1; pass <= 2; ++pass)
|
||
{
|
||
if (pass == 1)
|
||
{
|
||
- pushDiscardFilter();
|
||
+// pushDiscardFilter();
|
||
+ XXX = fopen("/tmp/pass1.pdf", "w");
|
||
+ pushPipeline(new Pl_StdioFile("pass1", XXX));
|
||
+ activatePipelineStack();
|
||
}
|
||
|
||
// Part 1: header
|
||
@@ -2204,6 +2208,8 @@
|
||
|
||
// Restore hint offset
|
||
this->xref[hint_id] = QPDFXRefEntry(1, hint_offset, 0);
|
||
+ fclose(XXX);
|
||
+ XXX = 0;
|
||
}
|
||
}
|
||
}
|
||
------------------------------
|
||
|
||
* Handle embedded files. PDF Reference 1.7 section 3.10, "File
|
||
Specifications", discusses this. Once we can definitely recongize
|
||
all embedded files in a docucment, we can update the encryption
|
||
code to handle it properly. In QPDF_encryption.cc, search for
|
||
cf_file. Remove exception thrown if cf_file is different from
|
||
cf_stream, and write code in the stream decryption section to use
|
||
cf_file instead of cf_stream. In general, add interfaces to get
|
||
the list of embedded files and to extract them. To handle general
|
||
embedded files associated with the whole document, follow root ->
|
||
/Names -> /EmbeddedFiles -> /Names to get to the file specification
|
||
dictionaries. Then, in each file specification dictionary, follow
|
||
/EF -> /F to the actual stream. There may be other places file
|
||
specification dictionaries may appear, and there are also /RF keys
|
||
with related files, so reread section 3.10 carefully.
|
||
|
||
* The description of Crypt filters is unclear with respect to how to
|
||
use them to override /StmF for specific streams. I'm not sure
|
||
whether qpdf will do the right thing for any specific individual
|
||
streams that might have crypt filters. The specification seems to
|
||
imply that only embedded file streams and metadata streams can have
|
||
crypt filters, and there are already special cases in the code to
|
||
handle those. Most likely, it won't be a problem, but someday
|
||
someone may find a file that qpdf doesn't work on because of crypt
|
||
filters. There is an example in the spec of using a crypt filter
|
||
on a metadata stream.
|
||
|
||
For now, we notice /Crypt filters and decode parameters consistent
|
||
with the example in the PDF specification, and the right thing
|
||
happens for metadata filters that happen to be uncompressed or
|
||
otherwise compressed in a way we can filter. This should handle
|
||
all normal cases, but it's more or less just a guess since I don't
|
||
have any test files that actually use stream-specific crypt filters
|
||
in them.
|
||
|
||
* The second xref stream for linearized files has to be padded only
|
||
because we need file_size as computed in pass 1 to be accurate. If
|
||
we were not allowing writing to a pipe, we could seek back to the
|
||
beginning and fill in the value of /L in the linearization
|
||
dictionary as an optimization to alleviate the need for this
|
||
padding. Doing so would require us to pad the /L value
|
||
individually and also to save the file descriptor and determine
|
||
whether it's seekable. This is probably not worth bothering with.
|
||
|
||
* The whole xref handling code in the QPDF object allows the same
|
||
object with more than one generation to coexist, but a lot of logic
|
||
assumes this isn't the case. Anything that creates mappings only
|
||
with the object number and not the generation is this way,
|
||
including most of the interaction between QPDFWriter and QPDF. If
|
||
we wanted to allow the same object with more than one generation to
|
||
coexist, which I'm not sure is allowed, we could fix this by
|
||
changing xref_table. Alternatively, we could detect and disallow
|
||
that case. In fact, it appears that Adobe reader and other PDF
|
||
viewing software silently ignores objects of this type, so this is
|
||
probably not a big deal.
|
||
|
||
* Pl_PNGFilter is only partially implemented. If we ever decoded
|
||
images, we'd have to finish implementing it along with the other
|
||
filter decode parameters and types. For just handling xref
|
||
streams, there's really no need as it wouldn't make sense to use
|
||
any kind of predictor other than 12 (PNG UP filter).
|
||
|
||
* If we ever want to have check mode check the integrity of the free
|
||
list, this can be done by looking at the code from prior to the
|
||
object stream support of 4/5/2008. It's in an if (0) block and
|
||
there's a comment about it. There's also something about it in
|
||
qpdf.test -- search for "free table". On the other hand, the value
|
||
of doing this seems very low since no viewer seems to care, so it's
|
||
probably not worth it.
|
||
|
||
* QPDFObjectHandle::getPageImages() doesn't notice images in
|
||
inherited resource dictionaries. See comments in that function.
|
||
|
||
* Based on an idea suggested by user "Atom Smasher", consider
|
||
providing some mechanism to recover earlier versions of a file
|
||
embedded prior to appended sections.
|
||
|
||
* From a suggestion in bug 3152169, consisder having an option to
|
||
re-encode inline images with an ASCII encoding.
|
||
|
||
|
||
Splitting by Pages
|
||
==================
|
||
|
||
Although qpdf does not currently support splitting a file into pages,
|
||
the work done for linearization covers almost all the work. To do
|
||
page splitting. If this functionality is needed, study
|
||
obj_user_to_objects and object_to_obj_users created in
|
||
QPDF_optimization for ideas. It's quite possible that the information
|
||
computed by calculateLinearizationData is actually sufficient to do
|
||
page splitting in many circumstances. That code knows which objects
|
||
are used by which pages, though it doesn't do anything page-specific
|
||
with outlines, thumbnails, page labels, or anything else.
|
||
|
||
Another approach would be to traverse only pages that are being output
|
||
taking care not to traverse into the pages tree, and then to fabricate
|
||
a new pages tree.
|
||
|
||
Either way, care must be taken to handle other things such as
|
||
outlines, page labels, thumbnails, threads, zones, etc. in a sensible
|
||
way. This may include simply omitting information other than page
|
||
content.
|