Eliminate flattenScalarReferences

This commit is contained in:
Jay Berkenbilt 2012-12-27 11:39:01 -05:00
parent b4b8b28ed2
commit 04c203ae06
18 changed files with 260 additions and 305 deletions

View File

@ -1,3 +1,21 @@
2012-12-27 Jay Berkenbilt <ejb@ql.org>
* Removed public method QPDF::flattenScalarReferences. Instead,
just flatten the scalar references we actually need to flatten.
Flattening scalar references was a wrong decision years ago and
has occasionally caused other problems, among which were that it
caused qpdf to visit otherwise unreferenced and possibly erroneous
objects in the file when it didn't have to.
* Removed public method QPDF::decodeStreams which was previously
used by qpdf --check but is no longer used. The decodeStreams
method could generate false positives since it would attempt to
access all objects in the file including those that were not
referenced.
* Removed public method QPDF::trimTrailerForWrite, which was only
intended for use by QPDFWriter and which is no longer used.
2012-12-25 Jay Berkenbilt <ejb@ql.org>
* Allow PDF header to appear anywhere in the first 1024 bytes of

32
TODO
View File

@ -41,38 +41,6 @@ General
outlines, page labels, thumbnails, zones. There are probably
others.
* See whether it's possible to remove the call to
flattenScalarReferences. I can't easily figure out why I do it,
but removing it causes strange test failures in linearization. I
would have to study the optimization and linearization code to
figure out why I added this to begin with and what in the code
assumes it's the case. For enqueueObject and unparseChild in
QPDFWriter, simply removing the checks for indirect scalars seems
sufficient. Looking back at the branch in the apex epub
repository, before flattening scalar references, there was special
case code in QPDFWriter to avoid writing out indirect nulls. It's
still not obvious to me why I did it though.
To pursue this, remove the call to flattenScalarReferences in
QPDFWriter.cc and disable the logic_error exceptions for indirect
scalars. Just search for flattenScalarReferences in QPDFWriter.cc
since the logic errors have comments that mention
flattenScalarReferences. Then run the test suite. Several files
that explicitly test flattening of scalar references fail, but the
indirect scalars are properly preserved and written. But then
there are some linearized files that have a bunch of unreferenced
objects that contain scalars. Need to figure out what these are
and why they're there. Maybe they're objects that used to be
stream lengths. Probably we just need to make sure don't traverse
through a stream's /Length stream when enqueueing stream
dictionaries. This could potentially happen with any object that
QPDFWriter replaces when writing out files. Such objects would be
orphaned in the newly written file. This could be fixed, but it
may not be worth fixing.
If flattenScalarReferences is removed, a new method will be needed
for checking PDF files.
* See if we can avoid preserving unreferenced objects in object
streams even when preserving the object streams.

View File

@ -352,24 +352,8 @@ class QPDF
void optimize(std::map<int, int> const& object_stream_data,
bool allow_changes = true);
// Replace all references to indirect objects that are "scalars"
// (i.e., things that don't have children: not arrays, streams, or
// dictionaries) with direct objects.
QPDF_DLL
void flattenScalarReferences();
// Decode all streams, discarding the output. Used to check
// correctness of stream encoding.
QPDF_DLL
void decodeStreams();
// For QPDFWriter:
// Remove /ID, /Encrypt, and /Prev keys from the trailer
// dictionary since these are regenerated during write.
QPDF_DLL
void trimTrailerForWrite();
// Get lists of all objects in order according to the part of a
// linearized file that they belong to.
QPDF_DLL

View File

@ -299,6 +299,7 @@ class QPDFWriter
void setDataKey(int objid);
int openObject(int objid = 0);
void closeObject(int objid);
void prepareFileForWrite();
void writeStandard();
void writeLinearized();
void enqueuePart(std::vector<QPDFObjectHandle>& part);

View File

@ -206,7 +206,6 @@ debian
declspec
DecodeParms
decodeRow
decodeStreams
decrypt
decrypted
decrypter
@ -335,7 +334,6 @@ fl
flate
FlateDecode
flattenPagesTree
flattenScalarReferences
fn
fname
fo

View File

@ -1860,28 +1860,6 @@ QPDF::swapObjects(int objid1, int generation1, int objid2, int generation2)
this->obj_cache[og2] = t;
}
void
QPDF::trimTrailerForWrite()
{
// Note that removing the encryption dictionary does not interfere
// with reading encrypted files. QPDF loads all the information
// it needs from the encryption dictionary at the beginning and
// never looks at it again.
this->trailer.removeKey("/ID");
this->trailer.removeKey("/Encrypt");
this->trailer.removeKey("/Prev");
// Remove all trailer keys that potentially come from a
// cross-reference stream
this->trailer.removeKey("/Index");
this->trailer.removeKey("/W");
this->trailer.removeKey("/Length");
this->trailer.removeKey("/Filter");
this->trailer.removeKey("/DecodeParms");
this->trailer.removeKey("/Type");
this->trailer.removeKey("/XRefStm");
}
std::string
QPDF::getFilename() const
{
@ -2067,20 +2045,3 @@ QPDF::pipeStreamData(int objid, int generation,
}
pipeline->finish();
}
void
QPDF::decodeStreams()
{
for (std::map<ObjGen, QPDFXRefEntry>::iterator iter =
this->xref_table.begin();
iter != this->xref_table.end(); ++iter)
{
ObjGen const& og = (*iter).first;
QPDFObjectHandle obj = getObjectByID(og.obj, og.gen);
if (obj.isStream())
{
Pl_Discard pl;
obj.pipeStreamData(&pl, true, false, false);
}
}
}

View File

@ -834,16 +834,6 @@ QPDFWriter::enqueueObject(QPDFObjectHandle object)
{
// This is a place-holder object for an object stream
}
else if (object.isScalar())
{
// flattenScalarReferences is supposed to have removed all
// indirect scalars.
throw std::logic_error(
"INTERNAL ERROR: QPDFWriter::enqueueObject: indirect scalar: " +
std::string(this->filename) + " " +
QUtil::int_to_string(object.getObjectID()) + " " +
QUtil::int_to_string(object.getGeneration()));
}
int objid = object.getObjectID();
if (obj_renumber.count(objid) == 0)
@ -916,15 +906,6 @@ QPDFWriter::unparseChild(QPDFObjectHandle child, int level, int flags)
}
if (child.isIndirect())
{
if (child.isScalar())
{
// flattenScalarReferences is supposed to have removed all
// indirect scalars.
throw std::logic_error(
"INTERNAL ERROR: QPDFWriter::unparseChild: indirect scalar: " +
QUtil::int_to_string(child.getObjectID()) + " " +
QUtil::int_to_string(child.getGeneration()));
}
int old_id = child.getObjectID();
int new_id = obj_renumber[old_id];
writeString(QUtil::int_to_string(new_id));
@ -1647,6 +1628,117 @@ QPDFWriter::generateObjectStreams()
}
}
void
QPDFWriter::prepareFileForWrite()
{
// Remove keys from the trailer that necessarily have to be
// replaced when writing the file.
QPDFObjectHandle trailer = pdf.getTrailer();
// Note that removing the encryption dictionary does not interfere
// with reading encrypted files. QPDF loads all the information
// it needs from the encryption dictionary at the beginning and
// never looks at it again.
trailer.removeKey("/ID");
trailer.removeKey("/Encrypt");
trailer.removeKey("/Prev");
// Remove all trailer keys that potentially come from a
// cross-reference stream
trailer.removeKey("/Index");
trailer.removeKey("/W");
trailer.removeKey("/Length");
trailer.removeKey("/Filter");
trailer.removeKey("/DecodeParms");
trailer.removeKey("/Type");
trailer.removeKey("/XRefStm");
// Do a traversal of the entire PDF file structure replacing all
// indirect objects that QPDFWriter wants to be direct. This
// includes stream lengths, stream filtering parameters, and
// document extension level information. Also replace all
// indirect null references with direct nulls. This way, the only
// indirect nulls queued for output will be object stream place
// holders.
std::list<QPDFObjectHandle> queue;
queue.push_back(pdf.getTrailer());
std::set<int> visited;
while (! queue.empty())
{
QPDFObjectHandle node = queue.front();
queue.pop_front();
if (node.isIndirect())
{
if (visited.count(node.getObjectID()) > 0)
{
continue;
}
visited.insert(node.getObjectID());
}
if (node.isArray())
{
int nitems = node.getArrayNItems();
for (int i = 0; i < nitems; ++i)
{
QPDFObjectHandle oh = node.getArrayItem(i);
if (oh.isIndirect() && oh.isNull())
{
QTC::TC("qpdf", "QPDFWriter flatten array null");
oh.makeDirect();
node.setArrayItem(i, oh);
}
else if (! oh.isScalar())
{
queue.push_back(oh);
}
}
}
else if (node.isDictionary() || node.isStream())
{
bool is_stream = false;
QPDFObjectHandle dict = node;
if (node.isStream())
{
is_stream = true;
dict = node.getDict();
}
std::set<std::string> keys = dict.getKeys();
for (std::set<std::string>::iterator iter = keys.begin();
iter != keys.end(); ++iter)
{
std::string const& key = *iter;
QPDFObjectHandle oh = dict.getKey(key);
bool add_to_queue = true;
if (oh.isIndirect())
{
if (is_stream)
{
if ((key == "/Length") ||
(key == "/Filter") ||
(key == "/DecodeParms"))
{
QTC::TC("qpdf", "QPDF make stream key direct");
add_to_queue = false;
oh.makeDirect();
dict.replaceKey(key, oh);
}
}
}
if (add_to_queue)
{
queue.push_back(oh);
}
}
}
}
}
void
QPDFWriter::write()
{
@ -1785,8 +1877,7 @@ QPDFWriter::write()
generateID();
pdf.trimTrailerForWrite();
pdf.flattenScalarReferences();
prepareFileForWrite();
if (this->linearized)
{

View File

@ -58,103 +58,6 @@ QPDF::ObjUser::operator<(ObjUser const& rhs) const
return false;
}
void
QPDF::flattenScalarReferences()
{
// Do a traversal of the entire PDF file structure replacing all
// indirect objects that are not arrays, streams, or dictionaries
// with direct objects.
std::list<QPDFObjectHandle> queue;
queue.push_back(this->trailer);
std::set<ObjGen> visited;
// Add every object in the xref table to the queue. This ensures
// that we flatten scalar references in unreferenced objects.
// This becomes important if we are preserving object streams in a
// file that has unreferenced objects in its object streams. (See
// QPDF bug 2974522 at SourceForge.)
for (std::map<ObjGen, QPDFXRefEntry>::iterator iter =
this->xref_table.begin();
iter != this->xref_table.end(); ++iter)
{
ObjGen const& og = (*iter).first;
queue.push_back(getObjectByID(og.obj, og.gen));
}
while (! queue.empty())
{
QPDFObjectHandle node = queue.front();
queue.pop_front();
if (node.isIndirect())
{
ObjGen og(node.getObjectID(), node.getGeneration());
if (visited.count(og) > 0)
{
continue;
}
visited.insert(og);
}
if (node.isArray())
{
int nitems = node.getArrayNItems();
for (int i = 0; i < nitems; ++i)
{
QPDFObjectHandle oh = node.getArrayItem(i);
if (oh.isScalar())
{
if (oh.isIndirect())
{
QTC::TC("qpdf", "QPDF opt flatten array scalar");
oh.makeDirect();
node.setArrayItem(i, oh);
}
}
else
{
queue.push_back(oh);
}
}
}
else if (node.isDictionary() || node.isStream())
{
QPDFObjectHandle dict = node;
if (node.isStream())
{
dict = node.getDict();
}
std::set<std::string> keys = dict.getKeys();
for (std::set<std::string>::iterator iter = keys.begin();
iter != keys.end(); ++iter)
{
std::string const& key = *iter;
QPDFObjectHandle oh = dict.getKey(key);
if (oh.isNull())
{
// QPDF_Dictionary.getKeys() never returns null
// keys.
throw std::logic_error(
"INTERNAL ERROR: dictionary with null key found");
}
else if (oh.isScalar())
{
if (oh.isIndirect())
{
QTC::TC("qpdf", "QPDF opt flatten dict scalar");
oh.makeDirect();
dict.replaceKey(key, oh);
}
}
else
{
queue.push_back(oh);
}
}
}
}
}
void
QPDF::optimize(std::map<int, int> const& object_stream_data,
bool allow_changes)
@ -304,9 +207,7 @@ QPDF::pushInheritedAttributesToPageInternal(
}
else
{
// Don't defeat flattenScalarReferences which
// would have already been called by this
// time.
// It's okay to copy scalars.
QTC::TC("qpdf", "QPDF opt inherited scalar");
}
}

View File

@ -8,6 +8,7 @@
#include <qpdf/QUtil.hh>
#include <qpdf/QTC.hh>
#include <qpdf/Pl_StdioFile.hh>
#include <qpdf/Pl_Discard.hh>
#include <qpdf/PointerHolder.hh>
#include <qpdf/QPDF.hh>
@ -1381,12 +1382,14 @@ int main(int argc, char* argv[])
else
{
std::cout << "File is not linearized\n";
// calling flattenScalarReferences causes full
// traversal of file, so any structural errors
// would be exposed.
pdf.flattenScalarReferences();
// Also explicitly decode all streams.
pdf.decodeStreams();
// Write the file no nowhere, uncompressing
// streams. This causes full file traversal
// and decoding of all streams we can decode.
QPDFWriter w(pdf);
Pl_Discard discard;
w.setOutputPipeline(&discard);
w.setStreamDataMode(qpdf_s_uncompress);
w.write();
okay = true;
}
}

View File

@ -29,8 +29,7 @@ QPDF lin outlines in part 1
QPDF lin nshared_total > nshared_first_page 1
QPDF lin part 8 empty 1
QPDF lin check shared past first page 0
QPDF opt flatten array scalar 0
QPDF opt flatten dict scalar 0
QPDFWriter flatten array null 0
main QTest implicit 0
main QTest indirect 1
main QTest null 0
@ -244,3 +243,4 @@ QPDFWriter extra header text no newline 0
QPDFWriter extra header text add newline 0
QPDF bogus 0 offset 0
QPDF global offset 0
QPDF make stream key direct 0

View File

@ -18,7 +18,7 @@ endobj
<01020300040560>
(AB)
]
/indirect (hello)
/indirect 4 0 R
/nesting <<
/a [
1
@ -58,17 +58,22 @@ endobj
<<
/Count 1
/Kids [
4 0 R
5 0 R
]
/Type /Pages
>>
endobj
%% Original object ID: 8 0
4 0 obj
(hello)
endobj
%% Page 1
%% Original object ID: 3 0
4 0 obj
5 0 obj
<<
/Contents 5 0 R
/Contents 6 0 R
/MediaBox [
0
0
@ -78,9 +83,9 @@ endobj
/Parent 3 0 R
/Resources <<
/Font <<
/F1 7 0 R
/F1 8 0 R
>>
/ProcSet 8 0 R
/ProcSet 9 0 R
>>
/Type /Page
>>
@ -88,9 +93,9 @@ endobj
%% Contents for page 1
%% Original object ID: 4 0
5 0 obj
6 0 obj
<<
/Length 6 0 R
/Length 7 0 R
>>
stream
BT
@ -101,12 +106,12 @@ ET
endstream
endobj
6 0 obj
7 0 obj
44
endobj
%% Original object ID: 6 0
7 0 obj
8 0 obj
<<
/BaseFont /Helvetica
/Encoding /WinAnsiEncoding
@ -117,7 +122,7 @@ endobj
endobj
%% Original object ID: 5 0
8 0 obj
9 0 obj
[
/PDF
/Text
@ -125,22 +130,23 @@ endobj
endobj
xref
0 9
0 10
0000000000 65535 f
0000000052 00000 n
0000000133 00000 n
0000000578 00000 n
0000000687 00000 n
0000000929 00000 n
0000001028 00000 n
0000001074 00000 n
0000001219 00000 n
0000000576 00000 n
0000000675 00000 n
0000000736 00000 n
0000000978 00000 n
0000001077 00000 n
0000001123 00000 n
0000001268 00000 n
trailer <<
/QTest 2 0 R
/Root 1 0 R
/Size 9
/Size 10
/ID [<31415926535897932384626433832795><31415926535897932384626433832795>]
>>
startxref
1254
1303
%%EOF

View File

@ -5,17 +5,22 @@
%% Original object ID: 1 0
1 0 obj
<<
/Pages 2 0 R
/Pages 3 0 R
/Type /Catalog
>>
endobj
%% Original object ID: 2 0
%% Original object ID: 7 0
2 0 obj
true
endobj
%% Original object ID: 2 0
3 0 obj
<<
/Count 1
/Kids [
3 0 R
4 0 R
]
/Type /Pages
>>
@ -23,21 +28,21 @@ endobj
%% Page 1
%% Original object ID: 3 0
3 0 obj
4 0 obj
<<
/Contents 4 0 R
/Contents 5 0 R
/MediaBox [
0
0
612
792
]
/Parent 2 0 R
/Parent 3 0 R
/Resources <<
/Font <<
/F1 6 0 R
/F1 7 0 R
>>
/ProcSet 7 0 R
/ProcSet 8 0 R
>>
/Type /Page
>>
@ -45,9 +50,9 @@ endobj
%% Contents for page 1
%% Original object ID: 4 0
4 0 obj
5 0 obj
<<
/Length 5 0 R
/Length 6 0 R
>>
stream
BT
@ -58,12 +63,12 @@ ET
endstream
endobj
5 0 obj
6 0 obj
44
endobj
%% Original object ID: 6 0
6 0 obj
7 0 obj
<<
/BaseFont /Helvetica
/Encoding /WinAnsiEncoding
@ -74,7 +79,7 @@ endobj
endobj
%% Original object ID: 5 0
7 0 obj
8 0 obj
[
/PDF
/Text
@ -82,21 +87,22 @@ endobj
endobj
xref
0 8
0 9
0000000000 65535 f
0000000052 00000 n
0000000133 00000 n
0000000242 00000 n
0000000484 00000 n
0000000583 00000 n
0000000629 00000 n
0000000774 00000 n
0000000181 00000 n
0000000290 00000 n
0000000532 00000 n
0000000631 00000 n
0000000677 00000 n
0000000822 00000 n
trailer <<
/QTest true
/QTest 2 0 R
/Root 1 0 R
/Size 8
/Size 9
/ID [<31415926535897932384626433832795><31415926535897932384626433832795>]
>>
startxref
809
857
%%EOF

View File

@ -5,17 +5,22 @@
%% Original object ID: 1 0
1 0 obj
<<
/Pages 2 0 R
/Pages 3 0 R
/Type /Catalog
>>
endobj
%% Original object ID: 2 0
%% Original object ID: 7 0
2 0 obj
3.14159
endobj
%% Original object ID: 2 0
3 0 obj
<<
/Count 1
/Kids [
3 0 R
4 0 R
]
/Type /Pages
>>
@ -23,21 +28,21 @@ endobj
%% Page 1
%% Original object ID: 3 0
3 0 obj
4 0 obj
<<
/Contents 4 0 R
/Contents 5 0 R
/MediaBox [
0
0
612
792
]
/Parent 2 0 R
/Parent 3 0 R
/Resources <<
/Font <<
/F1 6 0 R
/F1 7 0 R
>>
/ProcSet 7 0 R
/ProcSet 8 0 R
>>
/Type /Page
>>
@ -45,9 +50,9 @@ endobj
%% Contents for page 1
%% Original object ID: 4 0
4 0 obj
5 0 obj
<<
/Length 5 0 R
/Length 6 0 R
>>
stream
BT
@ -58,12 +63,12 @@ ET
endstream
endobj
5 0 obj
6 0 obj
44
endobj
%% Original object ID: 6 0
6 0 obj
7 0 obj
<<
/BaseFont /Helvetica
/Encoding /WinAnsiEncoding
@ -74,7 +79,7 @@ endobj
endobj
%% Original object ID: 5 0
7 0 obj
8 0 obj
[
/PDF
/Text
@ -82,21 +87,22 @@ endobj
endobj
xref
0 8
0 9
0000000000 65535 f
0000000052 00000 n
0000000133 00000 n
0000000242 00000 n
0000000484 00000 n
0000000583 00000 n
0000000629 00000 n
0000000774 00000 n
0000000184 00000 n
0000000293 00000 n
0000000535 00000 n
0000000634 00000 n
0000000680 00000 n
0000000825 00000 n
trailer <<
/QTest 3.14159
/QTest 2 0 R
/Root 1 0 R
/Size 8
/Size 9
/ID [<31415926535897932384626433832795><31415926535897932384626433832795>]
>>
startxref
809
860
%%EOF

View File

@ -39,8 +39,8 @@ endobj
<<
/A 5 0 R
/B 6 0 R
/Subject (Subject)
/Title (Some Title Is Here)
/Subject 7 0 R
/Title 8 0 R
>>
endobj
@ -49,7 +49,7 @@ endobj
<<
/Count 1
/Kids [
7 0 R
9 0 R
]
/Type /Pages
>>
@ -72,11 +72,21 @@ endobj
>>
endobj
%% Original object ID: 10 0
7 0 obj
(Subject)
endobj
%% Original object ID: 9 0
8 0 obj
(Some Title Is Here)
endobj
%% Page 1
%% Original object ID: 3 0
7 0 obj
9 0 obj
<<
/Contents 8 0 R
/Contents 10 0 R
/MediaBox [
0
0
@ -86,9 +96,9 @@ endobj
/Parent 4 0 R
/Resources <<
/Font <<
/F1 10 0 R
/F1 12 0 R
>>
/ProcSet 11 0 R
/ProcSet 13 0 R
>>
/Type /Page
>>
@ -96,9 +106,9 @@ endobj
%% Contents for page 1
%% Original object ID: 4 0
8 0 obj
10 0 obj
<<
/Length 9 0 R
/Length 11 0 R
>>
stream
BT
@ -109,12 +119,12 @@ ET
endstream
endobj
9 0 obj
11 0 obj
44
endobj
%% Original object ID: 6 0
10 0 obj
12 0 obj
<<
/BaseFont /Helvetica
/Encoding /WinAnsiEncoding
@ -125,7 +135,7 @@ endobj
endobj
%% Original object ID: 7 0
11 0 obj
13 0 obj
[
/PDF
/Text
@ -133,26 +143,28 @@ endobj
endobj
xref
0 12
0 14
0000000000 65535 f
0000000052 00000 n
0000000134 00000 n
0000000353 00000 n
0000000475 00000 n
0000000575 00000 n
0000000635 00000 n
0000000714 00000 n
0000000958 00000 n
0000001057 00000 n
0000001103 00000 n
0000001249 00000 n
0000000456 00000 n
0000000556 00000 n
0000000616 00000 n
0000000686 00000 n
0000000739 00000 n
0000000813 00000 n
0000001058 00000 n
0000001159 00000 n
0000001206 00000 n
0000001352 00000 n
trailer <<
/Info 2 0 R
/QTest 3 0 R
/Root 1 0 R
/Size 12
/Size 14
/ID [<c61bd35bada064f61e0a56aa9588064e><31415926535897932384626433832795>]
>>
startxref
1285
1388
%%EOF