* native UTF-8 strings
* names whose PDF and canonical syntax differ in both dictionary key
positions and other positions
For json, names are converted both as names and directly when used as
dictionary keys.
* Replace --create-from-json=file with --json-input, which causes the
regular input to be treated as json.
* Eliminate --to-json
* In --json=2, bring back "objects" and eliminate "objectinfo". Stream
data is never present.
* In --json-output=2, write "qpdf-v2" with "objects" and include
stream data.
There is one unexpected pass in this commit. This script was applied
to the files changed in this commit:
----------
#!/usr/bin/env python3
import json
import sys
def json_dumps(data):
return json.dumps(data, ensure_ascii=False,
indent=2, separators=(',', ': '))
for filename in sys.argv[1:]:
with open(filename, 'r') as f:
data = json.loads(f.read())
data['version'] = 2
objectinfo = {}
if 'objectinfo' in data:
objectinfo = data['objectinfo']
del data['objectinfo']
if 'objects' not in data:
continue
qpdf = {'jsonversion': 2, 'pdfversion': '1.3', 'objects': {}}
for k, v in data['objects'].items():
is_stream = objectinfo.get(k, {}).get('stream', {}).get('is', False)
if k.endswith(' R'):
k = 'obj:' + k
if is_stream:
v = {'stream': {'dict': v}}
else:
v = {'value': v}
qpdf['objects'][k] = v
data['qpdf'] = qpdf
del data['objects']
print(json_dumps(data))
----------
moddify -> modify. Also carefully spell checked all remaining keys by
splitting them into words and running a spell checker, not just
relying on visual proofreading. That was the only one.
This script was used on test data:
----------
#!/usr/bin/env python3
import json
import sys
import re
def json_dumps(data):
return json.dumps(data, ensure_ascii=False,
indent=2, separators=(',', ': '))
for filename in sys.argv[1:]:
with open(filename, 'r') as f:
data = json.loads(f.read())
if 'objectinfo' not in data:
continue
trailer = None
to_sort = []
for k, v in data['objectinfo'].items():
if k == 'trailer':
trailer = v
else:
m = re.match(r'^(\d+) \d+ R', k)
if m:
to_sort.append([int(m.group(1)), k, v])
newobjectinfo = {x[1]: x[2] for x in sorted(to_sort)}
if trailer is not None:
newobjectinfo['trailer'] = trailer
data['objectinfo'] = newobjectinfo
print(json_dumps(data))
----------
The following script was used to adjust test data:
----------
#!/usr/bin/env python3
import json
import sys
import re
def json_dumps(data):
return json.dumps(data, ensure_ascii=False,
indent=2, separators=(',', ': '))
for filename in sys.argv[1:]:
with open(filename, 'r') as f:
data = json.loads(f.read())
if 'objects' not in data:
continue
trailer = None
to_sort = []
for k, v in data['objects'].items():
if k == 'trailer':
trailer = v
else:
m = re.match(r'^(\d+) \d+ R', k)
if m:
to_sort.append([int(m.group(1)), k, v])
newobjects = {x[1]: x[2] for x in sorted(to_sort)}
if trailer is not None:
newobjects['trailer'] = trailer
data['objects'] = newobjects
print(json_dumps(data))
----------
This commit just changes the order in which fields are written to the
json without changing their content. All the json files in the test
suite were modified with this script to ensure that we didn't get any
changes other than ordering.
----------
#!/usr/bin/env python3
import json
import sys
def json_dumps(data):
return json.dumps(data, ensure_ascii=False,
indent=2, separators=(',', ': '))
for filename in sys.argv[1:]:
with open(filename, 'r') as f:
data = json.loads(f.read())
newdata = {}
for i in ('version', 'parameters', 'pages', 'pagelabels',
'acroform', 'attachments', 'encrypt', 'outlines',
'objects', 'objectinfo'):
if i in data:
newdata[i] = data[i]
print(json_dumps(newdata))
----------
Where not possible, use "auto" to get the iterator type.
Editorial note: I have avoid this change for a long time because of
not wanting to make gratuitous changes to version history, which can
obscure when certain changes were made, but with having recently
touched every single file to apply automatic code formatting and with
making several broad changes to the API, I decided it was time to take
the plunge and get rid of the older (pre-C++11) verbose iterator
syntax. The new code is just easier to read and understand, and in
many cases, it will be more effecient as fewer temporary copies are
being made.
m-holger, if you're reading, you can see that I've finally come
around. :-)
Prior to the cmake conversion, several private classes had methods
that were exported into the shared library so they could be tested
with libtests. With cmake, we build libtests using an object library,
so this is no longer necessary. The methods that are disappearing from
the ABI were never exposed through public headers, so no code should
be using them. Removal had to wait until the window for ABI-breaking
changes was open.
The executables that libtool built invoked the underlying binary with
an "lt-" prefix. The code contained numerous workarounds for testing,
which can now be removed.
This comment expands all tabs using an 8-character tab-width. You
should ignore this commit when using git blame or use git blame -w.
In the early days, I used to use tabs where possible for indentation,
since emacs did this automatically. In recent years, I have switched
to only using spaces, which means qpdf source code has been a mixture
of spaces and tabs. I have avoided cleaning this up because of not
wanting gratuitous whitespaces change to cloud the output of git
blame, but I changed my mind after discussing with users who view qpdf
source code in editors/IDEs that have other tab widths by default and
in light of the fact that I am planning to start applying automatic
code formatting soon.
Use get() and use_count() instead. Add #define
NO_POINTERHOLDER_DEPRECATION to remove deprecation markers for these
only.
This commit also removes all deprecated PointerHolder API calls from
qpdf's code except in PointerHolder's test suite, which must continue
to test the deprecated APIs.
All the coverage cases that used to be in qpdf.cc are now in
QPDFJob*.cc. It doesn't really matter, but better to follow the
convention of starting with the class that includes the coverage call.
Changing from bool requiring true to string requiring the empty string
is more consistent with the CLI and makes it possible to add an
optional parameter or choices later without breaking compatibility.
Flatten everything to make it easier to map command-line flags to
json. The old structure was an illusion anyway because there was no
mechanism to enforce that things were in the right place. This also
helps with future flexibility.
The previous commits have removed all references to memory from
QPDFArgParser from QPDFJob. This commit removes the constraint that
QPDFArgParser remain in scope. This is a prerequisite to allowing JSON
as an alternative way to initialize QPDFJob and to initialize it
directly using a public API.
Move ArgParser from qpdf.cc into QPDFJob.cc. It still works with
millions of public member variables, but now qpdf.cc is minimal and
just calls stable library functions.
Remove all calls to exit() from QPDFJob. Handle code that runs in
verbose mode to enable it to make use of output streams and message
prefix (whoami) from QPDFJob. This removes temporarily duplicated exit
code logic and most access to whoami/std::cout outside of QPDFJob
proper.
Move most of the methods called from qpdf.cc after argument parsing
into QPDFJob. In this increment, enough QPDFJob API has been added to
handle the branch of QPDFJob::run() that creates output with an
appropriate division between qpdf.cc and QPDFJob.
There are temporary bits of code to enable everything to compile and
pass the test suite, including some duplication and hard-coded values.
They have to be ot_* rather than qpdf_ot_* for compatibility.
* Different enumerated types are not assignment-compatible in C++, at
least with strict compiler settings
* While you can do `constexpr ot_xyz = ::qpdf_ot_xyz` in QPDFObject.hh to
make QPDFObject::ot_xyz work, QPDFObject::object_type_e::ot_xyz will
only work if the enumerated type names are the same.
* Handle error conditions that occur when using the object handle
interfaces. In the past, some exceptions were not correctly
converted to errors or warnings.
* Add more detailed information to qpdf-c.h
* Make it possible to work more explicitly with uninitialized objects
Don't assume endobj is at the beginning of the line. This means we are
looking at tokens for every line, but the odds of n n obj appearing in
the middle of the object are likely much lower than endobj not being
at the beginning of the line or missing entirely. This will probably
have a negative impact on recovery time for very large files.
Hopefully it will be worth it.
When making resources indirect in from_dr, the code was using the
wrong owning QPDF, forgetting that from_dr had already been copied
using CopyForeignObject.
When adding a QPDFObjectHandle to an array or dictionary, if possible,
check if the new object belongs to the same QPDF. This makes it much
easier to find incorrect code than waiting for the situation to be
detected when the file is written.
This was originally not public because I wanted to get rid fo the
pages cache, but I recently realized there were deep reasons not to do
that, and the author of pikepdf wanted this, so I decided to make it
public.
Converted ResourceFinder to ParserCallbacks so we can better detect
the name that precedes various operators and use the operators to sort
the names into resource types. This enables us to be smarter about
detecting unreferenced resources in pages and also sets the stage for
reconciling differences in /DR across documents.
If not found in the field hierarchy, /Q and /DA are supposed to be
looked up in the document-level form dictionary. /DR is supposed to
only come from the document dictionary.
This results in a performance penalty of 1% to 2% when replaceObject
and swapObjects are never called and a somewhat larger penalty if they
are called, but it's worth it to avoid very confusing behavior as
discussed in depth in qpdf#507.