mirror of
https://github.com/qpdf/qpdf.git
synced 2024-12-22 02:49:00 +00:00
Update documentation for qpdf JSON v2
This commit is contained in:
parent
b7bbf12e85
commit
0bd908b550
201
TODO
201
TODO
@ -2,14 +2,13 @@
|
||||
Next
|
||||
====
|
||||
|
||||
Before Release:
|
||||
|
||||
* At next release, hide release-qpdf-10.6.3.0cmake* versions at readthedocs
|
||||
* Stay on top of https://github.com/pikepdf/pikepdf/pull/315
|
||||
* Release qtest with updates to qtest-driver and copy back into qpdf
|
||||
|
||||
In order:
|
||||
* json v2
|
||||
|
||||
Other (do in any order):
|
||||
Pending changes:
|
||||
|
||||
* Good C API for json v2
|
||||
* QPDFPagesTree -- avoid ever flattening the pages tree.
|
||||
@ -50,180 +49,10 @@ Other (do in any order):
|
||||
* Rework tests so that nothing is written into the source directory.
|
||||
Ideally then the entire build could be done with a read-only
|
||||
source tree.
|
||||
* Consider adding fuzzer code for JSON
|
||||
|
||||
Soon: Break ground on "Document-level work"
|
||||
|
||||
Output JSON v2
|
||||
==============
|
||||
|
||||
Remaining work:
|
||||
|
||||
* Make sure all the information from informational options is
|
||||
available in the json output.
|
||||
|
||||
* --check: add but maybe not by default?
|
||||
|
||||
* --show-linearization: add but maybe not by default? Also figure
|
||||
out whether warnings reported for some of the PDF specs (1.7) are
|
||||
qpdf problems. This may not be worth adding in the first
|
||||
increment.
|
||||
|
||||
* --show-xref: add
|
||||
|
||||
* Consider having --check, --show-encryption, etc., just select the
|
||||
right keys when in json mode. I don't think I want check on by
|
||||
default, so that might be different.
|
||||
|
||||
* Consider having warnings be included in the json in a "warnings" key
|
||||
in json mode.
|
||||
|
||||
Notes for documentation:
|
||||
|
||||
* Find all mentions of json in the manual and update.
|
||||
|
||||
* Document typo fix in encrypt in release notes along with any other
|
||||
non-compatible json 2 changes. Scrutinize all the output to decide
|
||||
what should change.
|
||||
|
||||
* Keys other than "qpdf-v2" are ignored so people can stash their own
|
||||
stuff. Unknown keys are ignored at other places for future
|
||||
compatibility. Readers of qpdf json should continue to ignore keys
|
||||
they don't recognize.
|
||||
|
||||
* Change: names are written in canonical form with a leading slash
|
||||
just as they are treated in the code. In v1, they were written in
|
||||
PDF syntax in the json file. Example: /text#2fplain in pdf will be
|
||||
written as /text/plain in json v2 and as /text#2fplain in json v1.
|
||||
|
||||
* Document changes to strings, objects, streams, object keys.
|
||||
|
||||
* CLI: --json-input, --json-output[=version], --update-from-json. With
|
||||
--json-input, the input file is a JSON file instead of a PDF file.
|
||||
It must be complete, meaning that a PDF version must be given, all
|
||||
streams must have exactly one of data or datafile, and a trailer
|
||||
dictionary must be present, even if empty.
|
||||
|
||||
With --update-from-json, the JSON file updates objects in place. If
|
||||
updating an old stream, if stream data is omitted, the data remains
|
||||
untouched. The dictionary is always required. Remember that
|
||||
QPDFWriter does not preserve object numbers, though --json-output
|
||||
does. Therefore, if you want to update a PDF with a JSON, the input
|
||||
to --update-from-json must be the same PDF as the one that
|
||||
--json-output was run on previously. Otherwise, object numbers won't
|
||||
match. Show this with an example. When updating,
|
||||
|
||||
* Certain fields are ignored when reading the JSON. This includes
|
||||
maxobjectid, any computed fields in trailer (such as /Size), and all
|
||||
/Length keys in stream dictionaries. There is no need for the user
|
||||
to correct, remove, or otherwise worry about any values those keys
|
||||
might have. The maxobjectid field is present in the original output
|
||||
to assist with adding new objects to the file.
|
||||
|
||||
* JSON strings within PDF objects:
|
||||
|
||||
* "n n R" is an indirect object
|
||||
|
||||
* "/Name" is a name in canonical form with a leading slash (like
|
||||
"/text/plain"), not PDF syntax (like "/text#2fplain").
|
||||
|
||||
* "b:hex-digits" is a binary string ("b:feff03c0"). Hex digits may be
|
||||
mixed case. There must be an even number of digits.
|
||||
|
||||
* "u:utf-8" is a UTF-8 encoded string ("u:π", "u:\u03c0"). UTF-16
|
||||
surrogate pairs are allowed. These are all equivalent: "u:🥔",
|
||||
"u:\ud83e\udd54", "b:FEFFD83EDD54", "b:efbbbff09fa594".
|
||||
|
||||
* Both "b:" and "u:" are valid representations of the empty string.
|
||||
|
||||
* Anything else is an error
|
||||
|
||||
* Document use of --json-input and --json-output together to show
|
||||
preservation of object numbers. Draw attention to "original object
|
||||
ID" comments in qdf as another way to show it.
|
||||
|
||||
* Document top-level keys of "qpdf-v2" ("pdfversion", "objects",
|
||||
"maxobjectid") noting that "maxobjectid" is ignored when reading.
|
||||
|
||||
* Stream data: "data" is base64-encoded stream data. "datafile" is the
|
||||
path to a file (relative path recommended but not required)
|
||||
containing the binary data. As with any PDF representation, the data
|
||||
must be consistent with the filters. --decode-level is honored by
|
||||
--json-output.
|
||||
|
||||
* Other changes from v1:
|
||||
|
||||
* in "objects", keys are "obj:o g R" or "trailer"
|
||||
|
||||
* Non-stream objects are dictionaries with a "value" key whose value
|
||||
is the object. Stream objects are dictionaries with a "stream" key
|
||||
whose value is {"dict": stream-dictionary}. The "/Length" key is
|
||||
omitted from the stream dictionary.
|
||||
|
||||
* "objectinfo" is gone as it is now possible to tell a stream from a
|
||||
non-stream directly. To get stream data, use the --json-output
|
||||
option. Note about how "pages" may cause the pages tree to be
|
||||
corrected.
|
||||
|
||||
For non-streams:
|
||||
|
||||
"obj:o g R": {
|
||||
"value": ...
|
||||
}
|
||||
|
||||
For streams:
|
||||
|
||||
"obj:o g R": {
|
||||
"stream": {
|
||||
"dict": { ... stream dictionary ... },
|
||||
"data": "base64-encoded data",
|
||||
"datafile": "path to base64-encoded data"
|
||||
}
|
||||
}
|
||||
|
||||
Rationale of "obj:o g R" is that indirect object references are just
|
||||
"o g R", and so code that wants to resolve one can do so easily by
|
||||
just prepending "obj:" and not having to parse or split the string.
|
||||
Having a prefix rather than making the key just "o g R" makes it much
|
||||
easier to search in the JSON for the definition of an object.
|
||||
|
||||
CLI:
|
||||
|
||||
Example workflow:
|
||||
* qpdf in.pdf --json-output pdf.json
|
||||
* edit pdf.json
|
||||
* qpdf --json-input pdf.json out.pdf
|
||||
|
||||
* qpdf in.pdf --json-output pdf.json
|
||||
* edit pdf.json keeping only objects that need to be changed
|
||||
* qpdf in.pdf --update-from-json=pdf.json out.pdf
|
||||
|
||||
To modify a single object:
|
||||
|
||||
* qpdf in.pdf --json-output pdf.json --json-object=o,g
|
||||
* edit pdf.json
|
||||
* qpdf in.pdf --update-from-json=pdf.json out.pdf
|
||||
|
||||
Historical note: you can't create a PDF from v1 json because
|
||||
|
||||
* The PDF version header is not recorded
|
||||
|
||||
* Strings cannot be unambiguously encoded/decoded
|
||||
|
||||
* Can't tell string from name from indirect object
|
||||
|
||||
* Strings are treated as PDF doc encoding and output as UTF-8, which
|
||||
doesn't work since multiple PDF doc code points are undefined and
|
||||
is absurd for binary strings
|
||||
|
||||
* There is no representation of stream data
|
||||
|
||||
* You can't tell a stream from a dictionary except by looking in both
|
||||
"object" and "objectinfo".
|
||||
|
||||
* Using "n n R" as a key in "objects" and "objectinfo" makes it hard
|
||||
to search for things when viewing the JSON file in an editor.
|
||||
|
||||
|
||||
QPDFPagesTree
|
||||
=============
|
||||
|
||||
@ -256,6 +85,28 @@ sure /Count and /Parent are correct.
|
||||
refs/attic/QPDFPagesTree-old -- original, abandoned branch -- clean up
|
||||
when done.
|
||||
|
||||
Possible future JSON enhancements
|
||||
=================================
|
||||
|
||||
* Add to JSON output the information available from a few additional
|
||||
informational options:
|
||||
|
||||
* --check: add but maybe not by default?
|
||||
|
||||
* --show-linearization: add but maybe not by default? Also figure
|
||||
out whether warnings reported for some of the PDF specs (1.7) are
|
||||
qpdf problems. This may not be worth adding in the first
|
||||
increment.
|
||||
|
||||
* --show-xref: add
|
||||
|
||||
* Consider having --check, --show-encryption, etc., just select the
|
||||
right keys when in json mode. I don't think I want check on by
|
||||
default, so that might be different.
|
||||
|
||||
* Consider having warnings be included in the json in a "warnings" key
|
||||
in json mode.
|
||||
|
||||
QPDFJob
|
||||
=======
|
||||
|
||||
|
@ -271,6 +271,7 @@
|
||||
"mkinstalldirs",
|
||||
"mklink",
|
||||
"moddate",
|
||||
"modifyannotations",
|
||||
"monoseq",
|
||||
"msvc",
|
||||
"msvcrt",
|
||||
|
@ -112,8 +112,11 @@ class QPDF
|
||||
|
||||
// Create a PDF from an input source that contains JSON as written
|
||||
// by writeJSON (or qpdf --json-output, version 2 or higher). The
|
||||
// JSON must be a complete representation of a PDF. See "QPDF JSON
|
||||
// Format" in the manual for details.
|
||||
// JSON must be a complete representation of a PDF. See "qpdf
|
||||
// JSON" in the manual for details. The input JSON may be
|
||||
// arbitrarily large. QPDF does not load stream data into memory
|
||||
// for more than one stream at a time, even if the stream data is
|
||||
// specified inline.
|
||||
QPDF_DLL
|
||||
void createFromJSON(std::string const& json_file);
|
||||
QPDF_DLL
|
||||
@ -122,24 +125,40 @@ class QPDF
|
||||
// Update a PDF from an input source that contains JSON in the
|
||||
// same format as is written by writeJSON (or qpdf --json-output,
|
||||
// version 2 or higher). Objects in the PDF and not in the JSON
|
||||
// are not modified. See "QPDF JSON Format" in the manual for
|
||||
// details.
|
||||
// are not modified. See "qpdf JSON" in the manual for details. As
|
||||
// with createFromJSON, the input JSON may be arbitrarily large.
|
||||
QPDF_DLL
|
||||
void updateFromJSON(std::string const& json_file);
|
||||
QPDF_DLL
|
||||
void updateFromJSON(std::shared_ptr<InputSource>);
|
||||
|
||||
// Write qpdf json format. The only supported version is 2. If
|
||||
// wanted_objects is empty, write all objects. Otherwise, write
|
||||
// only objects whose keys are in wanted_objects. Keys may be
|
||||
// either "trailer" or of the form "obj:n n R". Invalid keys are
|
||||
// ignored.
|
||||
// Write qpdf json format to the pipeline "p". The only supported
|
||||
// version is 2. The finish() method is called on the pipeline at
|
||||
// the end. The decode_level parameter controls which streams are
|
||||
// uncompressed in the JSON. Use qpdf_dl_none to preserve all
|
||||
// stream data exactly as it appears in the input. The possible
|
||||
// values for json_stream_data can be found in qpdf/Constants.h
|
||||
// and correspond to the --json-stream-data command-line argument.
|
||||
// If json_stream_data is qpdf_sj_file, file_prefix must be
|
||||
// specified. Each stream will be written to a file whose path is
|
||||
// constructed by appending "-nnn" to file_prefix, where "nnn" is
|
||||
// the object number (not zero-filled). If wanted_objects is
|
||||
// empty, write all objects. Otherwise, write only objects whose
|
||||
// keys are in wanted_objects. Keys may be either "trailer" or of
|
||||
// the form "obj:n n R". Invalid keys are ignored. This
|
||||
// corresponds to the --json-object command-line argument.
|
||||
//
|
||||
// QPDF is efficient with regard to memory when writing, allowing
|
||||
// you to write arbitrarily large PDF files to a pipeline. You can
|
||||
// use a pipeline like Pl_Buffer or Pl_String to capture the JSON
|
||||
// output in memory, but do so with caution as this will allocate
|
||||
// enough memory to hold the entire PDF file.
|
||||
QPDF_DLL
|
||||
void writeJSON(
|
||||
int version,
|
||||
Pipeline*,
|
||||
qpdf_stream_decode_level_e,
|
||||
qpdf_json_stream_data_e,
|
||||
Pipeline* p,
|
||||
qpdf_stream_decode_level_e decode_level,
|
||||
qpdf_json_stream_data_e json_stream_data,
|
||||
std::string const& file_prefix,
|
||||
std::set<std::string> wanted_objects);
|
||||
|
||||
|
4
job.sums
4
job.sums
@ -8,10 +8,10 @@ include/qpdf/auto_job_c_pages.hh b3cc0f21029f6d89efa043dcdbfa183cb59325b6506001c
|
||||
include/qpdf/auto_job_c_uo.hh ae21b69a1efa9333050f4833d465f6daff87e5b38e5106e49bbef5d4132e4ed1
|
||||
job.yml 3b2b3c6f92b48f6c76109711cbfdd74669fa31a80cd17379548b09f8e76be05d
|
||||
libqpdf/qpdf/auto_job_decl.hh 74df4d7fdbdf51ecd0d58ce1e9844bb5525b9adac5a45f7c9a787ecdda2868df
|
||||
libqpdf/qpdf/auto_job_help.hh c1cc99f6fe17285ee5e40730f6280e37d17da1a5f408086ce34e01af121df7ad
|
||||
libqpdf/qpdf/auto_job_help.hh 3aaae4cde004e5314d3ac6d554da575e40209c0f0611f6a308957986f9c7967b
|
||||
libqpdf/qpdf/auto_job_init.hh 7ea8e0641dc26fdfba6e283e14dbbff0c016654e174cdace8054f8bef53750fd
|
||||
libqpdf/qpdf/auto_job_json_decl.hh 06caa46eaf71db8a50c046f91866baa8087745a9474319fb7c86d92634cc8297
|
||||
libqpdf/qpdf/auto_job_json_init.hh 5f6b53e3c81d4b54ce5c4cf9c3f52d0c02f987c53bf8841c0280367bad23e335
|
||||
libqpdf/qpdf/auto_job_schema.hh 9d543cd4a43eafffc2c4b8a6fee29e399c271c52cb6f7d417ae5497b3c1127dc
|
||||
manual/_ext/qpdf.py 6add6321666031d55ed4aedf7c00e5662bba856dfcd66ccb526563bffefbb580
|
||||
manual/cli.rst 82ead389c03bbf5e0498bd0571a11dc06544d591f4e4454c00322e3473fc556d
|
||||
manual/cli.rst e3f4331befa17450e0d0fff87569722a5aab42ea619ef64f0a3a04e1f99ed65c
|
||||
|
@ -817,4 +817,5 @@ QPDF::writeJSON(
|
||||
JSON::writeDictionaryClose(p, first_qpdf, 1);
|
||||
JSON::writeDictionaryClose(p, first, 0);
|
||||
*p << "\n";
|
||||
p->finish();
|
||||
}
|
||||
|
@ -70,6 +70,9 @@ ap.addOptionHelp("--copyright", "help", "show copyright information", R"(Display
|
||||
ap.addOptionHelp("--show-crypto", "help", "show available crypto providers", R"(Show a list of available crypto providers, one per line. The
|
||||
default provider is shown first.
|
||||
)");
|
||||
ap.addOptionHelp("--job-json-help", "help", "show format of job JSON", R"(Describe the format of the QPDFJob JSON input used by
|
||||
--job-json-file.
|
||||
)");
|
||||
ap.addHelpTopic("general", "general options", R"(General options control qpdf's behavior in ways that are not
|
||||
directly related to the operation it is performing.
|
||||
)");
|
||||
@ -87,11 +90,11 @@ ap.addOptionHelp("--verbose", "general", "print additional information", R"(Outp
|
||||
doing, including information about files created and operations
|
||||
performed.
|
||||
)");
|
||||
ap.addOptionHelp("--progress", "general", "show progress when writing", R"(Indicate progress when writing files.
|
||||
)");
|
||||
}
|
||||
static void add_help_2(QPDFArgParser& ap)
|
||||
{
|
||||
ap.addOptionHelp("--progress", "general", "show progress when writing", R"(Indicate progress when writing files.
|
||||
)");
|
||||
ap.addOptionHelp("--no-warn", "general", "suppress printing of warning messages", R"(Suppress printing of warning messages. If warnings were
|
||||
encountered, qpdf still exits with exit status 3.
|
||||
Use --warning-exit-0 with --no-warn to completely ignore
|
||||
@ -172,12 +175,12 @@ companion tool "fix-qdf" can be used to repair hand-edited QDF
|
||||
files. QDF is a feature specific to the qpdf tool. Please see
|
||||
the "QDF Mode" chapter in the manual.
|
||||
)");
|
||||
ap.addOptionHelp("--no-original-object-ids", "transformation", "omit original object IDs in qdf", R"(Omit comments in a QDF file indicating the object ID an object
|
||||
had in the original file.
|
||||
)");
|
||||
}
|
||||
static void add_help_3(QPDFArgParser& ap)
|
||||
{
|
||||
ap.addOptionHelp("--no-original-object-ids", "transformation", "omit original object IDs in qdf", R"(Omit comments in a QDF file indicating the object ID an object
|
||||
had in the original file.
|
||||
)");
|
||||
ap.addOptionHelp("--compress-streams", "transformation", "compress uncompressed streams", R"(--compress-streams=[y|n]
|
||||
|
||||
Setting --compress-streams=n prevents qpdf from compressing
|
||||
@ -188,9 +191,11 @@ ap.addOptionHelp("--decode-level", "transformation", "control which streams to u
|
||||
|
||||
When uncompressing streams, control which types of compression
|
||||
schemes should be uncompressed:
|
||||
- none: don't uncompress anything. This is the default with --json-output.
|
||||
- none: don't uncompress anything. This is the default with
|
||||
--json-output.
|
||||
- generalized: uncompress streams compressed with a
|
||||
general-purpose compression algorithm. This is the default.
|
||||
general-purpose compression algorithm. This is the default
|
||||
except when --json-output is given.
|
||||
- specialized: in addition to generalized, also uncompress
|
||||
streams compressed with a special-purpose but non-lossy
|
||||
compression scheme
|
||||
@ -290,13 +295,13 @@ from the resulting set, not based on the original page numbers.
|
||||
ap.addHelpTopic("modification", "change parts of the PDF", R"(Modification options make systematic changes to certain parts of
|
||||
the PDF, causing the PDF to render differently from the original.
|
||||
)");
|
||||
}
|
||||
static void add_help_4(QPDFArgParser& ap)
|
||||
{
|
||||
ap.addOptionHelp("--pages", "modification", "begin page selection", R"(--pages file [--password=password] [page-range] [...] --
|
||||
|
||||
Run qpdf --help=page-selection for details.
|
||||
)");
|
||||
}
|
||||
static void add_help_4(QPDFArgParser& ap)
|
||||
{
|
||||
ap.addOptionHelp("--collate", "modification", "collate with --pages", R"(--collate[=n]
|
||||
|
||||
Collate rather than concatenate pages specified with --pages.
|
||||
@ -460,14 +465,14 @@ ap.addOptionHelp("--assemble", "encryption", "restrict document assembly", R"(--
|
||||
Enable/disable document assembly (rotation and reordering of
|
||||
pages). This option is not available with 40-bit encryption.
|
||||
)");
|
||||
}
|
||||
static void add_help_5(QPDFArgParser& ap)
|
||||
{
|
||||
ap.addOptionHelp("--extract", "encryption", "restrict text/graphic extraction", R"(--extract=[y|n]
|
||||
|
||||
Enable/disable text/graphic extraction for purposes other than
|
||||
accessibility.
|
||||
)");
|
||||
}
|
||||
static void add_help_5(QPDFArgParser& ap)
|
||||
{
|
||||
ap.addOptionHelp("--form", "encryption", "restrict form filling", R"(--form=[y|n]
|
||||
|
||||
Enable/disable whether filling form fields is allowed even if
|
||||
@ -638,6 +643,9 @@ ap.addOptionHelp("--remove-attachment", "attachments", "remove an embedded file"
|
||||
Remove an embedded file using its key. Get the key with
|
||||
--list-attachments.
|
||||
)");
|
||||
}
|
||||
static void add_help_6(QPDFArgParser& ap)
|
||||
{
|
||||
ap.addHelpTopic("pdf-dates", "PDF date format", R"(When a date is required, the date should conform to the PDF date
|
||||
format specification, which is "D:yyyymmddhhmmssz" where "z" is
|
||||
either literally upper case "Z" for UTC or a timezone offset in
|
||||
@ -650,9 +658,6 @@ Examples:
|
||||
- D:20210207161528-05'00' February 7, 2021 at 4:15:28 p.m.
|
||||
- D:20210207211528Z February 7, 2021 at 21:15:28 UTC
|
||||
)");
|
||||
}
|
||||
static void add_help_6(QPDFArgParser& ap)
|
||||
{
|
||||
ap.addHelpTopic("add-attachment", "attach (embed) files", R"(The options listed below appear between --add-attachment and its
|
||||
terminating "--".
|
||||
)");
|
||||
@ -747,14 +752,14 @@ the linearization hint tables are correct.
|
||||
)");
|
||||
ap.addOptionHelp("--show-linearization", "inspection", "show linearization hint tables", R"(Check and display all data in the linearization hint tables.
|
||||
)");
|
||||
}
|
||||
static void add_help_7(QPDFArgParser& ap)
|
||||
{
|
||||
ap.addOptionHelp("--show-xref", "inspection", "show cross reference data", R"(Show the contents of the cross-reference table or stream (object
|
||||
locations in the file) in a human-readable form. This is
|
||||
especially useful for files with cross-reference streams, which
|
||||
are stored in a binary format.
|
||||
)");
|
||||
}
|
||||
static void add_help_7(QPDFArgParser& ap)
|
||||
{
|
||||
ap.addOptionHelp("--show-object", "inspection", "show contents of an object", R"(--show-object={trailer|obj[,gen]}
|
||||
|
||||
Show the contents of the given object. This is especially useful
|
||||
@ -814,21 +819,20 @@ This option is repeatable. If given, only specified objects will
|
||||
be shown in the "objects" key of the JSON output. Otherwise, all
|
||||
objects will be shown.
|
||||
)");
|
||||
ap.addOptionHelp("--job-json-help", "json", "show format of job JSON", R"(Describe the format of the QPDFJob JSON input used by
|
||||
--job-json-file.
|
||||
)");
|
||||
ap.addOptionHelp("--json-stream-data", "json", "how to handle streams in json output", R"(--json-stream-data={none|inline|file}
|
||||
|
||||
Control whether streams in json output should be omitted,
|
||||
written inline (base64-encoded) or written to a file. If "file"
|
||||
is chosen, the file will be the name of the input file appended
|
||||
with -nnn where nnn is the object number. The prefix can be
|
||||
overridden with --json-stream-prefix.
|
||||
When used with --json-output, this option controls whether
|
||||
streams in json output should be omitted, written inline
|
||||
(base64-encoded) or written to a file. If "file" is chosen, the
|
||||
file will be the name of the output file appended with -nnn where
|
||||
nnn is the object number. The prefix can be overridden with
|
||||
--json-stream-prefix.
|
||||
)");
|
||||
ap.addOptionHelp("--json-stream-prefix", "json", "prefix for json stream data files", R"(--json-stream-prefix=file-prefix
|
||||
|
||||
When --json-stream-data=file is given, override the input file
|
||||
name as the prefix for stream data files. Whatever is given here
|
||||
When used with --json-output, --json-stream-data=file-prefix
|
||||
sets the prefix for stream data files, overriding the default,
|
||||
which is to use the output file name. Whatever is given here
|
||||
will be appended with -nnn to create the name of the file that
|
||||
will contain the data for the stream stream in object nnn.
|
||||
)");
|
||||
@ -836,19 +840,19 @@ ap.addOptionHelp("--json-output", "json", "serialize to JSON", R"(--json-output[
|
||||
|
||||
The output file will be qpdf JSON format at the given version.
|
||||
"version" may be a specific version or "latest" (the default).
|
||||
Version 1 is not supported. See also --json-stream-data,
|
||||
The only supported version is 2. See also --json-stream-data,
|
||||
--json-stream-prefix, and --decode-level.
|
||||
)");
|
||||
ap.addOptionHelp("--json-input", "json", "input file is qpdf JSON", R"(Treat the input file as a JSON file in qpdf JSON format as
|
||||
written by qpdf --json-output. See the "QPDF JSON Format"
|
||||
written by qpdf --json-output. See the "qpdf JSON Format"
|
||||
section of the manual for information about how to use this
|
||||
option.
|
||||
)");
|
||||
ap.addOptionHelp("--update-from-json", "json", "update a PDF from qpdf JSON", R"(--update-from-json=qpdf-json-file
|
||||
|
||||
Update a PDF file from a JSON file. Please see the "QPDF JSON
|
||||
Format" section of the manual for information about how to use
|
||||
this option.
|
||||
Update a PDF file from a JSON file. Please see the "qpdf JSON"
|
||||
chapter of the manual for information about how to use this
|
||||
option.
|
||||
)");
|
||||
}
|
||||
static void add_help_8(QPDFArgParser& ap)
|
||||
|
154
manual/cli.rst
154
manual/cli.rst
@ -171,7 +171,9 @@ Related Options
|
||||
equivalent command-line arguments were supplied. It can be repeated
|
||||
and mixed freely with other options. Run ``qpdf`` with
|
||||
:qpdf:ref:`--job-json-help` for a description of the job JSON input
|
||||
file format. For more information, see :ref:`qpdf-job`.
|
||||
file format. For more information, see :ref:`qpdf-job`. Note that
|
||||
this is unrelated to :qpdf:ref:`--json` but may be combined with
|
||||
it.
|
||||
|
||||
.. _exit-status:
|
||||
|
||||
@ -341,6 +343,17 @@ Related Options
|
||||
itself. The default provider is always listed first. See
|
||||
:ref:`crypto` for more information about crypto providers.
|
||||
|
||||
.. qpdf:option:: --job-json-help
|
||||
|
||||
.. help: show format of job JSON
|
||||
|
||||
Describe the format of the QPDFJob JSON input used by
|
||||
--job-json-file.
|
||||
|
||||
Describe the format of the QPDFJob JSON input used by
|
||||
:qpdf:ref:`--job-json-file`. For more information about QPDFJob,
|
||||
see :ref:`qpdf-job`.
|
||||
|
||||
.. _general-options:
|
||||
|
||||
General Options
|
||||
@ -852,9 +865,11 @@ Related Options
|
||||
|
||||
When uncompressing streams, control which types of compression
|
||||
schemes should be uncompressed:
|
||||
- none: don't uncompress anything. This is the default with --json-output.
|
||||
- none: don't uncompress anything. This is the default with
|
||||
--json-output.
|
||||
- generalized: uncompress streams compressed with a
|
||||
general-purpose compression algorithm. This is the default.
|
||||
general-purpose compression algorithm. This is the default
|
||||
except when --json-output is given.
|
||||
- specialized: in addition to generalized, also uncompress
|
||||
streams compressed with a special-purpose but non-lossy
|
||||
compression scheme
|
||||
@ -875,7 +890,8 @@ Related Options
|
||||
``/ASCII85Decode``, and ``/ASCIIHexDecode``. We define
|
||||
generalized filters as those to be used for general-purpose
|
||||
compression or encoding, as opposed to filters specifically
|
||||
designed for image data. This is the default.
|
||||
designed for image data. This is the default except when
|
||||
:qpdf:ref:`--json-output` is given.
|
||||
|
||||
- :samp:`specialized`: in addition to generalized, decode streams
|
||||
with supported non-lossy specialized filters; currently this is
|
||||
@ -3126,8 +3142,9 @@ Related Options
|
||||
is usually but not always equal to the file name and is needed by
|
||||
some of the other options. See also :ref:`attachments`. Note that
|
||||
this option displays dates in PDF timestamp syntax. When attachment
|
||||
information is included in json output (see :ref:`--json`), dates
|
||||
are shown in ISO-8601 format.
|
||||
information is included in json output in the ``"attachments"`` key
|
||||
(see :ref:`--json`), dates are shown (just within that object) in
|
||||
ISO-8601 format.
|
||||
|
||||
.. qpdf:option:: --show-attachment=key
|
||||
|
||||
@ -3169,14 +3186,11 @@ Related Options
|
||||
|
||||
Generate a JSON representation of the file. This is described in
|
||||
depth in :ref:`json`. The version parameter can be used to specify
|
||||
which version of the qpdf JSON format should be output. The only
|
||||
supported value is ``1``, but it's possible that a new JSON output
|
||||
version will be added in a future version. You can also specify
|
||||
``latest`` to use the latest JSON version. For backward
|
||||
compatibility, the default value will remain ``1`` until qpdf
|
||||
version 11, after which point it will become ``latest``. In all
|
||||
case, you can tell what version of the JSON output you have from
|
||||
the ``"version"`` key in the output. Use the
|
||||
which version of the qpdf JSON format should be output. The version
|
||||
number be a number or ``latest``. The default is ``latest``. As of
|
||||
qpdf 11, the latest version is ``2``. If you have code that reads
|
||||
qpdf JSON output, you can tell what version of the JSON output you
|
||||
have from the ``"version"`` key in the output. Use the
|
||||
:qpdf:ref:`--json-help` option to get a description of the JSON
|
||||
object.
|
||||
|
||||
@ -3189,11 +3203,11 @@ Related Options
|
||||
containing descriptive text.
|
||||
|
||||
Describe the format of the JSON output by writing to standard
|
||||
output a JSON object with the same structure with the same keys as
|
||||
the JSON generated by qpdf. In the output written by
|
||||
``--json-help``, each key's value is a description of the key. The
|
||||
specific contract guaranteed by qpdf in its JSON representation is
|
||||
explained in more detail in the :ref:`json`.
|
||||
output a JSON object with the same structure as the JSON generated
|
||||
by qpdf. In the output written by ``--json-help``, each key's value
|
||||
is a description of the key. The specific contract guaranteed by
|
||||
qpdf in its JSON representation is explained in more detail in the
|
||||
:ref:`json`.
|
||||
|
||||
.. qpdf:option:: --json-key=key
|
||||
|
||||
@ -3216,53 +3230,50 @@ Related Options
|
||||
be shown in the "objects" key of the JSON output. Otherwise, all
|
||||
objects will be shown.
|
||||
|
||||
This option is repeatable. If given, only specified objects will
|
||||
be shown in the "``objects``" key of the JSON output. Otherwise, all
|
||||
objects will be shown.
|
||||
|
||||
.. qpdf:option:: --job-json-help
|
||||
|
||||
.. help: show format of job JSON
|
||||
|
||||
Describe the format of the QPDFJob JSON input used by
|
||||
--job-json-file.
|
||||
|
||||
Describe the format of the QPDFJob JSON input used by
|
||||
:qpdf:ref:`--job-json-file`. For more information about QPDFJob,
|
||||
see :ref:`qpdf-job`.
|
||||
This option is repeatable. If given, only specified objects will be
|
||||
shown in the ``"objects"`` key of the JSON output. Otherwise, all
|
||||
objects will be shown. For qpdf JSON version 1, this also affects
|
||||
the ``"objectinfo"`` key, which is not present in version 2. This
|
||||
option may be used with :qpdf:ref:`--json` and also with
|
||||
:qpdf:ref:`--json-output`.
|
||||
|
||||
.. qpdf:option:: --json-stream-data={none|inline|file}
|
||||
|
||||
.. help: how to handle streams in json output
|
||||
|
||||
Control whether streams in json output should be omitted,
|
||||
written inline (base64-encoded) or written to a file. If "file"
|
||||
is chosen, the file will be the name of the input file appended
|
||||
with -nnn where nnn is the object number. The prefix can be
|
||||
overridden with --json-stream-prefix.
|
||||
When used with --json-output, this option controls whether
|
||||
streams in json output should be omitted, written inline
|
||||
(base64-encoded) or written to a file. If "file" is chosen, the
|
||||
file will be the name of the output file appended with -nnn where
|
||||
nnn is the object number. The prefix can be overridden with
|
||||
--json-stream-prefix.
|
||||
|
||||
Control whether streams in json output should be omitted, written
|
||||
inline (base64-encoded) or written to a file. If ``file`` is
|
||||
chosen, the file will be the name of the input file appended with
|
||||
:samp:`-{nnn}` where :samp:`{nnn}` is the object number. The prefix
|
||||
can be overridden with :qpdf:ref:`--json-stream-prefix`. This
|
||||
option only applies when used with :qpdf:ref:`--json-output`.
|
||||
When used with :qpdf:ref:`--json-output`, this option controls
|
||||
whether streams in JSON output should be omitted, written inline
|
||||
(base64-encoded) or written to a file. If ``file`` is chosen, the
|
||||
file will be the name of the output file appended with
|
||||
:samp:`-{nnn}` where :samp:`{nnn}` is the object number. The stream
|
||||
data file prefix can be overridden with
|
||||
:qpdf:ref:`--json-stream-prefix`. This option only applies when
|
||||
used with :qpdf:ref:`--json-output`.
|
||||
|
||||
.. qpdf:option:: --json-stream-prefix=file-prefix
|
||||
|
||||
.. help: prefix for json stream data files
|
||||
|
||||
When --json-stream-data=file is given, override the input file
|
||||
name as the prefix for stream data files. Whatever is given here
|
||||
When used with --json-output, --json-stream-data=file-prefix
|
||||
sets the prefix for stream data files, overriding the default,
|
||||
which is to use the output file name. Whatever is given here
|
||||
will be appended with -nnn to create the name of the file that
|
||||
will contain the data for the stream stream in object nnn.
|
||||
|
||||
When :qpdf:ref:`--json-stream-data` is given with the value
|
||||
``file``, override the input file name as the prefix for stream
|
||||
data files. Whatever is given here will be appended with
|
||||
:samp:`-{nnn}` to create the name of the file that will contain the
|
||||
data for the stream stream in object :samp:`{nnn}`. This
|
||||
option only applies when used with :qpdf:ref:`--json-output`.
|
||||
When used with :qpdf:ref:`--json-output`,
|
||||
``--json-stream-data=file-prefix`` sets the prefix for stream data
|
||||
files, overriding the default, which is to use the output file
|
||||
name. Whatever is given here will be appended with :samp:`-{nnn}`
|
||||
to create the name of the file that will contain the data for the
|
||||
stream stream in object :samp:`{nnn}`. This option only applies
|
||||
when used with :qpdf:ref:`--json-output`.
|
||||
|
||||
.. qpdf:option:: --json-output[=version]
|
||||
|
||||
@ -3270,44 +3281,45 @@ Related Options
|
||||
|
||||
The output file will be qpdf JSON format at the given version.
|
||||
"version" may be a specific version or "latest" (the default).
|
||||
Version 1 is not supported. See also --json-stream-data,
|
||||
The only supported version is 2. See also --json-stream-data,
|
||||
--json-stream-prefix, and --decode-level.
|
||||
|
||||
The output file will be qpdf JSON format at the given version.
|
||||
``version`` may be a specific version or ``latest`` (the default).
|
||||
Version 1 is not supported. See also :qpdf:ref:`--json-stream-data`
|
||||
and :qpdf:ref:`--json-stream-prefix`. The default decode level is
|
||||
``none``, but you can override it with :qpdf:ref:`--decode-level`.
|
||||
If you want to look at the contents of streams easily as you would
|
||||
in QDF mode (see :ref:`qdf`), you can use
|
||||
``--decode-level=generalized`` and ``--json-stream-data=file`` for
|
||||
a convenient way to do that.
|
||||
The output file, instead of being a PDF file, will be a JSON file
|
||||
in qpdf JSON format at the given version. ``version`` may be a
|
||||
specific version or ``latest`` (the default). The only supported
|
||||
version is 2. See also :qpdf:ref:`--json-stream-data` and
|
||||
:qpdf:ref:`--json-stream-prefix`. When this option is specified,
|
||||
the default decode level for stream data is ``none``, but you can
|
||||
override it with :qpdf:ref:`--decode-level`. If you want to look at
|
||||
the contents of streams easily as you would in QDF mode (see
|
||||
:ref:`qdf`), you can use ``--decode-level=generalized`` and
|
||||
``--json-stream-data=file`` for a convenient way to do that.
|
||||
|
||||
.. qpdf:option:: --json-input
|
||||
|
||||
.. help: input file is qpdf JSON
|
||||
|
||||
Treat the input file as a JSON file in qpdf JSON format as
|
||||
written by qpdf --json-output. See the "QPDF JSON Format"
|
||||
written by qpdf --json-output. See the "qpdf JSON Format"
|
||||
section of the manual for information about how to use this
|
||||
option.
|
||||
|
||||
Treat the input file as a JSON file in qpdf JSON format as written
|
||||
by ``qpdf --json-output``. The input file must be complete and
|
||||
include all stream data. For information about converting between
|
||||
PDF and JSON, please see :ref:`qpdf-json`.
|
||||
PDF and JSON, please see :ref:`json`.
|
||||
|
||||
.. qpdf:option:: --update-from-json=qpdf-json-file
|
||||
|
||||
.. help: update a PDF from qpdf JSON
|
||||
|
||||
Update a PDF file from a JSON file. Please see the "QPDF JSON
|
||||
Format" section of the manual for information about how to use
|
||||
this option.
|
||||
Update a PDF file from a JSON file. Please see the "qpdf JSON"
|
||||
chapter of the manual for information about how to use this
|
||||
option.
|
||||
|
||||
This option updates a PDF file from a qpdf JSON file. For a
|
||||
information about how to use this option, please see
|
||||
:ref:`qpdf-json`.
|
||||
This option updates a PDF file from the specified qpdf JSON file.
|
||||
For a information about how to use this option, please see
|
||||
:ref:`json`.
|
||||
|
||||
.. _test-options:
|
||||
|
||||
@ -3420,7 +3432,7 @@ Related Options
|
||||
|
||||
This is used by qpdf's test suite to check consistency between the
|
||||
output of ``qpdf --json`` and the output of ``qpdf --json-help``.
|
||||
This option causes an extra copy of the generated json to appear in
|
||||
This option causes an extra copy of the generated JSON to appear in
|
||||
memory and is therefore unsuitable for use with large files. This
|
||||
is why it's also not on by default.
|
||||
|
||||
|
@ -242,7 +242,7 @@ the current file position. If the token is a not either a dictionary or
|
||||
array opener, an object is immediately constructed from the single token
|
||||
and the parser returns. Otherwise, the parser iterates in a special mode
|
||||
in which it accumulates objects until it finds a balancing closer.
|
||||
During this process, the "``R``" keyword is recognized and an indirect
|
||||
During this process, the ``R`` keyword is recognized and an indirect
|
||||
``QPDFObjectHandle`` may be constructed.
|
||||
|
||||
The ``QPDF::resolve()`` method, which is used to resolve an indirect
|
||||
@ -280,15 +280,15 @@ file.
|
||||
it is looking before the last ``%%EOF``. After getting to ``trailer``
|
||||
keyword, it invokes the parser.
|
||||
|
||||
- The parser sees "``<<``", so it calls itself recursively in
|
||||
- The parser sees ``<<``, so it calls itself recursively in
|
||||
dictionary creation mode.
|
||||
|
||||
- In dictionary creation mode, the parser keeps accumulating objects
|
||||
until it encounters "``>>``". Each object that is read is pushed onto
|
||||
a stack. If "``R``" is read, the last two objects on the stack are
|
||||
until it encounters ``>>``. Each object that is read is pushed onto
|
||||
a stack. If ``R`` is read, the last two objects on the stack are
|
||||
inspected. If they are integers, they are popped off the stack and
|
||||
their values are used to construct an indirect object handle which is
|
||||
then pushed onto the stack. When "``>>``" is finally read, the stack
|
||||
then pushed onto the stack. When ``>>`` is finally read, the stack
|
||||
is converted into a ``QPDF_Dictionary`` which is placed in a
|
||||
``QPDFObjectHandle`` and returned.
|
||||
|
||||
|
796
manual/json.rst
796
manual/json.rst
@ -1,6 +1,9 @@
|
||||
.. cSpell:ignore moddifyannotations
|
||||
.. cSpell:ignore feff
|
||||
|
||||
.. _json:
|
||||
|
||||
QPDF JSON
|
||||
qpdf JSON
|
||||
=========
|
||||
|
||||
.. _json-overview:
|
||||
@ -8,27 +11,540 @@ QPDF JSON
|
||||
Overview
|
||||
--------
|
||||
|
||||
Beginning with qpdf version 8.3.0, the :command:`qpdf`
|
||||
command-line program can produce a JSON representation of the
|
||||
non-content data in a PDF file. It includes a dump in JSON format of all
|
||||
objects in the PDF file excluding the content of streams. This JSON
|
||||
representation makes it very easy to look in detail at the structure of
|
||||
a given PDF file, and it also provides a great way to work with PDF
|
||||
files programmatically from the command-line in languages that can't
|
||||
call or link with the qpdf library directly. Note that stream data can
|
||||
be extracted from PDF files using other qpdf command-line options.
|
||||
Beginning with qpdf version 11.0.0, the qpdf library and command-line
|
||||
program can produce a JSON representation of the in a PDF file. qpdf
|
||||
version 11 introduces JSON format version 2. Prior to qpdf 11,
|
||||
versions 8.3.0 onward had a more limited JSON representation
|
||||
accessible only from the command-line. For details on what changed,
|
||||
see :ref:`json-v2-changes`. The rest of this chapter documents qpdf
|
||||
JSON version 2.
|
||||
|
||||
Please note: this chapter discusses *qpdf JSON format*, which
|
||||
represents the contents of a PDF file. This is distinct from the
|
||||
*QPDFJob JSON format* which provides a higher-level interface
|
||||
interacting with qpdf the way the command-line tool does. For
|
||||
information about that, see :ref:`qpdf-job`.
|
||||
|
||||
The qpdf JSON format is specific to qpdf. There are two ways to use
|
||||
qpdf JSON:
|
||||
|
||||
- The :qpdf:ref:`--json` command-ine flag causes creation of a JSON
|
||||
representation of all the objects in a PDF file, excluding stream
|
||||
data. This includes an unambiguous representation of the PDF object
|
||||
structure and also provides JSON-formatted summaries of other
|
||||
information about the file. This functionality is built into
|
||||
``QPDFJob`` and can be accessed from the ``qpdf`` command-line tool
|
||||
or from the ``QPDFJob`` C or C++ API.
|
||||
|
||||
- qpdf can create a JSON file that completely represents a PDF file.
|
||||
You can think of this as using JSON as an *alternative syntax* for
|
||||
representing a PDF file. Using qpdf JSON, it is possible to
|
||||
convert a PDF file to JSON, manipulate the structure or contents of
|
||||
the objects at a low level, and convert the results back to a PDF
|
||||
file. This functionality can be accessed from the command-line with
|
||||
the :qpdf:ref:`--json-output`, :qpdf:ref:`--json-input`, and
|
||||
:qpdf:ref:`--update-from-json` flags, or from the API using the
|
||||
``QPDF::writeJSON``, ``QPDF::createFromJSON``, and
|
||||
``QPDF::updateFromJSON`` methods.
|
||||
|
||||
.. _json-terminology:
|
||||
|
||||
JSON Terminology
|
||||
----------------
|
||||
|
||||
Notes about terminology:
|
||||
|
||||
- In JavaScript and JSON, that thing that has keys and values is
|
||||
typically called an *object*.
|
||||
|
||||
- In PDF, that thing that has keys and values is typically called a
|
||||
*dictionary*. An *object* is a PDF object such as integer, real,
|
||||
boolean, null, string, array, dictionary, or stream.
|
||||
|
||||
- Some languages that use JSON call an *object* a *dictionary*, a
|
||||
*map*, or a *hash*.
|
||||
|
||||
- Sometimes, it's called on *object* if it has fixed keys and a
|
||||
*dictionary* if it has variable keys.
|
||||
|
||||
This manual is not entirely consistent about its use of *dictionary*
|
||||
vs. *object* because sometimes one term or another is clearer in
|
||||
context. Just be aware of the ambiguity when reading the manual. We
|
||||
frequently use the term *dictionary* to refer to a JSON object because
|
||||
of the consistency with PDF terminology.
|
||||
|
||||
.. _what-qpdf-json-is-not:
|
||||
|
||||
What qpdf JSON is not
|
||||
---------------------
|
||||
|
||||
Please note that qpdf JSON offers a convenient syntax for manipulating
|
||||
PDF files at a low level using JSON syntax. JSON syntax is much easier
|
||||
to work with than native PDF syntax, and there are good JSON libraries
|
||||
in virtually every commonly used programming language. Working with
|
||||
PDF objects in JSON removes the need to worry about stream lengths,
|
||||
cross reference tables, and PDF-specific representations of Unicode or
|
||||
binary strings that appear outside of content streams. It does not
|
||||
eliminate the need to understand the semantic structure of PDF files.
|
||||
Working with qpdf JSON still requires familiarity with the PDF
|
||||
specification.
|
||||
|
||||
In particular, qpdf JSON *does not* provide any of the following
|
||||
capabilities:
|
||||
|
||||
- Text extraction. While you could use qpdf JSON syntax to navigate to
|
||||
a page's content streams and font structures, text within pages is
|
||||
still encoded using PDF syntax within content streams, and there is
|
||||
no assistance for text extraction.
|
||||
|
||||
- Reflowing text, document structure. qpdf JSON does not add any new
|
||||
information or insight into the content of PDF files. If you have a
|
||||
PDF file that lacks any structural information, qpdf JSON won't help
|
||||
you solve any of those problems.
|
||||
|
||||
This is what we mean when we say that JSON provides a *alternative
|
||||
syntax* for working with PDF data. Semantically, it is identical to
|
||||
native PDF.
|
||||
|
||||
.. _qpdf-json:
|
||||
|
||||
QPDF JSON Format
|
||||
qpdf JSON Format
|
||||
----------------
|
||||
|
||||
XXX Write this.
|
||||
This section describes how qpdf represents PDF objects in JSON format.
|
||||
It also describes how to work with qpdf JSON to create or
|
||||
modify PDF files.
|
||||
|
||||
.. _json.objects:
|
||||
|
||||
qpdf JSON Object Representation
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This section describes the representation of PDF objects in qpdf JSON
|
||||
version 2. PDF objects are represented within the ``"objects"``
|
||||
dictionary of a qpdf JSON file. This is true both for PDF serialized
|
||||
to JSON (:qpdf:ref:`--json-output`, ``QPDF::writeJSON``) or objects as
|
||||
they appear in the output of ``qpdf`` with the :qpdf:ref:`--json`
|
||||
option.
|
||||
|
||||
Each key in the ``"objects"`` dictionary is either ``"trailer"`` or a
|
||||
string of the form ``"obj:O G R"`` where ``O`` and ``G`` are the
|
||||
object and generation numbers and ``R`` is the literal string ``R``.
|
||||
This is the PDF syntax for the indirect object reference prepended by
|
||||
``obj:``. The value, representing the object itself, is a JSON object
|
||||
whose structure is described below.
|
||||
|
||||
Top-level Stream Objects
|
||||
Stream objects are represented as a JSON object with the single key
|
||||
``"stream"``. The stream object has a key called ``"dict"`` whose
|
||||
value is the stream dictionary as an object value (described below)
|
||||
with the ``"/Length"`` key omitted. Other keys are determined by the
|
||||
value for json stream data (:qpdf:ref:`--json-stream-data`, or a
|
||||
parameter of type ``qpdf_json_stream_data_e``) as follows:
|
||||
|
||||
- ``none``: stream data is not represented; no other keys are
|
||||
present
|
||||
|
||||
- ``inline``: the stream data appears as a base64-encoded string as
|
||||
the value of the ``"data"`` key
|
||||
|
||||
- ``file``: the stream data is written to a file, and the path to
|
||||
the file is stored in the ``"datafile"`` key. A relative path is
|
||||
interpreted as relative to the current directory when qpdf is
|
||||
invoked.
|
||||
|
||||
Keys other than ``"dict"``, ``"data"``, and ``"datafile"`` are
|
||||
ignored. This is primarily for future compatibility in case a newer
|
||||
version of qpdf includes additional information.
|
||||
|
||||
As with the native PDF representation, the stream data must be
|
||||
consistent with whatever filters and decode parameters are specified
|
||||
in the stream dictionary.
|
||||
|
||||
Top-level Non-stream Objects
|
||||
Non-stream objects are represented as a dictionary with the single
|
||||
key ``"value"``. Other keys are ignored for future compatibility.
|
||||
The value's structure is described in "Object Values" below.
|
||||
|
||||
Note: in files that use object streams, the trailer "dictionary" is
|
||||
actually a stream, but in the JSON representation, the value of the
|
||||
``"trailer"`` key is always written as a dictionary (with a
|
||||
``"value"`` key like other non-stream objects). There will also be a
|
||||
a stream object whose key is the object ID of the cross-reference
|
||||
stream, even though this stream will generally be unreferenced. This
|
||||
makes it possible to assume ``"trailer"`` points to a dictionary
|
||||
without having to consider whether the file uses object streams or
|
||||
not. It is also consistent with how ``QPDF::getTrailer`` behaves in
|
||||
the C++ API.
|
||||
|
||||
Object Values
|
||||
Within ``"value"`` or ``"stream"."dict"``, PDF objects are
|
||||
represented as follows:
|
||||
|
||||
- Objects of type Boolean or null are represented as JSON objects of
|
||||
the same type.
|
||||
|
||||
- Objects that are numeric are represented as numeric in the JSON
|
||||
without regard to precision. Internally, qpdf stores numeric
|
||||
values as strings, so qpdf will preserve arbitrary precision
|
||||
numerical values when reading and writing JSON. It is likely that
|
||||
other JSON readers and writers will have implementation-dependent
|
||||
ways of handling numerical values that are out of range.
|
||||
|
||||
- Name objects are represented as JSON strings that start with ``/``
|
||||
and are followed by the PDF name in canonical form with all PDF
|
||||
syntax resolved. For example, the name whose canonical form (per
|
||||
the PDF specification) is ``text/plain`` would be represented in
|
||||
JSON as ``"/text/plain"`` and in PDF as ``"/text#2fplain"``.
|
||||
|
||||
- Indirect object references are represented as JSON strings that
|
||||
look like a PDF indirect object reference and have the form ``"O G
|
||||
R"`` where ``O`` and ``G`` are the object and generation numbers
|
||||
and ``R`` is the literal string ``R``. For example, ``"3 0 R"``
|
||||
would represent a reference to the object with object ID 3 and
|
||||
generation 0.
|
||||
|
||||
- PDF strings are represented as JSON strings in one of two ways:
|
||||
|
||||
- ``"u:utf8-encoded-string"``: this format is used when the PDF
|
||||
string can be unambiguously represented as a Unicode string and
|
||||
contains no unprintable characters. This is the case whether the
|
||||
input string is encoded as UTF-16, UTF-8 (as allowed by PDF
|
||||
2.0), or PDF doc encoding. Strings are only represented this way
|
||||
if they can be encoded without loss of information.
|
||||
|
||||
- ``"b:hex-string"``: this format is used to represent any binary
|
||||
string value that can't be represented as a Unicode string.
|
||||
``hex-string`` must have an even number of characters that range
|
||||
from ``a`` through ``f``, ``A`` through ``F``, or ``0`` through
|
||||
``9``.
|
||||
|
||||
qpdf writes empty strings as ``"u:"``, but both ``"b:"`` and
|
||||
``"u:"`` are valid representations of the empty string.
|
||||
|
||||
There is full support for UTF-16 surrogate pairs. Binary strings
|
||||
encoded with ``"b:..."`` are the internal PDF representations.
|
||||
As such, the following are equivalent:
|
||||
|
||||
- ``"u:\ud83e\udd54"`` -- representation of U+1F954 as a surrogate
|
||||
pair in JSON syntax
|
||||
|
||||
- ``"b:FEFFD83EDD54"`` -- representation of U+1F954 as the bytes
|
||||
of a UTF-16 string in PDF syntax with the leading ``FEFF``
|
||||
indicating UTF-16
|
||||
|
||||
- ``"b:efbbbff09fa594"`` -- representation of U+1F954 as the
|
||||
bytes of a UTF-8 string in PDF syntax (as allowed by PDF 2.0)
|
||||
with the leading ``EF``, ``BB``, ``BF`` sequence (which is just
|
||||
UTF-8 encoding of ``FEFF``).
|
||||
|
||||
- A JSON string whose contents are ``u:`` followed by the UTF-8
|
||||
representation of U+1F954. This is the potato emoji.
|
||||
Unfortunately, I am not able to render it in the PDF version
|
||||
of this manual.
|
||||
|
||||
- PDF arrays are represented as JSON arrays of objects as described
|
||||
above
|
||||
|
||||
- PDF dictionaries are represented as JSON objects whose keys are
|
||||
the string representations of names and whose values are
|
||||
representations of PDF objects.
|
||||
|
||||
.. _json.output:
|
||||
|
||||
qpdf JSON Output
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
The format of the JSON written by qpdf's :qpdf:ref:`--json-output`
|
||||
flag or the ``QPDF::writeJSON`` API call is a JSON object consisting
|
||||
of a single key: ``"qpdf-v2"``. Any other top-level keys are ignored.
|
||||
While unknown keys in other places are ignored for future
|
||||
compatibility, in this case, ignoring other top-level keys is an
|
||||
explicit decision to allow users to include other keys for their own
|
||||
use. No new top-level keys will be added in JSON version 2.
|
||||
|
||||
The ``"qpdf-v2"`` key points to a JSON object with the following keys:
|
||||
|
||||
- ``"pdfversion"`` -- a string containing PDF version as indicated in
|
||||
the PDF header (e.g. ``"1.7"``, ``"2.0"``)
|
||||
|
||||
- ``"maxobjectid"`` -- a number indicating the object ID of the
|
||||
highest numbered object in the file. This is provided to make it
|
||||
easier for software that wants to add new objects to the file as you
|
||||
can safely start with one above that number when creating new
|
||||
objects. Note that the value of ``"maxobjectid"`` may be higher than
|
||||
the actual maximum object that appears in the input PDF since it
|
||||
takes into consideration any dangling indirect object references
|
||||
from the original file. This prevents you from unwittingly creating
|
||||
an object that doesn't exist but that is referenced, which may have
|
||||
unintended side effects. (The PDF specification explicitly allows
|
||||
dangling references and says to treat them as nulls. This can happen
|
||||
if objects are removed from a PDF file.)
|
||||
|
||||
- ``"objects"`` -- the actual PDF objects as described in
|
||||
:ref:`json.objects`.
|
||||
|
||||
Note that writing JSON output is done by ``QPDF``, not ``QPDFWriter``.
|
||||
As such, none of the things ``QPDFWriter`` does apply. This includes
|
||||
recompression of streams, renumbering of objects, anything to do with
|
||||
object streams (which are not represented by qpdf JSON at all since
|
||||
they are PDF syntax, not semantics), encryption, decryption,
|
||||
linearization, QDF mode, etc.
|
||||
|
||||
.. _json.example:
|
||||
|
||||
qpdf JSON Example
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
The JSON below shows an example of a simple PDF file represented in
|
||||
qpdf JSON format.
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"qpdf-v2": {
|
||||
"pdfversion": "1.3",
|
||||
"maxobjectid": 5,
|
||||
"objects": {
|
||||
"obj:1 0 R": {
|
||||
"value": {
|
||||
"/Pages": "2 0 R",
|
||||
"/Type": "/Catalog"
|
||||
}
|
||||
},
|
||||
"obj:2 0 R": {
|
||||
"value": {
|
||||
"/Count": 1,
|
||||
"/Kids": [ "3 0 R" ],
|
||||
"/Type": "/Pages"
|
||||
}
|
||||
},
|
||||
"obj:3 0 R": {
|
||||
"value": {
|
||||
"/Contents": "4 0 R",
|
||||
"/MediaBox": [ 0, 0, 612, 792 ],
|
||||
"/Parent": "2 0 R",
|
||||
"/Resources": {
|
||||
"/Font": {
|
||||
"/F1": "5 0 R"
|
||||
}
|
||||
},
|
||||
"/Type": "/Page"
|
||||
}
|
||||
},
|
||||
"obj:4 0 R": {
|
||||
"stream": {
|
||||
"data": "eJxzCuFSUNB3M1QwMlEISQOyzY2AyEAhJAXI1gjIL0ksyddUCMnicg3hAgDLAQnI",
|
||||
"dict": {
|
||||
"/Filter": "/FlateDecode"
|
||||
}
|
||||
}
|
||||
},
|
||||
"obj:5 0 R": {
|
||||
"value": {
|
||||
"/BaseFont": "/Helvetica",
|
||||
"/Encoding": "/WinAnsiEncoding",
|
||||
"/Subtype": "/Type1",
|
||||
"/Type": "/Font"
|
||||
}
|
||||
},
|
||||
"trailer": {
|
||||
"value": {
|
||||
"/ID": [
|
||||
"b:98b5a26966fba4d3a769b715b2558da6",
|
||||
"b:98b5a26966fba4d3a769b715b2558da6"
|
||||
],
|
||||
"/Root": "1 0 R",
|
||||
"/Size": 6
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
.. _json.input:
|
||||
|
||||
qpdf JSON Input
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Output in the JSON output format described in :ref:`json.output` can
|
||||
be used in two different ways:
|
||||
|
||||
- By using the :qpdf:ref:`--json-input` flag or calling
|
||||
``QPDF::createFromJSON`` in place of ``QPDF::processFile``, a qpdf
|
||||
JSON file can be used in place of a PDF file as the input to qpdf.
|
||||
|
||||
- By using the :qpdf:ref:`--update-from-json` flag or calling
|
||||
``QPDF::updateFromJSON`` on an initialized ``QPDF`` object, a qpdf
|
||||
JSON file can be used to apply changes to an existing ``QPDF``
|
||||
object. That ``QPDF`` object can have come from any source including
|
||||
a PDF file, a qpdf JSON file, or the result of any other process
|
||||
that results in a valid, initialized ``QPDF`` object.
|
||||
|
||||
Here are some important things to know about qpdf JSON input.
|
||||
|
||||
- When a qpdf JSON file is used as the primary input file, it must be
|
||||
complete. This means
|
||||
|
||||
- A PDF version number must be specified with the ``"pdfversion"``
|
||||
key
|
||||
|
||||
- Stream data must be present for all streams
|
||||
|
||||
- The trailer dictionary must be present, though only the
|
||||
``"/Root"`` key is required.
|
||||
|
||||
- Certain fields from the input are ignored whether creating or
|
||||
updating from a JSON file:
|
||||
|
||||
- ``"maxobjectid"`` is ignored, so it is not necessary to update it
|
||||
when adding new objects.
|
||||
|
||||
- ``"/Length"`` is ignored in all stream dictionaries. qpdf doesn't
|
||||
put it there when it creates JSON output, and it is not necessary
|
||||
to add it.
|
||||
|
||||
- ``"/Size"`` is ignored if it appears in a trailer dictionary as
|
||||
that is always recomputed by ``QPDFWriter``.
|
||||
|
||||
- Unknown keys at the to top level of the file, within ``objects``,
|
||||
at the top level of each individual object (inside the object that
|
||||
has the ``"value"`` or ``"stream"`` key) and directly within
|
||||
``"stream"`` are ignored for future compatibility. You should
|
||||
avoid putting your own values in those places if you wish to avoid
|
||||
risking that your JSON files will not work in future versions of
|
||||
qpdf. The exception to this advice is at the top level of the
|
||||
overall file where it is explicitly supported for you to add your
|
||||
own keys. For example, you could add your own metadata at the top
|
||||
level, and qpdf will ignore it. Note that extra top-level keys are
|
||||
not preserved when qpdf reads your JSON file.
|
||||
|
||||
- When qpdf reads a PDF file, the internal object numbers are always
|
||||
preserved. However, when qpdf writes a file using ``QPDFWriter``,
|
||||
``QPDFWriter`` does its own numbering and, in general, does not
|
||||
preserve input object numbers. That means that a qpdf JSON file that
|
||||
is used to update an existing PDF must have object numbers that
|
||||
match the input file it is modifying. In practical terms, this means
|
||||
that you can't use a JSON file created from one PDF file to modify
|
||||
the *output of running qpdf on that file*.
|
||||
|
||||
To put this more concretely, the following is valid:
|
||||
|
||||
::
|
||||
|
||||
qpdf --json-output in.pdf pdf.json
|
||||
# edit pdf.json
|
||||
qpdf in.pdf out.pdf --update-from-json=pdf.json
|
||||
|
||||
The following will not produce predictable results because
|
||||
``out.pdf`` won't have the same object numbers as ``pdf.json`` and
|
||||
``in.pdf``.
|
||||
|
||||
::
|
||||
|
||||
qpdf --json-output in.pdf pdf.json
|
||||
# edit pdf.json
|
||||
qpdf in.pdf out.pdf --update-from-json=pdf.json
|
||||
# edit pdf.json again
|
||||
# Don't do this
|
||||
qpdf out.pdf out2.pdf --update-from-json=pdf.json
|
||||
|
||||
- When updating from a JSON file (:qpdf:ref:`--update-from-json`,
|
||||
``QPDF::updateFromJSON``), existing objects are updated in place.
|
||||
This has the following implications:
|
||||
|
||||
- You may omit both ``"data"`` and ``"datafile"`` if the object you
|
||||
are updating is already a stream. In that case the original stream
|
||||
data is preserved. You must always provide a stream dictionary,
|
||||
but it may be empty. Note that an empty stream dictionary will
|
||||
clear the old dictionary. There is no way to indicate that an old
|
||||
stream dictionary should be left alone, so if your intention is to
|
||||
replace the stream data and preserve the dictionary, the
|
||||
original dictionary must appear in the JSON file.
|
||||
|
||||
- You can change one object type to another object type including
|
||||
replacing a stream with a non-stream or a non-stream with a
|
||||
stream. If you replace a non-stream with a stream, you must
|
||||
provide data for the stream.
|
||||
|
||||
- Objects that you do not wish to modify can be omitted from the
|
||||
JSON. That includes the trailer. That means you can use the output
|
||||
of a qpdf JSON file that was written using
|
||||
:qpdf:ref:`--json-object` to have it include only the objects you
|
||||
intend to modify.
|
||||
|
||||
- You can omit the ``"pdfversion"`` key. The input PDF version will
|
||||
be preserved.
|
||||
|
||||
.. _json.workflow-cli:
|
||||
|
||||
qpdf JSON Workflow: CLI
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This section includes a few examples of using qpdf JSON.
|
||||
|
||||
- Convert a PDF file to JSON format, edit the JSON, and convert back
|
||||
to PDF. This is an alternative to using QDF mode (see :ref:`qdf`) to
|
||||
modify PDF files in a text editor. Each method has its own
|
||||
advantages and disadvantages.
|
||||
|
||||
::
|
||||
|
||||
qpdf --json-output in.pdf pdf.json
|
||||
# edit pdf.json
|
||||
qpdf --json-input pdf.json out.pdf
|
||||
|
||||
- Extract only a specific object into a JSON file, modify the object
|
||||
in JSON, and use the modified object to update the original PDF. In
|
||||
this case, we're editing object 4, whatever that may happen to be.
|
||||
You would have to know through some other means which object you
|
||||
wanted to edit, such as by looking at other JSON output or using a
|
||||
tool (possibly but not necessarily qpdf) to identify the object.
|
||||
|
||||
::
|
||||
|
||||
qpdf --json-output in.pdf pdf.json --json-object=4,0
|
||||
# edit pdf.json
|
||||
qpdf in.pdf --update-from-json=pdf.json out.pdf
|
||||
|
||||
Rather than using :qpdf:ref:`--json-object` as in the above example,
|
||||
you could edit the JSON file to remove the objects you didn't need.
|
||||
You could also just leave them there, though the update process
|
||||
would be slower.
|
||||
|
||||
You could also add new objects to a file by adding them to
|
||||
``pdf.json``. Just be sure the object number doesn't conflict with
|
||||
an existing object. The ``"maxobjectid"`` field in the original
|
||||
output can help with this. You don't have to update it if you add
|
||||
objects as it is ignored when the file is read back in.
|
||||
|
||||
- Use :qpdf:ref:`--json-input` and :qpdf:ref:`--json-output` together
|
||||
to demonstrate preservation of object numbers. In this example,
|
||||
``a.json`` and ``b.json`` will have the same objects and object
|
||||
numbers. The files may not be identical since strings may be
|
||||
normalized, fields may appear in a different order, etc. However
|
||||
``b.json`` and ``c.json`` are probably identical.
|
||||
|
||||
::
|
||||
|
||||
qpdf --json-output in.pdf a.json
|
||||
qpdf --json-input --json-output a.json b.json
|
||||
qpdf --json-input --json-output b.json c.json
|
||||
|
||||
|
||||
.. _json.workflow-api:
|
||||
|
||||
qpdf JSON Workflow: API
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Everything that can be done using the qpdf CLI can be done using the
|
||||
C++ API. See comments in :file:`QPDF.hh` for ``writeJSON``,
|
||||
``createFromJSON``, and ``updateFromJSON`` for details.
|
||||
|
||||
.. _json-guarantees:
|
||||
|
||||
JSON Guarantees
|
||||
---------------
|
||||
JSON Compatibility Guarantees
|
||||
-----------------------------
|
||||
|
||||
The qpdf JSON representation includes a JSON serialization of the raw
|
||||
objects in the PDF file as well as some computed information in a more
|
||||
@ -37,24 +553,23 @@ format. These guarantees are designed to simplify the experience of a
|
||||
developer working with the JSON format.
|
||||
|
||||
Compatibility
|
||||
The top-level JSON object output is a dictionary. The JSON output
|
||||
contains various nested dictionaries and arrays. With the exception
|
||||
of dictionaries that are populated by the fields of objects from the
|
||||
file, all instances of a dictionary are guaranteed to have exactly
|
||||
the same keys. Future versions of qpdf are free to add additional
|
||||
keys but not to remove keys or change the type of object that a key
|
||||
points to. The qpdf program validates this guarantee, and in the
|
||||
unlikely event that a bug in qpdf should cause it to generate data
|
||||
that doesn't conform to this rule, it will ask you to file a bug
|
||||
report.
|
||||
The top-level JSON object is a dictionary (JSON "object"). The JSON
|
||||
output contains various nested dictionaries and arrays. With the
|
||||
exception of dictionaries that are populated by the fields of
|
||||
PDF objects from the file, all instances of a dictionary are
|
||||
guaranteed to have exactly the same keys.
|
||||
|
||||
The top-level JSON structure contains a "``version``" key whose value
|
||||
is simple integer. The value of the ``version`` key will be
|
||||
The top-level JSON structure contains a ``"version"`` key whose
|
||||
value is simple integer. The value of the ``version`` key will be
|
||||
incremented if a non-compatible change is made. A non-compatible
|
||||
change would be any change that involves removal of a key, a change
|
||||
to the format of data pointed to by a key, or a semantic change that
|
||||
requires a different interpretation of a previously existing key. A
|
||||
strong effort will be made to avoid breaking compatibility.
|
||||
to the format of data pointed to by a key, or a semantic change
|
||||
that requires a different interpretation of a previously existing
|
||||
key.
|
||||
|
||||
With a specific qpdf JSON version, future versions of qpdf are free
|
||||
to add additional keys but not to remove keys or change the type of
|
||||
object that a key points to.
|
||||
|
||||
Documentation
|
||||
The :command:`qpdf` command can be invoked with the
|
||||
@ -66,28 +581,29 @@ Documentation
|
||||
|
||||
- A dictionary in the help output means that the corresponding
|
||||
location in the actual JSON output is also a dictionary with
|
||||
exactly the same keys; that is, no keys present in help are absent
|
||||
in the real output, and no keys will be present in the real output
|
||||
that are not in help. As a special case, if the dictionary has a
|
||||
single key whose name starts with ``<`` and ends with ``>``, it
|
||||
means that the JSON output is a dictionary that can have any keys,
|
||||
each of which conforms to the value of the special key. This is
|
||||
used for cases in which the keys of the dictionary are things like
|
||||
object IDs.
|
||||
exactly the same keys; that is, no keys present in help are
|
||||
absent in the real output, and no keys will be present in the
|
||||
real output that are not in help. It is possible for a key to be
|
||||
present and have a value that is explicitly ``null``. As a
|
||||
special case, if the dictionary has a single key whose name
|
||||
starts with ``<`` and ends with ``>``, it means that the JSON
|
||||
output is a dictionary that can have any value as a key. This is
|
||||
used for cases in which the keys of the dictionary are things
|
||||
like object IDs.
|
||||
|
||||
- A string in the help output is a description of the item that
|
||||
appears in the corresponding location of the actual output. The
|
||||
corresponding output can have any format.
|
||||
corresponding output can have any value including ``null``.
|
||||
|
||||
- An array in the help output always contains a single element. It
|
||||
indicates that the corresponding location in the actual output is
|
||||
also an array, and that each element of the array has whatever
|
||||
format is implied by the single element of the help output's
|
||||
array.
|
||||
an array of any length, and that each element of the array has
|
||||
whatever format is implied by the single element of the help
|
||||
output's array.
|
||||
|
||||
For example, the help output indicates includes a "``pagelabels``"
|
||||
For example, the help output indicates includes a ``"pagelabels"``
|
||||
key whose value is an array of one element. That element is a
|
||||
dictionary with keys "``index``" and "``label``". In addition to
|
||||
dictionary with keys ``"index"`` and ``"label"``. In addition to
|
||||
describing the meaning of those keys, this tells you that the actual
|
||||
JSON output will contain a ``pagelabels`` array, each of whose
|
||||
elements is a dictionary that contains an ``index`` key, a ``label``
|
||||
@ -95,56 +611,13 @@ Documentation
|
||||
|
||||
Directness and Simplicity
|
||||
The JSON output contains the value of every object in the file, but
|
||||
it also contains some processed data. This is analogous to how qpdf's
|
||||
library interface works. The processed data is similar to the helper
|
||||
functions in that it allows you to look at certain aspects of the PDF
|
||||
file without having to understand all the nuances of the PDF
|
||||
it also contains some summary data. This is analogous to how qpdf's
|
||||
library interface works. The summary data is similar to the helper
|
||||
functions in that it allows you to look at certain aspects of the
|
||||
PDF file without having to understand all the nuances of the PDF
|
||||
specification, while the raw objects allow you to mine the PDF for
|
||||
anything that the higher-level interfaces are lacking.
|
||||
|
||||
.. _json.limitations:
|
||||
|
||||
Limitations of JSON Representation
|
||||
----------------------------------
|
||||
|
||||
There are a few limitations to be aware of with the JSON structure:
|
||||
|
||||
- Strings, names, and indirect object references in the original PDF
|
||||
file are all converted to strings in the JSON representation. In the
|
||||
case of a "normal" PDF file, you can tell the difference because a
|
||||
name starts with a slash (``/``), and an indirect object reference
|
||||
looks like ``n n R``, but if there were to be a string that looked
|
||||
like a name or indirect object reference, there would be no way to
|
||||
tell this from the JSON output. Note that there are certain cases
|
||||
where you know for sure what something is, such as knowing that
|
||||
dictionary keys in objects are always names and that certain things
|
||||
in the higher-level computed data are known to contain indirect
|
||||
object references.
|
||||
|
||||
- The JSON format doesn't support binary data very well. Mostly the
|
||||
details are not important, but they are presented here for
|
||||
information. When qpdf outputs a string in the JSON representation,
|
||||
it converts the string to UTF-8, assuming usual PDF string semantics.
|
||||
Specifically, if the original string is UTF-16, it is converted to
|
||||
UTF-8. Otherwise, it is assumed to have PDF doc encoding, and is
|
||||
converted to UTF-8 with that assumption. This causes strange things
|
||||
to happen to binary strings. For example, if you had the binary
|
||||
string ``<038051>``, this would be output to the JSON as ``\u0003•Q``
|
||||
because ``03`` is not a printable character and ``80`` is the bullet
|
||||
character in PDF doc encoding and is mapped to the Unicode value
|
||||
``2022``. Since ``51`` is ``Q``, it is output as is. If you wanted to
|
||||
convert back from here to a binary string, would have to recognize
|
||||
Unicode values whose code points are higher than ``0xFF`` and map
|
||||
those back to their corresponding PDF doc encoding characters. There
|
||||
is no way to tell the difference between a Unicode string that was
|
||||
originally encoded as UTF-16 or one that was converted from PDF doc
|
||||
encoding. In other words, it's best if you don't try to use the JSON
|
||||
format to extract binary strings from the PDF file, but if you really
|
||||
had to, it could be done. Note that qpdf's
|
||||
:qpdf:ref:`--show-object` option does not have this
|
||||
limitation and will reveal the string as encoded in the original
|
||||
file.
|
||||
|
||||
.. _json.considerations:
|
||||
|
||||
JSON: Special Considerations
|
||||
@ -157,12 +630,15 @@ be aware of:
|
||||
- If a PDF file has certain types of errors in its pages tree (such as
|
||||
page objects that are direct or multiple pages sharing the same
|
||||
object ID), qpdf will automatically repair the pages tree. If you
|
||||
specify ``"objects"`` and/or ``"objectinfo"`` without any other
|
||||
keys, you will see the original pages tree without any corrections.
|
||||
If you specify any of keys that require page tree traversal (for
|
||||
example, ``"pages"``, ``"outlines"``, or ``"pagelabel"``), then
|
||||
``"objects"`` and ``"objectinfo"`` will show the repaired page tree
|
||||
so that object references will be consistent throughout the file.
|
||||
specify ``"objects"`` (and, with qpdf JSON version 1, also
|
||||
``"objectinfo"``) without any other keys, you will see the original
|
||||
pages tree without any corrections. If you specify any of keys that
|
||||
require page tree traversal (for example, ``"pages"``,
|
||||
``"outlines"``, or ``"pagelabel"``), then ``"objects"`` (and
|
||||
``"objectinfo"``) will show the repaired page tree so that object
|
||||
references will be consistent throughout the file. This is not an
|
||||
issue with :qpdf:ref:`--json-output`, which doesn't repair the pages
|
||||
tree.
|
||||
|
||||
- While qpdf guarantees that keys present in the help will be present
|
||||
in the output, those fields may be null or empty if the information
|
||||
@ -177,22 +653,128 @@ be aware of:
|
||||
1. Note that JSON indexes from 0, and you would also use 0-based
|
||||
indexing using the API. However, 1-based indexing is easier in this
|
||||
case because the command-line syntax for specifying page ranges is
|
||||
1-based. If you were going to write a program that looked through the
|
||||
JSON for information about specific pages and then use the
|
||||
1-based. If you were going to write a program that looked through
|
||||
the JSON for information about specific pages and then use the
|
||||
command-line to extract those pages, 1-based indexing is easier.
|
||||
Besides, it's more convenient to subtract 1 from a program in a real
|
||||
programming language than it is to add 1 from shell code.
|
||||
Besides, it's more convenient to subtract 1 in a real programming
|
||||
language than it is to add 1 in shell code.
|
||||
|
||||
- The image information included in the ``page`` section of the JSON
|
||||
output includes the key "``filterable``". Note that the value of this
|
||||
field may depend on the :qpdf:ref:`--decode-level` that
|
||||
you invoke qpdf with. The JSON output includes a top-level key
|
||||
"``parameters``" that indicates the decode level used for computing
|
||||
whether a stream was filterable. For example, jpeg images will be
|
||||
shown as not filterable by default, but they will be shown as
|
||||
filterable if you run :command:`qpdf --json
|
||||
output includes the key ``"filterable"``. Note that the value of
|
||||
this field may depend on the :qpdf:ref:`--decode-level` that you
|
||||
invoke qpdf with. The JSON output includes a top-level key
|
||||
``"parameters"`` that indicates the decode level that was used for
|
||||
computing whether a stream was filterable. For example, jpeg images
|
||||
will be shown as not filterable by default, but they will be shown
|
||||
as filterable if you run :command:`qpdf --json
|
||||
--decode-level=all`.
|
||||
|
||||
- The ``encrypt`` key's values will be populated for non-encrypted
|
||||
files. Some values will be null, and others will have values that
|
||||
apply to unencrypted files.
|
||||
|
||||
- The qpdf library itself never loads an entire PDF into memory. This
|
||||
remains true for PDF files represented in JSON format. In general,
|
||||
qpdf will hold the entire object structure in memory once a file has
|
||||
been fully read (objects are loaded into memory lazily but stay
|
||||
there once loaded), but it will never have more than two copies of a
|
||||
stream in memory at once. That said, if you ask qpdf to write JSON
|
||||
to memory, it will do so, so be careful about this if you are
|
||||
working with very large PDF files. There is nothing in the qpdf
|
||||
library itself that prevents working with PDF files much larger than
|
||||
available system memory. qpdf can both read and write such files in
|
||||
JSON format. If you need to work with a PDF file's json
|
||||
representation in memory, it is recommended that you use either
|
||||
``none`` or ``file`` as the argument to
|
||||
:qpdf:ref:`--json-stream-data`, or if using the API, use
|
||||
``qpdf_sj_none`` or ``pdf_sj_file`` as the json stream data value.
|
||||
If using ``none``, you can use other means to obtain the stream
|
||||
data.
|
||||
|
||||
.. _json-v2-changes:
|
||||
|
||||
Changes from JSON v1 to v2
|
||||
--------------------------
|
||||
|
||||
The following changes were made to qpdf's JSON output format for
|
||||
version 2.
|
||||
|
||||
- The representation of objects has changed. For details, see
|
||||
:ref:`json.objects`.
|
||||
|
||||
- The representation of strings is now unambiguous for all strings.
|
||||
Strings a prefixed with either ``u:`` for Unicode strings or
|
||||
``b:`` for byte strings.
|
||||
|
||||
- Names are shown in qpdf's canonical form rather than in PDF
|
||||
syntax. (Example: the PDF-syntax name ``/text#2fplain`` appeared
|
||||
as ``"/text#2fplain"`` in v1 but appears as ``"/text/plain"`` in
|
||||
v2.
|
||||
|
||||
- The top-level representation of an object in ``"objects"`` is a
|
||||
dictionary containing either a ``"value"`` key or a ``"stream"``
|
||||
key, making it possible to distinguish streams from other objects.
|
||||
|
||||
- The ``"objectinfo"`` key has been removed in favor of a
|
||||
representation in ``"objects"`` that differentiates between a stream
|
||||
and other kinds of objects. In v1, it was not possible to tell a
|
||||
stream from a dictionary within ``"objects"``.
|
||||
|
||||
- Within the ``"objects"`` dictionary, keys are now ``"obj:O G R"``
|
||||
where ``O`` and ``G`` are the object and generation number.
|
||||
``"trailer"`` remains the key for the trailer dictionary. In v1, the
|
||||
``obj:`` prefix was not present. The rationale for this change is as
|
||||
follows:
|
||||
|
||||
- Having a unique prefix (``obj:``) makes it much easier to search
|
||||
in the JSON file for the definition of an object
|
||||
|
||||
- Having the key still contain ``O G R`` makes it much easier to
|
||||
construct the key from an indirect reference. You just have to
|
||||
prepend ``obj:``. There is no need to parse the indirect object
|
||||
reference.
|
||||
|
||||
- In the ``"encrypt"`` object, the ``"modifyannotations"`` was
|
||||
misspelled as ``"moddifyannotations"`` in v1. This has been
|
||||
corrected.
|
||||
|
||||
Motivation for qpdf JSON version 2
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
qpdf JSON version 2 was created to make it possible to manipulate PDF
|
||||
files using JSON syntax instead of native PDF syntax. This makes it
|
||||
possible to make low-level updates to PDF files from just about any
|
||||
programming language or even to do so from the command-line using
|
||||
tools like ``jq`` or any editor that's capable of working with JSON
|
||||
files. There were several limitations of JSON format version 1 that
|
||||
made this impossible:
|
||||
|
||||
- Strings, names, and indirect object references in the original PDF
|
||||
file were all converted to strings in the JSON representation. For
|
||||
casual human inspection, this was fine, but in the general case,
|
||||
there was no way to tell the difference between a string that looked
|
||||
like a name or indirect object reference from an actual name or
|
||||
indirect object reference.
|
||||
|
||||
- PDF strings were not unambiguously represented in the JSON format.
|
||||
The way qpdf JSON v1 represented a string was to try to convert the
|
||||
string to UTF-8. This was done by assuming a string that was not
|
||||
explicitly marked as Unicode was encoded in PDF doc encoding. The
|
||||
problem is that there is not a perfect bidirectional mapping between
|
||||
Unicode and PDF doc encoding, so if a binary string happened to
|
||||
contain characters that couldn't be bidirectionally mapped, there
|
||||
would be no way to get back to the original PDF string. Even when
|
||||
possible, trying to map from the JSON representation of a binary
|
||||
string back to the original string required knowledge of the mapping
|
||||
between PDF doc encoding and Unicode.
|
||||
|
||||
- There was no representation of stream data. If you wanted to extract
|
||||
stream data, you could use :qpdf:ref:`--show-object`, so this wasn't
|
||||
that important for inspection, but it was a blocker for being able
|
||||
to go from JSON back to PDF. qpdf JSON version 2 allows stream data
|
||||
to be included inline as base64-encoded data. There is also an
|
||||
option to write all stream data to external files, which makes it
|
||||
possible to work with very large PDF files in JSON format even with
|
||||
tools that try to read the entire JSON structure into memory.
|
||||
|
||||
- The PDF version from PDF header was not represented in qpdf JSON v1.
|
||||
|
@ -70,12 +70,14 @@ Python
|
||||
qpdf's capabilities with other functionality provided by Python's
|
||||
rich standard library and available modules.
|
||||
|
||||
Other Languages
|
||||
Starting with version 8.3.0, the :command:`qpdf`
|
||||
command-line tool can produce a JSON representation of the PDF file's
|
||||
non-content data. This can facilitate interacting programmatically
|
||||
with PDF files through qpdf's command line interface. For more
|
||||
information, please see :ref:`json`.
|
||||
Other Languages Starting with version 11.0.0, the :command:`qpdf`
|
||||
command-line tool can produce an unambiguous JSON representation of
|
||||
a PDF file and can also create or update PDF files using this JSON
|
||||
representation. qpdf versions from 8.3.0 through 10.6.3 had a more
|
||||
limited JSON output format. The qpdf JSON format makes it possible
|
||||
to inspect and modify the structure of a PDF file down to the
|
||||
object level from the command-line or from any language that can
|
||||
handle JSON data. Please see :ref:`json` for details.
|
||||
|
||||
Wrappers
|
||||
The `qpdf Wiki <https://github.com/qpdf/qpdf/wiki>`__ contains a
|
||||
|
@ -122,7 +122,7 @@ entries in ``/W`` above. Each entry consists of one or more fields, the
|
||||
first of which is the type of the field. The number of bytes for each
|
||||
field is given by ``/W`` above. A 0 in ``/W`` indicates that the field
|
||||
is omitted and has the default value. The default value for the field
|
||||
type is "``1``". All other default values are "``0``".
|
||||
type is ``1``. All other default values are ``0``.
|
||||
|
||||
PDF 1.5 has three field types:
|
||||
|
||||
|
@ -28,6 +28,13 @@ able to restore edited files to a correct state. The
|
||||
arguments. It reads a possibly edited QDF file from standard input and
|
||||
writes a repaired file to standard output.
|
||||
|
||||
For another way to work with PDF files in an editor, see :ref:`json`.
|
||||
Using qpdf JSON format allows you to edit the PDF file semantically
|
||||
without having to be concerned about PDF syntax. However, QDF files
|
||||
are actually valid PDF files, so the feedback cycle may be faster if
|
||||
previewing with a PDF reader. Also, since QDF files are valid PDF, you
|
||||
can experiment with all aspects of the PDF file, including syntax.
|
||||
|
||||
The following attributes characterize a QDF file:
|
||||
|
||||
- All objects appear in numerical order in the PDF file, including when
|
||||
|
@ -27,6 +27,10 @@ executable is available from inside the C++ library using the
|
||||
|
||||
- Use from the C API with ``qpdfjob_run_from_json`` from :file:`qpdfjob-c.h`
|
||||
|
||||
- Note: this is unrelated to :qpdf:ref:`--json` but can be combined
|
||||
with it. For more information on qpdf JSON (vs. QPDFJob JSON), see
|
||||
:ref:`json`.
|
||||
|
||||
- The ``QPDFJob`` C++ API
|
||||
|
||||
If you can understand how to use the :command:`qpdf` CLI, you can
|
||||
|
@ -60,7 +60,8 @@ For a detailed list of changes, please see the file
|
||||
- CLI: breaking changes
|
||||
|
||||
- The default json output version when :qpdf:ref:`--json` is
|
||||
specified has been changed from ``1`` to ``latest``.
|
||||
specified has been changed from ``1`` to ``latest``, which is
|
||||
now ``2``.
|
||||
|
||||
- The :qpdf:ref:`--allow-weak-crypto` flag is now mandatory when
|
||||
explicitly creating files with weak cryptographic algorithms.
|
||||
@ -100,7 +101,7 @@ For a detailed list of changes, please see the file
|
||||
|
||||
- ``qpdf --list-attachments --verbose`` include some additional
|
||||
information about attachments. Additional information about
|
||||
attachments is also included in the ``attachments`` json key
|
||||
attachments is also included in the ``attachments`` JSON key
|
||||
with ``--json``.
|
||||
|
||||
- For encrypted files, ``qpdf --json`` reveals the user password
|
||||
@ -647,8 +648,8 @@ For a detailed list of changes, please see the file
|
||||
passwords from files or standard input than using
|
||||
:samp:`@file` for this purpose.
|
||||
|
||||
- Add some information about attachments to the json output, and
|
||||
added ``attachments`` as an additional json key. The
|
||||
- Add some information about attachments to the JSON output, and
|
||||
added ``attachments`` as an additional JSON key. The
|
||||
information included here is limited to the preferred name and
|
||||
content stream and a reference to the file spec object. This is
|
||||
enough detail for clients to avoid the hassle of navigating a
|
||||
|
Loading…
Reference in New Issue
Block a user