TODO: flesh out JSON v2 details

This commit is contained in:
Jay Berkenbilt 2022-02-25 14:54:25 -05:00
parent 36794a60cf
commit 905e99a314
1 changed files with 152 additions and 36 deletions

188
TODO
View File

@ -1,3 +1,4 @@
Next
====
@ -9,6 +10,7 @@ Priorities for 11:
* cmake
* PointerHolder -> shared_ptr
* ABI
* --json default is latest
Misc
* Get rid of "ugly switch statements" in QUtil.cc -- replace with
@ -17,6 +19,16 @@ Misc
* Consider exposing get_next_utf8_codepoint in QUtil
* Add QUtil::is_explicit_utf8 that does what QPDF_String::getUTF8Val
does to detect UTF-8 encoded strings per PDF 2.0 spec.
* Add an option --ignore-encryption to ignore encryption information
and treat encrypted files as if they weren't encrypted. This should
make it possible to solve #598 (--show-encryption without a
password). We'll need to make sure we don't try to filter any
streams in this mode. Ideally we should be able to combine this with
--json so we can look at the raw encrypted strings and streams if we
want to. Since providing the password may reveal additional details,
--show-encryption could potentially retry with this option if the
first time doesn't work. Then, with the file open, we can read the
encryption dictionary normally.
Soon: Break ground on "Document-level work"
@ -82,21 +94,17 @@ A .clang-format file can be created at the top of the repository.
Output JSON v2
==============
Output JSON v2 contain enough information to completely recreate a PDF
file.
This is not an ABI change as long as the default --json version is 1.
Output JSON v2 will contain enough information to completely recreate
a PDF file. In other words, qpdf will have full, bidirectional,
lossless json serialization/deserialization of PDF.
If this is done, update --json option in cli.rst to mention v2. Also
update QPDFJob::Config::json and of course other parts of the docs
(json.rst).
Fix the following problems:
You can't create a PDF from v1 json because
* Include the PDF version header somewhere.
* Using "n n R" as a key in "objects" and "objectinfo" messes up
searching for things
* The PDF version header is not recorded
* Strings cannot be unambiguously encoded/decoded
@ -110,36 +118,83 @@ Fix the following problems:
* You can't tell a stream from a dictionary except by looking in both
"object" and "objectinfo". Fix this, and then remove "objectinfo".
* There are differences between information shown in the json format
vs. information shown with options like --check, --list-attachments,
Additionally, using "n n R" as a key in "objects" and "objectinfo"
messes up searching for things.
For json v2:
* Make sure it is possible to serialize and deserializes a PDF to JSON
without loading the whole thing into memory. This is substantial. It
means we need sax-style parsing and handling so we can
handle/generate objects as we go. We'll have to be able to keep
track of keys for dictionary error checking. May want to add json to
large file tests.
* Resolve differences between information shown in the json format vs.
information shown with options like --check, --list-attachments,
etc. The json format should be able to completely replace things
that write to stdout.
that write to stdout. Be sure getAllPages() and other top-level
convenience routines are there so people don't need to parse the
pages tree themselves. For many workflows, it should be possible for
someone to work in the json file based on json metadata rather than
calling the QPDF API. (Of course, you still need the QPDF API for
higher level helper objects.)
* Consider using camelCase in multi-word key names to be consistent
with job JSON and with how JSON is often represented in languages
that use it more natively
that use it more natively.
* Consider changing the contract to allow fields to be absent even
when present in the schema. It's reasonable for people to check for
presence of a key. Most languages make this easy to do.
* If we allow --json to be mixed with --ignore-encryption, we must
emphasize that the resulting json can't be turned back into a valid
PDF.
Most things that are informational can stay the same. We will have to
go through every item to decide for sure.
go through every item to decide for sure, especially when camelCase is
taken into consideration.
To address ambiguity, consider the following:
New APIs:
Whenever a direct PDF object appears, disambiguate things represented
in JSON as strings as follows:
QPDFObjectHandle::parseJSON(QPDF* context, JSON);
QPDFObjectHandle::parseJSON(QPDF* context, std::string const&);
operator ""_qpdf_json
C API to create a QPDFObjectHandle from a json string
* "/Name" -- if it starts with /, it's a name
* "n n R" -- if it is "n n R", it's an indirect object
* "u:utf8-encoded" -- a utf8-encoded string
* "b:<12ab34>" -- a binary string
JSON::parseFile
QPDF::parseJSON(JSON) (like parseFile, etc. -- deserializes json)
QPDF::updateFromJSON(JSON)
In "objects", the key is "obj:o,g", and the value is a dictionary with
exactly one of "value" or "stream" as its single key.
CLI: --infile-is-json -- indicate that the input is a qpdf json file
rather than a PDF file
CLI: --update-from-json=file.json
For non-streams, the value of "value" is as described above.
Have a "qpdf" key in the output that contains "jsonVersion",
"pdfVersion", and "objects". This replaces the "objects" field at the
top level. "objects" and "objectinfo" disappear from the top-level.
".version" and ".qpdf.jsonVersion" will match. The input to parseJSON
and updateFromJSON will have to have the "qpdf" key in it. All other
keys are ignored.
When creating from a JSON file, the JSON must be complete with data
for all streams, a trailer, and a pdfVersion. When updating from a
JSON:
* Any object whose value is null (not "value": null, but just null) is
deleted.
* For any stream that appears without stream data, the stream data is
left alone.
* Otherwise, the object from the JSON completely replaces the input
object. No dictionary merges or anything like that are performed.
It will call replaceObject.
Within .qpdf.objects, the key is "obj:o,g" or "obj:trailer", and the
value is a dictionary with exactly one of "value" or "stream" as its
single key.
For non-streams:
{
"obj:o,g": {
@ -149,7 +204,6 @@ For non-streams, the value of "value" is as described above.
For streams:
{
"obj:o,g": {
"stream": {
"dict": { ... stream dictionary ... },
@ -160,27 +214,89 @@ For streams:
}
}
Notes about stream data:
Wherever a PDF object appears in the JSON output, including "value"
and "stream"."dict" above as well as other places where they might
appear, objects are represented as follows:
* Always include "dict".
* Arrays, dictionaries, booleans, nulls, integers, and real numbers
with no more than six decimal places are represented as their native
JSON type.
* Real numbers with more than six decimal places are represented as
"r:{real-value}".
* Names: "/Name" -- internal/canonical representation (e.g.
"/Text/Plain", not #xx quoted)
* Indirect objects: "n n R"
* Strings: one of
"s:json string treated as Unicode"
"b:json string treated as bytes; character > \u00ff is an error"
"e:base64-encoded bytes"
Test cases: these are the same:
* "b:\u00c8\u0080", "s:π", "s:\u03c0", and "e:z4A="
* "b:\u00d8\u003e\u00dd\u0054", "s:🥔", "s:\ud83e\udd54", and "e:8J+llA=="
When creating output from a string:
* If the string is explicitly unicode (UTF-8 or UTF-16), encode as
"s:" without the leading U+FEFF
* Else if the string can be bidirectionally mapped between pdf-doc and
unicode, transcode to unicode and encode as "s:"
* Else if the string would be decoded as binary, encode as "e:"
* Else encode as "b:"
When reading a string, any string that doesn't follow the above rules
is an error. This includes "r:" strings not paresable as a real
number, "/Name" strings containing a NUL character, "s:" or "b:"
strings that are not valid JSON strings, "b:" strings containing
character values > 0xff, or "e:" values that are not valid base64.
Once the string is read in, if the "s:" string can be bidirectionally
mapped between pdf-doc and unicode, store as PDFDoc. Otherwise store
as UTF-16BE. "b:" strings are stored as bytes, and "e:" are decoded
and stored as bytes.
Implementing this will require some refactoring of things between
QUtil and QPDF_String, plus we will need to implement a base64
encoder/decoder.
This enables a workflow like this:
* qpdf --json=latest infile.pdf > pdf.json
* modify pdf.json
* qpdf infile.pdf --update-from=pdf.json out.pdf
or
* qpdf --json=latest --json-stream-data=raw|filtered infile.pdf > pdf.json
* modify pdf.json
* qpdf pdf.json --infile-is-json out.pdf
Notes about streams and stream data:
* Always include "dict". "/Length" is removed from the stream
dictionary.
* Add new flag --json-stream-data={raw,filtered,none}. At most one of
"raw" and "filtered" will appear for each stream. If "filtered"
appears, "/Filter" and "/DecodeParms" are removed from the stream
dictionary. This makes the stream data and dictionary match for when
the file is read back in.
* Always include "filterable" regardless of value of
--json-stream-data. The value of filterable is influenced by
--decode-level, which is already in parameters.
* Add new flag --json-stream-data={raw,filtered,none}. At most one of
"raw" and "filtered" will appear for each stream.
* Add to parameters: value of json-stream-data, default is none
* If none, omit stream data entirely
* If --json-stream-data=none, omit stream data entirely
* If raw, include raw stream data as base64
* If --json-stream-data=raw, include raw stream data as base64. Show
the data even for unfiltered streams in "raw".
* If filtered, including the base64-encoded filtered stream data if we
can and should decode it based on decode-level. Otherwise, include
the base64-encoded raw data. See if we can honor
--normalize-content.
* If --json-stream-data=filtered, include the base64-encoded filtered
stream data if we can and should decode it based on decode-level.
Otherwise, include the base64-encoded raw data. See if we can honor
--normalize-content. If a stream appears unfiltered in the input,
still show it as filtered. Remove /DecodeParms and /Filter if
filtering.
Note that --json-stream-data=filtered is different from
--filtered-stream-data in that --filtered-stream-data implies