2
1
mirror of https://github.com/qpdf/qpdf.git synced 2025-01-03 07:12:28 +00:00

TODO: clean up remaining work for json v2

This commit is contained in:
Jay Berkenbilt 2022-05-21 17:58:30 -04:00
parent 27a42c16c7
commit f1a9ba0c62
3 changed files with 89 additions and 109 deletions

195
TODO
View File

@ -55,11 +55,7 @@ Soon: Break ground on "Document-level work"
Output JSON v2 Output JSON v2
============== ==============
Some of this documentation has drifted from the actual implementation. Remaining work:
* Document that /Length is ignored in stream dictionary replacements
General things to remember:
* Make sure all the information from --check and other informational * Make sure all the information from --check and other informational
options (--show-linearization, --show-encryption, --show-xref, options (--show-linearization, --show-encryption, --show-xref,
@ -68,106 +64,98 @@ General things to remember:
right keys when in json mode. I don't think I want check on by right keys when in json mode. I don't think I want check on by
default, so that might be different. default, so that might be different.
* Consider changing the contract to allow fields to be absent even Notes for documentation:
when present in the schema. It's reasonable for people to check for
presence of a key. Most languages make this easy to do. * Find all mentions of json in the manual and update.
* Document typo fix in encrypt in release notes along with any other * Document typo fix in encrypt in release notes along with any other
non-compatible json 2 changes. Scrutinize all the output to decide non-compatible json 2 changes. Scrutinize all the output to decide
what should change. what should change.
* Document that keys other than "qpdf-v2" are ignored so people can * Keys other than "qpdf-v2" are ignored so people can stash their own
stash their own stuff. stuff. Unknown keys are ignored at other places for future
compatibility. Readers of qpdf json should continue to ignore keys
they don't recognize.
JSON to PDF: * Change: names are written in canonical form with a leading slash
just as they are treated in the code. In v1, they were written in
PDF syntax in the json file. Example: /text#2fplain in pdf will be
written as /text/plain in json v2 and as /text#2fplain in json v1.
Have --json-input and --update-from-json. With --json-input, the json * Document changes to strings, objects, streams, object keys.
file must be complete, meaning all stream data, the trailer, and the
PDF version must be present. For streams with no stream data, the
dictionary is updated but the data is left untouched. Other things
that are omitted are left alone. Make sure document that, when writing
a PDF file from QPDF, there is no expectation of object numbers being
preserved. As such, --update-from-json can only be used to update the
exact file that the json was created from. You can put multiple
objects in the update file, but you can't use a json from one file to
update the output of a previous update since the object numbers will
have changed. Note that, when creating from a JSON, object numbers are
preserved in the resulting QPDF object but still modified by
QPDFWriter for the output. This would be visible by combining
--json-output and --json-input. Also using --qdf with
--create-from-json would show original object IDs in comments. It will
be important to capture this in the documentation.
When reading a JSON string, any string that doesn't look like a name * CLI: --json-input, --json-output[=version], --update-from-json. With
or indirect object or start with "b:" or "u:" should be considered an --json-input, the input file is a JSON file instead of a PDF file.
error. Just use newUnicodeString on "u:" strings. For "b:" strings, It must be complete, meaning that a PDF version must be given, all
decode the bytes with hex_decode and use newString. streams must have exactly one of data or datafile, and a trailer
dictionary must be present, even if empty.
Test case: combine --json-input and --json-output to show preservation With --update-from-json, the JSON file updates objects in place. If
of object numbers. QPDFWriter won't show that although --qdf with the updating an old stream, if stream data is omitted, the data remains
original object ID comments would. untouched. The dictionary is always required. Remember that
QPDFWriter does not preserve object numbers, though --json-output
does. Therefore, if you want to update a PDF with a JSON, the input
to --update-from-json must be the same PDF as the one that
--json-output was run on previously. Otherwise, object numbers won't
match. Show this with an example. When updating,
The backing input source for createFromJSON is this memory block: * Certain fields are ignored when reading the JSON. This includes
maxobjectid, any computed fields in trailer (such as /Size), and all
/Length keys in stream dictionaries. There is no need for the user
to correct, remove, or otherwise worry about any values those keys
might have. The maxobjectid field is present in the original output
to assist with adding new objects to the file.
``` * JSON strings within PDF objects:
%PDF-1.3
xref
0 1
0000000000 65535 f
trailer << /Size 1 >>
startxref
9
%%EOF
```
* Ignore all keys except .qpdf-v2. * "n n R" is an indirect object
* Set this->m->pdf_version based on the .qpdf.pdfVersion key
* For each object in .qpdf.objects:
* Walk through the object detecting any indirect objects. For each
one that is not already known, reserve the object. We can also
validate but we should try to do the best we can with invalid JSON
so people can get good error messages.
* Construct a QPDFObjectHandle from the JSON
* If the object is the trailer, update the trailer
* Else if the object doesn't exist, reserve it
* If the object is reserved, call replaceReserved()
* Else the object already exists; this is an error.
For streams, have a stream data provider that, for inline streams, * "/Name" is a name in canonical form with a leading slash (like
does a base64 from the file offsets and for file-based streams, reads "/text/plain"), not PDF syntax (like "/text#2fplain").
the file. For the inline case, we have to keep the json InputSource
around. Otherwise, we don't. It is an error if there is no stream
data. For files, we can have a stream data provider that just reads
the file. Remember QUtil::file_provider.
Documentation: * "b:hex-digits" is a binary string ("b:feff03c0"). Hex digits may be
mixed case. There must be an even number of digits.
Serialized PDF: * "u:utf-8" is a UTF-8 encoded string ("u:π", "u:\u03c0"). UTF-16
surrogate pairs are allowed. These are all equivalent: "u:🥔",
"u:\ud83e\udd54", "b:FEFFD83EDD54", "b:efbbbff09fa594".
The JSON output will have a "qpdf-v2" key containing * Both "b:" and "u:" are valid representations of the empty string.
* pdfversion
* maxobjectid
* objects
In regular json mode, "objectinfo" is gone. * Anything else is an error
Within .objects, the key is "obj:o g R" or "trailer", and the * Document use of --json-input and --json-output together to show
value is a dictionary with exactly one of "value" or "stream" as its preservation of object numbers. Draw attention to "original object
single key. ID" comments in qdf as another way to show it.
Rationale of "obj:o g R" is that indirect object references are just * Document top-level keys of "qpdf-v2" ("pdfversion", "objects",
"o g R", and so code that wants to resolve one can do so easily by "maxobjectid") noting that "maxobjectid" is ignored when reading.
just prepending "obj:" and not having to parse or split the string.
Having a prefix rather than making the key just "o g R" makes it much * Stream data: "data" is base64-encoded stream data. "datafile" is the
easier to search in the JSON for the definition of an object. path to a file (relative path recommended but not required)
containing the binary data. As with any PDF representation, the data
must be consistent with the filters. --decode-level is honored by
--json-output.
* Other changes from v1:
* in "objects", keys are "obj:o g R" or "trailer"
* Non-stream objects are dictionaries with a "value" key whose value
is the object. Stream objects are dictionaries with a "stream" key
whose value is {"dict": stream-dictionary}. The "/Length" key is
omitted from the stream dictionary.
* "objectinfo" is gone as it is now possible to tell a stream from a
non-stream directly. To get stream data, use the --json-output
option. Note about how "pages" may cause the pages tree to be
corrected.
For non-streams: For non-streams:
{
"obj:o g R": { "obj:o g R": {
"value": ... "value": ...
} }
}
For streams: For streams:
@ -178,41 +166,31 @@ For streams:
"datafile": "path to base64-encoded data" "datafile": "path to base64-encoded data"
} }
} }
}
At most one of "data" or "datafile" will be present. When serializing, Rationale of "obj:o g R" is that indirect object references are just
stream decode parameters will be obeyed, and the stream dictionary "o g R", and so code that wants to resolve one can do so easily by
will reflect the result. There will be the option to omit stream data. just prepending "obj:" and not having to parse or split the string.
Having a prefix rather than making the key just "o g R" makes it much
When data is included, "/Length" is removed from the stream easier to search in the JSON for the definition of an object.
dictionary.
Streams are filtered or not based on the --decode-level parameter. If
a stream is filtered, "/Filter" and "/DecodeParms" are removed from
the stream dictionary. This makes the stream data and dictionary match
for when the file is read back in.
CLI: CLI:
Example workflow: Example workflow:
* qpdf in.pdf --json-output=2 pdf.json * qpdf in.pdf --json-output pdf.json
* edit pdf.json * edit pdf.json
* qpdf --json-input pdf.json out.pdf * qpdf --json-input pdf.json out.pdf
* qpdf in.pdf --json-output=2 pdf.json * qpdf in.pdf --json-output pdf.json
* edit pdf.json keeping only objects that need to be changed * edit pdf.json keeping only objects that need to be changed
* qpdf in.pdf --update-from-json=pdf.json out.pdf * qpdf in.pdf --update-from-json=pdf.json out.pdf
Update --json option in cli.rst to mention v2 and update json.rst. To modify a single object:
Other documentation fodder: * qpdf in.pdf --json-output pdf.json --json-object=o,g
* edit pdf.json
* qpdf in.pdf --update-from-json=pdf.json out.pdf
You can't create a PDF from v1 json because Historical note: you can't create a PDF from v1 json because
* Change: names are written in canonical form with a leading slash
just as they are treated in the code. In v1, they were written in
PDF syntax in the json file. Example: /text#2fplain in pdf will be
written as /text/plain in json v2 and as /text#2fplain in json v1.
* The PDF version header is not recorded * The PDF version header is not recorded
@ -221,15 +199,16 @@ You can't create a PDF from v1 json because
* Can't tell string from name from indirect object * Can't tell string from name from indirect object
* Strings are treated as PDF doc encoding and output as UTF-8, which * Strings are treated as PDF doc encoding and output as UTF-8, which
doesn't work since multiple PDF doc code points are undefined doesn't work since multiple PDF doc code points are undefined and
is absurd for binary strings
* There is no representation of stream data * There is no representation of stream data
* You can't tell a stream from a dictionary except by looking in both * You can't tell a stream from a dictionary except by looking in both
"object" and "objectinfo". Fix this, and then remove "objectinfo". "object" and "objectinfo".
Additionally, using "n n R" as a key in "objects" and "objectinfo" * Using "n n R" as a key in "objects" and "objectinfo" makes it hard
messes up searching for things. to search for things when viewing the JSON file in an editor.
QPDFPagesTree QPDFPagesTree
@ -249,7 +228,7 @@ I'm thinking we will want to keep a pages cache for efficient
insertion. There's no reason we can't keep a vector of page objects up insertion. There's no reason we can't keep a vector of page objects up
to date and just do a traversal the first time we do getAllPages just to date and just do a traversal the first time we do getAllPages just
like we do now. The difference is that we would not flatten the pages like we do now. The difference is that we would not flatten the pages
tree. It would be useful to go through QPDF_pages and re-reimplement tree. It would be useful to go through QPDF_pages and reimplement
everything without calling flattenPagesTree. Then we can remove everything without calling flattenPagesTree. Then we can remove
flattenPagesTree, which is private. flattenPagesTree, which is private.
@ -261,7 +240,7 @@ isPagesObject and isPageObject are reliable and can be made more
reliable. Maybe add a validate or repair function? It should also make reliable. Maybe add a validate or repair function? It should also make
sure /Count and /Parent are correct. sure /Count and /Parent are correct.
refs/attic/QPDFPagesTree-old -- original, abndoned branch -- clean up refs/attic/QPDFPagesTree-old -- original, abandoned branch -- clean up
when done. when done.
QPDFJob QPDFJob

View File

@ -429,6 +429,7 @@
"rdpp", "rdpp",
"rdquo", "rdquo",
"refcount", "refcount",
"reimplement",
"resave", "resave",
"retargeted", "retargeted",
"rfont", "rfont",

View File

@ -23,7 +23,7 @@ be extracted from PDF files using other qpdf command-line options.
QPDF JSON Format QPDF JSON Format
---------------- ----------------
QXXXQ Write this. XXX Write this.
.. _json-guarantees: .. _json-guarantees: