TODO: clean up remaining work for json v2

2025-01-03 07:12:28 +00:00 · 2022-05-21 17:58:30 -04:00 · 2022-05-21 17:58:30 -04:00 · f1a9ba0c62
commit f1a9ba0c62
parent 27a42c16c7
3 changed files with 89 additions and 109 deletions
--- a/195
+++ b/195
@ -55,11 +55,7 @@ Soon: Break ground on "Document-level work"
 Output JSON v2
 ==============
-Some of this documentation has drifted from the actual implementation.
+Remaining work:
 * Document that /Length is ignored in stream dictionary replacements
 General things to remember:
 * Make sure all the information from --check and other informational
  options (--show-linearization, --show-encryption, --show-xref,
@ -68,106 +64,98 @@ General things to remember:
  right keys when in json mode. I don't think I want check on by
  default, so that might be different.
-* Consider changing the contract to allow fields to be absent even
+Notes for documentation:
-  when present in the schema. It's reasonable for people to check for
+
-  presence of a key. Most languages make this easy to do.
+* Find all mentions of json in the manual and update.
 * Document typo fix in encrypt in release notes along with any other
  non-compatible json 2 changes. Scrutinize all the output to decide
  what should change.
-* Document that keys other than "qpdf-v2" are ignored so people can
+* Keys other than "qpdf-v2" are ignored so people can stash their own
-  stash their own stuff.
+  stuff. Unknown keys are ignored at other places for future
  compatibility. Readers of qpdf json should continue to ignore keys
  they don't recognize.
-JSON to PDF:
+* Change: names are written in canonical form with a leading slash
  just as they are treated in the code. In v1, they were written in
  PDF syntax in the json file. Example: /text#2fplain in pdf will be
  written as /text/plain in json v2 and as /text#2fplain in json v1.
-Have --json-input and --update-from-json. With --json-input, the json
+* Document changes to strings, objects, streams, object keys.
 file must be complete, meaning all stream data, the trailer, and the
 PDF version must be present. For streams with no stream data, the
 dictionary is updated but the data is left untouched. Other things
 that are omitted are left alone. Make sure document that, when writing
 a PDF file from QPDF, there is no expectation of object numbers being
 preserved. As such, --update-from-json can only be used to update the
 exact file that the json was created from. You can put multiple
 objects in the update file, but you can't use a json from one file to
 update the output of a previous update since the object numbers will
 have changed. Note that, when creating from a JSON, object numbers are
 preserved in the resulting QPDF object but still modified by
 QPDFWriter for the output. This would be visible by combining
 --json-output and --json-input. Also using --qdf with
 --create-from-json would show original object IDs in comments. It will
 be important to capture this in the documentation.
-When reading a JSON string, any string that doesn't look like a name
+* CLI: --json-input, --json-output[=version], --update-from-json. With
-or indirect object or start with "b:" or "u:" should be considered an
+  --json-input, the input file is a JSON file instead of a PDF file.
-error. Just use newUnicodeString on "u:" strings. For "b:" strings,
+  It must be complete, meaning that a PDF version must be given, all
-decode the bytes with hex_decode and use newString.
+  streams must have exactly one of data or datafile, and a trailer
  dictionary must be present, even if empty.
-Test case: combine --json-input and --json-output to show preservation
+  With --update-from-json, the JSON file updates objects in place. If
-of object numbers. QPDFWriter won't show that although --qdf with the
+  updating an old stream, if stream data is omitted, the data remains
-original object ID comments would.
+  untouched. The dictionary is always required. Remember that
  QPDFWriter does not preserve object numbers, though --json-output
  does. Therefore, if you want to update a PDF with a JSON, the input
  to --update-from-json must be the same PDF as the one that
  --json-output was run on previously. Otherwise, object numbers won't
  match. Show this with an example. When updating,
-The backing input source for createFromJSON is this memory block:
+* Certain fields are ignored when reading the JSON. This includes
  maxobjectid, any computed fields in trailer (such as /Size), and all
  /Length keys in stream dictionaries. There is no need for the user
  to correct, remove, or otherwise worry about any values those keys
  might have. The maxobjectid field is present in the original output
  to assist with adding new objects to the file.
-```
+* JSON strings within PDF objects:
 %PDF-1.3
 xref
 0 1
 0000000000 65535 f 
 trailer << /Size 1 >>
 startxref
 9
 %%EOF
 ```
-* Ignore all keys except .qpdf-v2.
+  * "n n R" is an indirect object
 * Set this->m->pdf_version based on the .qpdf.pdfVersion key
 * For each object in .qpdf.objects:
  * Walk through the object detecting any indirect objects. For each
    one that is not already known, reserve the object. We can also
    validate but we should try to do the best we can with invalid JSON
    so people can get good error messages.
  * Construct a QPDFObjectHandle from the JSON
  * If the object is the trailer, update the trailer
  * Else if the object doesn't exist, reserve it
  * If the object is reserved, call replaceReserved()
  * Else the object already exists; this is an error.
-For streams, have a stream data provider that, for inline streams,
+  * "/Name" is a name in canonical form with a leading slash (like
-does a base64 from the file offsets and for file-based streams, reads
+    "/text/plain"), not PDF syntax (like "/text#2fplain").
 the file. For the inline case, we have to keep the json InputSource
 around. Otherwise, we don't. It is an error if there is no stream
 data. For files, we can have a stream data provider that just reads
 the file. Remember QUtil::file_provider.
-Documentation:
+  * "b:hex-digits" is a binary string ("b:feff03c0"). Hex digits may be
    mixed case. There must be an even number of digits.
-Serialized PDF:
+  * "u:utf-8" is a UTF-8 encoded string ("u:π", "u:\u03c0"). UTF-16
    surrogate pairs are allowed. These are all equivalent: "u:🥔",
    "u:\ud83e\udd54", "b:FEFFD83EDD54", "b:efbbbff09fa594".
-The JSON output will have a "qpdf-v2" key containing
+  * Both "b:" and "u:" are valid representations of the empty string.
 * pdfversion
 * maxobjectid
 * objects
-In regular json mode, "objectinfo" is gone.
+  * Anything else is an error
-Within .objects, the key is "obj:o g R" or "trailer", and the
+* Document use of --json-input and --json-output together to show
-value is a dictionary with exactly one of "value" or "stream" as its
+  preservation of object numbers. Draw attention to "original object
-single key.
+  ID" comments in qdf as another way to show it.
-Rationale of "obj:o g R" is that indirect object references are just
+* Document top-level keys of "qpdf-v2" ("pdfversion", "objects",
-"o g R", and so code that wants to resolve one can do so easily by
+  "maxobjectid") noting that "maxobjectid" is ignored when reading.
-just prepending "obj:" and not having to parse or split the string.
+
-Having a prefix rather than making the key just "o g R" makes it much
+* Stream data: "data" is base64-encoded stream data. "datafile" is the
-easier to search in the JSON for the definition of an object.
+  path to a file (relative path recommended but not required)
  containing the binary data. As with any PDF representation, the data
  must be consistent with the filters. --decode-level is honored by
  --json-output.
 * Other changes from v1:
  * in "objects", keys are "obj:o g R" or "trailer"
  * Non-stream objects are dictionaries with a "value" key whose value
    is the object. Stream objects are dictionaries with a "stream" key
    whose value is {"dict": stream-dictionary}. The "/Length" key is
    omitted from the stream dictionary.
  * "objectinfo" is gone as it is now possible to tell a stream from a
    non-stream directly. To get stream data, use the --json-output
    option. Note about how "pages" may cause the pages tree to be
    corrected.
 For non-streams:
 {
  "obj:o g R": {
    "value": ...
  }
 }
 For streams:
@ -178,41 +166,31 @@ For streams:
      "datafile": "path to base64-encoded data"
    }
  }
 }
-At most one of "data" or "datafile" will be present. When serializing,
+Rationale of "obj:o g R" is that indirect object references are just
-stream decode parameters will be obeyed, and the stream dictionary
+"o g R", and so code that wants to resolve one can do so easily by
-will reflect the result. There will be the option to omit stream data.
+just prepending "obj:" and not having to parse or split the string.
-
+Having a prefix rather than making the key just "o g R" makes it much
-When data is included, "/Length" is removed from the stream
+easier to search in the JSON for the definition of an object.
 dictionary.
 Streams are filtered or not based on the --decode-level parameter. If
 a stream is filtered, "/Filter" and "/DecodeParms" are removed from
 the stream dictionary. This makes the stream data and dictionary match
 for when the file is read back in.
 CLI:
 Example workflow:
-* qpdf in.pdf --json-output=2 pdf.json
+* qpdf in.pdf --json-output pdf.json
 * edit pdf.json
 * qpdf --json-input pdf.json out.pdf
-* qpdf in.pdf --json-output=2 pdf.json
+* qpdf in.pdf --json-output pdf.json
 * edit pdf.json keeping only objects that need to be changed
 * qpdf in.pdf --update-from-json=pdf.json out.pdf
-Update --json option in cli.rst to mention v2 and update json.rst.
+To modify a single object:
-Other documentation fodder:
+* qpdf in.pdf --json-output pdf.json --json-object=o,g
 * edit pdf.json
 * qpdf in.pdf --update-from-json=pdf.json out.pdf
-You can't create a PDF from v1 json because
+Historical note: you can't create a PDF from v1 json because
 * Change: names are written in canonical form with a leading slash
  just as they are treated in the code. In v1, they were written in
  PDF syntax in the json file. Example: /text#2fplain in pdf will be
  written as /text/plain in json v2 and as /text#2fplain in json v1.
 * The PDF version header is not recorded
@ -221,15 +199,16 @@ You can't create a PDF from v1 json because
  * Can't tell string from name from indirect object
  * Strings are treated as PDF doc encoding and output as UTF-8, which
-    doesn't work since multiple PDF doc code points are undefined
+    doesn't work since multiple PDF doc code points are undefined and
    is absurd for binary strings
 * There is no representation of stream data
 * You can't tell a stream from a dictionary except by looking in both
-  "object" and "objectinfo". Fix this, and then remove "objectinfo".
+  "object" and "objectinfo".
-Additionally, using "n n R" as a key in "objects" and "objectinfo"
+* Using "n n R" as a key in "objects" and "objectinfo" makes it hard
-messes up searching for things.
+  to search for things when viewing the JSON file in an editor.
 QPDFPagesTree
@ -249,7 +228,7 @@ I'm thinking we will want to keep a pages cache for efficient
 insertion. There's no reason we can't keep a vector of page objects up
 to date and just do a traversal the first time we do getAllPages just
 like we do now. The difference is that we would not flatten the pages
-tree. It would be useful to go through QPDF_pages and re-reimplement
+tree. It would be useful to go through QPDF_pages and reimplement
 everything without calling flattenPagesTree. Then we can remove
 flattenPagesTree, which is private.
@ -261,7 +240,7 @@ isPagesObject and isPageObject are reliable and can be made more
 reliable. Maybe add a validate or repair function? It should also make
 sure /Count and /Parent are correct.
-refs/attic/QPDFPagesTree-old -- original, abndoned branch -- clean up
+refs/attic/QPDFPagesTree-old -- original, abandoned branch -- clean up
 when done.
 QPDFJob
--- a/cSpell.json
+++ b/cSpell.json
@ -429,6 +429,7 @@
    "rdpp",
    "rdquo",
    "refcount",
    "reimplement",
    "resave",
    "retargeted",
    "rfont",
--- a/manual/json.rst
+++ b/manual/json.rst
@ -23,7 +23,7 @@ be extracted from PDF files using other qpdf command-line options.
 QPDF JSON Format
 ----------------
-QXXXQ Write this.
+XXX Write this.
 .. _json-guarantees: