TODO: solidify work for JSON to PDF

2025-01-22 14:48:28 +00:00 · 2022-05-08 13:12:41 -04:00 · 2022-05-08 13:12:41 -04:00 · ed6130036c
commit ed6130036c
parent 9a0e9a1a9e
2 changed files with 56 additions and 14 deletions
--- a/65
+++ b/65
@ -18,7 +18,9 @@ Other (do in any order):
 * See if I can change all output and error messages issued by the
  library, when context is available, to have a pipeline rather than a
  FILE* or std::ostream. This makes it possible for people to capture
-  output more flexibly.
+  output more flexibly. We could also add a generic pipeline that
+  takes std::function<void(char const*, size_t)> or even a
+  void(*)(char const*, unsigned long) for the C API.
 * Make job JSON accept a single element and treat as an array of one
  when an array is expected. This allows for making things repeatable
  in the future without breaking compatibility and is needed for the
@ -62,31 +64,59 @@ General things to remember:
  when present in the schema. It's reasonable for people to check for
  presence of a key. Most languages make this easy to do.

+* Document typo fix in encrypt in release notes along with any other
+  non-compatible json 2 changes. Scrutinize all the output to decide
+  what should change.
+
 * When we get to full serialization, add json serialization
  performance test.

 * Add json to the large file tests.

-* We could consider arguments like --replace-object that would take a
-  JSON representation of the object and could include indirect
-  references, etc. We could also add --delete object.
-
 * Object representation tests
  * "b:cf80", "b:CF80", "u:π", "u:\u03c0"
  * "b:d83edd54", "u:🥔", "u:\ud83e\udd54"

 JSON to PDF:

-When reading a JSON string, any string that doesn't follow the above rules
-is an error. Just use newUnicodeString on "u:" strings. For "b:"
-strings, decode the bytes with hex_decode and use newString.
+Have --create-from-json and --update-from-json. With
+--create-from-json, the json file must be complete, meaning all stream
+data, the trailer, and the PDF version must be present. In
+--update-from-json, an object explicitly set to null (not "value":
+null) is deleted. For streams with no stream data, the dictionary is
+updated but the data is left untouched. Other things that are omitted
+are left alone. Make sure document that, when writing a PDF file from
+QPDF, there is no expectation of object numbers being preserved. As
+such, --update-from-json can only be used to update the exact file
+that the json was created from. You can put multiple objects in the
+update file, but you can't use a json from one file to update the
+output of a previous update since the object numbers will have
+changed. Note that, when creating from a JSON, object numbers are
+preserved in the resulting QPDF object but still modified by
+QPDFWriter for the output. This would be visible by combining
+--to-json and --create-from-json. Also using --qdf with
+--create-from-json would show original object IDs in comments. It will
+be important to capture this in the documentation.
+
+When reading a JSON string, any string that doesn't look like a name
+or indirect object or start with "b:" or "u:" should be considered an
+error. Just use newUnicodeString on "u:" strings. For "b:" strings,
+decode the bytes with hex_decode and use newString.

 For going back from JSON to PDF, we can have
-QPDF::fromJSON(std::shared_ptr<InputSource> which will have logic
-similar to copyForeignObject. Note that this InputSource is not going
-to be this->file. We have to keep it separately.
+QPDF::createFromJSON(std::shared_ptr<InputSource>)
+which will have logic similar to copyForeignObject. Note that this
+InputSource is not going to be this->file. We have to keep it
+separately. There's also non-static QPDF::updateFromJSON. Both
+createFromJSON and updateFromJSON will call the same internal method
+with different options. That method will use a reactor that is a
+private QPDF class that just proxies to private QPDF methods.

-The backing input source is this memory block:
+Test case: combine --create-from-json and --to-json to preservation of
+object numbers. QPDFWriter won't show that although --qdf with the
+original object ID comments would.
+
+The backing input source for createFromJSON is this memory block:

 ```
 %PDF-1.3
@ -116,7 +146,9 @@ startxref
 For streams, have a stream data provider that, for inline streams,
 does a base64 from the file offsets and for file-based streams, reads
 the file. For the inline case, we have to keep the json InputSource
-around. Otherwise, we don't. It is an error if there is no stream data.
+around. Otherwise, we don't. It is an error if there is no stream
+data. For files, we can have a stream data provider that just reads
+the file. Remember QUtil::file_provider.

 Documentation:

@ -125,6 +157,7 @@ Serialized PDF:
 The JSON output will have a "qpdf" key containing
 * jsonversion
 * pdfversion
+* maxobjectid
 * objects

 The "qpdf" key replaces "objects" and "objectinfo" in v1 JSON.
@ -175,7 +208,11 @@ CLI:
 Example workflow:
 * qpdf in.pdf --to-json > pdf.json
 * edit pdf.json
-* qpdf --from-json=pdf.json out.pdf
+* qpdf --create-from-json=pdf.json out.pdf
+
+* qpdf in.pdf --to-json > pdf.json
+* edit pdf.json keeping only objects that need to be changed
+* qpdf in.pdf --update-from-json=pdf.json out.pdf

 Update --json option in cli.rst to mention v2 and update json.rst.

--- a/cSpell.json
+++ b/cSpell.json
@ -79,6 +79,7 @@
    "ctest",
    "cxxflags",
    "cygwin",
+    "datafile",
    "dbuild",
    "dcmake",
    "dctdecode",
@ -216,6 +217,7 @@
    "jsample",
    "jsamprow",
    "jsimd",
+    "jsonversion",
    "jstr",
    "jurczyk",
    "kgdl",
@ -262,6 +264,7 @@
    "masamichi",
    "mateusz",
    "maxdepth",
+    "maxobjectid",
    "mdash",
    "mindepth",
    "mkdir",
@ -344,6 +347,7 @@
    "pcre",
    "pdflatex",
    "pdfs",
+    "pdfversion",
    "pdlin",
    "pfeifle",
    "pikepdf",
@ -434,6 +438,7 @@
    "rpath",
    "rstream",
    "runlength",
+    "runpath",
    "runtest",
    "sahil",
    "samp",