TODO: more JSON notes

2025-02-02 03:48:24 +00:00 · 2022-05-02 09:41:43 -04:00 · 2022-05-02 09:41:43 -04:00 · 7882b85b06
commit 7882b85b06
parent 3c4d2bfb21
1 changed files with 109 additions and 3 deletions
--- a/112
+++ b/112
@ -39,6 +39,108 @@ Soon: Break ground on "Document-level work"
 Output JSON v2
 ==============

+----
+notes from 5/2:
+
+Need new pipelines:
+* Pl_OStream(std::ostream) with semantics like Pl_StdioFile
+* Pl_String to std::string with semantics like Pl_Buffer
+* Pl_Base64
+
+New Pipeline methods:
+* writeString(std::string const&)
+* writeCString(char*)
+* writeChars(char*, size_t)
+
+* Consider templated operator<< which could specialize for char* and
+  std::string and could use std::ostringstream otherwise
+
+See if I can change all output and error messages issued by the
+library, when context is available, to have a pipeline rather than a
+FILE* or std::ostream. This makes it possible for people to capture
+output more flexibly.
+
+JSON: rather than unparse() -> string, there should be write method
+that takes a pipeline and a depth. Then rewrite all the unparse
+methods to use it. This makes incremental write possible as well as
+writing arbitrarily large amounts of output.
+
+JSON::parse should work from an InputSource. BufferInputSource can
+already start with a std::string.
+
+Have a json blob defined by a function that takes a pipeline and
+writes data to the pipeline. It's writer should create a Pl_Base64 ->
+Pl_Concatenate in front of the pipeline passed to write and call the
+function with that.
+
+Add methods needed to do incremental writes. Basically we need to
+expose functionality the array and dictionary unparse methods. Maybe
+we can have a DictionaryWriter and an ArrayWriter that deal with the
+first/depth logic and have writeElement or writeEntry(key, value)
+methods.
+
+For json output, do not unparse to string. Use the writers instead.
+Write incrementally. This changes ordering only, but we should be able
+manually update the test output for those cases. Objects should be
+written in numerical order, not lexically sorted. It probably makes
+sense to put the trailer at the end since that's where it is in a
+regular PDF.
+
+When we get to full serialization, add json serialization performance
+test.
+
+Some if not all of the json output functionality for v2 should move
+into QPDF proper rather than living in QPDFJob. There can be a
+top-level QPDF method that takes a pipeline and writes the JSON
+serialization to it.
+
+Decide what the API/CLI will be for serializing to v2. Will it just be
+part of --json or will it be its own separate thing? Probably we
+should make it so that a serialized PDF is different but uses the same
+object format as regular json mode.
+
+For going back from JSON to PDF, a separate utility will be needed.
+It's not practical for QPDFObjectHandle to be able to read JSON
+because of the special handling that is required for indirect objects,
+and QPDF can't just accept JSON because the way InputSource is used is
+complete different. Instead, we will need a separate utility that has
+logic similar to what copyForeignObject does. It will go something
+like this:
+
+* Create an empty QPDF (not emptyPDF, one with no objects in it at
+  all). This works:
+
+```
+%PDF-1.3
+xref
+0 1
+0000000000 65535 f 
+trailer << /Size 1 >>
+startxref
+9
+%%EOF
+```
+
+For each object:
+
+* Walk through the object detecting any indirect objects. For each one
+  that is not already known, reserve the object. We can also validate
+  but we should try to do the best we can with invalid JSON so people
+  can get good error messages.
+* Construct a QPDFObjectHandle from the JSON
+* If the object is the trailer, update the trailer
+* Else if the object doesn't exist, reserve it
+* If the object is reserved, call replaceReserved()
+* Else the object already exists; this is an error.
+
+This can almost be done through public API. I think all we need is the
+ability to create a reserved object with a specific object ID.
+
+The choices for json_key (job.yml) will be different for v1 and v2.
+That information is already duplicated in multiple places.
+
+----
+
 Remember typo: search for "Typo" In QPDFJob::doJSONEncrypt.

 Remember to test interaction between generators and schemas.
@ -173,21 +275,25 @@ JSON:
  object. No dictionary merges or anything like that are performed.
  It will call replaceObject.

-Within .qpdf.objects, the key is "obj:o,g" or "obj:trailer", and the
+Within .qpdf.objects, the key is "obj:o g R" or "obj:trailer", and the
 value is a dictionary with exactly one of "value" or "stream" as its
 single key.

+Rationale of "obj:o g R" is that indirect object references are just
+"o g R", and so code that wants to resolve one can do so easily by
+just prepending "obj:" and not having to parse or split the string.
+
 For non-streams:

 {
-  "obj:o,g": {
+  "obj:o g R": {
    "value": ...
  }
 }

 For streams:

-  "obj:o,g": {
+  "obj:o g R": {
    "stream": {
      "dict": { ... stream dictionary ... },
      "filterable": bool,