TODO: add notes on json v2 and other post-QPDFJob activities/ideas

This commit is contained in:
Jay Berkenbilt 2022-02-01 17:16:52 -05:00
parent 95e7d36b7a
commit 8b67ac494e
2 changed files with 177 additions and 22 deletions

198
TODO
View File

@ -1,30 +1,13 @@
Next
10.6
====
* Add user-defined initializer `QPDFObjectHandle operator ""_qpdf` to
be like QPDFObjectHandle::parse: `auto oh = "<< /a (b) >>"_qpdf;`
* Close issue #556.
* Add QPDF_MAJOR_VERSION, QPDF_MINOR_VERSION to some header, possibly
dll.h since this is everywhere that there's API
* Take a fresh look at PointerHolder with a good plan for being able
to have developers phase it in using macros or something. Decide
about shared_ptr vs unique_ptr for each time make_shared_cstr is
called. For non-copiable classes, we can use unique_ptr instead of
shared_ptr as a replacement for PointerHolder. For performance
critical cases, we could potentially have a real pointer and a
shared pointer where the shared pointer's job is to clean up but we
use the real pointer for regular access.
Consider in the context of #593, possibly with a different
implementation
* replace mode: --replace-object, --replace-stream-raw,
--replace-stream-filtered
* update first paragraph of QPDF JSON in the manual to mention this
* object numbers are not preserved by write, so object ID lookup
has to be done separately for each invocation
* you don't have to specify length for streams
* you only have to specify filtering for streams if providing raw data
* Add user-defined initializer `QPDFObjectHandle operator ""_qpdf` to
be like QPDFObjectHandle::parse: `auto oh = "<< /a (b) >>"_qpdf;`
* See if this has been done or is trivial with C++11 local static
initializers: Secure random number generation could be made more
@ -43,6 +26,168 @@ implementation
* Completion: would be nice if --job-json-file=<TAB> would complete
files
* Remember for release notes: starting in qpdf 11, the default value
for the --json keyword will be "latest". If you are depending on
version 1, change your code to specify --json=1, which works
starting with 10.6.0.
* Try to put something in to ease future PointerHolder migration, such
as typedefs for containers of PointerHolders. Test to see whether
using auto or decltype in certain places may make containers of
pointerholders switch over cleanly. Clearly document the deprecation
stuff.
Output JSON v2
==============
Output JSON v2 contain enough information to completely recreate a PDF
file.
This is not an ABI change as long as the default --json version is 1.
If this is done, update --json option in cli.rst to mention v2. Also
update QPDFJob::Config::json and of course other parts of the docs
(json.rst).
Fix the following problems:
* Include the PDF version header somewhere.
* Using "n n R" as a key in "objects" and "objectinfo" messes up
searching for things
* Strings cannot be unambiguously encoded/decoded
* Can't tell string from name from indirect object
* Strings are treated as PDF doc encoding and output as UTF-8, which
doesn't work since multiple PDF doc code points are undefined
* There is no representation of stream data
* You can't tell a stream from a dictionary except by looking in both
"object" and "objectinfo". Fix this, and then remove "objectinfo".
* There are differences between information shown in the json format
vs. information shown with options like --check, --list-attachments,
etc. The json format should be able to completely replace things
that write to stdout.
* Consider using camelCase in multi-word key names to be consistent
with job JSON and with how JSON is often represented in languages
that use it more natively
* Consider changing the contract to allow fields to be absent even
when present in the schema. It's reasonable for people to check for
presence of a key. Most languages make this easy to do.
Most things that are informational can stay the same. We will have to
go through every item to decide for sure.
To address ambiguity, consider the following:
Whenever a direct PDF object appears, disambiguate things represented
in JSON as strings as follows:
* "/Name" -- if it starts with /, it's a name
* "n n R" -- if it is "n n R", it's an indirect object
* "u:utf8-encoded" -- a utf8-encoded string
* "b:<12ab34>" -- a binary string
In "objects", the key is "obj:o,g", and the value is a dictionary with
exactly one of "value" or "stream" as its single key.
For non-streams, the value of "value" is as described above.
{
"obj:o,g": {
"value": ...
}
}
For streams:
{
"obj:o,g": {
"stream": {
"dict": { ... stream dictionary ... },
"filterable": bool,
"raw": "base64-encoded raw data",
"filtered": "base64-encoded filtered data"
}
}
}
Notes about stream data:
* Always include "dict".
* Always include "filterable" regardless of value of
--json-stream-data. The value of filterable is influenced by
--decode-level, which is already in parameters.
* Add new flag --json-stream-data={raw,filtered,none}. At most one of
"raw" and "filtered" will appear for each stream.
* Add to parameters: value of json-stream-data, default is none
* If none, omit stream data entirely
* If raw, include raw stream data as base64
* If filtered, including the base64-encoded filtered stream data if we
can and should decode it based on decode-level. Otherwise, include
the base64-encoded raw data. See if we can honor
--normalize-content.
Note that --json-stream-data=filtered is different from
--filtered-stream-data in that --filtered-stream-data implies
--decode-level=all while --json-stream-data=filtered does not. Make
sure this is mentioned in the help for both options.
QPDFJob
=======
Here are some ideas for QPDFJob that didn't make it into 10.6. Not all
of these are necessarily good -- just things to consider.
* replace mode: --replace-object, --replace-stream-raw,
--replace-stream-filtered
* update first paragraph of QPDF JSON in the manual to mention this
* object numbers are not preserved by write, so object ID lookup
has to be done separately for each invocation
* you don't have to specify length for streams
* you only have to specify filtering for streams if providing raw data
* Allow users to supply a custom progress reporter for QPDFJob
* Better interoperability with json output:
* Make sure all the things that print stuff to stdout have json
equivalents (check, showLinearizationData, etc.)
* There should be a way to get json output other than having it
print to stdout. It should be multi-language friendly and allow
for large amounts of data, such as providing a callback that qpdf
can write to (like a pipeline)
* See also JSON v2
* How do we chain jobs? The idea would be that the input and/or output
of a QPDFJob could be a QPDF object rather than a file. For input,
it's pretty easy. For output, none of the output-specific options
(encrypt, compress-streams, objects-streams, etc.) would have any
affect, so we would have to treat this like inspect for error
checking. The QPDF object in the state where it's ready to be sent
off to QPDFWriter would be used as the input to the next QPDFJob.
For the job json, I think we can have the output be an identifier
that can be used as the input for another QPDFJob. For a json file,
we could the top level detect if it's an array with the convention
that exactly one has an output, or we could have a subkey with other
job definitions or something. Ideally, any input
(copy-attachments-from, pages, etc.) could use a QPDF object. It
wouldn't surprise me if this exposes bugs in qpdf around foreign
streams as this has been a relatively fragile area before.
Documentation
=============
@ -210,6 +355,15 @@ This is a list of changes to make next time there is an ABI change.
Comments appear in the code prefixed by "ABI"
* Search for ABI to find items not listed here.
* Switch default --json to latest
* Take a fresh look at PointerHolder with a good plan for being able
to have developers phase it in using macros or something. Decide
about shared_ptr vs unique_ptr for each time make_shared_cstr is
called. For non-copiable classes, we can use unique_ptr instead of
shared_ptr as a replacement for PointerHolder. For performance
critical cases, we could potentially have a real pointer and a
shared pointer where the shared pointer's job is to clean up but we
use the real pointer for regular access.
* See where anonymous namespaces can be used to keep things private to
a source file. Search for `(class|struct)` in **/*.cc.
* See if we can use constructor delegation instead of init() in

View File

@ -411,6 +411,7 @@
"struct",
"stylesheet",
"subclassing",
"subkey",
"subkeys",
"subramanyam",
"swversion",