2
1
mirror of https://github.com/qpdf/qpdf.git synced 2025-01-05 08:02:11 +00:00

TODO-pages: introduce QPDFAssembler and QPDFSplitter

This commit is contained in:
Jay Berkenbilt 2024-01-04 07:21:23 -05:00
parent e52b026db4
commit f7dd653d5f

View File

@ -24,18 +24,15 @@ of the following properties, among others:
* Contains information used by pages (named destinations) * Contains information used by pages (named destinations)
As long as qpdf has had the ability to copy pages from one PDF to another, it has had robust As long as qpdf has had the ability to copy pages from one PDF to another, it has had robust
handling of page-level data. Prior to the implementation of the pages epic, with the exception of handling of page-level data. When qpdf creates a new PDF file from existing PDF files, it starts
page labels and form fields, qpdf has ignored document-level data during page copy operations. with a specific PDF, known as the _primary input_. The primary input may be a file or the built-in
Specifically, when qpdf creates a new PDF file from existing PDF files, it always starts with a _empty PDF_. Prior to the implementation of the pages epic, qpdf has ignored document-level data
specific PDF, known as the _primary input_. The primary input may be a file or the built-in _empty (except for page labels and interactive form fields) when merging and splitting files. Any
PDF_. With the exception of page labels and form fields, document-level constructs that appear in document-level data in the primary input was preserved, and any document-level data other than form
the primary input are preserved, and document-level constructs from the other PDF files are ignored. fields and page labels was discarded from the other files. After this work is complete, qpdf will
With page labels, qpdf always ensures that any given page has the same label in the final output as handle other document-level data in a manner that preserves the functionality of all pages in the
it had in whichever input file it originated from, which is usually (but not always) the desired final PDF. Here are several examples of problems in qpdf prior to the implementation of the pages
behavior. With form fields, qpdf has awareness and ensures that all form fields remain operational. epic:
The goal is to extend this document-level-awareness to other document-level constructs.
Here are several examples of problems in qpdf prior to the implementation of the pages epic:
* If two files with optional content (layers) are merged, all layers in all but the primary input * If two files with optional content (layers) are merged, all layers in all but the primary input
will be visible in the combined file. will be visible in the combined file.
* If two files with file attachments are merged, attachments will be retained on the primary input * If two files with file attachments are merged, attachments will be retained on the primary input
@ -46,9 +43,10 @@ Here are several examples of problems in qpdf prior to the implementation of the
entirety, including outlines that point to pages that are no longer there, and outlines will be entirety, including outlines that point to pages that are no longer there, and outlines will be
lost from all files except the primary input. lost from all files except the primary input.
With the above limitations, qpdf allows combining pages from arbitrary numbers of input PDFs to Regarding page assembly, prior to the pages epic, qpdf allows combining pages from arbitrary numbers
create an output PDF, or in the case of page splitting, multiple output PDFs. The API allows of input PDFs to create an output PDF, or in the case of page splitting, multiple output PDFs. The
arbitrary combinations of input and output files. The command-line allows only the following: API allows arbitrary combinations of input and output files. The command-line allows only the
following:
* Merge: creation of a single output file from a primary input and any number of other inputs by * Merge: creation of a single output file from a primary input and any number of other inputs by
selecting pages by index from the beginning or end of the file selecting pages by index from the beginning or end of the file
* Split: creation of multiple output files from a single input or the result of a merge into files * Split: creation of multiple output files from a single input or the result of a merge into files
@ -79,10 +77,13 @@ Here are some examples of things that will become possible:
The rest of this document describes the details of what how these features will work and what needs The rest of this document describes the details of what how these features will work and what needs
to be done to make them possible to build. to be done to make them possible to build.
# Architectural Thoughts # Architecture
Open question: if I do all the complex logic in `QPDFJob`, what are the implications for pikepdf or Create a `QPDFAssembler` class to handle merging and a `QPDFSplitter` to handle splitting. The
other wrappers? This will need to be discussed in the discussion ticket. complex assembly logic can be handled by `QPDFAssembler`. `QPDFSplitter` can invoke `QPDFAssembler`
with a previous `QPDFAssembler`'s output (or any `QPDF`) multiple times to create the split files.
This will mostly involve moving code from `QPDFJob` to `QPDFAssembler` and `QPDFSplitter` and having
`QPDFJob` invoke them.
Prior to implementation of the pages epic, `QPDFJob` goes through the following stages: Prior to implementation of the pages epic, `QPDFJob` goes through the following stages:
@ -123,8 +124,16 @@ Prior to implementation of the pages epic, `QPDFJob` goes through the following
* Preserve form fields and page labels * Preserve form fields and page labels
Broadly, the above has to be modified in the following ways: Broadly, the above has to be modified in the following ways:
* From the C++ API, make it possible to use an arbitrary QPDF as an input rather than having to * The transformations step has to be pulled out as that wil stay in `QPDFJob`.
start with a file. That makes it possible to do arbitrary work on the PDF prior to submitting it. * Most of write QPDF will stay in `QPDFJob`, but the split logic will move to `QPDFSplitter`.
* The entire create QPDF logic will move into `QPDFAssembler`.
* `QPDFAssembler`'s API will allow using an arbitrary QPDF as an input rather than having to start
with a file. That makes it possible to do arbitrary work on the PDF prior to passing it to
`QPDFAssembler`.
* `QPDFAssembler` and `QPDFSplitter` may need a C API, or perhaps C users will have to work through
`QPDFJob`, which will expose nearly all of the functionality.
Within `QPDFAssembler`, we will extend the create QPDF logic in the following ways:
* Allow creation of blank pages as an additional input source * Allow creation of blank pages as an additional input source
* Generalize underlay/overlay * Generalize underlay/overlay
* Enable controlling placement * Enable controlling placement
@ -132,17 +141,32 @@ Broadly, the above has to be modified in the following ways:
* Add additional reordering options * Add additional reordering options
* We don't need to provide hooks for this. If someone is going to code a hook, they can just * We don't need to provide hooks for this. If someone is going to code a hook, they can just
compute the page ordering directly. compute the page ordering directly.
* Have a page composition phase after the overlay/underlay stage * Have a page composition stage after the overlay/underlay stage
* Allow n-up, left-to-right (can reverse page order to get rtl), top-to-bottom, or modular * Allow n-up, left-to-right (can reverse page order to get rtl), top-to-bottom, or modular
composition like pstops composition like pstops
* Add additional ways to select pages besides range (e.g. based on outlines) * Add additional ways to select pages besides range (e.g. based on outlines)
* Add additional ways to specify boundaries for splitting
* Enhance existing logic to handle other document-level structures, preferably in a way that * Enhance existing logic to handle other document-level structures, preferably in a way that
requires less duplication between split and merge. requires less duplication between split and merge.
* We don't need to turn on and off most types of document constructs individually. People can * We don't need to turn on and off most types of document constructs individually. People can
preprocess using the API or qpdf JSON if they want fine-grained control. preprocess using the API or qpdf JSON if they want fine-grained control.
* For things like attachments and outlines, we can add additional flags. * For things like attachments and outlines, we can add additional flags.
Within `QPDFSplitter`, we will add additional ways to specify boundaries for splitting.
We must take care with the implementations and APIs for `QPDFSplitter`, `QPDFAssembler`, and
`QPDFJob` to avoid excessive duplication. Perhaps `QPDFJob` can create and configure a
`QPDFAssembler` and `QPDFSplitter` on the fly to avoid too much duplication of state.
Much of the logic will actually reside in other helper classes. For example, `QPDFAssembler` will
probably not operate with numeric ranges, leaving that to `QPDFJob` and `QUtil` but will instead
have vectors of page numbers. The logic for creating page groups from outlines, threads, or
structure will most likely live in the document helpers for those bits of functionality. This keeps
needless clutter out of `QPDFAssembler` and also makes it possible for people to perform their own
subset of functionality by calling lower-level interfaces. The main power of `QPDFAssembler` will be
to manage sequencing and destination tracking as well as to provide a future-proof API that will
allow developers to automatically benefit from additional document-level support as it is added to
qpdf.
## Flexible Assembly ## Flexible Assembly
This section discusses modifications to the command-line syntax to make it easier to add flexibility This section discusses modifications to the command-line syntax to make it easier to add flexibility
@ -189,10 +213,10 @@ are handled, specify placement options, etc. Given the above framework, it would
additional features incrementally, without breaking compatibility, such as selecting or splitting additional features incrementally, without breaking compatibility, such as selecting or splitting
pages based on tags, article threads, or outlines. pages based on tags, article threads, or outlines.
It's tempting to allow assemblies to be nested, but this gets very complicated. From the C++ API, we It's tempting to allow assemblies to be nested, but this gets very complicated. From the C++ API,
could modify QPDFJob to allow the use any QPDF as an input, but supporting this from the CLI is hard there is no problem using the output of one `QPDFAssembler` as the input to another, but supporting
because of the way JSON/arg parsing is set up. If people need to do that, they can just create this from the CLI is hard because of the way JSON/arg parsing is set up. If people need to do that,
intermediate files. they can just create intermediate files.
Proposed CLI enhancements: Proposed CLI enhancements:
@ -424,7 +448,8 @@ Last checked: 2023-12-29
gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open
``` ```
* Allow an existing `QPDF` to be an input to a merge operation when using the QPDFJob C++ API * Allow an existing `QPDF` to be an input to a merge or underly/overlay operation when using the
`QPDFAssembler` C++ API
* Issues: none * Issues: none
* Generate a mapping from source to destination for all destinations * Generate a mapping from source to destination for all destinations
* Issues: #1077 * Issues: #1077