qpdf/TODO-pages.md

# Pages

This document describes a project known as the _pages epic_. The goal of the pages epic is to enable
qpdf to properly preserve all functionality associated with a page as pages are copied from one PDF
to another (or back to the same PDF). A secondary goal is to add more flexiblity to the ways in
which documents can be split and combined (flexible assembly).

This is a work in progress. As implementation proceeds, details will become more solid. Comments are
encouraged. Please make comments in the [QPDF pages epic
discussion](https://github.com/qpdf/qpdf/discussions/1104). As ideas are refined, they will be
updated in this document.

# Tasks

This a breakdown of work. It is near the top of the file for easy access but doesn't make sense
without reading the rest of the file.

* Create `QPDFAssembler` and incrementally move functionality from `QPDFJob` into it.
* Create `QPDFSplitter` to use `QPDFAssembler`. Write a section in the manual describing how they
  work, but leave the detailed API documentation in the header files. Model after how `QPDFJob` is
  documented.
* Break down remaining work, to include coming up with a mechanism for tracking destinations and
  creating helper classes for other document-level operations, then incrementally adding support for
  each idea. Keep the manual updated as we go.

# Design and Background

Terminology:
* _Page-level data_: information that is contained within objects reachable from the page dictionary
  without traversing through any `/Parent` pointers
* _Document-level data_: information that is reachable from the document catalog (`/Root`) that is
  not reachable from a page dictionary as well as the `/Info` dictionary

PDF uses document-level data in a variety of ways. There is some document-level data that has each
of the following properties, among others:
* References pages by object ID (outlines, interactive forms)
* Doesn't reference any pages (embedded files)
* Doesn't reference any pages but influences page rendering (optional content/layers)
* Doesn't reference any pages but contains information about pages (page labels)
* Contains information used by pages (named destinations)

As long as qpdf has had the ability to copy pages from one PDF to another, it has had robust
handling of page-level data. When qpdf creates a new PDF file from existing PDF files, it starts
with a specific PDF, known as the _primary input_. The primary input may be a file or the built-in
_empty PDF_. Prior to the implementation of the pages epic, qpdf has ignored document-level data
(except for page labels and interactive form fields) when merging and splitting files. Any
document-level data in the primary input was preserved, and any document-level data other than form
fields and page labels was discarded from the other files. After this work is complete, qpdf will
handle other document-level data in a manner that preserves the functionality of all pages in the
final PDF. Here are several examples of problems in qpdf prior to the implementation of the pages
epic:
* If two files with optional content (layers) are merged, all layers in all but the primary input
  will be visible in the combined file.
* If two files with file attachments are merged, attachments will be retained on the primary input
  but dropped on the others. (qpdf has other ways to copy attachments from one file to another.)
* If two files with hyperlinks are merged, any hyperlink from other than primary input become
  non-functional.
* If two files with outlines are merged, the outlines from the original file will appear in their
  entirety, including outlines that point to pages that are no longer there, and outlines will be
  lost from all files except the primary input.

Regarding page assembly, prior to the pages epic, qpdf allows combining pages from arbitrary numbers
of input PDFs to create an output PDF, or in the case of page splitting, multiple output PDFs. The
API allows arbitrary combinations of input and output files. The command-line allows only the
following:
* Merge: creation of a single output file from a primary input and any number of other inputs by
  selecting pages by index from the beginning or end of the file
* Split: creation of multiple output files from a single input or the result of a merge into files
  whose primary input is the empty PDF and that contain a fixed number of pages per group
* Overlay/underlay: layering pages on top of each other with a maximum of one underlay and one
  overlay and with no ability to specify transformation of the pages (such as scaling, placing them
  in a particular spot).

The pages epic consists of two broad categories of work:
* Proper handling of document-level features when splitting and merging documents
* Flexible assembly: greatly increased flexibility in the ways in which pages can be selected from
  the various input files and combined for the output file. This includes creation of blank pages
  and composition of pages (n-up or other ways of combining multiple input pages into one output
  page)

Here are some examples of things that will become possible:

* Stacking arbitrary pages on top of each other with full control over transformation and cropping,
  including being able to access information about the various bounding boxes associated with the
  pages (generalization of underlay/overlay)
* Inserting blank pages
* Doing n-up page layouts
* Creating single very long or wide pages with output from other pages
* Re-ordering pages for printing booklets (also called signatures or printer spreads)
* Selecting pages based on the outline hierarchy, tags, or article threads
* Keeping only and all relevant parts of the outline hierarchies from all input files

The rest of this document describes the details of what how these features will work and what needs
to be done to make them possible to build.

# Architecture

Create a `QPDFAssembler` class to handle merging and a `QPDFSplitter` to handle splitting. The
complex assembly logic can be handled by `QPDFAssembler`. `QPDFSplitter` can invoke `QPDFAssembler`
with a previous `QPDFAssembler`'s output (or any `QPDF`) multiple times to create the split files.
This will mostly involve moving code from `QPDFJob` to `QPDFAssembler` and `QPDFSplitter` and having
`QPDFJob` invoke them.

Prior to implementation of the pages epic, `QPDFJob` goes through the following stages:

* create QPDF
  * update from JSON
  * page specs (`--pages`)
    * Create a QPDF for each input source
    * Figure out whether to keep files open
    * Remove unreferenced resources if needed
    * Remove pages from the pages tree
    * Handle collation
    * Copy or revive all final pages
      * When copying foreign pages, possibly remove unreferenced resources
      * Handle the same page copied more than once by cloning the page dictionary while sharing any
        indirect references
      * Preserve form fields and page labels
    * Delete pages from the primary input that were not used in the output
    * Delete unreferenced form fields
  * rotation
  * underlay/overlay
  * transformations
    * disable signatures
    * externalize images
    * optimize images
    * generate appearances
    * flatten annotations
    * coalesce contents
    * flatten rotation
    * remove page labels
    * remove attachments
    * add attachments
    * copy attachments
* write QPDF
  * One of:
    * Do inspections
    * Write single file
    * Split pages
      * Remove unreference resources if needed
      * Preserve form fields and page labels

Broadly, the above has to be modified in the following ways:
* The transformations step has to be pulled out as that will stay in `QPDFJob`.
* Most of write QPDF will stay in `QPDFJob`, but the split logic will move to `QPDFSplitter`.
* The entire create QPDF logic will move into `QPDFAssembler`.
* `QPDFAssembler`'s API will allow using an arbitrary QPDF as an input rather than having to start
  with a file. That makes it possible to do arbitrary work on the PDF prior to passing it to
  `QPDFAssembler`. The API contract will state that, if an existing `QPDF` is used as the primary
  input, the same `QPDF` will be returned. Effectively, `QPDFAssembler` will modify the `QPDF` in
  place.
* `QPDFAssembler` and `QPDFSplitter` may need a C API, or perhaps C users will have to work through
  `QPDFJob`, which will expose nearly all of the functionality.

Within `QPDFAssembler`, we will extend the create QPDF logic in the following ways:
* Allow creation of blank pages as an additional input source
* Generalize underlay/overlay
  * Enable controlling placement
  * Make repeatable (done)
* Add additional reordering options
  * We don't need to provide hooks for this. If someone is going to code a hook, they can just
    compute the page ordering directly.
* Have a page composition stage after the overlay/underlay stage
  * Allow n-up, left-to-right (can reverse page order to get rtl), top-to-bottom, or modular
    composition like pstops
* Add additional ways to select pages besides range (e.g. based on outlines)
* Enhance existing logic to handle other document-level structures, preferably in a way that
  requires less duplication between split and merge.
  * We don't need to turn on and off most types of document constructs individually. People can
    preprocess using the API or qpdf JSON if they want fine-grained control.
  * For things like attachments and outlines, we can add additional flags.

Within `QPDFSplitter`, we will add additional ways to specify boundaries for splitting.

We must take care with the implementations and APIs for `QPDFSplitter`, `QPDFAssembler`, and
`QPDFJob` to avoid excessive duplication. Perhaps `QPDFJob` can create and configure a
`QPDFAssembler` and `QPDFSplitter` on the fly to avoid too much duplication of state.

Much of the logic will actually reside in other helper classes. For example, `QPDFAssembler` will
probably not operate with numeric ranges, leaving that to `QPDFJob` and `QUtil` but will instead
have vectors of page numbers. The logic for creating page groups from outlines, threads, or
structure will most likely live in the document helpers for those bits of functionality. This keeps
needless clutter out of `QPDFAssembler` and also makes it possible for people to perform their own
subset of functionality by calling lower-level interfaces. The main power of `QPDFAssembler` will be
to manage sequencing and destination tracking as well as to provide a future-proof API that will
allow developers to automatically benefit from additional document-level support as it is added to
qpdf.

Other notes:
* Per jbarlow, raw pointers are hard to work with in pybind11. Use references or shared pointers
  instead for fluent interfaces.


## Flexible Assembly

This section discusses modifications to the command-line syntax to make it easier to add flexibility
going forward without breaking backward compatibility. In qpdf 11.9.0, we added non-positional
options to `--pages`, `--overlay`, `--underlay` and modifid configuration to make it easier to add
new options.

In several cases, we allow specification of transformations or placements. In this context:
* The origin is always lower-left corner.
* A _dimension_ may be absolute or relative.
  * An _absolute dimension_ is `{n}` (in points), `{n}in` (inches), `{n}cm` (centimeters),
  * A _relative dimension_ is expressed in terms of the corresponding dimension of one of a page's
    boxes. Which dimension is determined by context.
    * `{n}{M|C|B|T|A}` is `{n}` times the corresopnding dimension of the media, crop, bleed, trim,
      or art box. Example: `0.5M` would be half the width or height of the media box.
    * `{n}+{M|C|B|T|A}` is `{n}` plus the corresponding dimension. Example: `-0.5in+T` is half an
      inch (36 points) less than the width or height of the trim box.
* A _size_ is
  * `{w}x{h}`, where `{w}` and `{h}` are dimensions
  * `letter|a4` (potentially add other page sizes)
* A _position_ is `{x}x{y}` where `{x}` and `{y}` are dimensions offset from the origin
* A _rectangle_ is `{llx},{lly},{urx},{ury}` (lower|upper left|right x|y) with `llx` < `urx` and
  `lly` < `ury`
  * Examples:
    * `0.1M,0.1M,0.9M,0.9M` is a box whose llx is 10% of the media box width, lly is 10% of the
      height, urx is 90% of the width, and ury is 90% of the height
    * `0,0,612,792` is a box whose size is that of a US Letter page.
  * A rectangle may also be just one of `M|C|B|T|A` to refer to a page's media, crop, bleed, trim,
    or art box.

It's tempting to allow assemblies to be nested, but this gets very complicated. From the C++ API,
there is no problem using the output of one `QPDFAssembler` as the input to another, but supporting
this from the CLI is hard because of the way JSON/arg parsing is set up. If people need to do that,
they can just create intermediate files.

Proposed CLI enhancements:

```
# --pages: inputs
--file=x [ --password=x ]
--blank=n  [ --size={size} [ --size-from-page=n ] ]  # see below
# modifiers refer to most recent input
--range=...
--with-attachments={none|all|referenced}  # default = referenced
--with-outlines={none|all|referenced}  # default = referenced
--...  # future options to select pages based on outlines, article threads, tags, etc.
# placement (matrix transformation -- see notes below)
--rotate=[+-]angle[:page-range]  # existing
--scale=x,y[:page-range]
--translate=dx,dy[:page-range]  # dx and dy are dimensions
--flip={h|v}[:page-range]
--transform=a,b,c,d,e,f[:page-range]
--set-box={M|C|B|T|A}=rect[:page-range]  # change a bounding box
# stacking -- make --underlay and --overlay repeatable
--{underlay|overlay} ... --
--file=x [ --password=x ]
--from, --to, --repeat  # same as current --overlay, --underlay
--from-rect={rect} # default = T -- see notes
--to-rect={rect} # default = M -- see notes
# composition -- a new QPDFJob stage between stacking and transformation
--compose=...  # see notes
--n-up={2,4,6,9,16}
--concat={h|v}  # concatenate all pages to a single big page
# reordering
--collate=a,b,c  # exists
--booklet=...  # re-order pages for book signatures like psbook -- see notes
# split
--split-pages=n  # existing
--split-after=a,b,c  # split after each named page
--...  # future options to split based on outlines, article threads, tags, etc.
```

Notes:
* For `--blank`, `--size` specifies the size of the blank page. If any relative dimensions are used,
  `--size-from-page=n` must be used to specify the page (from n in the overall input) that relative
  dimensions should be taken from. It is an error to specify a relative size based on another blank
  page. (Let's not complicate things by doing a graph traversal to find an eventual absolute page.
  Just disallow a blank page to specified relative to another blank page.)
* For stacking, the default is to map the source page's trim box onto the destination page's
  mediabox. This is a weird default, but it's there for compatibility. The `--from-rect` and
  `--to-rect` may be used to map an arbitrary region of the over/underlay file into an arbitrary
  region of a page. With the defaults, an overlay or underlay page will be stretched or shrunk if
  pages are of variable size. Absolute rectangles can be used to avoid this. If a rectangle uses
  relative dimensions, they are relative to the page that has the rectangle. You can't create a
  `--to-rect` relative to the size of the from page or vice versa. If you need to do this, use
  external logic to compute the rectangles and then use absolute rectangles.
* `--compose`: XXX
* `--booklet`: XXX
* I'm not sure what impact composition should have on page labels. Most likely, we should drop page
  labels on composition. If someone wants them, they can use `--set-page-labels`.

### Compose, Booklet

This section needs to be fleshed out. It is probably lower priority than document-level work.

Here are some ideas from pstops. The following is an excerpt from the pstops manual page. Maybe we
can come up with something similar using our enhanced rectangle syntax.

This section contains some sample re‐arrangements. To put two pages on one sheet (of A4 paper),
the pagespec to use is:
```
2:0L@.7(21cm,0)+1L@.7(21cm,14.85cm)
```
To select all of the odd pages in reverse order, use:
```
2:‐0
```
To re‐arrange pages for printing 2‐up booklets, use
```
4:‐3L@.7(21cm,0)+0L@.7(21cm,14.85cm)
```
for the front sides, and
```
4:1L@.7(21cm,0)+‐2L@.7(21cm,14.85cm)
```
for the reverse sides (or join them with a comma for duplex printing).

From issue #493
```
 pdf2ps infile.pdf infile.ps
 ps2ps -pa4 "2:0R(4.5cm,26.85cm)+1R(4.5cm,14.85cm)" infile.ps outfile.ps
 ps2pdf outfile.ps outfile.pdf
 ```

Notes on signatures (psbook). For a signature of size 3, we have the following assuming a 2-up
configuration that is printed double-sided so that, when the whole stack is placed face-up and
folded in half, page 1 is on top.
* front: 6,7, back: 8,5
* front: 4,9, back: 10,3
* front: 2,11, back: 12,1

This is the same as duplex 2-up with pages in order 6, 7, 8, 5, 4, 9, 10, 3, 2, 11, 12, 1

n-up:
* For 2-up, calculate new w and h such that w/h maintains a fixed ratio and w and h are the largest
  values that can fit within 1/2 the page with specified margins.
* Can support 1, 2, 4, 6, 9, 16. 2 and 6 require rotation. The others don't. Will probably need to
  change getFormXObjectForPage to handle other boxes than trim box.
* Maybe define n-up a scale and rotate followed by fitting the result into a specified rectangle. I
  might already have this logic in QPDFAnnotationObjectHelper::getPageContentForAppearance.

## Destinations

We will have to keep track of destinations that point to a page when the page is moved or copied.
For example, if an outline has a destination that points to a particular rectangle on page 5 of the
second file, and we end up dropping a portion of that page into an n-up configuration on a specific
output page, we will have to keep track of enough information to replace the destination with a new
one that points to the new physical location of the same material. For another example, consider a
case in which the left side of page 3 of the primary input ends up as page 5 of the output and the
right side of page 3 ends up as page 6. We would have to map destinations from a single source page
to different destination pages based on which part of the page it was on. If part of the rectangle
points to one page and part to another, what do we do? I suggest we go with the top/center of the
rectangle.

A destination consists of a QPDF, page object, and rectangle in user coordinates. When `QPDFJob`
copies a page or converts it to a form XObject, possibly with transformations applied, it will have
to be able to map a destination to the same triple (QPDF, page object, rectangle) on all pages that
contain data from the original page. When writing the final output, any destination that no longer
points anywhere should be dropped, and any destination that points to multiple places will need to
be handled according to some specification.

Whenever we create any new thing from a page, we create _derived page data_. Examples of derived
page data would include a copy of the page and a form XObject created from a page. We will have to
keep a mapping from any source page to all of its derived objects along with any transformations or
clipping. When a derived page data object is placed on a final page, that information can be
combined with the position and any transformations onto the final page to be able to map any
destination to a new one or to determine that it points outside of the visible area. There is
already code in placeFormXObject and the code that places appearance streams that deals with these
kinds of mappings.

What do we do if a source page is copied multiple times? I think we will have to just make the new
destination point to the first place that the target appears with precedence going to the original
location. If we can detect this, we can give a warning.

# Document-level Behavior

Both merging and splitting contain logic, sometimes duplicated, to handle page labels, form fields,
and annotations. We will need to build logic for other things. This section is a rough breakdown of
the different things in the document catalog (plus the info dictionary, which is referenced from the
trailer) and how we may have to handle them. We will need to implement various ObjectHelper and
DocumentHelper classes.

7.7.2 contains the list of all keys in the document catalog.

Document-level structures to merge:
* Extensions
  * Must be combination of Extensions from all input files
* PageLabels
  * Ensure each page has its original label
  * Allow post-processing
* Names -- see below
  * Combine per tree
  * May require disambiguation
  * Page: TemplateInstantiated
* Dests
  * Keep referenced destinations across all files
  * May need to disambiguate or "flatten" or convert to named dests with the names tree
* Outlines
* Threads (easy)
  * Page: B
* AcroForm
* StructTreeRoot
  * Page: StructParents
  * See jbarlow's comments in #1104 for additional notes
* MarkInfo (see 14.7 - Logical Structure, 14.8 Tagged PDF)
* SpiderInfo
  * Page: ID
* OutputIntents
  * Page: OutputIntents
* PieceInfo
  * Page: PieceInfo
* OCProperties
* Requirements
* AF (file specification dictionaries)
  * Page: AF
* DPartRoot
  * Page: DPart
* Version
  * Maximum

Things that stay with the first document that has one and/or will not be supported
* AA (Additional Actions)
  * Would be possible to combine and let the first contributor win, but it probably wouldn't usually
    be what we want.
* Info (not part of document catalog)
* ViewerPreferences
* PageLayout
* PageMode
* OpenAction
* URI
* Metadata
* Lang
* NeedsRendering
* Collection
* Perms
* Legal
* DSS

Name dictionary (7.7.4)
* Dests
* AP (appearance streams)
* JavaScript
* Pages (named pages)
* Templates
  * Combine across all documents
  * Page: TemplateInstantiated points to a named page
* IDS
* URLS
* EmbeddedFiles
* AlternatePresentations
* Renditions

Most of chapter 12 applies. See Document-level navigation (12.3).

# Feature to Issue Mapping

Last checked: 2024-01-18

```
gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open
```

* Allow an existing `QPDF` to be an input to a merge or underly/overlay operation when using the
  `QPDFAssembler` C++ API
  * Issues: none
* Fixes to copying annotations
  * Issues: #1116
  * Notes:
    * This is a PR that includes some failing test cases
    * Fix `/P`
    * Allow copying of annotations from a region of a page (not sure I want to add that)
    * Allow selection of pages without annotations (not sure I want to do that)
* Generate a mapping from source to destination for all destinations
  * Issues: #1077
  * Notes:
    * Source can be an outline or link, either directly or via action. If link, it should include
      the page.
    * Destination can be a structure destination, which should map to a regular destination.
    * source: page X -> link -> action -> dest: page Y
    * source: page X -> link -> action -> dest: structure -> page Y
    * Consider something in json that dumps this.
    * We will need to associate this with a QPDF. It would be great if remote or embedded go-to
      actions could be handled, but that's ambitious.
    * It will be necessary to keep some global map that includes all QPDF objects that are part of
      the final file.
    * An interesting use case to consider would be to create a QPDF object from an embedded file and
      append the embedded file and make the embedded actions work. This would probably require some
      way to tell qpdf that a particular external file came from an embedded file.
* Control size of page and position/transformation of overlay/underlay
  * Issues: #1031, #811, #740, #559
  * Notes:
    * It should be possible to define a destination page from scratch or in terms of other pages and
      then place page contents onto it with arbitrary transformations applied.
    * It should be possible to compute the size of the destination page in terms of the source
      pages, e.g., to create one long or wide page from other pages.
    * Also allow specification of which page box to use
* Preserve hyperlinks when doing any page operations
  * See also "Generate a mapping from source to destination for all destinations"
  * Issues: #1003, #797, #94
  * Notes:
    * A link annotation that points to a destination rather than an external URL should continue to
      work when files are split or merged.
* Awareness of structured and tagged PDF (14.7, 14.8)
  * Issues: #957, #953, #490
  * Notes:
    * This looks complicated. It may be not be possible to do this fully in the first increment, but
      we have to keep it in mind and warn if we can't and we see /SD in an action.
    * #490 has some good analysis
* Interleave pages with ordering
  * Issues: #921
  * Notes:
    * From 921: interleave odd pages and reversed even pages. This might require different handling
      for even/odd numbers of pages. Make sure it's natural for the cases of len(odd) == len(even)
      or len(odd) == 1+len(even)
* Preserve all attachments when merging files
  * Issues: #856
  * Notes:
    * If all pages of a file are selected, keep all attachments
    * If some pages of a file are selected
      * Keep all attachments if there are any embedded file annotations
      * Otherwise, what? Do we have a keep-attachments flag of some sort? Or do we just make the
        user copy attachments from one file to another?
* Apply clipping to a page
  * Issues: #771
  * Notes:
    * Create a form xobject from a page, then apply a specific clipping region expressed in
      coordinates or as a percentage
* Ability to create a blank page
  * Issues: #753
  * Notes:
    * Create a blank page of a specific size or of the same size as another page
* Split groups with explicit boundaries
  * Issues: #741, #616
  * Notes:
    * Example: --split-after a,b,c
* Handle Optional Content (layers) (8.11)
  * Issues: #672, #9, #570
* Scale a page up or down to fit to a size
  * Issues: #611
* Place contents of pages adjacent horizontally or vertically on one page
  * Issues: #1040, #546
* nup, booklet
  * Issues: #493, #461, #152
  * Notes:
    * #461 may want the inverse of booklet and discusses reader and printer spreads
* Flexible multiplexing
  * Issues: #505 (already implemented with --collate)
* Split pages based on outlines
  * Issues: #477
* Keep relevant parts of outline hierarchy
  * Issues: #457, #356, #343, #323
  * Notes:
    * There is some helpful discussion in #343 including
      * Preserving open/closed status
      * Preserving javascript actions
* Split pages: write pages to memory
  * Issues: #1130

# Other use cases

* Other ways to specify pages besides numeric range
  * all pages reachable from a section of the outline hierarchy
  * something based on threads or document structure
  * selection based on page labels
* Placement for composition, overlay, underlay
  * Scale the smaller page up to the size of the larger page
  * Center the smaller page horizontally and bottom-align the trim boxes