mirror of https://github.com/qpdf/qpdf.git synced 2024-05-29 00:10:54 +00:00

Jay Berkenbilt de3094124d Include information from all pages issues in TODO-pages.md

2024-01-01 18:57:32 -05:00

17 KiB

Raw Blame History

Pages

THIS IS A WORK IN PROGRESS. THE ACTUAL IMPLEMENTATION MAY NOT LOOK ANYTHING LIKE THIS. When this gets to the stage where it is starting to congeal into an actual plan, I will remove this disclaimer and open a discussion ticket in GitHub to work out details.

This file contains plans and notes regarding implementing of the "pages epic." The pages epic consists of the following features:

Proper handling of document-level features when splitting and merging documents
Insertion of blank pages
More flexible aways of
- selecting pages from one or more documents
- composing pages out of other pages
  - underlay and overlay with control over position, transformation, and bounding box selection
- organizing pages
  - n-up
  - booklet generation ("signatures", as in what psbook does)
Possibly others pending analysis of open issues and public discussion

Feature to Issue Mapping

Last checked: 2023-12-29

Questions/ideas
- I have often wondered whether we need to be able to attach arbitrary metadata to a QPDFObjectHandle (or object or value) and to control whether it should be included in copies. For example, one could attach to a page which qpdf id and page number it came from, then carry that around as the page was converted to a form xobject, inserted into a foreign file, etc. It feels like something like that will be needed to support some of these features.
Generate a mapping from source to destination for all destinations
- Issues: #1077
- Notes:
  - Source can be an outline or link, either directly or via action. If link, it should include the page.
  - Destination can be a structure destination, which should map to a regular destination.
  - source: page X -> link -> action -> dest: page Y
  - source: page X -> link -> action -> dest: structure -> page Y
  - Consider something in json that dumps this.
  - We will need to associate this with a QPDF. It would be great if remote or embedded go-to actions could be handled, but that's ambitious.
  - It will be necessary to keep some global map that includes all QPDF objects that are part of the final file.
  - An interesting use case to consider would be to create a QPDF object from an embedded file and append the embedded file and make the embedded actions work. This would probably require some way to tell qpdf that a particular external file came from an embedded file.
Control size of page and position/transformation of overlay/underlay
- Issues: #1031, #811, #740, #559
- Notes:
  - It should be possible to define a destination page from scratch or in terms of other pages and then place page contents onto it with arbitrary transformations applied.
  - It should be possible to compute the size of the destination page in terms of the source pages, e.g., to create one long or wide page from other pages.
  - Also allow specification of which page box to use
Preserve hyperlinks when doing any page operations
- See also "Generate a mapping from source to destination for all destinations"
- Issues: #1003, #797, #94
- Notes:
  - A link annotation that points to a destination rather than an external URL should continue to work when files are split or merged.
Awareness of structured and tagged PDF (14.7, 14.8)
- Issues: #957, #953, #490
- Notes:
  - This looks complicated. It may be not be possible to do this fully in the first increment, but we have to keep it in mind and warn if we can't and we see /SD in an action.
  - #490 has some good analysis
Assign page labels
- Issues: #939
- Notes:
  - #939 has a good proposal
  - This could be applied to page groups, and we could have an option to keep the labels as they are in a given group, which is what qpdf does now.
Interleave pages with ordering
- Issues: #921
- Notes:
  - From 921: interleave odd pages and reversed even pages. This might require different handling for even/odd numbers of pages. Make sure it's natural for the cases of len(odd) == len(even) or len(odd) == 1+len(even)
Preserve all attachments when merging files
- Issues: #856
- Notes:
  - If all pages of a file are selected, keep all attachments
  - If some pages of a file are selected
    - Keep all attachments if there are any embedded file annotations
    - Otherwise, what? Do we have a keep-attachments flag of some sort? Or do we just make the user copy attachments from one file to another?
Create page group by excluding pages
- Issues: #790, #564
- Notes:
  - Handle cases in PageSelector below
Apply clipping to a page
- Issues: #771
- Notes:
  - Create a form xobject from a page, then apply a specific clipping region expressed in coordinates or as a percentage
Ability to create a blank page
- Issues: #753
- Notes:
  - Create a blank page of a specific size or of the same size as another page
Split groups with explicit boundaries
- Issues: #741, #616
- Notes:
  - Example: --split-after a,b,c
Handle Optional Content (8.11)
- Issues: #672, #9
Scale a page up or down to fit to a size
- Issues: #611
Place contents of pages adjacent horizontally or vertically on one page
- Issues: #1040, #546
nup, booklet
- Issues: #493, #461, #152
- Notes:
  - #461 may want the inverse of booklet and discusses reader and printer spreads
Flexible multiplexing
- Issues: #505
Split pages based on outlines
- Issues: #477
Keep relevant parts of outline hierarchy
- Issues: #457, #356, #343, #323
- Notes:
  - There is some helpful discussion in #343 including
    - Prserving open/closed status
    - Preserving javascript actions

Architectural Thoughts

I want to encapsulate various aspects of the logic into interfaces that can be implemented by developers to add their own logic. It should be easy to contribute these. Here are some rough ideas.

A page group is just a group of pages.

PageSelector -- creates page groups from other page groups
PageTransformer -- selects a part of a page and possibly transforms it; applies to all pages of a group. Based on the page dictionary; does not look at the content stream
PageFilter -- apply arbitrary code to a page; may access the content stream
PageAssembler -- combines pages from groups into new groups whose pages are each assembled from corresponding pages of the input groups

These should be able to be composed in arbitrary ways. There should be a natural API for doing this, and it there should be some specification, probably based on JSON, that can be provided on the command line or embedded in the job JSON format. I have been considering whether a lisp-like S-expression syntax may be less cumbersome to work with. I'll have to decide whether to support this or some other syntax in addition to a JSON representation.

There also needs to be something to represent how document-level structures relate to this. I'm not sure exactly how this should work, but we need things like

what to do with page labels, especially when assembling pages from other pages
whether to preserve destinations (outlines, links, etc.), particularly when pages are duplicated
- If A refers to B and there is more than one copy of B, how do you decide which copies of A link to which copies of B?
what to do with pages that belong to more than one group, e.g., what happens if you used document structure or outlines to form page groups and a group boundary lies in the middle of the page

Maybe pages groups can have arbitrary, user-defined tags so we can specify that links should only point to other pages with the same value of some tag. We can probably many-to-one links if the source is duplicated.

We probably need to hold onto the concept of the primary input file. If there is a primary input file, there may need to be a way to specify what gets preserved it. The behavior of qpdf prior to all of this is to preserve all document-level constructs from the primary input file and to try to preserve page labels from other input files when combining pages.

Here are some examples.

PageSelector
- all pages from an input file
- pages from a group using a NumericRange
- concatenate groups
- pages from a group in reverse order
- a group repeated as often as necessary until a specified number of pages is reached
- a group padded with blank pages to create a multiple of n pages
- odd or even pages from a group
- every nth page from a group
- pages interleaved from multiple groups
- the left-front (left-back, right-front, right-back) pages of a booklet with signatures of n pages
- all pages reachable from a section of the outline hierarchy or something based on threads or other structure
- selection based on page labels
- pages in a group except pages in another group
PageTransformer
- clip to media box (trim box, crop box, etc.)
- clip to specific absolute or relative size
- scale
- translate
- rotate
- apply transformation matrix
PageFilter
- optimize images
- flatten annotations
PageAssembler
- Overlay/underlay all pages from one group onto corresponding pages from another group
  - Control placement based on properties of all the groups, so higher order than a stand-alone transformer
  - Examples
    - Scale the smaller page up to the size of the larger page
    - Center the smaller page horizontally and bottom-align the trim boxes
- Generalized overlay/underlay allowing n pages in a given order with transformations.
- n-up -- application of generalized overlay/underlay
- make one long page with an arbitrary number of pages one after the other (#546)

It should be possible to represent all of the existing qpdf operations using the above framework. It would be good to re-implement all of them in terms of this framework to exercise it. We will have to look through all the command-line arguments and make sure. Of course also make sure suggestions from issues can be implemented or at least supported by adding new selectors.

Here are a few bits of scratch work. The top-level call is a selector. This doesn't capture everything. Implementing this would be tedious and challenging. It could be done using JSON arrays, but it would be clunky. This feels over-designed and possibly in conflict with QPDFJob.

(concat
 (primary-input)
 (file "file2.pdf")
 (page-range (file "file3.pdf") "1-4,5-8")
)

(with
 ("a"
  (concat
   (primary-input)
   (file "file2.pdf")
   (page-range (file "file3.pdf") "1-4,5-8")
  )
 )
 (concat
  (even-pages (from "a"))
  (reverse (odd-pages (from "a")))
 )
)

(with
 ("a"
  (concat
   (primary-input)
   (file "file2.pdf")
   (page-range (file "file3.pdf") "1-4,5-8")
  )
  "b-even"
  (even-pages (from "a"))
  "b-odd"
  (reverse (odd-pages (from "a")))
 )
 (stack
  (repeat-range (from "a") "z")
  (pad-end (from "b"))
 )
)

Easier to parse but yuck:

["with",
 ["a",
  ["concat",
   ["primary-input"],
   ["file", "file2.pdf"],
   ["page-range", ["file", "file3.pdf"], "1-4,5-8"]
  ],
  "b-even",
  ["even-pages", ["from", "a"]],
  "b-odd",
  ["reverse", ["odd-pages", ["from", "a"]]]
 ],
 ["stack",
  ["repeat-range", ["from", "a"], "z"],
  ["pad-end", ["from", "b"]]
 ]
]

To-do list

Go through all issues marked with the pages label and ensure that any ideas are represented here. Keep a list with mappings back to the issue number.
- gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open
When ready, open a discussion ticket.
Flesh out an implementation plan.

Supporting Document-level Features

qpdf needs full support for document-level features like article threads, outlines, etc. There is no support for some things and partial support for others. See notes below for a comprehensive list.

Most likely, this will be done by creating DocumentHelper and ObjectHelper classes.

It will be necessary not only to read information about these structures from a single PDF file as the existing document helpers do but also to reconstruct or update these based on modifications to the pages in a file. I'm not sure how to do that, but one idea would be to allow a document helper to register a callback with QPDFPageDocumentHelper that notifies it when a page is added or removed. This may be able to take other parameters such as a document helper from a foreign file.

Since these operations can be expensive, there will need to be a way to opt in and out. The default (to be clearly documented) should be that all supported document-level constructs are preserved. That way, as new features are added, changes to the output of previous operations to include information that was previously omitted will not constitute a non-backward compatible change that requires a major version bump. This will be a default for the API when using the higher-level page assemebly API (below) as well as the CLI.

There will also need to be some kind of support for features that are document-level and not tied to any pages, such as (sometimes) embedded files. When splitting/merging files, there needs to be a way to specify what should happen with those things. Perhaps the default here should be that these are preserved from files from which all pages are selected. For some things, like viewer preferences, it may make sense to take them from the first file.

Page Assembly (page selection)

In addition to the existing numeric ranges of page numbers, page selection could be driven by document-level features like the outlines hierarchy or article threads. There have been a lot of suggestions about this in various tickets. There will need to be some kind of page manipulation class with configuration options. I'm thinking something similar to QPDFJob, where you construct a class and then call a bunch of methods to configure it, including the ability to configure with JSON. Several suggestions have been made in issues, which I will go through and distill into a list. Off hand, some ideas include being able to split based on explicit chunks and being able to do all pages except a list of pages.

For CLI, I'm probably going to have it take a JSON blob or JSON file on the CLI rather than having some absurd way of doing it with arguments (people have a lot of trouble with --pages as it is). See TODO for a feature on command-line/job JSON support for JSON specification arguments.

There are some other things, like allowing n-up and genearlizing overlay/underlay to allow different placement and scaling options, that I think may also be in scope.

Scaling/Transforming Pages

Keep in mind that destinations, such as links and outlines, may need to be adjusted when a page is scaled or otherwise transformed.

Notes

PDF document structure

7.7.2 contains the list of all keys in the document catalog.

Document-level structures:

Extensions
PageLabels
Names -- see below
- Page: TemplateInstantiated
Dests
Outlines
Threads (easy)
- Page: B
AA (Additional Actions)
URI
AcroForm
StructTreeRoot
- Page: StructParents
MarkInfo (see 14.7 - Logical Structure, 14.8 Tagged PDF)
SpiderInfo
- Page: ID
OutputIntents
- Page: OutputIntents
PieceInfo
- Page: PieceInfo
OCProperties
Requirements
Collection
AF (file specification dictionaries)
- Page: AF
DPartRoot
- Page: DPart

Things qpdf probably needs to drop

Version
Perms
Legal
DSS

Things that stay with the first document and/or will not be supported

ViewerPreferences
PageLayout
PageMode
OpenAction
Metadata
Lang
NeedsRendering

Name dictionary (7.4)

Dests
AP (appearance strams)
JavaScript
Pages (named pages)
Templates
IDS
URLS
EmbeddedFiles
AlternatePresentations
Renditions

Most of chapter 12 applies.

Document-level navigation (12.3)

QPDF will need a global way to reference a page. This will most likely be in the form of the QPDF uuid and a QPDFObjectHandle to the page. If this can just be a QPDFObjectHandle, that would be better. I need to make sure we can meaningfully interact with QPDFObjectHandle objects from multiple QPDFs in a safe fashion. Figure out how this works with immediateCopyFrom, etc. Better to avoid this whole thing and make sure that we just keep all the document-level stuff specific to a PDF, but we will need to have some internal representation that can be used to reconstruct the document-level dictionaries when writing. Making this work with structures (structure destinations) will require more indirection.

I imagine that there will be some internal repreentation of what document-level things come along for the ride when we take a page from a document. I wonder whether this need to change the way linearization works.

There should be different ways to specify collections of pages. The existing one, which is using a numeric range, is just one. Other ideas include things related to document structure (all pages in an article thread, all pages in an outline hierarchy), page labels, book binding (Is that called folio? There's an issue for it.), all except, subgroups, or any number of things.

We will need to be able to start with document-level objects to get page groups and also to start with pages and reconstruct document level objects. For example, it should be possibe to reconstruct article threads to omit beads that don't belong to any of the pages. Likewise with outlines.

17 KiB Raw Blame History