2023-01-07 16:37:08 +00:00
|
|
|
|
# Pages
|
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
**THIS IS A WORK IN PROGRESS. THE ACTUAL IMPLEMENTATION MAY NOT LOOK ANYTHING LIKE THIS. When this
|
|
|
|
|
gets to the stage where it is starting to congeal into an actual plan, I will remove this disclaimer
|
|
|
|
|
and open a discussion ticket in GitHub to work out details.**
|
|
|
|
|
|
|
|
|
|
This document describes a project known as the _pages epic_. The goal of the pages epic is to enable
|
|
|
|
|
qpdf to properly preserve all functionality associated with a page as pages are copied from one PDF
|
|
|
|
|
to another (or back to the same PDF).
|
|
|
|
|
|
|
|
|
|
Terminology:
|
|
|
|
|
* _Page-level data_: information that is contained within objects reachable from the page dictionary
|
|
|
|
|
without traversing through any `/Parent` pointers
|
|
|
|
|
* _Document-level data_: information that is reachable from the document catalog (`/Root`) that is
|
|
|
|
|
not reachable from a page dictionary as well as the `/Info` dictionary
|
|
|
|
|
|
|
|
|
|
Some document-level data references specific pages by page object ID, such as outlines or
|
|
|
|
|
interactive forms. Some document-level data doesn't reference any pages, such as embedded files or
|
|
|
|
|
optional content (layers). Some document-level data contains information that pertains to a specific
|
|
|
|
|
page but does not reference the page, such as page labels (explicit page numbers). Some page-level
|
|
|
|
|
data may sometimes depend on document-level data. For example, a _named destination_ depends on the
|
|
|
|
|
document-level _names tree_.
|
|
|
|
|
|
|
|
|
|
As long as qpdf has had the ability to copy pages from one PDF to another, it has had robust
|
|
|
|
|
handling of page-level data. Prior to the implementation of the pages epic, with the exception of
|
|
|
|
|
page labels, qpdf has ignored document-level data during page copy operations. Specifically, when
|
|
|
|
|
qpdf creates a new PDF file from existing PDF files, it always starts with a specific PDF, known as
|
|
|
|
|
the _primary input_. The primary input may be the built-in _empty PDF_. With the exception of page
|
|
|
|
|
labels, document-level constructs that appear in the primary input are preserved, and document-level
|
|
|
|
|
constructs from the other PDF files are ignored. The exception to this is page labels. With page
|
|
|
|
|
labels, qpdf always ensures that any given page has the same label in the final output as it had in
|
|
|
|
|
whichever input file it originated from, which is usually (but not always) the desired behavior.
|
|
|
|
|
|
|
|
|
|
Here are several examples of problems in qpdf prior to the implementation of the pages epic:
|
|
|
|
|
* If two files with optional content (layers) are merged, all layers in all but the primary input
|
|
|
|
|
will be visible in the combined file.
|
|
|
|
|
* If two files with file attachments are merged, attachments will be retained on the primary input
|
|
|
|
|
but dropped on the others. (qpdf has other ways to copy attachments from one file to another.)
|
|
|
|
|
* If two files with hyperlinks are merged, any hyperlink from other than primary input whose
|
|
|
|
|
destination is a named destination will become non-functional.
|
|
|
|
|
* If two files with outlines are merged, the outlines from the original file will appear in their
|
|
|
|
|
entirety, including outlines that point to pages that are no longer there, and outlines will be
|
|
|
|
|
lost from all files except the primary input.
|
|
|
|
|
|
|
|
|
|
With the above limitations, qpdf allows combining pages from arbitrary numbers of input PDFs to
|
|
|
|
|
create an output PDF, or in the case of page splitting, multiple output PDFs. The API allows
|
|
|
|
|
arbitrary combinations of input and output files. The command-line allows only the following:
|
|
|
|
|
* Merge: creation of a single output file from a primary input and any number of other inputs by
|
|
|
|
|
selecting pages by index from the beginning or end of the file
|
|
|
|
|
* Split: creation of multiple output files from a single input or the result of a merge into files
|
|
|
|
|
whose primary input is the empty PDF and that contain a fixed number of pages per group
|
|
|
|
|
* Overlay/underlay: layering pages on top of each other with a maximum of one underlay and one
|
|
|
|
|
overlay and with no ability to specify transformation of the pages (such as scaling, placing them
|
|
|
|
|
in a particular spot).
|
|
|
|
|
|
|
|
|
|
The pages epic consists of two broad categories of work:
|
2023-01-07 16:37:08 +00:00
|
|
|
|
* Proper handling of document-level features when splitting and merging documents
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* Greatly increased flexibility in the ways in which pages can be selected from the various input
|
|
|
|
|
files and combined for the output file. This includes creation of blank pages.
|
|
|
|
|
|
|
|
|
|
Here are some examples of things that will become possible:
|
|
|
|
|
|
|
|
|
|
* Stacking arbitrary pages on top of each other with full control over transformation and cropping,
|
|
|
|
|
including being able to access information about the various bounding boxes associated with the
|
|
|
|
|
pages
|
|
|
|
|
* Inserting blank pages
|
|
|
|
|
* Doing n-up page layouts
|
|
|
|
|
* Re-ordering pages for printing booklets (also called signatures or printer spreads)
|
|
|
|
|
* Selecting pages based on the outline hierarchy, tags, or article threads
|
|
|
|
|
* Keeping only and all relevant parts of the outline hierarchies from all input files
|
|
|
|
|
* Creating single very long or wide pages with output from other pages
|
|
|
|
|
|
|
|
|
|
The rest of this document describes the details of what how these features will work and what needs
|
|
|
|
|
to be done to make them possible to build.
|
|
|
|
|
|
2024-01-02 22:32:51 +00:00
|
|
|
|
# QPDFJob Summary
|
|
|
|
|
|
|
|
|
|
`QPDFJob` goes through the following stages:
|
|
|
|
|
|
|
|
|
|
* create QPDF
|
|
|
|
|
* update from JSON
|
|
|
|
|
* page specs (`--pages`)
|
|
|
|
|
* Create a QPDF for each input source
|
|
|
|
|
* Figure out whether to keep files open
|
|
|
|
|
* Remove unreferenced resources if needed
|
|
|
|
|
* Remove pages from the pages tree
|
|
|
|
|
* Handle collation
|
|
|
|
|
* Copy or revive all final pages
|
|
|
|
|
* When copying foreign pages, possibly remove unreferenced resources
|
|
|
|
|
* Handle the same page copied more than once by doing a shallow copy
|
|
|
|
|
* Preserve form fields and page labels
|
|
|
|
|
* Delete pages from the primary input that were not used in the output
|
|
|
|
|
* Delete unreferenced form fields
|
|
|
|
|
* rotation
|
|
|
|
|
* underlay/overlay
|
|
|
|
|
* transformations
|
|
|
|
|
* disable signatures
|
|
|
|
|
* externalize images
|
|
|
|
|
* optimize images
|
|
|
|
|
* generate appearances
|
|
|
|
|
* flatten annotations
|
|
|
|
|
* coalesce contents
|
|
|
|
|
* flatten rotation
|
|
|
|
|
* remove page labels
|
|
|
|
|
* remove attachments
|
|
|
|
|
* add attachments
|
|
|
|
|
* copy attachments
|
|
|
|
|
* write QPDF
|
|
|
|
|
* One of:
|
|
|
|
|
* Do inspections
|
|
|
|
|
* Write single file
|
|
|
|
|
* Split pages
|
|
|
|
|
* Remove unreference resources if needed
|
|
|
|
|
* Preserve form fields and page labels
|
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
# Architectural Thoughts
|
|
|
|
|
|
2024-01-02 23:06:47 +00:00
|
|
|
|
XXX WORK IN: Dump `QPDFAssembler`. Instead, these are enhancements to `QPDFJob`. Don't try to
|
|
|
|
|
generalize this too much. There are actually only a few things we need to add to `QPDFJob`. Go
|
|
|
|
|
through and flesh out the list, but roughly:
|
|
|
|
|
|
|
|
|
|
* From the C++ API, make it possible to use an arbitrary QPDF as an input rather than having to
|
|
|
|
|
start with a file. That makes it possible to do arbitrary work on the PDF prior to submitting it.
|
|
|
|
|
* Allow specification of n blank pages of a given size, e.g. `--blank=5@612x792`. Maybe we can
|
|
|
|
|
support standard paper sizes, inches, centimeters, or sizes relative to other pages.
|
|
|
|
|
* Generalize underlay/overlay
|
|
|
|
|
* Maybe we can do it by adding flags and allowing them to be repeated
|
|
|
|
|
* Maybe we need a new syntax, like pstops, but with the ability to specify anchors and proportions
|
|
|
|
|
based on varoius boxes
|
|
|
|
|
* Maybe we need something like `--stack`
|
|
|
|
|
* It needs to be possible to stack arbitrary pages with arbitrary transformations and to have the
|
|
|
|
|
transformations be a function of the source or destination page; the rectangle mapping idea
|
|
|
|
|
discussed elsewhere may be a good basis
|
|
|
|
|
* Have a page composition phase after the overlay/underlay stage
|
|
|
|
|
* Allow n-up, left-to-right (can reverse page order to get rtl), top-to-bottom, or modular
|
|
|
|
|
composition like pstops
|
|
|
|
|
* Possible hook for page composition to allow custom compositions
|
|
|
|
|
* A few additional split options
|
|
|
|
|
|
|
|
|
|
Then, we need to make the existing logic handle other document-level structures, preferably in a way
|
|
|
|
|
that requires less duplication between split and merge. Maybe we can add a flag to disregard
|
|
|
|
|
document-level structures for speed, but I think being able to turn them on and off individually is
|
|
|
|
|
overkill, especially since people who are that sophisticated can tweak with JSON or just do it in
|
|
|
|
|
code.
|
|
|
|
|
|
|
|
|
|
The challenge will be able to come up with command-line syntax to do most things from the CLI and to
|
|
|
|
|
make the C++ API flexible enough for users to insert their own bits in key places, just as we can
|
|
|
|
|
now grab the QPDF before the write phase. This approach eliminates all the function stuff. We just
|
|
|
|
|
have to make sure we can support all these features and have a relatively easy way to add new ones
|
|
|
|
|
or to let developers extend. The documentation will have to explain the flow of QPDFJob so people
|
|
|
|
|
can know where to apply hooks.
|
|
|
|
|
|
|
|
|
|
----------
|
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
Create a new top-level class called `QPDFAssembler` that will be used to perform page-level
|
|
|
|
|
operations. Its implementation will use existing APIs, and it will add many new APIs. It should be
|
|
|
|
|
possible to perform all existing page splitting and merging operations using `QPDFAssembler` without
|
|
|
|
|
having to worry about details such as copying annotations, remapping destinations, and adjusting
|
|
|
|
|
document-level data.
|
|
|
|
|
|
|
|
|
|
Early strategy: keep `QPDFAssembler` private to the library, and start with a pure C++ API (no JSON
|
|
|
|
|
support). Migrate splitting and merging from `QPDFJob` into `QPDFAssembler`, then build in
|
|
|
|
|
document-level support. Also work the difference between normal write and split, which are two
|
|
|
|
|
separate ways to write output files.
|
|
|
|
|
|
|
|
|
|
One of the main responsibilities of `QPDFAssembler` will be to remap destinations as data from a
|
|
|
|
|
page is moved or copied. For example, if an outline has a destination that points to a particular
|
|
|
|
|
rectangle on page 5 of the second file, and we end up dropping a portion of that page into an n-up
|
|
|
|
|
configuration on a specific output page, we will have to keep track of enough information to replace
|
|
|
|
|
the destination with a new one that points to the new physical location of the same material. For
|
|
|
|
|
another example, consider a case in which the left side of page 3 of the primary input ends up as
|
|
|
|
|
page 5 of the output and the right side of page 3 ends up as page 6. We would have to map
|
|
|
|
|
destinations from a single source page to different destination pages based on which part of the
|
|
|
|
|
page it was on. If part of the rectangle points to one page and part to another, what do we do? I
|
|
|
|
|
suggest we go with the top/center of the rectangle.
|
|
|
|
|
|
|
|
|
|
A destination consists of a QPDF, page object, and rectangle in user coordinates. When
|
|
|
|
|
`QPDFAssembler` copies a page or converts it to a form XObject, possibly with transformations
|
|
|
|
|
applied, it will have to be able to map a destination to the same triple (QPDF, page object,
|
|
|
|
|
rectangle) on all pages that contain data from the original page. When writing the final output, any
|
|
|
|
|
destination that no longer points anywhere should be dropped, and any destination that points to
|
|
|
|
|
multiple places will need to be handled according to some specification.
|
|
|
|
|
|
|
|
|
|
Whenever we create any new thing from a page, we create _derived page data_. Examples of derived
|
|
|
|
|
page data would include a copy of the page and a form XObject created from a page. `QPDFAssembler`
|
|
|
|
|
will have to keep a mapping from any source page to all of its derived objects along with any
|
|
|
|
|
transformations or clipping. When a derived page data object is placed on a final page, that
|
|
|
|
|
information can be combined with the position and any transformations onto the final page to be able
|
|
|
|
|
to map any destination to a new one or to determine that it points outside of the visible area.
|
|
|
|
|
|
|
|
|
|
If a source page is copied multiple times, then if exactly one copy is explicitly marked as the
|
|
|
|
|
target, that becomes the target. Otherwise, the first derived object to be placed becomes the
|
|
|
|
|
target.
|
|
|
|
|
|
|
|
|
|
## Overall Structure
|
|
|
|
|
|
|
|
|
|
A single instance of `QPDFAssembler` creates a single assembly job. `QPDFJob` can create one
|
|
|
|
|
assembly job but does other things, such as setting writer options, inspection operations, etc. An
|
|
|
|
|
assembly job consists of the following:
|
|
|
|
|
* Global document-level data handling information
|
|
|
|
|
* Mode
|
|
|
|
|
* intelligent: try to combine everything using latest capabilities of qpdf; this is the default
|
|
|
|
|
* legacy: document-level features are kept from primary input; this is for compatibility and can
|
|
|
|
|
be selected from the CLI
|
|
|
|
|
* Input sources
|
|
|
|
|
* File/password
|
|
|
|
|
* Whether to keep attachments: yes, no, if-all-pages (default)
|
|
|
|
|
* Empty
|
|
|
|
|
* Output mode
|
|
|
|
|
* Single file
|
|
|
|
|
* Split -- this must include definitions of the split groups
|
|
|
|
|
* Description of the output in terms of the input sources and some series of transformations
|
|
|
|
|
|
|
|
|
|
## Cases to support
|
|
|
|
|
|
|
|
|
|
Here is a list of cases that need to be expressible.
|
|
|
|
|
|
|
|
|
|
* Create output by concatenating pages from page groups where each page group is pages specified by
|
|
|
|
|
a numeric range. This is what `--pages` does now.
|
|
|
|
|
* Collation, including different sized groups.
|
|
|
|
|
* Overlay/underlay, generalized to support a stack consisting of various underlays, the base page,
|
|
|
|
|
and various overlays, with flexibility around posititioning. It should be natural to express
|
|
|
|
|
exactly whate underlay and overlay do now.
|
|
|
|
|
* Split into groups of fixed size (what `--split-pages` does) with the ability to define split
|
|
|
|
|
groups based on other things, like outlines, article threads, and document structure
|
|
|
|
|
* Examples from the manual:
|
|
|
|
|
* `qpdf in.pdf --pages . a.pdf b.pdf:even -- out.pdf`
|
|
|
|
|
* `qpdf --empty --pages a.pdf b.pdf --password=x z-1 c.pdf 3,6`
|
|
|
|
|
* `qpdf --collate odd.pdf --pages . even.pdf -- all.pdf`
|
|
|
|
|
* `qpdf --collate --empty --pages odd.pdf even.pdf -- all.pdf`
|
|
|
|
|
* `qpdf --collate --empty --pages a.pdf 1-5 b.pdf 6-4 c.pdf r1 -- out.pdf`
|
|
|
|
|
* `qpdf --collate=2 --empty --pages a.pdf 1-5 b.pdf 6-4 c.pdf r1 -- out.pdf`
|
|
|
|
|
* `qpdf file2.pdf --pages file1.pdf 1-5 . 15-11 -- outfile.pdf`
|
|
|
|
|
*
|
|
|
|
|
```
|
|
|
|
|
qpdf --empty --copy-encryption=encrypted.pdf \
|
|
|
|
|
--encryption-file-password=pass \
|
|
|
|
|
--pages encrypted.pdf --password=pass 1 \
|
|
|
|
|
./encrypted.pdf --password=pass 1 -- \
|
|
|
|
|
outfile.pdf
|
|
|
|
|
```
|
|
|
|
|
* `qpdf --collate=2,6 a.pdf --pages . b.pdf -- all.pdf`
|
|
|
|
|
* Take A 1-2, B 1-6, A 3-4, C 7-12, A 5-6, B 13-18, ...
|
|
|
|
|
* Ideas from pstops. The following is an excerpt from the pstops manual page.
|
|
|
|
|
|
|
|
|
|
This section contains some sample re‐arrangements. To put two pages on one sheet (of A4 paper),
|
|
|
|
|
the pagespec to use is:
|
|
|
|
|
```
|
|
|
|
|
2:0L@.7(21cm,0)+1L@.7(21cm,14.85cm)
|
|
|
|
|
```
|
|
|
|
|
To select all of the odd pages in reverse order, use:
|
|
|
|
|
```
|
|
|
|
|
2:‐0
|
|
|
|
|
```
|
|
|
|
|
To re‐arrange pages for printing 2‐up booklets, use
|
|
|
|
|
```
|
|
|
|
|
4:‐3L@.7(21cm,0)+0L@.7(21cm,14.85cm)
|
|
|
|
|
```
|
|
|
|
|
for the front sides, and
|
|
|
|
|
```
|
|
|
|
|
4:1L@.7(21cm,0)+‐2L@.7(21cm,14.85cm)
|
|
|
|
|
```
|
|
|
|
|
for the reverse sides (or join them with a comma for duplex printing).
|
|
|
|
|
* From #493
|
|
|
|
|
```
|
|
|
|
|
pdf2ps infile.pdf infile.ps
|
|
|
|
|
ps2ps -pa4 "2:0R(4.5cm,26.85cm)+1R(4.5cm,14.85cm)" infile.ps outfile.ps
|
|
|
|
|
ps2pdf outfile.ps outfile.pdf
|
|
|
|
|
```
|
|
|
|
|
* Like psbook. Signature size n:
|
|
|
|
|
* take groups of 4n
|
|
|
|
|
* shown for n=3 in order such that, if printed so that the front of the first page is on top, the
|
|
|
|
|
whole stack can be folded in half.
|
|
|
|
|
* front: 6,7, back: 8,5
|
|
|
|
|
* front: 4,9, back: 10,3
|
|
|
|
|
* front: 2,11, back: 12,1
|
|
|
|
|
|
2024-01-02 22:32:51 +00:00
|
|
|
|
This is the same as duplex 2-up with pages in order 6, 7, 8, 5, 4, 9, 10, 3, 2, 11, 12, 1
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* n-up:
|
|
|
|
|
* For 2-up, calculate new w and h such that w/h maintains a fixed ratio and w and h are the
|
|
|
|
|
largest values that can fit within 1/2 the page with specified margins.
|
|
|
|
|
* Can support 1, 2, 4, 6, 9, 16. 2 and 6 require rotation. The others don't. Will probably need to
|
|
|
|
|
change getFormXObjectForPage to handle other boxes than trim box.
|
|
|
|
|
* Maybe define n-up a scale and rotate followed by fitting the result into a specified rectangle.
|
|
|
|
|
I might already have this logic in QPDFAnnotationObjectHelper::getPageContentForAppearance.
|
|
|
|
|
|
2023-12-30 14:47:29 +00:00
|
|
|
|
# Feature to Issue Mapping
|
|
|
|
|
|
|
|
|
|
Last checked: 2023-12-29
|
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
```
|
|
|
|
|
gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open
|
|
|
|
|
```
|
|
|
|
|
|
2023-12-30 14:47:29 +00:00
|
|
|
|
* Generate a mapping from source to destination for all destinations
|
|
|
|
|
* Issues: #1077
|
|
|
|
|
* Notes:
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* Source can be an outline or link, either directly or via action. If link, it should include
|
|
|
|
|
the page.
|
2023-12-30 14:47:29 +00:00
|
|
|
|
* Destination can be a structure destination, which should map to a regular destination.
|
|
|
|
|
* source: page X -> link -> action -> dest: page Y
|
|
|
|
|
* source: page X -> link -> action -> dest: structure -> page Y
|
|
|
|
|
* Consider something in json that dumps this.
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* We will need to associate this with a QPDF. It would be great if remote or embedded go-to
|
|
|
|
|
actions could be handled, but that's ambitious.
|
|
|
|
|
* It will be necessary to keep some global map that includes all QPDF objects that are part of
|
|
|
|
|
the final file.
|
|
|
|
|
* An interesting use case to consider would be to create a QPDF object from an embedded file and
|
|
|
|
|
append the embedded file and make the embedded actions work. This would probably require some
|
|
|
|
|
way to tell qpdf that a particular external file came from an embedded file.
|
2023-12-30 14:47:29 +00:00
|
|
|
|
* Control size of page and position/transformation of overlay/underlay
|
|
|
|
|
* Issues: #1031, #811, #740, #559
|
|
|
|
|
* Notes:
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* It should be possible to define a destination page from scratch or in terms of other pages and
|
|
|
|
|
then place page contents onto it with arbitrary transformations applied.
|
|
|
|
|
* It should be possible to compute the size of the destination page in terms of the source
|
|
|
|
|
pages, e.g., to create one long or wide page from other pages.
|
2023-12-30 14:47:29 +00:00
|
|
|
|
* Also allow specification of which page box to use
|
|
|
|
|
* Preserve hyperlinks when doing any page operations
|
|
|
|
|
* See also "Generate a mapping from source to destination for all destinations"
|
|
|
|
|
* Issues: #1003, #797, #94
|
|
|
|
|
* Notes:
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* A link annotation that points to a destination rather than an external URL should continue to
|
|
|
|
|
work when files are split or merged.
|
2023-12-30 14:47:29 +00:00
|
|
|
|
* Awareness of structured and tagged PDF (14.7, 14.8)
|
|
|
|
|
* Issues: #957, #953, #490
|
|
|
|
|
* Notes:
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* This looks complicated. It may be not be possible to do this fully in the first increment, but
|
|
|
|
|
we have to keep it in mind and warn if we can't and we see /SD in an action.
|
2023-12-30 14:47:29 +00:00
|
|
|
|
* #490 has some good analysis
|
|
|
|
|
* Assign page labels
|
|
|
|
|
* Issues: #939
|
|
|
|
|
* Notes:
|
|
|
|
|
* #939 has a good proposal
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* This could be applied to page groups, and we could have an option to keep the labels as they
|
|
|
|
|
are in a given group, which is what qpdf does now.
|
2023-12-30 14:47:29 +00:00
|
|
|
|
* Interleave pages with ordering
|
|
|
|
|
* Issues: #921
|
|
|
|
|
* Notes:
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* From 921: interleave odd pages and reversed even pages. This might require different handling
|
|
|
|
|
for even/odd numbers of pages. Make sure it's natural for the cases of len(odd) == len(even)
|
|
|
|
|
or len(odd) == 1+len(even)
|
2023-12-30 14:47:29 +00:00
|
|
|
|
* Preserve all attachments when merging files
|
|
|
|
|
* Issues: #856
|
|
|
|
|
* Notes:
|
|
|
|
|
* If all pages of a file are selected, keep all attachments
|
|
|
|
|
* If some pages of a file are selected
|
|
|
|
|
* Keep all attachments if there are any embedded file annotations
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* Otherwise, what? Do we have a keep-attachments flag of some sort? Or do we just make the
|
|
|
|
|
user copy attachments from one file to another?
|
2023-12-30 14:47:29 +00:00
|
|
|
|
* Apply clipping to a page
|
|
|
|
|
* Issues: #771
|
|
|
|
|
* Notes:
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* Create a form xobject from a page, then apply a specific clipping region expressed in
|
|
|
|
|
coordinates or as a percentage
|
2023-12-30 14:47:29 +00:00
|
|
|
|
* Ability to create a blank page
|
|
|
|
|
* Issues: #753
|
|
|
|
|
* Notes:
|
|
|
|
|
* Create a blank page of a specific size or of the same size as another page
|
|
|
|
|
* Split groups with explicit boundaries
|
|
|
|
|
* Issues: #741, #616
|
|
|
|
|
* Notes:
|
|
|
|
|
* Example: --split-after a,b,c
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* Handle Optional Content (layers) (8.11)
|
|
|
|
|
* Issues: #672, #9, #570
|
2023-12-30 14:47:29 +00:00
|
|
|
|
* Scale a page up or down to fit to a size
|
|
|
|
|
* Issues: #611
|
|
|
|
|
* Place contents of pages adjacent horizontally or vertically on one page
|
|
|
|
|
* Issues: #1040, #546
|
|
|
|
|
* nup, booklet
|
|
|
|
|
* Issues: #493, #461, #152
|
|
|
|
|
* Notes:
|
|
|
|
|
* #461 may want the inverse of booklet and discusses reader and printer spreads
|
|
|
|
|
* Flexible multiplexing
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* Issues: #505 (already implemented with --collate)
|
2023-12-30 14:47:29 +00:00
|
|
|
|
* Split pages based on outlines
|
|
|
|
|
* Issues: #477
|
|
|
|
|
* Keep relevant parts of outline hierarchy
|
|
|
|
|
* Issues: #457, #356, #343, #323
|
|
|
|
|
* Notes:
|
|
|
|
|
* There is some helpful discussion in #343 including
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* Preserving open/closed status
|
2023-12-30 14:47:29 +00:00
|
|
|
|
* Preserving javascript actions
|
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
# XXX OLD NOTES
|
|
|
|
|
|
|
|
|
|
I want to encapsulate various aspects of the logic into interfaces that can be implemented by
|
|
|
|
|
developers to add their own logic. It should be easy to contribute these. Here are some rough ideas.
|
2023-09-23 13:26:25 +00:00
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
A source is an input file, the output of another operation, or a blank page. In the API, it can be
|
|
|
|
|
any QPDF object.
|
2023-09-23 13:26:25 +00:00
|
|
|
|
|
|
|
|
|
A page group is just a group of pages.
|
|
|
|
|
|
|
|
|
|
* PageSelector -- creates page groups from other page groups
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* PageTransformer -- selects a part of a page and possibly transforms it; applies to all pages of a
|
|
|
|
|
group. Based on the page dictionary; does not look at the content stream
|
2023-09-23 13:26:25 +00:00
|
|
|
|
* PageFilter -- apply arbitrary code to a page; may access the content stream
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* PageAssembler -- combines pages from groups into new groups whose pages are each assembled from
|
|
|
|
|
corresponding pages of the input groups
|
2023-09-23 13:26:25 +00:00
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
These should be able to be composed in arbitrary ways. There should be a natural API for doing this,
|
|
|
|
|
and it there should be some specification, probably based on JSON, that can be provided on the
|
|
|
|
|
command line or embedded in the job JSON format. I have been considering whether a lisp-like
|
|
|
|
|
S-expression syntax may be less cumbersome to work with. I'll have to decide whether to support this
|
|
|
|
|
or some other syntax in addition to a JSON representation.
|
2023-09-23 13:26:25 +00:00
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
There also needs to be something to represent how document-level structures relate to this. I'm not
|
|
|
|
|
sure exactly how this should work, but we need things like
|
2023-09-23 13:26:25 +00:00
|
|
|
|
* what to do with page labels, especially when assembling pages from other pages
|
|
|
|
|
* whether to preserve destinations (outlines, links, etc.), particularly when pages are duplicated
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* If A refers to B and there is more than one copy of B, how do you decide which copies of A link
|
|
|
|
|
to which copies of B?
|
|
|
|
|
* what to do with pages that belong to more than one group, e.g., what happens if you used document
|
|
|
|
|
structure or outlines to form page groups and a group boundary lies in the middle of the page
|
2023-09-23 13:26:25 +00:00
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
Maybe pages groups can have arbitrary, user-defined tags so we can specify that links should only
|
|
|
|
|
point to other pages with the same value of some tag. We can probably many-to-one links if the
|
|
|
|
|
source is duplicated.
|
2023-09-23 13:26:25 +00:00
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
We probably need to hold onto the concept of the primary input file. If there is a primary input
|
|
|
|
|
file, there may need to be a way to specify what gets preserved it. The behavior of qpdf prior to
|
|
|
|
|
all of this is to preserve all document-level constructs from the primary input file and to try to
|
|
|
|
|
preserve page labels from other input files when combining pages.
|
2023-09-23 13:26:25 +00:00
|
|
|
|
|
|
|
|
|
Here are some examples.
|
|
|
|
|
|
|
|
|
|
* PageSelector
|
|
|
|
|
* all pages from an input file
|
|
|
|
|
* pages from a group using a NumericRange
|
|
|
|
|
* concatenate groups
|
|
|
|
|
* pages from a group in reverse order
|
|
|
|
|
* a group repeated as often as necessary until a specified number of pages is reached
|
|
|
|
|
* a group padded with blank pages to create a multiple of n pages
|
|
|
|
|
* odd or even pages from a group
|
|
|
|
|
* every nth page from a group
|
|
|
|
|
* pages interleaved from multiple groups
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* the left-front (left-back, right-front, right-back) pages of a booklet with signatures of n
|
|
|
|
|
pages
|
|
|
|
|
* all pages reachable from a section of the outline hierarchy or something based on threads or
|
|
|
|
|
other structure
|
2023-09-23 13:26:25 +00:00
|
|
|
|
* selection based on page labels
|
|
|
|
|
* PageTransformer
|
|
|
|
|
* clip to media box (trim box, crop box, etc.)
|
|
|
|
|
* clip to specific absolute or relative size
|
|
|
|
|
* scale
|
|
|
|
|
* translate
|
|
|
|
|
* rotate
|
|
|
|
|
* apply transformation matrix
|
|
|
|
|
* PageFilter
|
|
|
|
|
* optimize images
|
|
|
|
|
* flatten annotations
|
|
|
|
|
* PageAssembler
|
|
|
|
|
* Overlay/underlay all pages from one group onto corresponding pages from another group
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* Control placement based on properties of all the groups, so higher order than a stand-alone
|
|
|
|
|
transformer
|
2023-09-23 13:26:25 +00:00
|
|
|
|
* Examples
|
|
|
|
|
* Scale the smaller page up to the size of the larger page
|
|
|
|
|
* Center the smaller page horizontally and bottom-align the trim boxes
|
|
|
|
|
* Generalized overlay/underlay allowing n pages in a given order with transformations.
|
|
|
|
|
* n-up -- application of generalized overlay/underlay
|
2023-12-27 19:07:38 +00:00
|
|
|
|
* make one long page with an arbitrary number of pages one after the other (#546)
|
2023-09-23 13:26:25 +00:00
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
It should be possible to represent all of the existing qpdf operations using the above framework. It
|
|
|
|
|
would be good to re-implement all of them in terms of this framework to exercise it. We will have to
|
|
|
|
|
look through all the command-line arguments and make sure. Of course also make sure suggestions from
|
|
|
|
|
issues can be implemented or at least supported by adding new selectors.
|
2023-09-23 13:26:25 +00:00
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
Here are a few bits of scratch work. The top-level call is a selector. This doesn't capture
|
|
|
|
|
everything. Implementing this would be tedious and challenging. It could be done using JSON arrays,
|
|
|
|
|
but it would be clunky. This feels over-designed and possibly in conflict with QPDFJob.
|
2023-09-23 13:26:25 +00:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
(concat
|
|
|
|
|
(primary-input)
|
|
|
|
|
(file "file2.pdf")
|
|
|
|
|
(page-range (file "file3.pdf") "1-4,5-8")
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
(with
|
|
|
|
|
("a"
|
|
|
|
|
(concat
|
|
|
|
|
(primary-input)
|
|
|
|
|
(file "file2.pdf")
|
|
|
|
|
(page-range (file "file3.pdf") "1-4,5-8")
|
|
|
|
|
)
|
|
|
|
|
)
|
|
|
|
|
(concat
|
|
|
|
|
(even-pages (from "a"))
|
|
|
|
|
(reverse (odd-pages (from "a")))
|
|
|
|
|
)
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
(with
|
|
|
|
|
("a"
|
|
|
|
|
(concat
|
|
|
|
|
(primary-input)
|
|
|
|
|
(file "file2.pdf")
|
|
|
|
|
(page-range (file "file3.pdf") "1-4,5-8")
|
|
|
|
|
)
|
|
|
|
|
"b-even"
|
|
|
|
|
(even-pages (from "a"))
|
|
|
|
|
"b-odd"
|
|
|
|
|
(reverse (odd-pages (from "a")))
|
|
|
|
|
)
|
|
|
|
|
(stack
|
|
|
|
|
(repeat-range (from "a") "z")
|
|
|
|
|
(pad-end (from "b"))
|
|
|
|
|
)
|
|
|
|
|
)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
```json
|
2023-01-07 16:37:08 +00:00
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
```
|
2023-01-07 16:37:08 +00:00
|
|
|
|
|
|
|
|
|
# Supporting Document-level Features
|
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
qpdf needs full support for document-level features like article threads, outlines, etc. There is no
|
|
|
|
|
support for some things and partial support for others. See notes below for a comprehensive list.
|
2023-01-07 16:37:08 +00:00
|
|
|
|
|
|
|
|
|
Most likely, this will be done by creating DocumentHelper and ObjectHelper classes.
|
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
It will be necessary not only to read information about these structures from a single PDF file as
|
|
|
|
|
the existing document helpers do but also to reconstruct or update these based on modifications to
|
|
|
|
|
the pages in a file. I'm not sure how to do that, but one idea would be to allow a document helper
|
|
|
|
|
to register a callback with QPDFPageDocumentHelper that notifies it when a page is added or removed.
|
|
|
|
|
This may be able to take other parameters such as a document helper from a foreign file.
|
|
|
|
|
|
|
|
|
|
Since these operations can be expensive, there will need to be a way to opt in and out. The default
|
|
|
|
|
(to be clearly documented) should be that all supported document-level constructs are preserved.
|
|
|
|
|
That way, as new features are added, changes to the output of previous operations to include
|
|
|
|
|
information that was previously omitted will not constitute a non-backward compatible change that
|
|
|
|
|
requires a major version bump. This will be a default for the API when using the higher-level page
|
|
|
|
|
assemebly API (below) as well as the CLI.
|
|
|
|
|
|
|
|
|
|
There will also need to be some kind of support for features that are document-level and not tied to
|
|
|
|
|
any pages, such as (sometimes) embedded files. When splitting/merging files, there needs to be a way
|
|
|
|
|
to specify what should happen with those things. Perhaps the default here should be that these are
|
|
|
|
|
preserved from files from which all pages are selected. For some things, like viewer preferences, it
|
|
|
|
|
may make sense to take them from the first file.
|
2023-01-07 16:37:08 +00:00
|
|
|
|
|
|
|
|
|
# Page Assembly (page selection)
|
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
In addition to the existing numeric ranges of page numbers, page selection could be driven by
|
|
|
|
|
document-level features like the outlines hierarchy or article threads. There have been a lot of
|
|
|
|
|
suggestions about this in various tickets. There will need to be some kind of page manipulation
|
|
|
|
|
class with configuration options. I'm thinking something similar to QPDFJob, where you construct a
|
|
|
|
|
class and then call a bunch of methods to configure it, including the ability to configure with
|
|
|
|
|
JSON. Several suggestions have been made in issues, which I will go through and distill into a list.
|
|
|
|
|
Off hand, some ideas include being able to split based on explicit chunks and being able to do all
|
|
|
|
|
pages except a list of pages.
|
2023-01-07 16:37:08 +00:00
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
For CLI, I'm probably going to have it take a JSON blob or JSON file on the CLI rather than having
|
|
|
|
|
some absurd way of doing it with arguments (people have a lot of trouble with --pages as it is). See
|
|
|
|
|
TODO for a feature on command-line/job JSON support for JSON specification arguments.
|
2023-01-07 16:37:08 +00:00
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
There are some other things, like allowing n-up and genearlizing overlay/underlay to allow different
|
|
|
|
|
placement and scaling options, that I think may also be in scope.
|
2023-01-07 16:37:08 +00:00
|
|
|
|
|
|
|
|
|
# Scaling/Transforming Pages
|
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* Keep in mind that destinations, such as links and outlines, may need to be adjusted when a page is
|
|
|
|
|
scaled or otherwise transformed.
|
2023-01-07 16:37:08 +00:00
|
|
|
|
|
|
|
|
|
# Notes
|
|
|
|
|
|
|
|
|
|
PDF document structure
|
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
The trailer contains the catalog and the Info dictionary. We probably need to do something
|
|
|
|
|
intelligent with the info dictionary.
|
|
|
|
|
|
2023-01-07 16:37:08 +00:00
|
|
|
|
7.7.2 contains the list of all keys in the document catalog.
|
|
|
|
|
|
2024-01-02 22:32:51 +00:00
|
|
|
|
Document-level structures to merge:
|
2023-01-07 16:37:08 +00:00
|
|
|
|
* Extensions
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* Must be combination of Extensions from all input files
|
2023-01-07 16:37:08 +00:00
|
|
|
|
* PageLabels
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* Ensure each page has its original label
|
|
|
|
|
* Allow post-processing
|
2023-01-07 16:37:08 +00:00
|
|
|
|
* Names -- see below
|
2024-01-02 22:32:51 +00:00
|
|
|
|
* Combine per tree
|
|
|
|
|
* May require disambiguation
|
2023-01-07 16:37:08 +00:00
|
|
|
|
* Page: TemplateInstantiated
|
|
|
|
|
* Dests
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* Keep referenced destinations across all files
|
|
|
|
|
* May need to disambiguate or "flatten" or convert to named dests with the names tree
|
2023-01-07 16:37:08 +00:00
|
|
|
|
* Outlines
|
|
|
|
|
* Threads (easy)
|
|
|
|
|
* Page: B
|
|
|
|
|
* AcroForm
|
|
|
|
|
* StructTreeRoot
|
|
|
|
|
* Page: StructParents
|
2023-12-30 14:47:29 +00:00
|
|
|
|
* MarkInfo (see 14.7 - Logical Structure, 14.8 Tagged PDF)
|
2023-01-07 16:37:08 +00:00
|
|
|
|
* SpiderInfo
|
|
|
|
|
* Page: ID
|
|
|
|
|
* OutputIntents
|
|
|
|
|
* Page: OutputIntents
|
|
|
|
|
* PieceInfo
|
|
|
|
|
* Page: PieceInfo
|
|
|
|
|
* OCProperties
|
|
|
|
|
* Requirements
|
|
|
|
|
* AF (file specification dictionaries)
|
|
|
|
|
* Page: AF
|
|
|
|
|
* DPartRoot
|
|
|
|
|
* Page: DPart
|
|
|
|
|
* Version
|
2024-01-02 22:32:51 +00:00
|
|
|
|
* Maximum
|
2023-01-07 16:37:08 +00:00
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
Things that stay with the first document that has one and/or will not be supported
|
2024-01-02 22:32:51 +00:00
|
|
|
|
* AA (Additional Actions)
|
|
|
|
|
* Would be possible to combine and let the first contributor win, but it probably wouldn't usually
|
|
|
|
|
be what we want.
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* Info (not part of document catalog)
|
2023-01-07 16:37:08 +00:00
|
|
|
|
* ViewerPreferences
|
|
|
|
|
* PageLayout
|
|
|
|
|
* PageMode
|
|
|
|
|
* OpenAction
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* URI
|
2023-01-07 16:37:08 +00:00
|
|
|
|
* Metadata
|
|
|
|
|
* Lang
|
|
|
|
|
* NeedsRendering
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* Collection
|
2024-01-02 22:32:51 +00:00
|
|
|
|
* Perms
|
|
|
|
|
* Legal
|
|
|
|
|
* DSS
|
2023-01-07 16:37:08 +00:00
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
Name dictionary (7.7.4)
|
2023-01-07 16:37:08 +00:00
|
|
|
|
* Dests
|
2024-01-02 22:32:51 +00:00
|
|
|
|
* AP (appearance streams)
|
2023-01-07 16:37:08 +00:00
|
|
|
|
* JavaScript
|
|
|
|
|
* Pages (named pages)
|
|
|
|
|
* Templates
|
2024-01-01 22:45:14 +00:00
|
|
|
|
* Combine across all documents
|
|
|
|
|
* Page: TemplateInstantiated points to a named page
|
2023-01-07 16:37:08 +00:00
|
|
|
|
* IDS
|
|
|
|
|
* URLS
|
|
|
|
|
* EmbeddedFiles
|
|
|
|
|
* AlternatePresentations
|
|
|
|
|
* Renditions
|
|
|
|
|
|
|
|
|
|
Most of chapter 12 applies.
|
|
|
|
|
|
|
|
|
|
Document-level navigation (12.3)
|
|
|
|
|
|
2024-01-01 22:45:14 +00:00
|
|
|
|
QPDF will need a global way to reference a page. This will most likely be in the form of the QPDF
|
|
|
|
|
uuid and a QPDFObjectHandle to the page. If this can just be a QPDFObjectHandle, that would be
|
|
|
|
|
better. I need to make sure we can meaningfully interact with QPDFObjectHandle objects from multiple
|
|
|
|
|
QPDFs in a safe fashion. Figure out how this works with immediateCopyFrom, etc. Better to avoid this
|
|
|
|
|
whole thing and make sure that we just keep all the document-level stuff specific to a PDF, but we
|
|
|
|
|
will need to have some internal representation that can be used to reconstruct the document-level
|
|
|
|
|
dictionaries when writing. Making this work with structures (structure destinations) will require
|
|
|
|
|
more indirection.
|
|
|
|
|
|
|
|
|
|
I imagine that there will be some internal repreentation of what document-level things come along
|
|
|
|
|
for the ride when we take a page from a document. I wonder whether this need to change the way
|
|
|
|
|
linearization works.
|
|
|
|
|
|
|
|
|
|
There should be different ways to specify collections of pages. The existing one, which is using a
|
|
|
|
|
numeric range, is just one. Other ideas include things related to document structure (all pages in
|
|
|
|
|
an article thread, all pages in an outline hierarchy), page labels, book binding (Is that called
|
|
|
|
|
folio? There's an issue for it.), subgroups, or any number of things.
|
|
|
|
|
|
|
|
|
|
We will need to be able to start with document-level objects to get page groups and also to start
|
|
|
|
|
with pages and reconstruct document level objects. For example, it should be possibe to reconstruct
|
|
|
|
|
article threads to omit beads that don't belong to any of the pages. Likewise with outlines.
|