Commit e52b026db4a7f23b79f21b4563d2b02d3b87fde4

Authored by Jay Berkenbilt
1 parent 379fc7e5

Major rework of TODO-pages.md

This is converging into something that will be possible to do.
Showing 1 changed file with 322 additions and 471 deletions
TODO-pages.md
1 1 # Pages
2 2  
3   -**THIS IS A WORK IN PROGRESS. THE ACTUAL IMPLEMENTATION MAY NOT LOOK ANYTHING LIKE THIS. When this
4   -gets to the stage where it is starting to congeal into an actual plan, I will remove this disclaimer
5   -and open a discussion ticket in GitHub to work out details.**
  3 +**This is a work in progress, but it's getting close. When this gets to the stage where it is
  4 +starting to congeal into an actual plan, I will remove this disclaimer and open a discussion ticket
  5 +in GitHub to work out details.**
6 6  
7 7 This document describes a project known as the _pages epic_. The goal of the pages epic is to enable
8 8 qpdf to properly preserve all functionality associated with a page as pages are copied from one PDF
9   -to another (or back to the same PDF).
  9 +to another (or back to the same PDF). A secondary goal is to add more flexiblity to the ways in
  10 +which documents can be split and combined (flexible assembly).
10 11  
11 12 Terminology:
12 13 * _Page-level data_: information that is contained within objects reachable from the page dictionary
... ... @@ -14,30 +15,33 @@ Terminology:
14 15 * _Document-level data_: information that is reachable from the document catalog (`/Root`) that is
15 16 not reachable from a page dictionary as well as the `/Info` dictionary
16 17  
17   -Some document-level data references specific pages by page object ID, such as outlines or
18   -interactive forms. Some document-level data doesn't reference any pages, such as embedded files or
19   -optional content (layers). Some document-level data contains information that pertains to a specific
20   -page but does not reference the page, such as page labels (explicit page numbers). Some page-level
21   -data may sometimes depend on document-level data. For example, a _named destination_ depends on the
22   -document-level _names tree_.
  18 +PDF uses document-level data in a variety of ways. There is some document-level data that has each
  19 +of the following properties, among others:
  20 +* References pages by object ID (outlines, interactive forms)
  21 +* Doesn't reference any pages (embedded files)
  22 +* Doesn't reference any pages but influences page rendering (optional content/layers)
  23 +* Doesn't reference any pages but contains information about pages (page labels)
  24 +* Contains information used by pages (named destinations)
23 25  
24 26 As long as qpdf has had the ability to copy pages from one PDF to another, it has had robust
25 27 handling of page-level data. Prior to the implementation of the pages epic, with the exception of
26   -page labels, qpdf has ignored document-level data during page copy operations. Specifically, when
27   -qpdf creates a new PDF file from existing PDF files, it always starts with a specific PDF, known as
28   -the _primary input_. The primary input may be the built-in _empty PDF_. With the exception of page
29   -labels, document-level constructs that appear in the primary input are preserved, and document-level
30   -constructs from the other PDF files are ignored. The exception to this is page labels. With page
31   -labels, qpdf always ensures that any given page has the same label in the final output as it had in
32   -whichever input file it originated from, which is usually (but not always) the desired behavior.
  28 +page labels and form fields, qpdf has ignored document-level data during page copy operations.
  29 +Specifically, when qpdf creates a new PDF file from existing PDF files, it always starts with a
  30 +specific PDF, known as the _primary input_. The primary input may be a file or the built-in _empty
  31 +PDF_. With the exception of page labels and form fields, document-level constructs that appear in
  32 +the primary input are preserved, and document-level constructs from the other PDF files are ignored.
  33 +With page labels, qpdf always ensures that any given page has the same label in the final output as
  34 +it had in whichever input file it originated from, which is usually (but not always) the desired
  35 +behavior. With form fields, qpdf has awareness and ensures that all form fields remain operational.
  36 +The goal is to extend this document-level-awareness to other document-level constructs.
33 37  
34 38 Here are several examples of problems in qpdf prior to the implementation of the pages epic:
35 39 * If two files with optional content (layers) are merged, all layers in all but the primary input
36 40 will be visible in the combined file.
37 41 * If two files with file attachments are merged, attachments will be retained on the primary input
38 42 but dropped on the others. (qpdf has other ways to copy attachments from one file to another.)
39   -* If two files with hyperlinks are merged, any hyperlink from other than primary input whose
40   - destination is a named destination will become non-functional.
  43 +* If two files with hyperlinks are merged, any hyperlink from other than primary input become
  44 + non-functional.
41 45 * If two files with outlines are merged, the outlines from the original file will appear in their
42 46 entirety, including outlines that point to pages that are no longer there, and outlines will be
43 47 lost from all files except the primary input.
... ... @@ -55,27 +59,32 @@ arbitrary combinations of input and output files. The command-line allows only t
55 59  
56 60 The pages epic consists of two broad categories of work:
57 61 * Proper handling of document-level features when splitting and merging documents
58   -* Greatly increased flexibility in the ways in which pages can be selected from the various input
59   - files and combined for the output file. This includes creation of blank pages.
  62 +* Flexible assembly: greatly increased flexibility in the ways in which pages can be selected from
  63 + the various input files and combined for the output file. This includes creation of blank pages
  64 + and composition of pages (n-up or other ways of combining multiple input pages into one output
  65 + page)
60 66  
61 67 Here are some examples of things that will become possible:
62 68  
63 69 * Stacking arbitrary pages on top of each other with full control over transformation and cropping,
64 70 including being able to access information about the various bounding boxes associated with the
65   - pages
  71 + pages (generalization of underlay/overlay)
66 72 * Inserting blank pages
67 73 * Doing n-up page layouts
  74 +* Creating single very long or wide pages with output from other pages
68 75 * Re-ordering pages for printing booklets (also called signatures or printer spreads)
69 76 * Selecting pages based on the outline hierarchy, tags, or article threads
70 77 * Keeping only and all relevant parts of the outline hierarchies from all input files
71   -* Creating single very long or wide pages with output from other pages
72 78  
73 79 The rest of this document describes the details of what how these features will work and what needs
74 80 to be done to make them possible to build.
75 81  
76   -# QPDFJob Summary
  82 +# Architectural Thoughts
  83 +
  84 +Open question: if I do all the complex logic in `QPDFJob`, what are the implications for pikepdf or
  85 +other wrappers? This will need to be discussed in the discussion ticket.
77 86  
78   -`QPDFJob` goes through the following stages:
  87 +Prior to implementation of the pages epic, `QPDFJob` goes through the following stages:
79 88  
80 89 * create QPDF
81 90 * update from JSON
... ... @@ -113,176 +122,299 @@ to be done to make them possible to build.
113 122 * Remove unreference resources if needed
114 123 * Preserve form fields and page labels
115 124  
116   -# Architectural Thoughts
117   -
118   -XXX WORK IN: Dump `QPDFAssembler`. Instead, these are enhancements to `QPDFJob`. Don't try to
119   -generalize this too much. There are actually only a few things we need to add to `QPDFJob`. Go
120   -through and flesh out the list, but roughly:
121   -
  125 +Broadly, the above has to be modified in the following ways:
122 126 * From the C++ API, make it possible to use an arbitrary QPDF as an input rather than having to
123 127 start with a file. That makes it possible to do arbitrary work on the PDF prior to submitting it.
124   -* Allow specification of n blank pages of a given size, e.g. `--blank=5@612x792`. Maybe we can
125   - support standard paper sizes, inches, centimeters, or sizes relative to other pages.
  128 +* Allow creation of blank pages as an additional input source
126 129 * Generalize underlay/overlay
127   - * Maybe we can do it by adding flags and allowing them to be repeated
128   - * Maybe we need a new syntax, like pstops, but with the ability to specify anchors and proportions
129   - based on varoius boxes
130   - * Maybe we need something like `--stack`
131   - * It needs to be possible to stack arbitrary pages with arbitrary transformations and to have the
132   - transformations be a function of the source or destination page; the rectangle mapping idea
133   - discussed elsewhere may be a good basis
  130 + * Enable controlling placement
  131 + * Make repeatable
  132 +* Add additional reordering options
  133 + * We don't need to provide hooks for this. If someone is going to code a hook, they can just
  134 + compute the page ordering directly.
134 135 * Have a page composition phase after the overlay/underlay stage
135 136 * Allow n-up, left-to-right (can reverse page order to get rtl), top-to-bottom, or modular
136 137 composition like pstops
137   - * Possible hook for page composition to allow custom compositions
138   -* A few additional split options
139   -
140   -Then, we need to make the existing logic handle other document-level structures, preferably in a way
141   -that requires less duplication between split and merge. Maybe we can add a flag to disregard
142   -document-level structures for speed, but I think being able to turn them on and off individually is
143   -overkill, especially since people who are that sophisticated can tweak with JSON or just do it in
144   -code.
145   -
146   -The challenge will be able to come up with command-line syntax to do most things from the CLI and to
147   -make the C++ API flexible enough for users to insert their own bits in key places, just as we can
148   -now grab the QPDF before the write phase. This approach eliminates all the function stuff. We just
149   -have to make sure we can support all these features and have a relatively easy way to add new ones
150   -or to let developers extend. The documentation will have to explain the flow of QPDFJob so people
151   -can know where to apply hooks.
152   -
153   -----------
154   -
155   -Create a new top-level class called `QPDFAssembler` that will be used to perform page-level
156   -operations. Its implementation will use existing APIs, and it will add many new APIs. It should be
157   -possible to perform all existing page splitting and merging operations using `QPDFAssembler` without
158   -having to worry about details such as copying annotations, remapping destinations, and adjusting
159   -document-level data.
160   -
161   -Early strategy: keep `QPDFAssembler` private to the library, and start with a pure C++ API (no JSON
162   -support). Migrate splitting and merging from `QPDFJob` into `QPDFAssembler`, then build in
163   -document-level support. Also work the difference between normal write and split, which are two
164   -separate ways to write output files.
165   -
166   -One of the main responsibilities of `QPDFAssembler` will be to remap destinations as data from a
167   -page is moved or copied. For example, if an outline has a destination that points to a particular
168   -rectangle on page 5 of the second file, and we end up dropping a portion of that page into an n-up
169   -configuration on a specific output page, we will have to keep track of enough information to replace
170   -the destination with a new one that points to the new physical location of the same material. For
171   -another example, consider a case in which the left side of page 3 of the primary input ends up as
172   -page 5 of the output and the right side of page 3 ends up as page 6. We would have to map
173   -destinations from a single source page to different destination pages based on which part of the
174   -page it was on. If part of the rectangle points to one page and part to another, what do we do? I
175   -suggest we go with the top/center of the rectangle.
176   -
177   -A destination consists of a QPDF, page object, and rectangle in user coordinates. When
178   -`QPDFAssembler` copies a page or converts it to a form XObject, possibly with transformations
179   -applied, it will have to be able to map a destination to the same triple (QPDF, page object,
180   -rectangle) on all pages that contain data from the original page. When writing the final output, any
181   -destination that no longer points anywhere should be dropped, and any destination that points to
182   -multiple places will need to be handled according to some specification.
  138 +* Add additional ways to select pages besides range (e.g. based on outlines)
  139 +* Add additional ways to specify boundaries for splitting
  140 +* Enhance existing logic to handle other document-level structures, preferably in a way that
  141 + requires less duplication between split and merge.
  142 + * We don't need to turn on and off most types of document constructs individually. People can
  143 + preprocess using the API or qpdf JSON if they want fine-grained control.
  144 + * For things like attachments and outlines, we can add additional flags.
  145 +
  146 +## Flexible Assembly
  147 +
  148 +This section discusses modifications to the command-line syntax to make it easier to add flexibility
  149 +going forward without breaking backward compatibility. The main thrust will be to create
  150 +non-positional alternatives to some things that currently use positional arguments (`--pages`,
  151 +`--overlay`, `--underlay`), as was done for `--encrypt` in 11.7.0, to make it possible to add
  152 +additional flags.
  153 +
  154 +In several cases, we allow specification of transformations or placements. In this context:
  155 +* The origin is always lower-left corner.
  156 +* A _dimension_ may be absolute or relative.
  157 + * An _absolute dimension_ is `{n}` (in points), `{n}in` (inches), `{n}cm` (centimeters),
  158 + * A _relative dimension_ is expressed in terms of the corresponding dimension of one of a page's
  159 + boxes. Which dimension is determined by context.
  160 + * `{n}{M|C|B|T|A}` is `{n}` times the corresopnding dimension of the media, crop, bleed, trim,
  161 + or art box. Example: `0.5M` would be half the width or height of the media box.
  162 + * `{n}+{M|C|B|T|A}` is `{n}` plus the corresponding dimension. Example: `-0.5in+T` is half an
  163 + inch (36 points) less than the width or height of the trim box.
  164 +* A _size_ is
  165 + * `{w}x{h}`, where `{w}` and `{h}` are dimensions
  166 + * `letter|a4` (potentially add other page sizes)
  167 +* A _position_ is `{x}x{y}` where `{x}` and `{y}` are dimensions offset from the origin
  168 +* A _rectangle_ is `{llx},{lly},{urx},{ury}` (lower|upper left|right x|y) with `llx` < `urx` and
  169 + `lly` < `ury`
  170 + * Examples:
  171 + * `0.1M,0.1M,0.9M,0.9M` is a box whose llx is 10% of the media box width, lly is 10% of the
  172 + height, urx is 90% of the width, and ury is 90% of the height
  173 + * `0,0,612,792` is a box whose size is that of a US Letter page.
  174 + * A rectangle may also be just one of `M|C|B|T|A` to refer to a page's media, crop, bleed, trim,
  175 + or art box.
  176 +
  177 +Tweak `--pages` similarly to `--encrypt`. As an alternative to `--pages file [--password=p] range
  178 +--`, support `--pages --file=x --password=y --range=z --`. This allows for a more flexible syntax.
  179 +If `--file` appears, positional arguments are disallowed. The same applies to `--overlay` and
  180 +`--underlay`.
  181 +
  182 +```
  183 +OLD: qpdf 2.pdf --pages 1.pdf --password=x . 3.pdf 1-z -- out.pdf
  184 +NEW: qpdf 2.pdf --pages --file=1.pdf --password=x --file=. --file 3.pdf --range=1-z -- out.pdf
  185 +```
  186 +
  187 +This makes it possible to add additional flags to do things like control how document-level features
  188 +are handled, specify placement options, etc. Given the above framework, it would be possible to add
  189 +additional features incrementally, without breaking compatibility, such as selecting or splitting
  190 +pages based on tags, article threads, or outlines.
  191 +
  192 +It's tempting to allow assemblies to be nested, but this gets very complicated. From the C++ API, we
  193 +could modify QPDFJob to allow the use any QPDF as an input, but supporting this from the CLI is hard
  194 +because of the way JSON/arg parsing is set up. If people need to do that, they can just create
  195 +intermediate files.
  196 +
  197 +Proposed CLI enhancements:
  198 +
  199 +```
  200 +# --pages: inputs
  201 +--file=x [ --password=x ]
  202 +--blank=n [ --size={size} [ --size-from-page=n ] ] # see below
  203 +# modifiers refer to most recent input
  204 +--range=...
  205 +--with-attachments={none|all|referenced} # default = referenced
  206 +--with-outlines={none|all|referenced} # default = referenced
  207 +--... # future options to select pages based on outlines, article threads, tags, etc.
  208 +# placement (matrix transformation -- see notes below)
  209 +--rotate=[+-]angle[:page-range] # existing
  210 +--scale=x,y[:page-range]
  211 +--translate=dx,dy[:page-range] # dx and dy are dimensions
  212 +--flip={h|v}[:page-range]
  213 +--transform=a,b,c,d,e,f[:page-range]
  214 +--set-box={M|C|B|T|A}=rect[:page-range] # change a bounding box
  215 +# stacking -- make --underlay and --overlay repeatbale
  216 +--{underlay|overlay} ... --
  217 +--file=x [ --password=x ]
  218 +--from, --to, --repeat # same as current --overlay, --underlay
  219 +--from-rect={rect} # default = T -- see notes
  220 +--to-rect={rect} # default = M -- see notes
  221 +# composition -- a new QPDFJob stage between stacking and transformation
  222 +--compose=... # see notes
  223 +--n-up={2,4,6,9,16}
  224 +--concat={h|v} # concatenate all pages to a single big page
  225 +# reordering
  226 +--collate=a,b,c # exists
  227 +--booklet=... # re-order pages for book signatures like psbook -- see notes
  228 +# split
  229 +--split-pages=n # existing
  230 +--split-after=a,b,c # split after each named page
  231 +--... # future options to split based on outlines, article threads, tags, etc.
  232 +# post-processing (with transformations like optimize images)
  233 +--set-page-labels ... # See issue #939
  234 +```
  235 +
  236 +Notes:
  237 +* For `--blank`, `--size` specifies the size of the blank page. If any relative dimensions are used,
  238 + `--size-from-page=n` must be used to specify the page (from n in the overall input) that relative
  239 + dimensions should be taken from. It is an error to specify a relative size based on another blank
  240 + page. (Let's not complicate things by doing a graph traversal to find an eventual absolute page.
  241 + Just disallow a blank page to specified relative to another blank page.)
  242 +* For stacking, the default is to map the source page's trim box onto the destination page's
  243 + mediabox. This is a weird default, but it's there for compatibility. The `--from-rect` and
  244 + `--to-rect` may be used to map an arbitrary region of the over/underlay file into an arbitrary
  245 + region of a page. With the defaults, an overlay or underlay page will be stretched or shrunk if
  246 + pages are of variable size. Absolute rectangles can be used to avoid this. If a rectangle uses
  247 + relative dimensions, they are relative to the page that has the rectangle. You can't create a
  248 + `--to-rect` relative to the size of the from page or vice versa. If you need to do this, use
  249 + external logic to compute the rectangles and then use absolute rectangles.
  250 +* `--compose`: XXX
  251 +* `--booklet`: XXX
  252 +* The `--set-page-labels` option would be done at the very end and is actually not blocked by
  253 + anything else here. It can be done near removing page labels in `handleTransformations`.
  254 +* I'm not sure what impact composition should have on page labels. Most likely, we should drop page
  255 + labels on composition. If someone wants them, they can use `--set-page-labels`.
  256 +
  257 +### Compose, Booklet
  258 +
  259 +This section needs to be fleshed out. It is probably lower priority than document-level work.
  260 +
  261 +Here are some ideas from pstops. The following is an excerpt from the pstops manual page. Maybe we
  262 +can come up with something similar using our enhanced rectangle syntax.
  263 +
  264 +This section contains some sample reโ€arrangements. To put two pages on one sheet (of A4 paper),
  265 +the pagespec to use is:
  266 +```
  267 +2:0L@.7(21cm,0)+1L@.7(21cm,14.85cm)
  268 +```
  269 +To select all of the odd pages in reverse order, use:
  270 +```
  271 +2:โ€0
  272 +```
  273 +To reโ€arrange pages for printing 2โ€up booklets, use
  274 +```
  275 +4:โ€3L@.7(21cm,0)+0L@.7(21cm,14.85cm)
  276 +```
  277 +for the front sides, and
  278 +```
  279 +4:1L@.7(21cm,0)+โ€2L@.7(21cm,14.85cm)
  280 +```
  281 +for the reverse sides (or join them with a comma for duplex printing).
  282 +
  283 +From issue #493
  284 +```
  285 + pdf2ps infile.pdf infile.ps
  286 + ps2ps -pa4 "2:0R(4.5cm,26.85cm)+1R(4.5cm,14.85cm)" infile.ps outfile.ps
  287 + ps2pdf outfile.ps outfile.pdf
  288 + ```
  289 +
  290 +Notes on signatures (psbook). For a signature of size 3, we have the following assuming a 2-up
  291 +configuration that is printed double-sided so that, when the whole stack is placed face-up and
  292 +folded in half, page 1 is on top.
  293 +* front: 6,7, back: 8,5
  294 +* front: 4,9, back: 10,3
  295 +* front: 2,11, back: 12,1
  296 +
  297 +This is the same as duplex 2-up with pages in order 6, 7, 8, 5, 4, 9, 10, 3, 2, 11, 12, 1
  298 +
  299 +n-up:
  300 +* For 2-up, calculate new w and h such that w/h maintains a fixed ratio and w and h are the largest
  301 + values that can fit within 1/2 the page with specified margins.
  302 +* Can support 1, 2, 4, 6, 9, 16. 2 and 6 require rotation. The others don't. Will probably need to
  303 + change getFormXObjectForPage to handle other boxes than trim box.
  304 +* Maybe define n-up a scale and rotate followed by fitting the result into a specified rectangle. I
  305 + might already have this logic in QPDFAnnotationObjectHelper::getPageContentForAppearance.
  306 +
  307 +## Destinations
  308 +
  309 +We will have to keep track of destinations that point to a page when the page is moved or copied.
  310 +For example, if an outline has a destination that points to a particular rectangle on page 5 of the
  311 +second file, and we end up dropping a portion of that page into an n-up configuration on a specific
  312 +output page, we will have to keep track of enough information to replace the destination with a new
  313 +one that points to the new physical location of the same material. For another example, consider a
  314 +case in which the left side of page 3 of the primary input ends up as page 5 of the output and the
  315 +right side of page 3 ends up as page 6. We would have to map destinations from a single source page
  316 +to different destination pages based on which part of the page it was on. If part of the rectangle
  317 +points to one page and part to another, what do we do? I suggest we go with the top/center of the
  318 +rectangle.
  319 +
  320 +A destination consists of a QPDF, page object, and rectangle in user coordinates. When `QPDFJob`
  321 +copies a page or converts it to a form XObject, possibly with transformations applied, it will have
  322 +to be able to map a destination to the same triple (QPDF, page object, rectangle) on all pages that
  323 +contain data from the original page. When writing the final output, any destination that no longer
  324 +points anywhere should be dropped, and any destination that points to multiple places will need to
  325 +be handled according to some specification.
183 326  
184 327 Whenever we create any new thing from a page, we create _derived page data_. Examples of derived
185   -page data would include a copy of the page and a form XObject created from a page. `QPDFAssembler`
186   -will have to keep a mapping from any source page to all of its derived objects along with any
187   -transformations or clipping. When a derived page data object is placed on a final page, that
188   -information can be combined with the position and any transformations onto the final page to be able
189   -to map any destination to a new one or to determine that it points outside of the visible area.
190   -
191   -If a source page is copied multiple times, then if exactly one copy is explicitly marked as the
192   -target, that becomes the target. Otherwise, the first derived object to be placed becomes the
193   -target.
194   -
195   -## Overall Structure
196   -
197   -A single instance of `QPDFAssembler` creates a single assembly job. `QPDFJob` can create one
198   -assembly job but does other things, such as setting writer options, inspection operations, etc. An
199   -assembly job consists of the following:
200   -* Global document-level data handling information
201   - * Mode
202   - * intelligent: try to combine everything using latest capabilities of qpdf; this is the default
203   - * legacy: document-level features are kept from primary input; this is for compatibility and can
204   - be selected from the CLI
205   -* Input sources
206   - * File/password
207   - * Whether to keep attachments: yes, no, if-all-pages (default)
208   - * Empty
209   -* Output mode
210   - * Single file
211   - * Split -- this must include definitions of the split groups
212   -* Description of the output in terms of the input sources and some series of transformations
213   -
214   -## Cases to support
215   -
216   -Here is a list of cases that need to be expressible.
217   -
218   -* Create output by concatenating pages from page groups where each page group is pages specified by
219   - a numeric range. This is what `--pages` does now.
220   -* Collation, including different sized groups.
221   -* Overlay/underlay, generalized to support a stack consisting of various underlays, the base page,
222   - and various overlays, with flexibility around posititioning. It should be natural to express
223   - exactly whate underlay and overlay do now.
224   -* Split into groups of fixed size (what `--split-pages` does) with the ability to define split
225   - groups based on other things, like outlines, article threads, and document structure
226   -* Examples from the manual:
227   - * `qpdf in.pdf --pages . a.pdf b.pdf:even -- out.pdf`
228   - * `qpdf --empty --pages a.pdf b.pdf --password=x z-1 c.pdf 3,6`
229   - * `qpdf --collate odd.pdf --pages . even.pdf -- all.pdf`
230   - * `qpdf --collate --empty --pages odd.pdf even.pdf -- all.pdf`
231   - * `qpdf --collate --empty --pages a.pdf 1-5 b.pdf 6-4 c.pdf r1 -- out.pdf`
232   - * `qpdf --collate=2 --empty --pages a.pdf 1-5 b.pdf 6-4 c.pdf r1 -- out.pdf`
233   - * `qpdf file2.pdf --pages file1.pdf 1-5 . 15-11 -- outfile.pdf`
234   - *
235   - ```
236   - qpdf --empty --copy-encryption=encrypted.pdf \
237   - --encryption-file-password=pass \
238   - --pages encrypted.pdf --password=pass 1 \
239   - ./encrypted.pdf --password=pass 1 -- \
240   - outfile.pdf
241   - ```
242   - * `qpdf --collate=2,6 a.pdf --pages . b.pdf -- all.pdf`
243   - * Take A 1-2, B 1-6, A 3-4, C 7-12, A 5-6, B 13-18, ...
244   -* Ideas from pstops. The following is an excerpt from the pstops manual page.
245   -
246   - This section contains some sample reโ€arrangements. To put two pages on one sheet (of A4 paper),
247   - the pagespec to use is:
248   - ```
249   - 2:0L@.7(21cm,0)+1L@.7(21cm,14.85cm)
250   - ```
251   - To select all of the odd pages in reverse order, use:
252   - ```
253   - 2:โ€0
254   - ```
255   - To reโ€arrange pages for printing 2โ€up booklets, use
256   - ```
257   - 4:โ€3L@.7(21cm,0)+0L@.7(21cm,14.85cm)
258   - ```
259   - for the front sides, and
260   - ```
261   - 4:1L@.7(21cm,0)+โ€2L@.7(21cm,14.85cm)
262   - ```
263   - for the reverse sides (or join them with a comma for duplex printing).
264   -* From #493
265   - ```
266   - pdf2ps infile.pdf infile.ps
267   - ps2ps -pa4 "2:0R(4.5cm,26.85cm)+1R(4.5cm,14.85cm)" infile.ps outfile.ps
268   - ps2pdf outfile.ps outfile.pdf
269   - ```
270   -* Like psbook. Signature size n:
271   - * take groups of 4n
272   - * shown for n=3 in order such that, if printed so that the front of the first page is on top, the
273   - whole stack can be folded in half.
274   - * front: 6,7, back: 8,5
275   - * front: 4,9, back: 10,3
276   - * front: 2,11, back: 12,1
277   -
278   - This is the same as duplex 2-up with pages in order 6, 7, 8, 5, 4, 9, 10, 3, 2, 11, 12, 1
279   -* n-up:
280   - * For 2-up, calculate new w and h such that w/h maintains a fixed ratio and w and h are the
281   - largest values that can fit within 1/2 the page with specified margins.
282   - * Can support 1, 2, 4, 6, 9, 16. 2 and 6 require rotation. The others don't. Will probably need to
283   - change getFormXObjectForPage to handle other boxes than trim box.
284   - * Maybe define n-up a scale and rotate followed by fitting the result into a specified rectangle.
285   - I might already have this logic in QPDFAnnotationObjectHelper::getPageContentForAppearance.
  328 +page data would include a copy of the page and a form XObject created from a page. We will have to
  329 +keep a mapping from any source page to all of its derived objects along with any transformations or
  330 +clipping. When a derived page data object is placed on a final page, that information can be
  331 +combined with the position and any transformations onto the final page to be able to map any
  332 +destination to a new one or to determine that it points outside of the visible area. There is
  333 +already code in placeFormXObject and the code that places appearance streams that deals with these
  334 +kinds of mappings.
  335 +
  336 +What do we do if a source page is copied multiple times? I think we will have to just make the new
  337 +destination point to the first place that the target appears with precedence going to the original
  338 +location. If we can detect this, we can give a warning.
  339 +
  340 +# Document-level Behavior
  341 +
  342 +Both merging and splitting contain logic, sometimes duplicated, to handle page labels, form fields,
  343 +and annotations. We will need to build logic for other things. This section is a rough breakdown of
  344 +the different things in the document catalog (plus the info dictionary, which is referenced from the
  345 +trailer) and how we may have to handle them. We will need to implement various ObjectHelper and
  346 +DocumentHelper classes.
  347 +
  348 +7.7.2 contains the list of all keys in the document catalog.
  349 +
  350 +Document-level structures to merge:
  351 +* Extensions
  352 + * Must be combination of Extensions from all input files
  353 +* PageLabels
  354 + * Ensure each page has its original label
  355 + * Allow post-processing
  356 +* Names -- see below
  357 + * Combine per tree
  358 + * May require disambiguation
  359 + * Page: TemplateInstantiated
  360 +* Dests
  361 + * Keep referenced destinations across all files
  362 + * May need to disambiguate or "flatten" or convert to named dests with the names tree
  363 +* Outlines
  364 +* Threads (easy)
  365 + * Page: B
  366 +* AcroForm
  367 +* StructTreeRoot
  368 + * Page: StructParents
  369 +* MarkInfo (see 14.7 - Logical Structure, 14.8 Tagged PDF)
  370 +* SpiderInfo
  371 + * Page: ID
  372 +* OutputIntents
  373 + * Page: OutputIntents
  374 +* PieceInfo
  375 + * Page: PieceInfo
  376 +* OCProperties
  377 +* Requirements
  378 +* AF (file specification dictionaries)
  379 + * Page: AF
  380 +* DPartRoot
  381 + * Page: DPart
  382 +* Version
  383 + * Maximum
  384 +
  385 +Things that stay with the first document that has one and/or will not be supported
  386 +* AA (Additional Actions)
  387 + * Would be possible to combine and let the first contributor win, but it probably wouldn't usually
  388 + be what we want.
  389 +* Info (not part of document catalog)
  390 +* ViewerPreferences
  391 +* PageLayout
  392 +* PageMode
  393 +* OpenAction
  394 +* URI
  395 +* Metadata
  396 +* Lang
  397 +* NeedsRendering
  398 +* Collection
  399 +* Perms
  400 +* Legal
  401 +* DSS
  402 +
  403 +Name dictionary (7.7.4)
  404 +* Dests
  405 +* AP (appearance streams)
  406 +* JavaScript
  407 +* Pages (named pages)
  408 +* Templates
  409 + * Combine across all documents
  410 + * Page: TemplateInstantiated points to a named page
  411 +* IDS
  412 +* URLS
  413 +* EmbeddedFiles
  414 +* AlternatePresentations
  415 +* Renditions
  416 +
  417 +Most of chapter 12 applies. See Document-level navigation (12.3).
286 418  
287 419 # Feature to Issue Mapping
288 420  
... ... @@ -292,6 +424,8 @@ Last checked: 2023-12-29
292 424 gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open
293 425 ```
294 426  
  427 +* Allow an existing `QPDF` to be an input to a merge operation when using the QPDFJob C++ API
  428 + * Issues: none
295 429 * Generate a mapping from source to destination for all destinations
296 430 * Issues: #1077
297 431 * Notes:
... ... @@ -328,7 +462,7 @@ gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open
328 462 * This looks complicated. It may be not be possible to do this fully in the first increment, but
329 463 we have to keep it in mind and warn if we can't and we see /SD in an action.
330 464 * #490 has some good analysis
331   -* Assign page labels
  465 +* Assign page labels (renumber pages)
332 466 * Issues: #939
333 467 * Notes:
334 468 * #939 has a good proposal
... ... @@ -381,286 +515,3 @@ gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open
381 515 * There is some helpful discussion in #343 including
382 516 * Preserving open/closed status
383 517 * Preserving javascript actions
384   -
385   -# XXX OLD NOTES
386   -
387   -I want to encapsulate various aspects of the logic into interfaces that can be implemented by
388   -developers to add their own logic. It should be easy to contribute these. Here are some rough ideas.
389   -
390   -A source is an input file, the output of another operation, or a blank page. In the API, it can be
391   -any QPDF object.
392   -
393   -A page group is just a group of pages.
394   -
395   -* PageSelector -- creates page groups from other page groups
396   -* PageTransformer -- selects a part of a page and possibly transforms it; applies to all pages of a
397   - group. Based on the page dictionary; does not look at the content stream
398   -* PageFilter -- apply arbitrary code to a page; may access the content stream
399   -* PageAssembler -- combines pages from groups into new groups whose pages are each assembled from
400   - corresponding pages of the input groups
401   -
402   -These should be able to be composed in arbitrary ways. There should be a natural API for doing this,
403   -and it there should be some specification, probably based on JSON, that can be provided on the
404   -command line or embedded in the job JSON format. I have been considering whether a lisp-like
405   -S-expression syntax may be less cumbersome to work with. I'll have to decide whether to support this
406   -or some other syntax in addition to a JSON representation.
407   -
408   -There also needs to be something to represent how document-level structures relate to this. I'm not
409   -sure exactly how this should work, but we need things like
410   -* what to do with page labels, especially when assembling pages from other pages
411   -* whether to preserve destinations (outlines, links, etc.), particularly when pages are duplicated
412   - * If A refers to B and there is more than one copy of B, how do you decide which copies of A link
413   - to which copies of B?
414   -* what to do with pages that belong to more than one group, e.g., what happens if you used document
415   - structure or outlines to form page groups and a group boundary lies in the middle of the page
416   -
417   -Maybe pages groups can have arbitrary, user-defined tags so we can specify that links should only
418   -point to other pages with the same value of some tag. We can probably many-to-one links if the
419   -source is duplicated.
420   -
421   -We probably need to hold onto the concept of the primary input file. If there is a primary input
422   -file, there may need to be a way to specify what gets preserved it. The behavior of qpdf prior to
423   -all of this is to preserve all document-level constructs from the primary input file and to try to
424   -preserve page labels from other input files when combining pages.
425   -
426   -Here are some examples.
427   -
428   -* PageSelector
429   - * all pages from an input file
430   - * pages from a group using a NumericRange
431   - * concatenate groups
432   - * pages from a group in reverse order
433   - * a group repeated as often as necessary until a specified number of pages is reached
434   - * a group padded with blank pages to create a multiple of n pages
435   - * odd or even pages from a group
436   - * every nth page from a group
437   - * pages interleaved from multiple groups
438   - * the left-front (left-back, right-front, right-back) pages of a booklet with signatures of n
439   - pages
440   - * all pages reachable from a section of the outline hierarchy or something based on threads or
441   - other structure
442   - * selection based on page labels
443   -* PageTransformer
444   - * clip to media box (trim box, crop box, etc.)
445   - * clip to specific absolute or relative size
446   - * scale
447   - * translate
448   - * rotate
449   - * apply transformation matrix
450   -* PageFilter
451   - * optimize images
452   - * flatten annotations
453   -* PageAssembler
454   - * Overlay/underlay all pages from one group onto corresponding pages from another group
455   - * Control placement based on properties of all the groups, so higher order than a stand-alone
456   - transformer
457   - * Examples
458   - * Scale the smaller page up to the size of the larger page
459   - * Center the smaller page horizontally and bottom-align the trim boxes
460   - * Generalized overlay/underlay allowing n pages in a given order with transformations.
461   - * n-up -- application of generalized overlay/underlay
462   - * make one long page with an arbitrary number of pages one after the other (#546)
463   -
464   -It should be possible to represent all of the existing qpdf operations using the above framework. It
465   -would be good to re-implement all of them in terms of this framework to exercise it. We will have to
466   -look through all the command-line arguments and make sure. Of course also make sure suggestions from
467   -issues can be implemented or at least supported by adding new selectors.
468   -
469   -Here are a few bits of scratch work. The top-level call is a selector. This doesn't capture
470   -everything. Implementing this would be tedious and challenging. It could be done using JSON arrays,
471   -but it would be clunky. This feels over-designed and possibly in conflict with QPDFJob.
472   -
473   -```
474   -(concat
475   - (primary-input)
476   - (file "file2.pdf")
477   - (page-range (file "file3.pdf") "1-4,5-8")
478   -)
479   -
480   -(with
481   - ("a"
482   - (concat
483   - (primary-input)
484   - (file "file2.pdf")
485   - (page-range (file "file3.pdf") "1-4,5-8")
486   - )
487   - )
488   - (concat
489   - (even-pages (from "a"))
490   - (reverse (odd-pages (from "a")))
491   - )
492   -)
493   -
494   -(with
495   - ("a"
496   - (concat
497   - (primary-input)
498   - (file "file2.pdf")
499   - (page-range (file "file3.pdf") "1-4,5-8")
500   - )
501   - "b-even"
502   - (even-pages (from "a"))
503   - "b-odd"
504   - (reverse (odd-pages (from "a")))
505   - )
506   - (stack
507   - (repeat-range (from "a") "z")
508   - (pad-end (from "b"))
509   - )
510   -)
511   -```
512   -
513   -```json
514   -
515   -```
516   -
517   -# Supporting Document-level Features
518   -
519   -qpdf needs full support for document-level features like article threads, outlines, etc. There is no
520   -support for some things and partial support for others. See notes below for a comprehensive list.
521   -
522   -Most likely, this will be done by creating DocumentHelper and ObjectHelper classes.
523   -
524   -It will be necessary not only to read information about these structures from a single PDF file as
525   -the existing document helpers do but also to reconstruct or update these based on modifications to
526   -the pages in a file. I'm not sure how to do that, but one idea would be to allow a document helper
527   -to register a callback with QPDFPageDocumentHelper that notifies it when a page is added or removed.
528   -This may be able to take other parameters such as a document helper from a foreign file.
529   -
530   -Since these operations can be expensive, there will need to be a way to opt in and out. The default
531   -(to be clearly documented) should be that all supported document-level constructs are preserved.
532   -That way, as new features are added, changes to the output of previous operations to include
533   -information that was previously omitted will not constitute a non-backward compatible change that
534   -requires a major version bump. This will be a default for the API when using the higher-level page
535   -assemebly API (below) as well as the CLI.
536   -
537   -There will also need to be some kind of support for features that are document-level and not tied to
538   -any pages, such as (sometimes) embedded files. When splitting/merging files, there needs to be a way
539   -to specify what should happen with those things. Perhaps the default here should be that these are
540   -preserved from files from which all pages are selected. For some things, like viewer preferences, it
541   -may make sense to take them from the first file.
542   -
543   -# Page Assembly (page selection)
544   -
545   -In addition to the existing numeric ranges of page numbers, page selection could be driven by
546   -document-level features like the outlines hierarchy or article threads. There have been a lot of
547   -suggestions about this in various tickets. There will need to be some kind of page manipulation
548   -class with configuration options. I'm thinking something similar to QPDFJob, where you construct a
549   -class and then call a bunch of methods to configure it, including the ability to configure with
550   -JSON. Several suggestions have been made in issues, which I will go through and distill into a list.
551   -Off hand, some ideas include being able to split based on explicit chunks and being able to do all
552   -pages except a list of pages.
553   -
554   -For CLI, I'm probably going to have it take a JSON blob or JSON file on the CLI rather than having
555   -some absurd way of doing it with arguments (people have a lot of trouble with --pages as it is). See
556   -TODO for a feature on command-line/job JSON support for JSON specification arguments.
557   -
558   -There are some other things, like allowing n-up and genearlizing overlay/underlay to allow different
559   -placement and scaling options, that I think may also be in scope.
560   -
561   -# Scaling/Transforming Pages
562   -
563   -* Keep in mind that destinations, such as links and outlines, may need to be adjusted when a page is
564   - scaled or otherwise transformed.
565   -
566   -# Notes
567   -
568   -PDF document structure
569   -
570   -The trailer contains the catalog and the Info dictionary. We probably need to do something
571   -intelligent with the info dictionary.
572   -
573   -7.7.2 contains the list of all keys in the document catalog.
574   -
575   -Document-level structures to merge:
576   -* Extensions
577   - * Must be combination of Extensions from all input files
578   -* PageLabels
579   - * Ensure each page has its original label
580   - * Allow post-processing
581   -* Names -- see below
582   - * Combine per tree
583   - * May require disambiguation
584   - * Page: TemplateInstantiated
585   -* Dests
586   - * Keep referenced destinations across all files
587   - * May need to disambiguate or "flatten" or convert to named dests with the names tree
588   -* Outlines
589   -* Threads (easy)
590   - * Page: B
591   -* AcroForm
592   -* StructTreeRoot
593   - * Page: StructParents
594   -* MarkInfo (see 14.7 - Logical Structure, 14.8 Tagged PDF)
595   -* SpiderInfo
596   - * Page: ID
597   -* OutputIntents
598   - * Page: OutputIntents
599   -* PieceInfo
600   - * Page: PieceInfo
601   -* OCProperties
602   -* Requirements
603   -* AF (file specification dictionaries)
604   - * Page: AF
605   -* DPartRoot
606   - * Page: DPart
607   -* Version
608   - * Maximum
609   -
610   -Things that stay with the first document that has one and/or will not be supported
611   -* AA (Additional Actions)
612   - * Would be possible to combine and let the first contributor win, but it probably wouldn't usually
613   - be what we want.
614   -* Info (not part of document catalog)
615   -* ViewerPreferences
616   -* PageLayout
617   -* PageMode
618   -* OpenAction
619   -* URI
620   -* Metadata
621   -* Lang
622   -* NeedsRendering
623   -* Collection
624   -* Perms
625   -* Legal
626   -* DSS
627   -
628   -Name dictionary (7.7.4)
629   -* Dests
630   -* AP (appearance streams)
631   -* JavaScript
632   -* Pages (named pages)
633   -* Templates
634   - * Combine across all documents
635   - * Page: TemplateInstantiated points to a named page
636   -* IDS
637   -* URLS
638   -* EmbeddedFiles
639   -* AlternatePresentations
640   -* Renditions
641   -
642   -Most of chapter 12 applies.
643   -
644   -Document-level navigation (12.3)
645   -
646   -QPDF will need a global way to reference a page. This will most likely be in the form of the QPDF
647   -uuid and a QPDFObjectHandle to the page. If this can just be a QPDFObjectHandle, that would be
648   -better. I need to make sure we can meaningfully interact with QPDFObjectHandle objects from multiple
649   -QPDFs in a safe fashion. Figure out how this works with immediateCopyFrom, etc. Better to avoid this
650   -whole thing and make sure that we just keep all the document-level stuff specific to a PDF, but we
651   -will need to have some internal representation that can be used to reconstruct the document-level
652   -dictionaries when writing. Making this work with structures (structure destinations) will require
653   -more indirection.
654   -
655   -I imagine that there will be some internal repreentation of what document-level things come along
656   -for the ride when we take a page from a document. I wonder whether this need to change the way
657   -linearization works.
658   -
659   -There should be different ways to specify collections of pages. The existing one, which is using a
660   -numeric range, is just one. Other ideas include things related to document structure (all pages in
661   -an article thread, all pages in an outline hierarchy), page labels, book binding (Is that called
662   -folio? There's an issue for it.), subgroups, or any number of things.
663   -
664   -We will need to be able to start with document-level objects to get page groups and also to start
665   -with pages and reconstruct document level objects. For example, it should be possibe to reconstruct
666   -article threads to omit beads that don't belong to any of the pages. Likewise with outlines.
... ...