Commit f970b058623c0c57064a047cd1c19ac405fc0cb3

Authored by Jay Berkenbilt
1 parent de309412

Reformat TODO-pages, clean up, flesh out some ideas

Showing 2 changed files with 374 additions and 97 deletions
.dir-locals.el
1 1 ((nil . ((indent-tabs-mode . nil)
  2 + (fill-column 100)
2 3 (qpdf-cc-style
3 4 .
4 5 ("qpdf"
... ... @@ -36,4 +37,9 @@
36 37 )
37 38 ))
38 39 )
  40 + (gfm-mode . ((eval . (progn
  41 + (setq fill-column 100)
  42 + )
  43 + ))
  44 + )
39 45 )
... ...
TODO-pages.md
1 1 # Pages
2 2  
3   -THIS IS A WORK IN PROGRESS. THE ACTUAL IMPLEMENTATION MAY NOT LOOK ANYTHING LIKE THIS. When this gets to the stage where it is starting to congeal into an actual plan, I will remove this disclaimer and open a discussion ticket in GitHub to work out details.
4   -
5   -This file contains plans and notes regarding implementing of the "pages epic." The pages epic consists of the following features:
  3 +**THIS IS A WORK IN PROGRESS. THE ACTUAL IMPLEMENTATION MAY NOT LOOK ANYTHING LIKE THIS. When this
  4 +gets to the stage where it is starting to congeal into an actual plan, I will remove this disclaimer
  5 +and open a discussion ticket in GitHub to work out details.**
  6 +
  7 +This document describes a project known as the _pages epic_. The goal of the pages epic is to enable
  8 +qpdf to properly preserve all functionality associated with a page as pages are copied from one PDF
  9 +to another (or back to the same PDF).
  10 +
  11 +Terminology:
  12 +* _Page-level data_: information that is contained within objects reachable from the page dictionary
  13 + without traversing through any `/Parent` pointers
  14 +* _Document-level data_: information that is reachable from the document catalog (`/Root`) that is
  15 + not reachable from a page dictionary as well as the `/Info` dictionary
  16 +
  17 +Some document-level data references specific pages by page object ID, such as outlines or
  18 +interactive forms. Some document-level data doesn't reference any pages, such as embedded files or
  19 +optional content (layers). Some document-level data contains information that pertains to a specific
  20 +page but does not reference the page, such as page labels (explicit page numbers). Some page-level
  21 +data may sometimes depend on document-level data. For example, a _named destination_ depends on the
  22 +document-level _names tree_.
  23 +
  24 +As long as qpdf has had the ability to copy pages from one PDF to another, it has had robust
  25 +handling of page-level data. Prior to the implementation of the pages epic, with the exception of
  26 +page labels, qpdf has ignored document-level data during page copy operations. Specifically, when
  27 +qpdf creates a new PDF file from existing PDF files, it always starts with a specific PDF, known as
  28 +the _primary input_. The primary input may be the built-in _empty PDF_. With the exception of page
  29 +labels, document-level constructs that appear in the primary input are preserved, and document-level
  30 +constructs from the other PDF files are ignored. The exception to this is page labels. With page
  31 +labels, qpdf always ensures that any given page has the same label in the final output as it had in
  32 +whichever input file it originated from, which is usually (but not always) the desired behavior.
  33 +
  34 +Here are several examples of problems in qpdf prior to the implementation of the pages epic:
  35 +* If two files with optional content (layers) are merged, all layers in all but the primary input
  36 + will be visible in the combined file.
  37 +* If two files with file attachments are merged, attachments will be retained on the primary input
  38 + but dropped on the others. (qpdf has other ways to copy attachments from one file to another.)
  39 +* If two files with hyperlinks are merged, any hyperlink from other than primary input whose
  40 + destination is a named destination will become non-functional.
  41 +* If two files with outlines are merged, the outlines from the original file will appear in their
  42 + entirety, including outlines that point to pages that are no longer there, and outlines will be
  43 + lost from all files except the primary input.
  44 +
  45 +With the above limitations, qpdf allows combining pages from arbitrary numbers of input PDFs to
  46 +create an output PDF, or in the case of page splitting, multiple output PDFs. The API allows
  47 +arbitrary combinations of input and output files. The command-line allows only the following:
  48 +* Merge: creation of a single output file from a primary input and any number of other inputs by
  49 + selecting pages by index from the beginning or end of the file
  50 +* Split: creation of multiple output files from a single input or the result of a merge into files
  51 + whose primary input is the empty PDF and that contain a fixed number of pages per group
  52 +* Overlay/underlay: layering pages on top of each other with a maximum of one underlay and one
  53 + overlay and with no ability to specify transformation of the pages (such as scaling, placing them
  54 + in a particular spot).
  55 +
  56 +The pages epic consists of two broad categories of work:
6 57 * Proper handling of document-level features when splitting and merging documents
7   -* Insertion of blank pages
8   -* More flexible aways of
9   - * selecting pages from one or more documents
10   - * composing pages out of other pages
11   - * underlay and overlay with control over position, transformation, and bounding box selection
12   - * organizing pages
13   - * n-up
14   - * booklet generation ("signatures", as in what `psbook` does)
15   -* Possibly others pending analysis of open issues and public discussion
  58 +* Greatly increased flexibility in the ways in which pages can be selected from the various input
  59 + files and combined for the output file. This includes creation of blank pages.
  60 +
  61 +Here are some examples of things that will become possible:
  62 +
  63 +* Stacking arbitrary pages on top of each other with full control over transformation and cropping,
  64 + including being able to access information about the various bounding boxes associated with the
  65 + pages
  66 +* Inserting blank pages
  67 +* Doing n-up page layouts
  68 +* Re-ordering pages for printing booklets (also called signatures or printer spreads)
  69 +* Selecting pages based on the outline hierarchy, tags, or article threads
  70 +* Keeping only and all relevant parts of the outline hierarchies from all input files
  71 +* Creating single very long or wide pages with output from other pages
  72 +
  73 +The rest of this document describes the details of what how these features will work and what needs
  74 +to be done to make them possible to build.
  75 +
  76 +# Architectural Thoughts
  77 +
  78 +Create a new top-level class called `QPDFAssembler` that will be used to perform page-level
  79 +operations. Its implementation will use existing APIs, and it will add many new APIs. It should be
  80 +possible to perform all existing page splitting and merging operations using `QPDFAssembler` without
  81 +having to worry about details such as copying annotations, remapping destinations, and adjusting
  82 +document-level data.
  83 +
  84 +Early strategy: keep `QPDFAssembler` private to the library, and start with a pure C++ API (no JSON
  85 +support). Migrate splitting and merging from `QPDFJob` into `QPDFAssembler`, then build in
  86 +document-level support. Also work the difference between normal write and split, which are two
  87 +separate ways to write output files.
  88 +
  89 +One of the main responsibilities of `QPDFAssembler` will be to remap destinations as data from a
  90 +page is moved or copied. For example, if an outline has a destination that points to a particular
  91 +rectangle on page 5 of the second file, and we end up dropping a portion of that page into an n-up
  92 +configuration on a specific output page, we will have to keep track of enough information to replace
  93 +the destination with a new one that points to the new physical location of the same material. For
  94 +another example, consider a case in which the left side of page 3 of the primary input ends up as
  95 +page 5 of the output and the right side of page 3 ends up as page 6. We would have to map
  96 +destinations from a single source page to different destination pages based on which part of the
  97 +page it was on. If part of the rectangle points to one page and part to another, what do we do? I
  98 +suggest we go with the top/center of the rectangle.
  99 +
  100 +A destination consists of a QPDF, page object, and rectangle in user coordinates. When
  101 +`QPDFAssembler` copies a page or converts it to a form XObject, possibly with transformations
  102 +applied, it will have to be able to map a destination to the same triple (QPDF, page object,
  103 +rectangle) on all pages that contain data from the original page. When writing the final output, any
  104 +destination that no longer points anywhere should be dropped, and any destination that points to
  105 +multiple places will need to be handled according to some specification.
  106 +
  107 +Whenever we create any new thing from a page, we create _derived page data_. Examples of derived
  108 +page data would include a copy of the page and a form XObject created from a page. `QPDFAssembler`
  109 +will have to keep a mapping from any source page to all of its derived objects along with any
  110 +transformations or clipping. When a derived page data object is placed on a final page, that
  111 +information can be combined with the position and any transformations onto the final page to be able
  112 +to map any destination to a new one or to determine that it points outside of the visible area.
  113 +
  114 +If a source page is copied multiple times, then if exactly one copy is explicitly marked as the
  115 +target, that becomes the target. Otherwise, the first derived object to be placed becomes the
  116 +target.
  117 +
  118 +## Overall Structure
  119 +
  120 +A single instance of `QPDFAssembler` creates a single assembly job. `QPDFJob` can create one
  121 +assembly job but does other things, such as setting writer options, inspection operations, etc. An
  122 +assembly job consists of the following:
  123 +* Global document-level data handling information
  124 + * Mode
  125 + * intelligent: try to combine everything using latest capabilities of qpdf; this is the default
  126 + * legacy: document-level features are kept from primary input; this is for compatibility and can
  127 + be selected from the CLI
  128 +* Input sources
  129 + * File/password
  130 + * Whether to keep attachments: yes, no, if-all-pages (default)
  131 + * Empty
  132 +* Output mode
  133 + * Single file
  134 + * Split -- this must include definitions of the split groups
  135 +* Description of the output in terms of the input sources and some series of transformations
  136 +
  137 +## Cases to support
  138 +
  139 +Here is a list of cases that need to be expressible.
  140 +
  141 +* Create output by concatenating pages from page groups where each page group is pages specified by
  142 + a numeric range. This is what `--pages` does now.
  143 +* Collation, including different sized groups.
  144 +* Overlay/underlay, generalized to support a stack consisting of various underlays, the base page,
  145 + and various overlays, with flexibility around posititioning. It should be natural to express
  146 + exactly whate underlay and overlay do now.
  147 +* Split into groups of fixed size (what `--split-pages` does) with the ability to define split
  148 + groups based on other things, like outlines, article threads, and document structure
  149 +* Examples from the manual:
  150 + * `qpdf in.pdf --pages . a.pdf b.pdf:even -- out.pdf`
  151 + * `qpdf --empty --pages a.pdf b.pdf --password=x z-1 c.pdf 3,6`
  152 + * `qpdf --collate odd.pdf --pages . even.pdf -- all.pdf`
  153 + * `qpdf --collate --empty --pages odd.pdf even.pdf -- all.pdf`
  154 + * `qpdf --collate --empty --pages a.pdf 1-5 b.pdf 6-4 c.pdf r1 -- out.pdf`
  155 + * `qpdf --collate=2 --empty --pages a.pdf 1-5 b.pdf 6-4 c.pdf r1 -- out.pdf`
  156 + * `qpdf file2.pdf --pages file1.pdf 1-5 . 15-11 -- outfile.pdf`
  157 + *
  158 + ```
  159 + qpdf --empty --copy-encryption=encrypted.pdf \
  160 + --encryption-file-password=pass \
  161 + --pages encrypted.pdf --password=pass 1 \
  162 + ./encrypted.pdf --password=pass 1 -- \
  163 + outfile.pdf
  164 + ```
  165 + * `qpdf --collate=2,6 a.pdf --pages . b.pdf -- all.pdf`
  166 + * Take A 1-2, B 1-6, A 3-4, C 7-12, A 5-6, B 13-18, ...
  167 +* Ideas from pstops. The following is an excerpt from the pstops manual page.
  168 +
  169 + This section contains some sample reโ€arrangements. To put two pages on one sheet (of A4 paper),
  170 + the pagespec to use is:
  171 + ```
  172 + 2:0L@.7(21cm,0)+1L@.7(21cm,14.85cm)
  173 + ```
  174 + To select all of the odd pages in reverse order, use:
  175 + ```
  176 + 2:โ€0
  177 + ```
  178 + To reโ€arrange pages for printing 2โ€up booklets, use
  179 + ```
  180 + 4:โ€3L@.7(21cm,0)+0L@.7(21cm,14.85cm)
  181 + ```
  182 + for the front sides, and
  183 + ```
  184 + 4:1L@.7(21cm,0)+โ€2L@.7(21cm,14.85cm)
  185 + ```
  186 + for the reverse sides (or join them with a comma for duplex printing).
  187 +* From #493
  188 + ```
  189 + pdf2ps infile.pdf infile.ps
  190 + ps2ps -pa4 "2:0R(4.5cm,26.85cm)+1R(4.5cm,14.85cm)" infile.ps outfile.ps
  191 + ps2pdf outfile.ps outfile.pdf
  192 + ```
  193 +* Like psbook. Signature size n:
  194 + * take groups of 4n
  195 + * shown for n=3 in order such that, if printed so that the front of the first page is on top, the
  196 + whole stack can be folded in half.
  197 + * front: 6,7, back: 8,5
  198 + * front: 4,9, back: 10,3
  199 + * front: 2,11, back: 12,1
  200 +
  201 + This is the same as dupex 2-up with pages in order 6, 7, 8, 5, 4, 9, 10, 3, 2, 11, 12, 1
  202 +* n-up:
  203 + * For 2-up, calculate new w and h such that w/h maintains a fixed ratio and w and h are the
  204 + largest values that can fit within 1/2 the page with specified margins.
  205 + * Can support 1, 2, 4, 6, 9, 16. 2 and 6 require rotation. The others don't. Will probably need to
  206 + change getFormXObjectForPage to handle other boxes than trim box.
  207 + * Maybe define n-up a scale and rotate followed by fitting the result into a specified rectangle.
  208 + I might already have this logic in QPDFAnnotationObjectHelper::getPageContentForAppearance.
  209 +
16 210  
17 211 # Feature to Issue Mapping
18 212  
19 213 Last checked: 2023-12-29
20 214  
21   -* Questions/ideas
22   - * I have often wondered whether we need to be able to attach arbitrary metadata to a QPDFObjectHandle (or object or value) and to control whether it should be included in copies. For example, one could attach to a page which qpdf id and page number it came from, then carry that around as the page was converted to a form xobject, inserted into a foreign file, etc. It feels like something like that will be needed to support some of these features.
  215 +```
  216 +gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open
  217 +```
  218 +
23 219 * Generate a mapping from source to destination for all destinations
24 220 * Issues: #1077
25 221 * Notes:
26   - * Source can be an outline or link, either directly or via action. If link, it should include the page.
  222 + * Source can be an outline or link, either directly or via action. If link, it should include
  223 + the page.
27 224 * Destination can be a structure destination, which should map to a regular destination.
28 225 * source: page X -> link -> action -> dest: page Y
29 226 * source: page X -> link -> action -> dest: structure -> page Y
30 227 * Consider something in json that dumps this.
31   - * We will need to associate this with a QPDF. It would be great if remote or embedded go-to actions could be handled, but that's ambitious.
32   - * It will be necessary to keep some global map that includes all QPDF objects that are part of the final file.
33   - * An interesting use case to consider would be to create a QPDF object from an embedded file and append the embedded file and make the embedded actions work. This would probably require some way to tell qpdf that a particular external file came from an embedded file.
34   -
  228 + * We will need to associate this with a QPDF. It would be great if remote or embedded go-to
  229 + actions could be handled, but that's ambitious.
  230 + * It will be necessary to keep some global map that includes all QPDF objects that are part of
  231 + the final file.
  232 + * An interesting use case to consider would be to create a QPDF object from an embedded file and
  233 + append the embedded file and make the embedded actions work. This would probably require some
  234 + way to tell qpdf that a particular external file came from an embedded file.
35 235 * Control size of page and position/transformation of overlay/underlay
36 236 * Issues: #1031, #811, #740, #559
37 237 * Notes:
38   - * It should be possible to define a destination page from scratch or in terms of other pages and then place page contents onto it with arbitrary transformations applied.
39   - * It should be possible to compute the size of the destination page in terms of the source pages, e.g., to create one long or wide page from other pages.
  238 + * It should be possible to define a destination page from scratch or in terms of other pages and
  239 + then place page contents onto it with arbitrary transformations applied.
  240 + * It should be possible to compute the size of the destination page in terms of the source
  241 + pages, e.g., to create one long or wide page from other pages.
40 242 * Also allow specification of which page box to use
41 243 * Preserve hyperlinks when doing any page operations
42 244 * See also "Generate a mapping from source to destination for all destinations"
43 245 * Issues: #1003, #797, #94
44 246 * Notes:
45   - * A link annotation that points to a destination rather than an external URL should continue to work when files are split or merged.
  247 + * A link annotation that points to a destination rather than an external URL should continue to
  248 + work when files are split or merged.
46 249 * Awareness of structured and tagged PDF (14.7, 14.8)
47 250 * Issues: #957, #953, #490
48 251 * Notes:
49   - * This looks complicated. It may be not be possible to do this fully in the first increment, but we have to keep it in mind and warn if we can't and we see /SD in an action.
  252 + * This looks complicated. It may be not be possible to do this fully in the first increment, but
  253 + we have to keep it in mind and warn if we can't and we see /SD in an action.
50 254 * #490 has some good analysis
51 255 * Assign page labels
52 256 * Issues: #939
53 257 * Notes:
54 258 * #939 has a good proposal
55   - * This could be applied to page groups, and we could have an option to keep the labels as they are in a given group, which is what qpdf does now.
  259 + * This could be applied to page groups, and we could have an option to keep the labels as they
  260 + are in a given group, which is what qpdf does now.
56 261 * Interleave pages with ordering
57 262 * Issues: #921
58 263 * Notes:
59   - * From 921: interleave odd pages and reversed even pages. This might require different handling for even/odd numbers of pages. Make sure it's natural for the cases of len(odd) == len(even) or len(odd) == 1+len(even)
  264 + * From 921: interleave odd pages and reversed even pages. This might require different handling
  265 + for even/odd numbers of pages. Make sure it's natural for the cases of len(odd) == len(even)
  266 + or len(odd) == 1+len(even)
60 267 * Preserve all attachments when merging files
61 268 * Issues: #856
62 269 * Notes:
63 270 * If all pages of a file are selected, keep all attachments
64 271 * If some pages of a file are selected
65 272 * Keep all attachments if there are any embedded file annotations
66   - * Otherwise, what? Do we have a keep-attachments flag of some sort? Or do we just make the user copy attachments from one file to another?
67   -* Create page group by excluding pages
68   - * Issues: #790, #564
69   - * Notes:
70   - * Handle cases in `PageSelector` below
  273 + * Otherwise, what? Do we have a keep-attachments flag of some sort? Or do we just make the
  274 + user copy attachments from one file to another?
71 275 * Apply clipping to a page
72 276 * Issues: #771
73 277 * Notes:
74   - * Create a form xobject from a page, then apply a specific clipping region expressed in coordinates or as a percentage
  278 + * Create a form xobject from a page, then apply a specific clipping region expressed in
  279 + coordinates or as a percentage
75 280 * Ability to create a blank page
76 281 * Issues: #753
77 282 * Notes:
... ... @@ -80,8 +285,8 @@ Last checked: 2023-12-29
80 285 * Issues: #741, #616
81 286 * Notes:
82 287 * Example: --split-after a,b,c
83   -* Handle Optional Content (8.11)
84   - * Issues: #672, #9
  288 +* Handle Optional Content (layers) (8.11)
  289 + * Issues: #672, #9, #570
85 290 * Scale a page up or down to fit to a size
86 291 * Issues: #611
87 292 * Place contents of pages adjacent horizontally or vertically on one page
... ... @@ -91,38 +296,56 @@ Last checked: 2023-12-29
91 296 * Notes:
92 297 * #461 may want the inverse of booklet and discusses reader and printer spreads
93 298 * Flexible multiplexing
94   - * Issues: #505
  299 + * Issues: #505 (already implemented with --collate)
95 300 * Split pages based on outlines
96 301 * Issues: #477
97 302 * Keep relevant parts of outline hierarchy
98 303 * Issues: #457, #356, #343, #323
99 304 * Notes:
100 305 * There is some helpful discussion in #343 including
101   - * Prserving open/closed status
  306 + * Preserving open/closed status
102 307 * Preserving javascript actions
103 308  
104   -# Architectural Thoughts
  309 +# XXX OLD NOTES
  310 +
  311 +I want to encapsulate various aspects of the logic into interfaces that can be implemented by
  312 +developers to add their own logic. It should be easy to contribute these. Here are some rough ideas.
105 313  
106   -I want to encapsulate various aspects of the logic into interfaces that can be implemented by developers to add their own logic. It should be easy to contribute these. Here are some rough ideas.
  314 +A source is an input file, the output of another operation, or a blank page. In the API, it can be
  315 +any QPDF object.
107 316  
108 317 A page group is just a group of pages.
109 318  
110 319 * PageSelector -- creates page groups from other page groups
111   -* PageTransformer -- selects a part of a page and possibly transforms it; applies to all pages of a group. Based on the page dictionary; does not look at the content stream
  320 +* PageTransformer -- selects a part of a page and possibly transforms it; applies to all pages of a
  321 + group. Based on the page dictionary; does not look at the content stream
112 322 * PageFilter -- apply arbitrary code to a page; may access the content stream
113   -* PageAssembler -- combines pages from groups into new groups whose pages are each assembled from corresponding pages of the input groups
  323 +* PageAssembler -- combines pages from groups into new groups whose pages are each assembled from
  324 + corresponding pages of the input groups
114 325  
115   -These should be able to be composed in arbitrary ways. There should be a natural API for doing this, and it there should be some specification, probably based on JSON, that can be provided on the command line or embedded in the job JSON format. I have been considering whether a lisp-like S-expression syntax may be less cumbersome to work with. I'll have to decide whether to support this or some other syntax in addition to a JSON representation.
  326 +These should be able to be composed in arbitrary ways. There should be a natural API for doing this,
  327 +and it there should be some specification, probably based on JSON, that can be provided on the
  328 +command line or embedded in the job JSON format. I have been considering whether a lisp-like
  329 +S-expression syntax may be less cumbersome to work with. I'll have to decide whether to support this
  330 +or some other syntax in addition to a JSON representation.
116 331  
117   -There also needs to be something to represent how document-level structures relate to this. I'm not sure exactly how this should work, but we need things like
  332 +There also needs to be something to represent how document-level structures relate to this. I'm not
  333 +sure exactly how this should work, but we need things like
118 334 * what to do with page labels, especially when assembling pages from other pages
119 335 * whether to preserve destinations (outlines, links, etc.), particularly when pages are duplicated
120   - * If A refers to B and there is more than one copy of B, how do you decide which copies of A link to which copies of B?
121   -* what to do with pages that belong to more than one group, e.g., what happens if you used document structure or outlines to form page groups and a group boundary lies in the middle of the page
  336 + * If A refers to B and there is more than one copy of B, how do you decide which copies of A link
  337 + to which copies of B?
  338 +* what to do with pages that belong to more than one group, e.g., what happens if you used document
  339 + structure or outlines to form page groups and a group boundary lies in the middle of the page
122 340  
123   -Maybe pages groups can have arbitrary, user-defined tags so we can specify that links should only point to other pages with the same value of some tag. We can probably many-to-one links if the source is duplicated.
  341 +Maybe pages groups can have arbitrary, user-defined tags so we can specify that links should only
  342 +point to other pages with the same value of some tag. We can probably many-to-one links if the
  343 +source is duplicated.
124 344  
125   -We probably need to hold onto the concept of the primary input file. If there is a primary input file, there may need to be a way to specify what gets preserved it. The behavior of qpdf prior to all of this is to preserve all document-level constructs from the primary input file and to try to preserve page labels from other input files when combining pages.
  345 +We probably need to hold onto the concept of the primary input file. If there is a primary input
  346 +file, there may need to be a way to specify what gets preserved it. The behavior of qpdf prior to
  347 +all of this is to preserve all document-level constructs from the primary input file and to try to
  348 +preserve page labels from other input files when combining pages.
126 349  
127 350 Here are some examples.
128 351  
... ... @@ -136,10 +359,11 @@ Here are some examples.
136 359 * odd or even pages from a group
137 360 * every nth page from a group
138 361 * pages interleaved from multiple groups
139   - * the left-front (left-back, right-front, right-back) pages of a booklet with signatures of n pages
140   - * all pages reachable from a section of the outline hierarchy or something based on threads or other structure
  362 + * the left-front (left-back, right-front, right-back) pages of a booklet with signatures of n
  363 + pages
  364 + * all pages reachable from a section of the outline hierarchy or something based on threads or
  365 + other structure
141 366 * selection based on page labels
142   - * pages in a group except pages in another group
143 367 * PageTransformer
144 368 * clip to media box (trim box, crop box, etc.)
145 369 * clip to specific absolute or relative size
... ... @@ -152,7 +376,8 @@ Here are some examples.
152 376 * flatten annotations
153 377 * PageAssembler
154 378 * Overlay/underlay all pages from one group onto corresponding pages from another group
155   - * Control placement based on properties of all the groups, so higher order than a stand-alone transformer
  379 + * Control placement based on properties of all the groups, so higher order than a stand-alone
  380 + transformer
156 381 * Examples
157 382 * Scale the smaller page up to the size of the larger page
158 383 * Center the smaller page horizontally and bottom-align the trim boxes
... ... @@ -160,9 +385,14 @@ Here are some examples.
160 385 * n-up -- application of generalized overlay/underlay
161 386 * make one long page with an arbitrary number of pages one after the other (#546)
162 387  
163   -It should be possible to represent all of the existing qpdf operations using the above framework. It would be good to re-implement all of them in terms of this framework to exercise it. We will have to look through all the command-line arguments and make sure. Of course also make sure suggestions from issues can be implemented or at least supported by adding new selectors.
  388 +It should be possible to represent all of the existing qpdf operations using the above framework. It
  389 +would be good to re-implement all of them in terms of this framework to exercise it. We will have to
  390 +look through all the command-line arguments and make sure. Of course also make sure suggestions from
  391 +issues can be implemented or at least supported by adding new selectors.
164 392  
165   -Here are a few bits of scratch work. The top-level call is a selector. This doesn't capture everything. Implementing this would be tedious and challenging. It could be done using JSON arrays, but it would be clunky. This feels over-designed and possibly in conflict with QPDFJob.
  393 +Here are a few bits of scratch work. The top-level call is a selector. This doesn't capture
  394 +everything. Implementing this would be tedious and challenging. It could be done using JSON arrays,
  395 +but it would be clunky. This feels over-designed and possibly in conflict with QPDFJob.
166 396  
167 397 ```
168 398 (concat
... ... @@ -204,91 +434,113 @@ Here are a few bits of scratch work. The top-level call is a selector. This does
204 434 )
205 435 ```
206 436  
207   -Easier to parse but yuck:
208 437 ```json
209   -["with",
210   - ["a",
211   - ["concat",
212   - ["primary-input"],
213   - ["file", "file2.pdf"],
214   - ["page-range", ["file", "file3.pdf"], "1-4,5-8"]
215   - ],
216   - "b-even",
217   - ["even-pages", ["from", "a"]],
218   - "b-odd",
219   - ["reverse", ["odd-pages", ["from", "a"]]]
220   - ],
221   - ["stack",
222   - ["repeat-range", ["from", "a"], "z"],
223   - ["pad-end", ["from", "b"]]
224   - ]
225   -]
226   -```
227   -
228   -# To-do list
229 438  
230   -* Go through all issues marked with the `pages` label and ensure that any ideas are represented here. Keep a list with mappings back to the issue number.
231   - * gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open
232   -* When ready, open a discussion ticket.
233   -* Flesh out an implementation plan.
  439 +```
234 440  
235 441 # Supporting Document-level Features
236 442  
237   -qpdf needs full support for document-level features like article threads, outlines, etc. There is no support for some things and partial support for others. See notes below for a comprehensive list.
  443 +qpdf needs full support for document-level features like article threads, outlines, etc. There is no
  444 +support for some things and partial support for others. See notes below for a comprehensive list.
238 445  
239 446 Most likely, this will be done by creating DocumentHelper and ObjectHelper classes.
240 447  
241   -It will be necessary not only to read information about these structures from a single PDF file as the existing document helpers do but also to reconstruct or update these based on modifications to the pages in a file. I'm not sure how to do that, but one idea would be to allow a document helper to register a callback with QPDFPageDocumentHelper that notifies it when a page is added or removed. This may be able to take other parameters such as a document helper from a foreign file.
242   -
243   -Since these operations can be expensive, there will need to be a way to opt in and out. The default (to be clearly documented) should be that all supported document-level constructs are preserved. That way, as new features are added, changes to the output of previous operations to include information that was previously omitted will not constitute a non-backward compatible change that requires a major version bump. This will be a default for the API when using the higher-level page assemebly API (below) as well as the CLI.
244   -
245   -There will also need to be some kind of support for features that are document-level and not tied to any pages, such as (sometimes) embedded files. When splitting/merging files, there needs to be a way to specify what should happen with those things. Perhaps the default here should be that these are preserved from files from which all pages are selected. For some things, like viewer preferences, it may make sense to take them from the first file.
  448 +It will be necessary not only to read information about these structures from a single PDF file as
  449 +the existing document helpers do but also to reconstruct or update these based on modifications to
  450 +the pages in a file. I'm not sure how to do that, but one idea would be to allow a document helper
  451 +to register a callback with QPDFPageDocumentHelper that notifies it when a page is added or removed.
  452 +This may be able to take other parameters such as a document helper from a foreign file.
  453 +
  454 +Since these operations can be expensive, there will need to be a way to opt in and out. The default
  455 +(to be clearly documented) should be that all supported document-level constructs are preserved.
  456 +That way, as new features are added, changes to the output of previous operations to include
  457 +information that was previously omitted will not constitute a non-backward compatible change that
  458 +requires a major version bump. This will be a default for the API when using the higher-level page
  459 +assemebly API (below) as well as the CLI.
  460 +
  461 +There will also need to be some kind of support for features that are document-level and not tied to
  462 +any pages, such as (sometimes) embedded files. When splitting/merging files, there needs to be a way
  463 +to specify what should happen with those things. Perhaps the default here should be that these are
  464 +preserved from files from which all pages are selected. For some things, like viewer preferences, it
  465 +may make sense to take them from the first file.
246 466  
247 467 # Page Assembly (page selection)
248 468  
249   -In addition to the existing numeric ranges of page numbers, page selection could be driven by document-level features like the outlines hierarchy or article threads. There have been a lot of suggestions about this in various tickets. There will need to be some kind of page manipulation class with configuration options. I'm thinking something similar to QPDFJob, where you construct a class and then call a bunch of methods to configure it, including the ability to configure with JSON. Several suggestions have been made in issues, which I will go through and distill into a list. Off hand, some ideas include being able to split based on explicit chunks and being able to do all pages except a list of pages.
  469 +In addition to the existing numeric ranges of page numbers, page selection could be driven by
  470 +document-level features like the outlines hierarchy or article threads. There have been a lot of
  471 +suggestions about this in various tickets. There will need to be some kind of page manipulation
  472 +class with configuration options. I'm thinking something similar to QPDFJob, where you construct a
  473 +class and then call a bunch of methods to configure it, including the ability to configure with
  474 +JSON. Several suggestions have been made in issues, which I will go through and distill into a list.
  475 +Off hand, some ideas include being able to split based on explicit chunks and being able to do all
  476 +pages except a list of pages.
250 477  
251   -For CLI, I'm probably going to have it take a JSON blob or JSON file on the CLI rather than having some absurd way of doing it with arguments (people have a lot of trouble with --pages as it is). See TODO for a feature on command-line/job JSON support for JSON specification arguments.
  478 +For CLI, I'm probably going to have it take a JSON blob or JSON file on the CLI rather than having
  479 +some absurd way of doing it with arguments (people have a lot of trouble with --pages as it is). See
  480 +TODO for a feature on command-line/job JSON support for JSON specification arguments.
252 481  
253   -There are some other things, like allowing n-up and genearlizing overlay/underlay to allow different placement and scaling options, that I think may also be in scope.
  482 +There are some other things, like allowing n-up and genearlizing overlay/underlay to allow different
  483 +placement and scaling options, that I think may also be in scope.
254 484  
255 485 # Scaling/Transforming Pages
256 486  
257   -* Keep in mind that destinations, such as links and outlines, may need to be adjusted when a page is scaled or otherwise transformed.
  487 +* Keep in mind that destinations, such as links and outlines, may need to be adjusted when a page is
  488 + scaled or otherwise transformed.
258 489  
259 490 # Notes
260 491  
261 492 PDF document structure
262 493  
  494 +The trailer contains the catalog and the Info dictionary. We probably need to do something
  495 +intelligent with the info dictionary.
  496 +
  497 +
263 498 7.7.2 contains the list of all keys in the document catalog.
264 499  
265 500 Document-level structures:
266 501 * Extensions
  502 + * Must be combination of Extensions from all input files
267 503 * PageLabels
  504 + * Ensure each page has its original label
  505 + * Allow post-processing
268 506 * Names -- see below
  507 + * Combined and disambiguated
269 508 * Page: TemplateInstantiated
  509 +ombine from all files
270 510 * Dests
  511 + * Keep referenced destinations across all files
  512 + * May need to disambiguate or "flatten" or convert to named dests with the names tree
271 513 * Outlines
272 514 * Threads (easy)
273 515 * Page: B
274 516 * AA (Additional Actions)
275   -* URI
  517 + * Merge from different files if possible
  518 + * If duplicate, first contributor wins
276 519 * AcroForm
  520 + * Merge
277 521 * StructTreeRoot
  522 + * Combine
278 523 * Page: StructParents
279 524 * MarkInfo (see 14.7 - Logical Structure, 14.8 Tagged PDF)
  525 + * Combine
280 526 * SpiderInfo
  527 + * Combine
281 528 * Page: ID
282 529 * OutputIntents
  530 + * Combine
283 531 * Page: OutputIntents
284 532 * PieceInfo
  533 + * Combine
285 534 * Page: PieceInfo
286 535 * OCProperties
  536 + * Combine across documents
287 537 * Requirements
288   -* Collection
  538 + * Combine
289 539 * AF (file specification dictionaries)
  540 + * Combine
290 541 * Page: AF
291 542 * DPartRoot
  543 + * Combine
292 544 * Page: DPart
293 545  
294 546 Things qpdf probably needs to drop
... ... @@ -297,21 +549,26 @@ Things qpdf probably needs to drop
297 549 * Legal
298 550 * DSS
299 551  
300   -Things that stay with the first document and/or will not be supported
  552 +Things that stay with the first document that has one and/or will not be supported
  553 +* Info (not part of document catalog)
301 554 * ViewerPreferences
302 555 * PageLayout
303 556 * PageMode
304 557 * OpenAction
  558 +* URI
305 559 * Metadata
306 560 * Lang
307 561 * NeedsRendering
  562 +* Collection
308 563  
309   -Name dictionary (7.4)
  564 +Name dictionary (7.7.4)
310 565 * Dests
311 566 * AP (appearance strams)
312 567 * JavaScript
313 568 * Pages (named pages)
314 569 * Templates
  570 + * Combine across all documents
  571 + * Page: TemplateInstantiated points to a named page
315 572 * IDS
316 573 * URLS
317 574 * EmbeddedFiles
... ... @@ -322,10 +579,24 @@ Most of chapter 12 applies.
322 579  
323 580 Document-level navigation (12.3)
324 581  
325   -QPDF will need a global way to reference a page. This will most likely be in the form of the QPDF uuid and a QPDFObjectHandle to the page. If this can just be a QPDFObjectHandle, that would be better. I need to make sure we can meaningfully interact with QPDFObjectHandle objects from multiple QPDFs in a safe fashion. Figure out how this works with immediateCopyFrom, etc. Better to avoid this whole thing and make sure that we just keep all the document-level stuff specific to a PDF, but we will need to have some internal representation that can be used to reconstruct the document-level dictionaries when writing. Making this work with structures (structure destinations) will require more indirection.
326   -
327   -I imagine that there will be some internal repreentation of what document-level things come along for the ride when we take a page from a document. I wonder whether this need to change the way linearization works.
328   -
329   -There should be different ways to specify collections of pages. The existing one, which is using a numeric range, is just one. Other ideas include things related to document structure (all pages in an article thread, all pages in an outline hierarchy), page labels, book binding (Is that called folio? There's an issue for it.), all except, subgroups, or any number of things.
330   -
331   -We will need to be able to start with document-level objects to get page groups and also to start with pages and reconstruct document level objects. For example, it should be possibe to reconstruct article threads to omit beads that don't belong to any of the pages. Likewise with outlines.
  582 +QPDF will need a global way to reference a page. This will most likely be in the form of the QPDF
  583 +uuid and a QPDFObjectHandle to the page. If this can just be a QPDFObjectHandle, that would be
  584 +better. I need to make sure we can meaningfully interact with QPDFObjectHandle objects from multiple
  585 +QPDFs in a safe fashion. Figure out how this works with immediateCopyFrom, etc. Better to avoid this
  586 +whole thing and make sure that we just keep all the document-level stuff specific to a PDF, but we
  587 +will need to have some internal representation that can be used to reconstruct the document-level
  588 +dictionaries when writing. Making this work with structures (structure destinations) will require
  589 +more indirection.
  590 +
  591 +I imagine that there will be some internal repreentation of what document-level things come along
  592 +for the ride when we take a page from a document. I wonder whether this need to change the way
  593 +linearization works.
  594 +
  595 +There should be different ways to specify collections of pages. The existing one, which is using a
  596 +numeric range, is just one. Other ideas include things related to document structure (all pages in
  597 +an article thread, all pages in an outline hierarchy), page labels, book binding (Is that called
  598 +folio? There's an issue for it.), subgroups, or any number of things.
  599 +
  600 +We will need to be able to start with document-level objects to get page groups and also to start
  601 +with pages and reconstruct document level objects. For example, it should be possibe to reconstruct
  602 +article threads to omit beads that don't belong to any of the pages. Likewise with outlines.
... ...