Commit f7dd653d5fbe7d6c393a444eb3e8d841556ce085

Authored by Jay Berkenbilt
1 parent e52b026d

TODO-pages: introduce QPDFAssembler and QPDFSplitter

Showing 1 changed file with 52 additions and 27 deletions
TODO-pages.md
@@ -24,18 +24,15 @@ of the following properties, among others: @@ -24,18 +24,15 @@ of the following properties, among others:
24 * Contains information used by pages (named destinations) 24 * Contains information used by pages (named destinations)
25 25
26 As long as qpdf has had the ability to copy pages from one PDF to another, it has had robust 26 As long as qpdf has had the ability to copy pages from one PDF to another, it has had robust
27 -handling of page-level data. Prior to the implementation of the pages epic, with the exception of  
28 -page labels and form fields, qpdf has ignored document-level data during page copy operations.  
29 -Specifically, when qpdf creates a new PDF file from existing PDF files, it always starts with a  
30 -specific PDF, known as the _primary input_. The primary input may be a file or the built-in _empty  
31 -PDF_. With the exception of page labels and form fields, document-level constructs that appear in  
32 -the primary input are preserved, and document-level constructs from the other PDF files are ignored.  
33 -With page labels, qpdf always ensures that any given page has the same label in the final output as  
34 -it had in whichever input file it originated from, which is usually (but not always) the desired  
35 -behavior. With form fields, qpdf has awareness and ensures that all form fields remain operational.  
36 -The goal is to extend this document-level-awareness to other document-level constructs.  
37 -  
38 -Here are several examples of problems in qpdf prior to the implementation of the pages epic: 27 +handling of page-level data. When qpdf creates a new PDF file from existing PDF files, it starts
  28 +with a specific PDF, known as the _primary input_. The primary input may be a file or the built-in
  29 +_empty PDF_. Prior to the implementation of the pages epic, qpdf has ignored document-level data
  30 +(except for page labels and interactive form fields) when merging and splitting files. Any
  31 +document-level data in the primary input was preserved, and any document-level data other than form
  32 +fields and page labels was discarded from the other files. After this work is complete, qpdf will
  33 +handle other document-level data in a manner that preserves the functionality of all pages in the
  34 +final PDF. Here are several examples of problems in qpdf prior to the implementation of the pages
  35 +epic:
39 * If two files with optional content (layers) are merged, all layers in all but the primary input 36 * If two files with optional content (layers) are merged, all layers in all but the primary input
40 will be visible in the combined file. 37 will be visible in the combined file.
41 * If two files with file attachments are merged, attachments will be retained on the primary input 38 * If two files with file attachments are merged, attachments will be retained on the primary input
@@ -46,9 +43,10 @@ Here are several examples of problems in qpdf prior to the implementation of the @@ -46,9 +43,10 @@ Here are several examples of problems in qpdf prior to the implementation of the
46 entirety, including outlines that point to pages that are no longer there, and outlines will be 43 entirety, including outlines that point to pages that are no longer there, and outlines will be
47 lost from all files except the primary input. 44 lost from all files except the primary input.
48 45
49 -With the above limitations, qpdf allows combining pages from arbitrary numbers of input PDFs to  
50 -create an output PDF, or in the case of page splitting, multiple output PDFs. The API allows  
51 -arbitrary combinations of input and output files. The command-line allows only the following: 46 +Regarding page assembly, prior to the pages epic, qpdf allows combining pages from arbitrary numbers
  47 +of input PDFs to create an output PDF, or in the case of page splitting, multiple output PDFs. The
  48 +API allows arbitrary combinations of input and output files. The command-line allows only the
  49 +following:
52 * Merge: creation of a single output file from a primary input and any number of other inputs by 50 * Merge: creation of a single output file from a primary input and any number of other inputs by
53 selecting pages by index from the beginning or end of the file 51 selecting pages by index from the beginning or end of the file
54 * Split: creation of multiple output files from a single input or the result of a merge into files 52 * Split: creation of multiple output files from a single input or the result of a merge into files
@@ -79,10 +77,13 @@ Here are some examples of things that will become possible: @@ -79,10 +77,13 @@ Here are some examples of things that will become possible:
79 The rest of this document describes the details of what how these features will work and what needs 77 The rest of this document describes the details of what how these features will work and what needs
80 to be done to make them possible to build. 78 to be done to make them possible to build.
81 79
82 -# Architectural Thoughts 80 +# Architecture
83 81
84 -Open question: if I do all the complex logic in `QPDFJob`, what are the implications for pikepdf or  
85 -other wrappers? This will need to be discussed in the discussion ticket. 82 +Create a `QPDFAssembler` class to handle merging and a `QPDFSplitter` to handle splitting. The
  83 +complex assembly logic can be handled by `QPDFAssembler`. `QPDFSplitter` can invoke `QPDFAssembler`
  84 +with a previous `QPDFAssembler`'s output (or any `QPDF`) multiple times to create the split files.
  85 +This will mostly involve moving code from `QPDFJob` to `QPDFAssembler` and `QPDFSplitter` and having
  86 +`QPDFJob` invoke them.
86 87
87 Prior to implementation of the pages epic, `QPDFJob` goes through the following stages: 88 Prior to implementation of the pages epic, `QPDFJob` goes through the following stages:
88 89
@@ -123,8 +124,16 @@ Prior to implementation of the pages epic, `QPDFJob` goes through the following @@ -123,8 +124,16 @@ Prior to implementation of the pages epic, `QPDFJob` goes through the following
123 * Preserve form fields and page labels 124 * Preserve form fields and page labels
124 125
125 Broadly, the above has to be modified in the following ways: 126 Broadly, the above has to be modified in the following ways:
126 -* From the C++ API, make it possible to use an arbitrary QPDF as an input rather than having to  
127 - start with a file. That makes it possible to do arbitrary work on the PDF prior to submitting it. 127 +* The transformations step has to be pulled out as that wil stay in `QPDFJob`.
  128 +* Most of write QPDF will stay in `QPDFJob`, but the split logic will move to `QPDFSplitter`.
  129 +* The entire create QPDF logic will move into `QPDFAssembler`.
  130 +* `QPDFAssembler`'s API will allow using an arbitrary QPDF as an input rather than having to start
  131 + with a file. That makes it possible to do arbitrary work on the PDF prior to passing it to
  132 + `QPDFAssembler`.
  133 +* `QPDFAssembler` and `QPDFSplitter` may need a C API, or perhaps C users will have to work through
  134 + `QPDFJob`, which will expose nearly all of the functionality.
  135 +
  136 +Within `QPDFAssembler`, we will extend the create QPDF logic in the following ways:
128 * Allow creation of blank pages as an additional input source 137 * Allow creation of blank pages as an additional input source
129 * Generalize underlay/overlay 138 * Generalize underlay/overlay
130 * Enable controlling placement 139 * Enable controlling placement
@@ -132,17 +141,32 @@ Broadly, the above has to be modified in the following ways: @@ -132,17 +141,32 @@ Broadly, the above has to be modified in the following ways:
132 * Add additional reordering options 141 * Add additional reordering options
133 * We don't need to provide hooks for this. If someone is going to code a hook, they can just 142 * We don't need to provide hooks for this. If someone is going to code a hook, they can just
134 compute the page ordering directly. 143 compute the page ordering directly.
135 -* Have a page composition phase after the overlay/underlay stage 144 +* Have a page composition stage after the overlay/underlay stage
136 * Allow n-up, left-to-right (can reverse page order to get rtl), top-to-bottom, or modular 145 * Allow n-up, left-to-right (can reverse page order to get rtl), top-to-bottom, or modular
137 composition like pstops 146 composition like pstops
138 * Add additional ways to select pages besides range (e.g. based on outlines) 147 * Add additional ways to select pages besides range (e.g. based on outlines)
139 -* Add additional ways to specify boundaries for splitting  
140 * Enhance existing logic to handle other document-level structures, preferably in a way that 148 * Enhance existing logic to handle other document-level structures, preferably in a way that
141 requires less duplication between split and merge. 149 requires less duplication between split and merge.
142 * We don't need to turn on and off most types of document constructs individually. People can 150 * We don't need to turn on and off most types of document constructs individually. People can
143 preprocess using the API or qpdf JSON if they want fine-grained control. 151 preprocess using the API or qpdf JSON if they want fine-grained control.
144 * For things like attachments and outlines, we can add additional flags. 152 * For things like attachments and outlines, we can add additional flags.
145 153
  154 +Within `QPDFSplitter`, we will add additional ways to specify boundaries for splitting.
  155 +
  156 +We must take care with the implementations and APIs for `QPDFSplitter`, `QPDFAssembler`, and
  157 +`QPDFJob` to avoid excessive duplication. Perhaps `QPDFJob` can create and configure a
  158 +`QPDFAssembler` and `QPDFSplitter` on the fly to avoid too much duplication of state.
  159 +
  160 +Much of the logic will actually reside in other helper classes. For example, `QPDFAssembler` will
  161 +probably not operate with numeric ranges, leaving that to `QPDFJob` and `QUtil` but will instead
  162 +have vectors of page numbers. The logic for creating page groups from outlines, threads, or
  163 +structure will most likely live in the document helpers for those bits of functionality. This keeps
  164 +needless clutter out of `QPDFAssembler` and also makes it possible for people to perform their own
  165 +subset of functionality by calling lower-level interfaces. The main power of `QPDFAssembler` will be
  166 +to manage sequencing and destination tracking as well as to provide a future-proof API that will
  167 +allow developers to automatically benefit from additional document-level support as it is added to
  168 +qpdf.
  169 +
146 ## Flexible Assembly 170 ## Flexible Assembly
147 171
148 This section discusses modifications to the command-line syntax to make it easier to add flexibility 172 This section discusses modifications to the command-line syntax to make it easier to add flexibility
@@ -189,10 +213,10 @@ are handled, specify placement options, etc. Given the above framework, it would @@ -189,10 +213,10 @@ are handled, specify placement options, etc. Given the above framework, it would
189 additional features incrementally, without breaking compatibility, such as selecting or splitting 213 additional features incrementally, without breaking compatibility, such as selecting or splitting
190 pages based on tags, article threads, or outlines. 214 pages based on tags, article threads, or outlines.
191 215
192 -It's tempting to allow assemblies to be nested, but this gets very complicated. From the C++ API, we  
193 -could modify QPDFJob to allow the use any QPDF as an input, but supporting this from the CLI is hard  
194 -because of the way JSON/arg parsing is set up. If people need to do that, they can just create  
195 -intermediate files. 216 +It's tempting to allow assemblies to be nested, but this gets very complicated. From the C++ API,
  217 +there is no problem using the output of one `QPDFAssembler` as the input to another, but supporting
  218 +this from the CLI is hard because of the way JSON/arg parsing is set up. If people need to do that,
  219 +they can just create intermediate files.
196 220
197 Proposed CLI enhancements: 221 Proposed CLI enhancements:
198 222
@@ -424,7 +448,8 @@ Last checked: 2023-12-29 @@ -424,7 +448,8 @@ Last checked: 2023-12-29
424 gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open 448 gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open
425 ``` 449 ```
426 450
427 -* Allow an existing `QPDF` to be an input to a merge operation when using the QPDFJob C++ API 451 +* Allow an existing `QPDF` to be an input to a merge or underly/overlay operation when using the
  452 + `QPDFAssembler` C++ API
428 * Issues: none 453 * Issues: none
429 * Generate a mapping from source to destination for all destinations 454 * Generate a mapping from source to destination for all destinations
430 * Issues: #1077 455 * Issues: #1077