Commit f7dd653d5fbe7d6c393a444eb3e8d841556ce085
1 parent
e52b026d
TODO-pages: introduce QPDFAssembler and QPDFSplitter
Showing
1 changed file
with
52 additions
and
27 deletions
TODO-pages.md
| ... | ... | @@ -24,18 +24,15 @@ of the following properties, among others: |
| 24 | 24 | * Contains information used by pages (named destinations) |
| 25 | 25 | |
| 26 | 26 | As long as qpdf has had the ability to copy pages from one PDF to another, it has had robust |
| 27 | -handling of page-level data. Prior to the implementation of the pages epic, with the exception of | |
| 28 | -page labels and form fields, qpdf has ignored document-level data during page copy operations. | |
| 29 | -Specifically, when qpdf creates a new PDF file from existing PDF files, it always starts with a | |
| 30 | -specific PDF, known as the _primary input_. The primary input may be a file or the built-in _empty | |
| 31 | -PDF_. With the exception of page labels and form fields, document-level constructs that appear in | |
| 32 | -the primary input are preserved, and document-level constructs from the other PDF files are ignored. | |
| 33 | -With page labels, qpdf always ensures that any given page has the same label in the final output as | |
| 34 | -it had in whichever input file it originated from, which is usually (but not always) the desired | |
| 35 | -behavior. With form fields, qpdf has awareness and ensures that all form fields remain operational. | |
| 36 | -The goal is to extend this document-level-awareness to other document-level constructs. | |
| 37 | - | |
| 38 | -Here are several examples of problems in qpdf prior to the implementation of the pages epic: | |
| 27 | +handling of page-level data. When qpdf creates a new PDF file from existing PDF files, it starts | |
| 28 | +with a specific PDF, known as the _primary input_. The primary input may be a file or the built-in | |
| 29 | +_empty PDF_. Prior to the implementation of the pages epic, qpdf has ignored document-level data | |
| 30 | +(except for page labels and interactive form fields) when merging and splitting files. Any | |
| 31 | +document-level data in the primary input was preserved, and any document-level data other than form | |
| 32 | +fields and page labels was discarded from the other files. After this work is complete, qpdf will | |
| 33 | +handle other document-level data in a manner that preserves the functionality of all pages in the | |
| 34 | +final PDF. Here are several examples of problems in qpdf prior to the implementation of the pages | |
| 35 | +epic: | |
| 39 | 36 | * If two files with optional content (layers) are merged, all layers in all but the primary input |
| 40 | 37 | will be visible in the combined file. |
| 41 | 38 | * If two files with file attachments are merged, attachments will be retained on the primary input |
| ... | ... | @@ -46,9 +43,10 @@ Here are several examples of problems in qpdf prior to the implementation of the |
| 46 | 43 | entirety, including outlines that point to pages that are no longer there, and outlines will be |
| 47 | 44 | lost from all files except the primary input. |
| 48 | 45 | |
| 49 | -With the above limitations, qpdf allows combining pages from arbitrary numbers of input PDFs to | |
| 50 | -create an output PDF, or in the case of page splitting, multiple output PDFs. The API allows | |
| 51 | -arbitrary combinations of input and output files. The command-line allows only the following: | |
| 46 | +Regarding page assembly, prior to the pages epic, qpdf allows combining pages from arbitrary numbers | |
| 47 | +of input PDFs to create an output PDF, or in the case of page splitting, multiple output PDFs. The | |
| 48 | +API allows arbitrary combinations of input and output files. The command-line allows only the | |
| 49 | +following: | |
| 52 | 50 | * Merge: creation of a single output file from a primary input and any number of other inputs by |
| 53 | 51 | selecting pages by index from the beginning or end of the file |
| 54 | 52 | * Split: creation of multiple output files from a single input or the result of a merge into files |
| ... | ... | @@ -79,10 +77,13 @@ Here are some examples of things that will become possible: |
| 79 | 77 | The rest of this document describes the details of what how these features will work and what needs |
| 80 | 78 | to be done to make them possible to build. |
| 81 | 79 | |
| 82 | -# Architectural Thoughts | |
| 80 | +# Architecture | |
| 83 | 81 | |
| 84 | -Open question: if I do all the complex logic in `QPDFJob`, what are the implications for pikepdf or | |
| 85 | -other wrappers? This will need to be discussed in the discussion ticket. | |
| 82 | +Create a `QPDFAssembler` class to handle merging and a `QPDFSplitter` to handle splitting. The | |
| 83 | +complex assembly logic can be handled by `QPDFAssembler`. `QPDFSplitter` can invoke `QPDFAssembler` | |
| 84 | +with a previous `QPDFAssembler`'s output (or any `QPDF`) multiple times to create the split files. | |
| 85 | +This will mostly involve moving code from `QPDFJob` to `QPDFAssembler` and `QPDFSplitter` and having | |
| 86 | +`QPDFJob` invoke them. | |
| 86 | 87 | |
| 87 | 88 | Prior to implementation of the pages epic, `QPDFJob` goes through the following stages: |
| 88 | 89 | |
| ... | ... | @@ -123,8 +124,16 @@ Prior to implementation of the pages epic, `QPDFJob` goes through the following |
| 123 | 124 | * Preserve form fields and page labels |
| 124 | 125 | |
| 125 | 126 | Broadly, the above has to be modified in the following ways: |
| 126 | -* From the C++ API, make it possible to use an arbitrary QPDF as an input rather than having to | |
| 127 | - start with a file. That makes it possible to do arbitrary work on the PDF prior to submitting it. | |
| 127 | +* The transformations step has to be pulled out as that wil stay in `QPDFJob`. | |
| 128 | +* Most of write QPDF will stay in `QPDFJob`, but the split logic will move to `QPDFSplitter`. | |
| 129 | +* The entire create QPDF logic will move into `QPDFAssembler`. | |
| 130 | +* `QPDFAssembler`'s API will allow using an arbitrary QPDF as an input rather than having to start | |
| 131 | + with a file. That makes it possible to do arbitrary work on the PDF prior to passing it to | |
| 132 | + `QPDFAssembler`. | |
| 133 | +* `QPDFAssembler` and `QPDFSplitter` may need a C API, or perhaps C users will have to work through | |
| 134 | + `QPDFJob`, which will expose nearly all of the functionality. | |
| 135 | + | |
| 136 | +Within `QPDFAssembler`, we will extend the create QPDF logic in the following ways: | |
| 128 | 137 | * Allow creation of blank pages as an additional input source |
| 129 | 138 | * Generalize underlay/overlay |
| 130 | 139 | * Enable controlling placement |
| ... | ... | @@ -132,17 +141,32 @@ Broadly, the above has to be modified in the following ways: |
| 132 | 141 | * Add additional reordering options |
| 133 | 142 | * We don't need to provide hooks for this. If someone is going to code a hook, they can just |
| 134 | 143 | compute the page ordering directly. |
| 135 | -* Have a page composition phase after the overlay/underlay stage | |
| 144 | +* Have a page composition stage after the overlay/underlay stage | |
| 136 | 145 | * Allow n-up, left-to-right (can reverse page order to get rtl), top-to-bottom, or modular |
| 137 | 146 | composition like pstops |
| 138 | 147 | * Add additional ways to select pages besides range (e.g. based on outlines) |
| 139 | -* Add additional ways to specify boundaries for splitting | |
| 140 | 148 | * Enhance existing logic to handle other document-level structures, preferably in a way that |
| 141 | 149 | requires less duplication between split and merge. |
| 142 | 150 | * We don't need to turn on and off most types of document constructs individually. People can |
| 143 | 151 | preprocess using the API or qpdf JSON if they want fine-grained control. |
| 144 | 152 | * For things like attachments and outlines, we can add additional flags. |
| 145 | 153 | |
| 154 | +Within `QPDFSplitter`, we will add additional ways to specify boundaries for splitting. | |
| 155 | + | |
| 156 | +We must take care with the implementations and APIs for `QPDFSplitter`, `QPDFAssembler`, and | |
| 157 | +`QPDFJob` to avoid excessive duplication. Perhaps `QPDFJob` can create and configure a | |
| 158 | +`QPDFAssembler` and `QPDFSplitter` on the fly to avoid too much duplication of state. | |
| 159 | + | |
| 160 | +Much of the logic will actually reside in other helper classes. For example, `QPDFAssembler` will | |
| 161 | +probably not operate with numeric ranges, leaving that to `QPDFJob` and `QUtil` but will instead | |
| 162 | +have vectors of page numbers. The logic for creating page groups from outlines, threads, or | |
| 163 | +structure will most likely live in the document helpers for those bits of functionality. This keeps | |
| 164 | +needless clutter out of `QPDFAssembler` and also makes it possible for people to perform their own | |
| 165 | +subset of functionality by calling lower-level interfaces. The main power of `QPDFAssembler` will be | |
| 166 | +to manage sequencing and destination tracking as well as to provide a future-proof API that will | |
| 167 | +allow developers to automatically benefit from additional document-level support as it is added to | |
| 168 | +qpdf. | |
| 169 | + | |
| 146 | 170 | ## Flexible Assembly |
| 147 | 171 | |
| 148 | 172 | This section discusses modifications to the command-line syntax to make it easier to add flexibility |
| ... | ... | @@ -189,10 +213,10 @@ are handled, specify placement options, etc. Given the above framework, it would |
| 189 | 213 | additional features incrementally, without breaking compatibility, such as selecting or splitting |
| 190 | 214 | pages based on tags, article threads, or outlines. |
| 191 | 215 | |
| 192 | -It's tempting to allow assemblies to be nested, but this gets very complicated. From the C++ API, we | |
| 193 | -could modify QPDFJob to allow the use any QPDF as an input, but supporting this from the CLI is hard | |
| 194 | -because of the way JSON/arg parsing is set up. If people need to do that, they can just create | |
| 195 | -intermediate files. | |
| 216 | +It's tempting to allow assemblies to be nested, but this gets very complicated. From the C++ API, | |
| 217 | +there is no problem using the output of one `QPDFAssembler` as the input to another, but supporting | |
| 218 | +this from the CLI is hard because of the way JSON/arg parsing is set up. If people need to do that, | |
| 219 | +they can just create intermediate files. | |
| 196 | 220 | |
| 197 | 221 | Proposed CLI enhancements: |
| 198 | 222 | |
| ... | ... | @@ -424,7 +448,8 @@ Last checked: 2023-12-29 |
| 424 | 448 | gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open |
| 425 | 449 | ``` |
| 426 | 450 | |
| 427 | -* Allow an existing `QPDF` to be an input to a merge operation when using the QPDFJob C++ API | |
| 451 | +* Allow an existing `QPDF` to be an input to a merge or underly/overlay operation when using the | |
| 452 | + `QPDFAssembler` C++ API | |
| 428 | 453 | * Issues: none |
| 429 | 454 | * Generate a mapping from source to destination for all destinations |
| 430 | 455 | * Issues: #1077 | ... | ... |