Commit f7dd653d5fbe7d6c393a444eb3e8d841556ce085
1 parent
e52b026d
TODO-pages: introduce QPDFAssembler and QPDFSplitter
Showing
1 changed file
with
52 additions
and
27 deletions
TODO-pages.md
| @@ -24,18 +24,15 @@ of the following properties, among others: | @@ -24,18 +24,15 @@ of the following properties, among others: | ||
| 24 | * Contains information used by pages (named destinations) | 24 | * Contains information used by pages (named destinations) |
| 25 | 25 | ||
| 26 | As long as qpdf has had the ability to copy pages from one PDF to another, it has had robust | 26 | As long as qpdf has had the ability to copy pages from one PDF to another, it has had robust |
| 27 | -handling of page-level data. Prior to the implementation of the pages epic, with the exception of | ||
| 28 | -page labels and form fields, qpdf has ignored document-level data during page copy operations. | ||
| 29 | -Specifically, when qpdf creates a new PDF file from existing PDF files, it always starts with a | ||
| 30 | -specific PDF, known as the _primary input_. The primary input may be a file or the built-in _empty | ||
| 31 | -PDF_. With the exception of page labels and form fields, document-level constructs that appear in | ||
| 32 | -the primary input are preserved, and document-level constructs from the other PDF files are ignored. | ||
| 33 | -With page labels, qpdf always ensures that any given page has the same label in the final output as | ||
| 34 | -it had in whichever input file it originated from, which is usually (but not always) the desired | ||
| 35 | -behavior. With form fields, qpdf has awareness and ensures that all form fields remain operational. | ||
| 36 | -The goal is to extend this document-level-awareness to other document-level constructs. | ||
| 37 | - | ||
| 38 | -Here are several examples of problems in qpdf prior to the implementation of the pages epic: | 27 | +handling of page-level data. When qpdf creates a new PDF file from existing PDF files, it starts |
| 28 | +with a specific PDF, known as the _primary input_. The primary input may be a file or the built-in | ||
| 29 | +_empty PDF_. Prior to the implementation of the pages epic, qpdf has ignored document-level data | ||
| 30 | +(except for page labels and interactive form fields) when merging and splitting files. Any | ||
| 31 | +document-level data in the primary input was preserved, and any document-level data other than form | ||
| 32 | +fields and page labels was discarded from the other files. After this work is complete, qpdf will | ||
| 33 | +handle other document-level data in a manner that preserves the functionality of all pages in the | ||
| 34 | +final PDF. Here are several examples of problems in qpdf prior to the implementation of the pages | ||
| 35 | +epic: | ||
| 39 | * If two files with optional content (layers) are merged, all layers in all but the primary input | 36 | * If two files with optional content (layers) are merged, all layers in all but the primary input |
| 40 | will be visible in the combined file. | 37 | will be visible in the combined file. |
| 41 | * If two files with file attachments are merged, attachments will be retained on the primary input | 38 | * If two files with file attachments are merged, attachments will be retained on the primary input |
| @@ -46,9 +43,10 @@ Here are several examples of problems in qpdf prior to the implementation of the | @@ -46,9 +43,10 @@ Here are several examples of problems in qpdf prior to the implementation of the | ||
| 46 | entirety, including outlines that point to pages that are no longer there, and outlines will be | 43 | entirety, including outlines that point to pages that are no longer there, and outlines will be |
| 47 | lost from all files except the primary input. | 44 | lost from all files except the primary input. |
| 48 | 45 | ||
| 49 | -With the above limitations, qpdf allows combining pages from arbitrary numbers of input PDFs to | ||
| 50 | -create an output PDF, or in the case of page splitting, multiple output PDFs. The API allows | ||
| 51 | -arbitrary combinations of input and output files. The command-line allows only the following: | 46 | +Regarding page assembly, prior to the pages epic, qpdf allows combining pages from arbitrary numbers |
| 47 | +of input PDFs to create an output PDF, or in the case of page splitting, multiple output PDFs. The | ||
| 48 | +API allows arbitrary combinations of input and output files. The command-line allows only the | ||
| 49 | +following: | ||
| 52 | * Merge: creation of a single output file from a primary input and any number of other inputs by | 50 | * Merge: creation of a single output file from a primary input and any number of other inputs by |
| 53 | selecting pages by index from the beginning or end of the file | 51 | selecting pages by index from the beginning or end of the file |
| 54 | * Split: creation of multiple output files from a single input or the result of a merge into files | 52 | * Split: creation of multiple output files from a single input or the result of a merge into files |
| @@ -79,10 +77,13 @@ Here are some examples of things that will become possible: | @@ -79,10 +77,13 @@ Here are some examples of things that will become possible: | ||
| 79 | The rest of this document describes the details of what how these features will work and what needs | 77 | The rest of this document describes the details of what how these features will work and what needs |
| 80 | to be done to make them possible to build. | 78 | to be done to make them possible to build. |
| 81 | 79 | ||
| 82 | -# Architectural Thoughts | 80 | +# Architecture |
| 83 | 81 | ||
| 84 | -Open question: if I do all the complex logic in `QPDFJob`, what are the implications for pikepdf or | ||
| 85 | -other wrappers? This will need to be discussed in the discussion ticket. | 82 | +Create a `QPDFAssembler` class to handle merging and a `QPDFSplitter` to handle splitting. The |
| 83 | +complex assembly logic can be handled by `QPDFAssembler`. `QPDFSplitter` can invoke `QPDFAssembler` | ||
| 84 | +with a previous `QPDFAssembler`'s output (or any `QPDF`) multiple times to create the split files. | ||
| 85 | +This will mostly involve moving code from `QPDFJob` to `QPDFAssembler` and `QPDFSplitter` and having | ||
| 86 | +`QPDFJob` invoke them. | ||
| 86 | 87 | ||
| 87 | Prior to implementation of the pages epic, `QPDFJob` goes through the following stages: | 88 | Prior to implementation of the pages epic, `QPDFJob` goes through the following stages: |
| 88 | 89 | ||
| @@ -123,8 +124,16 @@ Prior to implementation of the pages epic, `QPDFJob` goes through the following | @@ -123,8 +124,16 @@ Prior to implementation of the pages epic, `QPDFJob` goes through the following | ||
| 123 | * Preserve form fields and page labels | 124 | * Preserve form fields and page labels |
| 124 | 125 | ||
| 125 | Broadly, the above has to be modified in the following ways: | 126 | Broadly, the above has to be modified in the following ways: |
| 126 | -* From the C++ API, make it possible to use an arbitrary QPDF as an input rather than having to | ||
| 127 | - start with a file. That makes it possible to do arbitrary work on the PDF prior to submitting it. | 127 | +* The transformations step has to be pulled out as that wil stay in `QPDFJob`. |
| 128 | +* Most of write QPDF will stay in `QPDFJob`, but the split logic will move to `QPDFSplitter`. | ||
| 129 | +* The entire create QPDF logic will move into `QPDFAssembler`. | ||
| 130 | +* `QPDFAssembler`'s API will allow using an arbitrary QPDF as an input rather than having to start | ||
| 131 | + with a file. That makes it possible to do arbitrary work on the PDF prior to passing it to | ||
| 132 | + `QPDFAssembler`. | ||
| 133 | +* `QPDFAssembler` and `QPDFSplitter` may need a C API, or perhaps C users will have to work through | ||
| 134 | + `QPDFJob`, which will expose nearly all of the functionality. | ||
| 135 | + | ||
| 136 | +Within `QPDFAssembler`, we will extend the create QPDF logic in the following ways: | ||
| 128 | * Allow creation of blank pages as an additional input source | 137 | * Allow creation of blank pages as an additional input source |
| 129 | * Generalize underlay/overlay | 138 | * Generalize underlay/overlay |
| 130 | * Enable controlling placement | 139 | * Enable controlling placement |
| @@ -132,17 +141,32 @@ Broadly, the above has to be modified in the following ways: | @@ -132,17 +141,32 @@ Broadly, the above has to be modified in the following ways: | ||
| 132 | * Add additional reordering options | 141 | * Add additional reordering options |
| 133 | * We don't need to provide hooks for this. If someone is going to code a hook, they can just | 142 | * We don't need to provide hooks for this. If someone is going to code a hook, they can just |
| 134 | compute the page ordering directly. | 143 | compute the page ordering directly. |
| 135 | -* Have a page composition phase after the overlay/underlay stage | 144 | +* Have a page composition stage after the overlay/underlay stage |
| 136 | * Allow n-up, left-to-right (can reverse page order to get rtl), top-to-bottom, or modular | 145 | * Allow n-up, left-to-right (can reverse page order to get rtl), top-to-bottom, or modular |
| 137 | composition like pstops | 146 | composition like pstops |
| 138 | * Add additional ways to select pages besides range (e.g. based on outlines) | 147 | * Add additional ways to select pages besides range (e.g. based on outlines) |
| 139 | -* Add additional ways to specify boundaries for splitting | ||
| 140 | * Enhance existing logic to handle other document-level structures, preferably in a way that | 148 | * Enhance existing logic to handle other document-level structures, preferably in a way that |
| 141 | requires less duplication between split and merge. | 149 | requires less duplication between split and merge. |
| 142 | * We don't need to turn on and off most types of document constructs individually. People can | 150 | * We don't need to turn on and off most types of document constructs individually. People can |
| 143 | preprocess using the API or qpdf JSON if they want fine-grained control. | 151 | preprocess using the API or qpdf JSON if they want fine-grained control. |
| 144 | * For things like attachments and outlines, we can add additional flags. | 152 | * For things like attachments and outlines, we can add additional flags. |
| 145 | 153 | ||
| 154 | +Within `QPDFSplitter`, we will add additional ways to specify boundaries for splitting. | ||
| 155 | + | ||
| 156 | +We must take care with the implementations and APIs for `QPDFSplitter`, `QPDFAssembler`, and | ||
| 157 | +`QPDFJob` to avoid excessive duplication. Perhaps `QPDFJob` can create and configure a | ||
| 158 | +`QPDFAssembler` and `QPDFSplitter` on the fly to avoid too much duplication of state. | ||
| 159 | + | ||
| 160 | +Much of the logic will actually reside in other helper classes. For example, `QPDFAssembler` will | ||
| 161 | +probably not operate with numeric ranges, leaving that to `QPDFJob` and `QUtil` but will instead | ||
| 162 | +have vectors of page numbers. The logic for creating page groups from outlines, threads, or | ||
| 163 | +structure will most likely live in the document helpers for those bits of functionality. This keeps | ||
| 164 | +needless clutter out of `QPDFAssembler` and also makes it possible for people to perform their own | ||
| 165 | +subset of functionality by calling lower-level interfaces. The main power of `QPDFAssembler` will be | ||
| 166 | +to manage sequencing and destination tracking as well as to provide a future-proof API that will | ||
| 167 | +allow developers to automatically benefit from additional document-level support as it is added to | ||
| 168 | +qpdf. | ||
| 169 | + | ||
| 146 | ## Flexible Assembly | 170 | ## Flexible Assembly |
| 147 | 171 | ||
| 148 | This section discusses modifications to the command-line syntax to make it easier to add flexibility | 172 | This section discusses modifications to the command-line syntax to make it easier to add flexibility |
| @@ -189,10 +213,10 @@ are handled, specify placement options, etc. Given the above framework, it would | @@ -189,10 +213,10 @@ are handled, specify placement options, etc. Given the above framework, it would | ||
| 189 | additional features incrementally, without breaking compatibility, such as selecting or splitting | 213 | additional features incrementally, without breaking compatibility, such as selecting or splitting |
| 190 | pages based on tags, article threads, or outlines. | 214 | pages based on tags, article threads, or outlines. |
| 191 | 215 | ||
| 192 | -It's tempting to allow assemblies to be nested, but this gets very complicated. From the C++ API, we | ||
| 193 | -could modify QPDFJob to allow the use any QPDF as an input, but supporting this from the CLI is hard | ||
| 194 | -because of the way JSON/arg parsing is set up. If people need to do that, they can just create | ||
| 195 | -intermediate files. | 216 | +It's tempting to allow assemblies to be nested, but this gets very complicated. From the C++ API, |
| 217 | +there is no problem using the output of one `QPDFAssembler` as the input to another, but supporting | ||
| 218 | +this from the CLI is hard because of the way JSON/arg parsing is set up. If people need to do that, | ||
| 219 | +they can just create intermediate files. | ||
| 196 | 220 | ||
| 197 | Proposed CLI enhancements: | 221 | Proposed CLI enhancements: |
| 198 | 222 | ||
| @@ -424,7 +448,8 @@ Last checked: 2023-12-29 | @@ -424,7 +448,8 @@ Last checked: 2023-12-29 | ||
| 424 | gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open | 448 | gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open |
| 425 | ``` | 449 | ``` |
| 426 | 450 | ||
| 427 | -* Allow an existing `QPDF` to be an input to a merge operation when using the QPDFJob C++ API | 451 | +* Allow an existing `QPDF` to be an input to a merge or underly/overlay operation when using the |
| 452 | + `QPDFAssembler` C++ API | ||
| 428 | * Issues: none | 453 | * Issues: none |
| 429 | * Generate a mapping from source to destination for all destinations | 454 | * Generate a mapping from source to destination for all destinations |
| 430 | * Issues: #1077 | 455 | * Issues: #1077 |