Commit e52b026db4a7f23b79f21b4563d2b02d3b87fde4
1 parent
379fc7e5
Major rework of TODO-pages.md
This is converging into something that will be possible to do.
Showing
1 changed file
with
322 additions
and
471 deletions
TODO-pages.md
| 1 | # Pages | 1 | # Pages |
| 2 | 2 | ||
| 3 | -**THIS IS A WORK IN PROGRESS. THE ACTUAL IMPLEMENTATION MAY NOT LOOK ANYTHING LIKE THIS. When this | ||
| 4 | -gets to the stage where it is starting to congeal into an actual plan, I will remove this disclaimer | ||
| 5 | -and open a discussion ticket in GitHub to work out details.** | 3 | +**This is a work in progress, but it's getting close. When this gets to the stage where it is |
| 4 | +starting to congeal into an actual plan, I will remove this disclaimer and open a discussion ticket | ||
| 5 | +in GitHub to work out details.** | ||
| 6 | 6 | ||
| 7 | This document describes a project known as the _pages epic_. The goal of the pages epic is to enable | 7 | This document describes a project known as the _pages epic_. The goal of the pages epic is to enable |
| 8 | qpdf to properly preserve all functionality associated with a page as pages are copied from one PDF | 8 | qpdf to properly preserve all functionality associated with a page as pages are copied from one PDF |
| 9 | -to another (or back to the same PDF). | 9 | +to another (or back to the same PDF). A secondary goal is to add more flexiblity to the ways in |
| 10 | +which documents can be split and combined (flexible assembly). | ||
| 10 | 11 | ||
| 11 | Terminology: | 12 | Terminology: |
| 12 | * _Page-level data_: information that is contained within objects reachable from the page dictionary | 13 | * _Page-level data_: information that is contained within objects reachable from the page dictionary |
| @@ -14,30 +15,33 @@ Terminology: | @@ -14,30 +15,33 @@ Terminology: | ||
| 14 | * _Document-level data_: information that is reachable from the document catalog (`/Root`) that is | 15 | * _Document-level data_: information that is reachable from the document catalog (`/Root`) that is |
| 15 | not reachable from a page dictionary as well as the `/Info` dictionary | 16 | not reachable from a page dictionary as well as the `/Info` dictionary |
| 16 | 17 | ||
| 17 | -Some document-level data references specific pages by page object ID, such as outlines or | ||
| 18 | -interactive forms. Some document-level data doesn't reference any pages, such as embedded files or | ||
| 19 | -optional content (layers). Some document-level data contains information that pertains to a specific | ||
| 20 | -page but does not reference the page, such as page labels (explicit page numbers). Some page-level | ||
| 21 | -data may sometimes depend on document-level data. For example, a _named destination_ depends on the | ||
| 22 | -document-level _names tree_. | 18 | +PDF uses document-level data in a variety of ways. There is some document-level data that has each |
| 19 | +of the following properties, among others: | ||
| 20 | +* References pages by object ID (outlines, interactive forms) | ||
| 21 | +* Doesn't reference any pages (embedded files) | ||
| 22 | +* Doesn't reference any pages but influences page rendering (optional content/layers) | ||
| 23 | +* Doesn't reference any pages but contains information about pages (page labels) | ||
| 24 | +* Contains information used by pages (named destinations) | ||
| 23 | 25 | ||
| 24 | As long as qpdf has had the ability to copy pages from one PDF to another, it has had robust | 26 | As long as qpdf has had the ability to copy pages from one PDF to another, it has had robust |
| 25 | handling of page-level data. Prior to the implementation of the pages epic, with the exception of | 27 | handling of page-level data. Prior to the implementation of the pages epic, with the exception of |
| 26 | -page labels, qpdf has ignored document-level data during page copy operations. Specifically, when | ||
| 27 | -qpdf creates a new PDF file from existing PDF files, it always starts with a specific PDF, known as | ||
| 28 | -the _primary input_. The primary input may be the built-in _empty PDF_. With the exception of page | ||
| 29 | -labels, document-level constructs that appear in the primary input are preserved, and document-level | ||
| 30 | -constructs from the other PDF files are ignored. The exception to this is page labels. With page | ||
| 31 | -labels, qpdf always ensures that any given page has the same label in the final output as it had in | ||
| 32 | -whichever input file it originated from, which is usually (but not always) the desired behavior. | 28 | +page labels and form fields, qpdf has ignored document-level data during page copy operations. |
| 29 | +Specifically, when qpdf creates a new PDF file from existing PDF files, it always starts with a | ||
| 30 | +specific PDF, known as the _primary input_. The primary input may be a file or the built-in _empty | ||
| 31 | +PDF_. With the exception of page labels and form fields, document-level constructs that appear in | ||
| 32 | +the primary input are preserved, and document-level constructs from the other PDF files are ignored. | ||
| 33 | +With page labels, qpdf always ensures that any given page has the same label in the final output as | ||
| 34 | +it had in whichever input file it originated from, which is usually (but not always) the desired | ||
| 35 | +behavior. With form fields, qpdf has awareness and ensures that all form fields remain operational. | ||
| 36 | +The goal is to extend this document-level-awareness to other document-level constructs. | ||
| 33 | 37 | ||
| 34 | Here are several examples of problems in qpdf prior to the implementation of the pages epic: | 38 | Here are several examples of problems in qpdf prior to the implementation of the pages epic: |
| 35 | * If two files with optional content (layers) are merged, all layers in all but the primary input | 39 | * If two files with optional content (layers) are merged, all layers in all but the primary input |
| 36 | will be visible in the combined file. | 40 | will be visible in the combined file. |
| 37 | * If two files with file attachments are merged, attachments will be retained on the primary input | 41 | * If two files with file attachments are merged, attachments will be retained on the primary input |
| 38 | but dropped on the others. (qpdf has other ways to copy attachments from one file to another.) | 42 | but dropped on the others. (qpdf has other ways to copy attachments from one file to another.) |
| 39 | -* If two files with hyperlinks are merged, any hyperlink from other than primary input whose | ||
| 40 | - destination is a named destination will become non-functional. | 43 | +* If two files with hyperlinks are merged, any hyperlink from other than primary input become |
| 44 | + non-functional. | ||
| 41 | * If two files with outlines are merged, the outlines from the original file will appear in their | 45 | * If two files with outlines are merged, the outlines from the original file will appear in their |
| 42 | entirety, including outlines that point to pages that are no longer there, and outlines will be | 46 | entirety, including outlines that point to pages that are no longer there, and outlines will be |
| 43 | lost from all files except the primary input. | 47 | lost from all files except the primary input. |
| @@ -55,27 +59,32 @@ arbitrary combinations of input and output files. The command-line allows only t | @@ -55,27 +59,32 @@ arbitrary combinations of input and output files. The command-line allows only t | ||
| 55 | 59 | ||
| 56 | The pages epic consists of two broad categories of work: | 60 | The pages epic consists of two broad categories of work: |
| 57 | * Proper handling of document-level features when splitting and merging documents | 61 | * Proper handling of document-level features when splitting and merging documents |
| 58 | -* Greatly increased flexibility in the ways in which pages can be selected from the various input | ||
| 59 | - files and combined for the output file. This includes creation of blank pages. | 62 | +* Flexible assembly: greatly increased flexibility in the ways in which pages can be selected from |
| 63 | + the various input files and combined for the output file. This includes creation of blank pages | ||
| 64 | + and composition of pages (n-up or other ways of combining multiple input pages into one output | ||
| 65 | + page) | ||
| 60 | 66 | ||
| 61 | Here are some examples of things that will become possible: | 67 | Here are some examples of things that will become possible: |
| 62 | 68 | ||
| 63 | * Stacking arbitrary pages on top of each other with full control over transformation and cropping, | 69 | * Stacking arbitrary pages on top of each other with full control over transformation and cropping, |
| 64 | including being able to access information about the various bounding boxes associated with the | 70 | including being able to access information about the various bounding boxes associated with the |
| 65 | - pages | 71 | + pages (generalization of underlay/overlay) |
| 66 | * Inserting blank pages | 72 | * Inserting blank pages |
| 67 | * Doing n-up page layouts | 73 | * Doing n-up page layouts |
| 74 | +* Creating single very long or wide pages with output from other pages | ||
| 68 | * Re-ordering pages for printing booklets (also called signatures or printer spreads) | 75 | * Re-ordering pages for printing booklets (also called signatures or printer spreads) |
| 69 | * Selecting pages based on the outline hierarchy, tags, or article threads | 76 | * Selecting pages based on the outline hierarchy, tags, or article threads |
| 70 | * Keeping only and all relevant parts of the outline hierarchies from all input files | 77 | * Keeping only and all relevant parts of the outline hierarchies from all input files |
| 71 | -* Creating single very long or wide pages with output from other pages | ||
| 72 | 78 | ||
| 73 | The rest of this document describes the details of what how these features will work and what needs | 79 | The rest of this document describes the details of what how these features will work and what needs |
| 74 | to be done to make them possible to build. | 80 | to be done to make them possible to build. |
| 75 | 81 | ||
| 76 | -# QPDFJob Summary | 82 | +# Architectural Thoughts |
| 83 | + | ||
| 84 | +Open question: if I do all the complex logic in `QPDFJob`, what are the implications for pikepdf or | ||
| 85 | +other wrappers? This will need to be discussed in the discussion ticket. | ||
| 77 | 86 | ||
| 78 | -`QPDFJob` goes through the following stages: | 87 | +Prior to implementation of the pages epic, `QPDFJob` goes through the following stages: |
| 79 | 88 | ||
| 80 | * create QPDF | 89 | * create QPDF |
| 81 | * update from JSON | 90 | * update from JSON |
| @@ -113,176 +122,299 @@ to be done to make them possible to build. | @@ -113,176 +122,299 @@ to be done to make them possible to build. | ||
| 113 | * Remove unreference resources if needed | 122 | * Remove unreference resources if needed |
| 114 | * Preserve form fields and page labels | 123 | * Preserve form fields and page labels |
| 115 | 124 | ||
| 116 | -# Architectural Thoughts | ||
| 117 | - | ||
| 118 | -XXX WORK IN: Dump `QPDFAssembler`. Instead, these are enhancements to `QPDFJob`. Don't try to | ||
| 119 | -generalize this too much. There are actually only a few things we need to add to `QPDFJob`. Go | ||
| 120 | -through and flesh out the list, but roughly: | ||
| 121 | - | 125 | +Broadly, the above has to be modified in the following ways: |
| 122 | * From the C++ API, make it possible to use an arbitrary QPDF as an input rather than having to | 126 | * From the C++ API, make it possible to use an arbitrary QPDF as an input rather than having to |
| 123 | start with a file. That makes it possible to do arbitrary work on the PDF prior to submitting it. | 127 | start with a file. That makes it possible to do arbitrary work on the PDF prior to submitting it. |
| 124 | -* Allow specification of n blank pages of a given size, e.g. `--blank=5@612x792`. Maybe we can | ||
| 125 | - support standard paper sizes, inches, centimeters, or sizes relative to other pages. | 128 | +* Allow creation of blank pages as an additional input source |
| 126 | * Generalize underlay/overlay | 129 | * Generalize underlay/overlay |
| 127 | - * Maybe we can do it by adding flags and allowing them to be repeated | ||
| 128 | - * Maybe we need a new syntax, like pstops, but with the ability to specify anchors and proportions | ||
| 129 | - based on varoius boxes | ||
| 130 | - * Maybe we need something like `--stack` | ||
| 131 | - * It needs to be possible to stack arbitrary pages with arbitrary transformations and to have the | ||
| 132 | - transformations be a function of the source or destination page; the rectangle mapping idea | ||
| 133 | - discussed elsewhere may be a good basis | 130 | + * Enable controlling placement |
| 131 | + * Make repeatable | ||
| 132 | +* Add additional reordering options | ||
| 133 | + * We don't need to provide hooks for this. If someone is going to code a hook, they can just | ||
| 134 | + compute the page ordering directly. | ||
| 134 | * Have a page composition phase after the overlay/underlay stage | 135 | * Have a page composition phase after the overlay/underlay stage |
| 135 | * Allow n-up, left-to-right (can reverse page order to get rtl), top-to-bottom, or modular | 136 | * Allow n-up, left-to-right (can reverse page order to get rtl), top-to-bottom, or modular |
| 136 | composition like pstops | 137 | composition like pstops |
| 137 | - * Possible hook for page composition to allow custom compositions | ||
| 138 | -* A few additional split options | ||
| 139 | - | ||
| 140 | -Then, we need to make the existing logic handle other document-level structures, preferably in a way | ||
| 141 | -that requires less duplication between split and merge. Maybe we can add a flag to disregard | ||
| 142 | -document-level structures for speed, but I think being able to turn them on and off individually is | ||
| 143 | -overkill, especially since people who are that sophisticated can tweak with JSON or just do it in | ||
| 144 | -code. | ||
| 145 | - | ||
| 146 | -The challenge will be able to come up with command-line syntax to do most things from the CLI and to | ||
| 147 | -make the C++ API flexible enough for users to insert their own bits in key places, just as we can | ||
| 148 | -now grab the QPDF before the write phase. This approach eliminates all the function stuff. We just | ||
| 149 | -have to make sure we can support all these features and have a relatively easy way to add new ones | ||
| 150 | -or to let developers extend. The documentation will have to explain the flow of QPDFJob so people | ||
| 151 | -can know where to apply hooks. | ||
| 152 | - | ||
| 153 | ----------- | ||
| 154 | - | ||
| 155 | -Create a new top-level class called `QPDFAssembler` that will be used to perform page-level | ||
| 156 | -operations. Its implementation will use existing APIs, and it will add many new APIs. It should be | ||
| 157 | -possible to perform all existing page splitting and merging operations using `QPDFAssembler` without | ||
| 158 | -having to worry about details such as copying annotations, remapping destinations, and adjusting | ||
| 159 | -document-level data. | ||
| 160 | - | ||
| 161 | -Early strategy: keep `QPDFAssembler` private to the library, and start with a pure C++ API (no JSON | ||
| 162 | -support). Migrate splitting and merging from `QPDFJob` into `QPDFAssembler`, then build in | ||
| 163 | -document-level support. Also work the difference between normal write and split, which are two | ||
| 164 | -separate ways to write output files. | ||
| 165 | - | ||
| 166 | -One of the main responsibilities of `QPDFAssembler` will be to remap destinations as data from a | ||
| 167 | -page is moved or copied. For example, if an outline has a destination that points to a particular | ||
| 168 | -rectangle on page 5 of the second file, and we end up dropping a portion of that page into an n-up | ||
| 169 | -configuration on a specific output page, we will have to keep track of enough information to replace | ||
| 170 | -the destination with a new one that points to the new physical location of the same material. For | ||
| 171 | -another example, consider a case in which the left side of page 3 of the primary input ends up as | ||
| 172 | -page 5 of the output and the right side of page 3 ends up as page 6. We would have to map | ||
| 173 | -destinations from a single source page to different destination pages based on which part of the | ||
| 174 | -page it was on. If part of the rectangle points to one page and part to another, what do we do? I | ||
| 175 | -suggest we go with the top/center of the rectangle. | ||
| 176 | - | ||
| 177 | -A destination consists of a QPDF, page object, and rectangle in user coordinates. When | ||
| 178 | -`QPDFAssembler` copies a page or converts it to a form XObject, possibly with transformations | ||
| 179 | -applied, it will have to be able to map a destination to the same triple (QPDF, page object, | ||
| 180 | -rectangle) on all pages that contain data from the original page. When writing the final output, any | ||
| 181 | -destination that no longer points anywhere should be dropped, and any destination that points to | ||
| 182 | -multiple places will need to be handled according to some specification. | 138 | +* Add additional ways to select pages besides range (e.g. based on outlines) |
| 139 | +* Add additional ways to specify boundaries for splitting | ||
| 140 | +* Enhance existing logic to handle other document-level structures, preferably in a way that | ||
| 141 | + requires less duplication between split and merge. | ||
| 142 | + * We don't need to turn on and off most types of document constructs individually. People can | ||
| 143 | + preprocess using the API or qpdf JSON if they want fine-grained control. | ||
| 144 | + * For things like attachments and outlines, we can add additional flags. | ||
| 145 | + | ||
| 146 | +## Flexible Assembly | ||
| 147 | + | ||
| 148 | +This section discusses modifications to the command-line syntax to make it easier to add flexibility | ||
| 149 | +going forward without breaking backward compatibility. The main thrust will be to create | ||
| 150 | +non-positional alternatives to some things that currently use positional arguments (`--pages`, | ||
| 151 | +`--overlay`, `--underlay`), as was done for `--encrypt` in 11.7.0, to make it possible to add | ||
| 152 | +additional flags. | ||
| 153 | + | ||
| 154 | +In several cases, we allow specification of transformations or placements. In this context: | ||
| 155 | +* The origin is always lower-left corner. | ||
| 156 | +* A _dimension_ may be absolute or relative. | ||
| 157 | + * An _absolute dimension_ is `{n}` (in points), `{n}in` (inches), `{n}cm` (centimeters), | ||
| 158 | + * A _relative dimension_ is expressed in terms of the corresponding dimension of one of a page's | ||
| 159 | + boxes. Which dimension is determined by context. | ||
| 160 | + * `{n}{M|C|B|T|A}` is `{n}` times the corresopnding dimension of the media, crop, bleed, trim, | ||
| 161 | + or art box. Example: `0.5M` would be half the width or height of the media box. | ||
| 162 | + * `{n}+{M|C|B|T|A}` is `{n}` plus the corresponding dimension. Example: `-0.5in+T` is half an | ||
| 163 | + inch (36 points) less than the width or height of the trim box. | ||
| 164 | +* A _size_ is | ||
| 165 | + * `{w}x{h}`, where `{w}` and `{h}` are dimensions | ||
| 166 | + * `letter|a4` (potentially add other page sizes) | ||
| 167 | +* A _position_ is `{x}x{y}` where `{x}` and `{y}` are dimensions offset from the origin | ||
| 168 | +* A _rectangle_ is `{llx},{lly},{urx},{ury}` (lower|upper left|right x|y) with `llx` < `urx` and | ||
| 169 | + `lly` < `ury` | ||
| 170 | + * Examples: | ||
| 171 | + * `0.1M,0.1M,0.9M,0.9M` is a box whose llx is 10% of the media box width, lly is 10% of the | ||
| 172 | + height, urx is 90% of the width, and ury is 90% of the height | ||
| 173 | + * `0,0,612,792` is a box whose size is that of a US Letter page. | ||
| 174 | + * A rectangle may also be just one of `M|C|B|T|A` to refer to a page's media, crop, bleed, trim, | ||
| 175 | + or art box. | ||
| 176 | + | ||
| 177 | +Tweak `--pages` similarly to `--encrypt`. As an alternative to `--pages file [--password=p] range | ||
| 178 | +--`, support `--pages --file=x --password=y --range=z --`. This allows for a more flexible syntax. | ||
| 179 | +If `--file` appears, positional arguments are disallowed. The same applies to `--overlay` and | ||
| 180 | +`--underlay`. | ||
| 181 | + | ||
| 182 | +``` | ||
| 183 | +OLD: qpdf 2.pdf --pages 1.pdf --password=x . 3.pdf 1-z -- out.pdf | ||
| 184 | +NEW: qpdf 2.pdf --pages --file=1.pdf --password=x --file=. --file 3.pdf --range=1-z -- out.pdf | ||
| 185 | +``` | ||
| 186 | + | ||
| 187 | +This makes it possible to add additional flags to do things like control how document-level features | ||
| 188 | +are handled, specify placement options, etc. Given the above framework, it would be possible to add | ||
| 189 | +additional features incrementally, without breaking compatibility, such as selecting or splitting | ||
| 190 | +pages based on tags, article threads, or outlines. | ||
| 191 | + | ||
| 192 | +It's tempting to allow assemblies to be nested, but this gets very complicated. From the C++ API, we | ||
| 193 | +could modify QPDFJob to allow the use any QPDF as an input, but supporting this from the CLI is hard | ||
| 194 | +because of the way JSON/arg parsing is set up. If people need to do that, they can just create | ||
| 195 | +intermediate files. | ||
| 196 | + | ||
| 197 | +Proposed CLI enhancements: | ||
| 198 | + | ||
| 199 | +``` | ||
| 200 | +# --pages: inputs | ||
| 201 | +--file=x [ --password=x ] | ||
| 202 | +--blank=n [ --size={size} [ --size-from-page=n ] ] # see below | ||
| 203 | +# modifiers refer to most recent input | ||
| 204 | +--range=... | ||
| 205 | +--with-attachments={none|all|referenced} # default = referenced | ||
| 206 | +--with-outlines={none|all|referenced} # default = referenced | ||
| 207 | +--... # future options to select pages based on outlines, article threads, tags, etc. | ||
| 208 | +# placement (matrix transformation -- see notes below) | ||
| 209 | +--rotate=[+-]angle[:page-range] # existing | ||
| 210 | +--scale=x,y[:page-range] | ||
| 211 | +--translate=dx,dy[:page-range] # dx and dy are dimensions | ||
| 212 | +--flip={h|v}[:page-range] | ||
| 213 | +--transform=a,b,c,d,e,f[:page-range] | ||
| 214 | +--set-box={M|C|B|T|A}=rect[:page-range] # change a bounding box | ||
| 215 | +# stacking -- make --underlay and --overlay repeatbale | ||
| 216 | +--{underlay|overlay} ... -- | ||
| 217 | +--file=x [ --password=x ] | ||
| 218 | +--from, --to, --repeat # same as current --overlay, --underlay | ||
| 219 | +--from-rect={rect} # default = T -- see notes | ||
| 220 | +--to-rect={rect} # default = M -- see notes | ||
| 221 | +# composition -- a new QPDFJob stage between stacking and transformation | ||
| 222 | +--compose=... # see notes | ||
| 223 | +--n-up={2,4,6,9,16} | ||
| 224 | +--concat={h|v} # concatenate all pages to a single big page | ||
| 225 | +# reordering | ||
| 226 | +--collate=a,b,c # exists | ||
| 227 | +--booklet=... # re-order pages for book signatures like psbook -- see notes | ||
| 228 | +# split | ||
| 229 | +--split-pages=n # existing | ||
| 230 | +--split-after=a,b,c # split after each named page | ||
| 231 | +--... # future options to split based on outlines, article threads, tags, etc. | ||
| 232 | +# post-processing (with transformations like optimize images) | ||
| 233 | +--set-page-labels ... # See issue #939 | ||
| 234 | +``` | ||
| 235 | + | ||
| 236 | +Notes: | ||
| 237 | +* For `--blank`, `--size` specifies the size of the blank page. If any relative dimensions are used, | ||
| 238 | + `--size-from-page=n` must be used to specify the page (from n in the overall input) that relative | ||
| 239 | + dimensions should be taken from. It is an error to specify a relative size based on another blank | ||
| 240 | + page. (Let's not complicate things by doing a graph traversal to find an eventual absolute page. | ||
| 241 | + Just disallow a blank page to specified relative to another blank page.) | ||
| 242 | +* For stacking, the default is to map the source page's trim box onto the destination page's | ||
| 243 | + mediabox. This is a weird default, but it's there for compatibility. The `--from-rect` and | ||
| 244 | + `--to-rect` may be used to map an arbitrary region of the over/underlay file into an arbitrary | ||
| 245 | + region of a page. With the defaults, an overlay or underlay page will be stretched or shrunk if | ||
| 246 | + pages are of variable size. Absolute rectangles can be used to avoid this. If a rectangle uses | ||
| 247 | + relative dimensions, they are relative to the page that has the rectangle. You can't create a | ||
| 248 | + `--to-rect` relative to the size of the from page or vice versa. If you need to do this, use | ||
| 249 | + external logic to compute the rectangles and then use absolute rectangles. | ||
| 250 | +* `--compose`: XXX | ||
| 251 | +* `--booklet`: XXX | ||
| 252 | +* The `--set-page-labels` option would be done at the very end and is actually not blocked by | ||
| 253 | + anything else here. It can be done near removing page labels in `handleTransformations`. | ||
| 254 | +* I'm not sure what impact composition should have on page labels. Most likely, we should drop page | ||
| 255 | + labels on composition. If someone wants them, they can use `--set-page-labels`. | ||
| 256 | + | ||
| 257 | +### Compose, Booklet | ||
| 258 | + | ||
| 259 | +This section needs to be fleshed out. It is probably lower priority than document-level work. | ||
| 260 | + | ||
| 261 | +Here are some ideas from pstops. The following is an excerpt from the pstops manual page. Maybe we | ||
| 262 | +can come up with something similar using our enhanced rectangle syntax. | ||
| 263 | + | ||
| 264 | +This section contains some sample reโarrangements. To put two pages on one sheet (of A4 paper), | ||
| 265 | +the pagespec to use is: | ||
| 266 | +``` | ||
| 267 | +2:0L@.7(21cm,0)+1L@.7(21cm,14.85cm) | ||
| 268 | +``` | ||
| 269 | +To select all of the odd pages in reverse order, use: | ||
| 270 | +``` | ||
| 271 | +2:โ0 | ||
| 272 | +``` | ||
| 273 | +To reโarrange pages for printing 2โup booklets, use | ||
| 274 | +``` | ||
| 275 | +4:โ3L@.7(21cm,0)+0L@.7(21cm,14.85cm) | ||
| 276 | +``` | ||
| 277 | +for the front sides, and | ||
| 278 | +``` | ||
| 279 | +4:1L@.7(21cm,0)+โ2L@.7(21cm,14.85cm) | ||
| 280 | +``` | ||
| 281 | +for the reverse sides (or join them with a comma for duplex printing). | ||
| 282 | + | ||
| 283 | +From issue #493 | ||
| 284 | +``` | ||
| 285 | + pdf2ps infile.pdf infile.ps | ||
| 286 | + ps2ps -pa4 "2:0R(4.5cm,26.85cm)+1R(4.5cm,14.85cm)" infile.ps outfile.ps | ||
| 287 | + ps2pdf outfile.ps outfile.pdf | ||
| 288 | + ``` | ||
| 289 | + | ||
| 290 | +Notes on signatures (psbook). For a signature of size 3, we have the following assuming a 2-up | ||
| 291 | +configuration that is printed double-sided so that, when the whole stack is placed face-up and | ||
| 292 | +folded in half, page 1 is on top. | ||
| 293 | +* front: 6,7, back: 8,5 | ||
| 294 | +* front: 4,9, back: 10,3 | ||
| 295 | +* front: 2,11, back: 12,1 | ||
| 296 | + | ||
| 297 | +This is the same as duplex 2-up with pages in order 6, 7, 8, 5, 4, 9, 10, 3, 2, 11, 12, 1 | ||
| 298 | + | ||
| 299 | +n-up: | ||
| 300 | +* For 2-up, calculate new w and h such that w/h maintains a fixed ratio and w and h are the largest | ||
| 301 | + values that can fit within 1/2 the page with specified margins. | ||
| 302 | +* Can support 1, 2, 4, 6, 9, 16. 2 and 6 require rotation. The others don't. Will probably need to | ||
| 303 | + change getFormXObjectForPage to handle other boxes than trim box. | ||
| 304 | +* Maybe define n-up a scale and rotate followed by fitting the result into a specified rectangle. I | ||
| 305 | + might already have this logic in QPDFAnnotationObjectHelper::getPageContentForAppearance. | ||
| 306 | + | ||
| 307 | +## Destinations | ||
| 308 | + | ||
| 309 | +We will have to keep track of destinations that point to a page when the page is moved or copied. | ||
| 310 | +For example, if an outline has a destination that points to a particular rectangle on page 5 of the | ||
| 311 | +second file, and we end up dropping a portion of that page into an n-up configuration on a specific | ||
| 312 | +output page, we will have to keep track of enough information to replace the destination with a new | ||
| 313 | +one that points to the new physical location of the same material. For another example, consider a | ||
| 314 | +case in which the left side of page 3 of the primary input ends up as page 5 of the output and the | ||
| 315 | +right side of page 3 ends up as page 6. We would have to map destinations from a single source page | ||
| 316 | +to different destination pages based on which part of the page it was on. If part of the rectangle | ||
| 317 | +points to one page and part to another, what do we do? I suggest we go with the top/center of the | ||
| 318 | +rectangle. | ||
| 319 | + | ||
| 320 | +A destination consists of a QPDF, page object, and rectangle in user coordinates. When `QPDFJob` | ||
| 321 | +copies a page or converts it to a form XObject, possibly with transformations applied, it will have | ||
| 322 | +to be able to map a destination to the same triple (QPDF, page object, rectangle) on all pages that | ||
| 323 | +contain data from the original page. When writing the final output, any destination that no longer | ||
| 324 | +points anywhere should be dropped, and any destination that points to multiple places will need to | ||
| 325 | +be handled according to some specification. | ||
| 183 | 326 | ||
| 184 | Whenever we create any new thing from a page, we create _derived page data_. Examples of derived | 327 | Whenever we create any new thing from a page, we create _derived page data_. Examples of derived |
| 185 | -page data would include a copy of the page and a form XObject created from a page. `QPDFAssembler` | ||
| 186 | -will have to keep a mapping from any source page to all of its derived objects along with any | ||
| 187 | -transformations or clipping. When a derived page data object is placed on a final page, that | ||
| 188 | -information can be combined with the position and any transformations onto the final page to be able | ||
| 189 | -to map any destination to a new one or to determine that it points outside of the visible area. | ||
| 190 | - | ||
| 191 | -If a source page is copied multiple times, then if exactly one copy is explicitly marked as the | ||
| 192 | -target, that becomes the target. Otherwise, the first derived object to be placed becomes the | ||
| 193 | -target. | ||
| 194 | - | ||
| 195 | -## Overall Structure | ||
| 196 | - | ||
| 197 | -A single instance of `QPDFAssembler` creates a single assembly job. `QPDFJob` can create one | ||
| 198 | -assembly job but does other things, such as setting writer options, inspection operations, etc. An | ||
| 199 | -assembly job consists of the following: | ||
| 200 | -* Global document-level data handling information | ||
| 201 | - * Mode | ||
| 202 | - * intelligent: try to combine everything using latest capabilities of qpdf; this is the default | ||
| 203 | - * legacy: document-level features are kept from primary input; this is for compatibility and can | ||
| 204 | - be selected from the CLI | ||
| 205 | -* Input sources | ||
| 206 | - * File/password | ||
| 207 | - * Whether to keep attachments: yes, no, if-all-pages (default) | ||
| 208 | - * Empty | ||
| 209 | -* Output mode | ||
| 210 | - * Single file | ||
| 211 | - * Split -- this must include definitions of the split groups | ||
| 212 | -* Description of the output in terms of the input sources and some series of transformations | ||
| 213 | - | ||
| 214 | -## Cases to support | ||
| 215 | - | ||
| 216 | -Here is a list of cases that need to be expressible. | ||
| 217 | - | ||
| 218 | -* Create output by concatenating pages from page groups where each page group is pages specified by | ||
| 219 | - a numeric range. This is what `--pages` does now. | ||
| 220 | -* Collation, including different sized groups. | ||
| 221 | -* Overlay/underlay, generalized to support a stack consisting of various underlays, the base page, | ||
| 222 | - and various overlays, with flexibility around posititioning. It should be natural to express | ||
| 223 | - exactly whate underlay and overlay do now. | ||
| 224 | -* Split into groups of fixed size (what `--split-pages` does) with the ability to define split | ||
| 225 | - groups based on other things, like outlines, article threads, and document structure | ||
| 226 | -* Examples from the manual: | ||
| 227 | - * `qpdf in.pdf --pages . a.pdf b.pdf:even -- out.pdf` | ||
| 228 | - * `qpdf --empty --pages a.pdf b.pdf --password=x z-1 c.pdf 3,6` | ||
| 229 | - * `qpdf --collate odd.pdf --pages . even.pdf -- all.pdf` | ||
| 230 | - * `qpdf --collate --empty --pages odd.pdf even.pdf -- all.pdf` | ||
| 231 | - * `qpdf --collate --empty --pages a.pdf 1-5 b.pdf 6-4 c.pdf r1 -- out.pdf` | ||
| 232 | - * `qpdf --collate=2 --empty --pages a.pdf 1-5 b.pdf 6-4 c.pdf r1 -- out.pdf` | ||
| 233 | - * `qpdf file2.pdf --pages file1.pdf 1-5 . 15-11 -- outfile.pdf` | ||
| 234 | - * | ||
| 235 | - ``` | ||
| 236 | - qpdf --empty --copy-encryption=encrypted.pdf \ | ||
| 237 | - --encryption-file-password=pass \ | ||
| 238 | - --pages encrypted.pdf --password=pass 1 \ | ||
| 239 | - ./encrypted.pdf --password=pass 1 -- \ | ||
| 240 | - outfile.pdf | ||
| 241 | - ``` | ||
| 242 | - * `qpdf --collate=2,6 a.pdf --pages . b.pdf -- all.pdf` | ||
| 243 | - * Take A 1-2, B 1-6, A 3-4, C 7-12, A 5-6, B 13-18, ... | ||
| 244 | -* Ideas from pstops. The following is an excerpt from the pstops manual page. | ||
| 245 | - | ||
| 246 | - This section contains some sample reโarrangements. To put two pages on one sheet (of A4 paper), | ||
| 247 | - the pagespec to use is: | ||
| 248 | - ``` | ||
| 249 | - 2:0L@.7(21cm,0)+1L@.7(21cm,14.85cm) | ||
| 250 | - ``` | ||
| 251 | - To select all of the odd pages in reverse order, use: | ||
| 252 | - ``` | ||
| 253 | - 2:โ0 | ||
| 254 | - ``` | ||
| 255 | - To reโarrange pages for printing 2โup booklets, use | ||
| 256 | - ``` | ||
| 257 | - 4:โ3L@.7(21cm,0)+0L@.7(21cm,14.85cm) | ||
| 258 | - ``` | ||
| 259 | - for the front sides, and | ||
| 260 | - ``` | ||
| 261 | - 4:1L@.7(21cm,0)+โ2L@.7(21cm,14.85cm) | ||
| 262 | - ``` | ||
| 263 | - for the reverse sides (or join them with a comma for duplex printing). | ||
| 264 | -* From #493 | ||
| 265 | - ``` | ||
| 266 | - pdf2ps infile.pdf infile.ps | ||
| 267 | - ps2ps -pa4 "2:0R(4.5cm,26.85cm)+1R(4.5cm,14.85cm)" infile.ps outfile.ps | ||
| 268 | - ps2pdf outfile.ps outfile.pdf | ||
| 269 | - ``` | ||
| 270 | -* Like psbook. Signature size n: | ||
| 271 | - * take groups of 4n | ||
| 272 | - * shown for n=3 in order such that, if printed so that the front of the first page is on top, the | ||
| 273 | - whole stack can be folded in half. | ||
| 274 | - * front: 6,7, back: 8,5 | ||
| 275 | - * front: 4,9, back: 10,3 | ||
| 276 | - * front: 2,11, back: 12,1 | ||
| 277 | - | ||
| 278 | - This is the same as duplex 2-up with pages in order 6, 7, 8, 5, 4, 9, 10, 3, 2, 11, 12, 1 | ||
| 279 | -* n-up: | ||
| 280 | - * For 2-up, calculate new w and h such that w/h maintains a fixed ratio and w and h are the | ||
| 281 | - largest values that can fit within 1/2 the page with specified margins. | ||
| 282 | - * Can support 1, 2, 4, 6, 9, 16. 2 and 6 require rotation. The others don't. Will probably need to | ||
| 283 | - change getFormXObjectForPage to handle other boxes than trim box. | ||
| 284 | - * Maybe define n-up a scale and rotate followed by fitting the result into a specified rectangle. | ||
| 285 | - I might already have this logic in QPDFAnnotationObjectHelper::getPageContentForAppearance. | 328 | +page data would include a copy of the page and a form XObject created from a page. We will have to |
| 329 | +keep a mapping from any source page to all of its derived objects along with any transformations or | ||
| 330 | +clipping. When a derived page data object is placed on a final page, that information can be | ||
| 331 | +combined with the position and any transformations onto the final page to be able to map any | ||
| 332 | +destination to a new one or to determine that it points outside of the visible area. There is | ||
| 333 | +already code in placeFormXObject and the code that places appearance streams that deals with these | ||
| 334 | +kinds of mappings. | ||
| 335 | + | ||
| 336 | +What do we do if a source page is copied multiple times? I think we will have to just make the new | ||
| 337 | +destination point to the first place that the target appears with precedence going to the original | ||
| 338 | +location. If we can detect this, we can give a warning. | ||
| 339 | + | ||
| 340 | +# Document-level Behavior | ||
| 341 | + | ||
| 342 | +Both merging and splitting contain logic, sometimes duplicated, to handle page labels, form fields, | ||
| 343 | +and annotations. We will need to build logic for other things. This section is a rough breakdown of | ||
| 344 | +the different things in the document catalog (plus the info dictionary, which is referenced from the | ||
| 345 | +trailer) and how we may have to handle them. We will need to implement various ObjectHelper and | ||
| 346 | +DocumentHelper classes. | ||
| 347 | + | ||
| 348 | +7.7.2 contains the list of all keys in the document catalog. | ||
| 349 | + | ||
| 350 | +Document-level structures to merge: | ||
| 351 | +* Extensions | ||
| 352 | + * Must be combination of Extensions from all input files | ||
| 353 | +* PageLabels | ||
| 354 | + * Ensure each page has its original label | ||
| 355 | + * Allow post-processing | ||
| 356 | +* Names -- see below | ||
| 357 | + * Combine per tree | ||
| 358 | + * May require disambiguation | ||
| 359 | + * Page: TemplateInstantiated | ||
| 360 | +* Dests | ||
| 361 | + * Keep referenced destinations across all files | ||
| 362 | + * May need to disambiguate or "flatten" or convert to named dests with the names tree | ||
| 363 | +* Outlines | ||
| 364 | +* Threads (easy) | ||
| 365 | + * Page: B | ||
| 366 | +* AcroForm | ||
| 367 | +* StructTreeRoot | ||
| 368 | + * Page: StructParents | ||
| 369 | +* MarkInfo (see 14.7 - Logical Structure, 14.8 Tagged PDF) | ||
| 370 | +* SpiderInfo | ||
| 371 | + * Page: ID | ||
| 372 | +* OutputIntents | ||
| 373 | + * Page: OutputIntents | ||
| 374 | +* PieceInfo | ||
| 375 | + * Page: PieceInfo | ||
| 376 | +* OCProperties | ||
| 377 | +* Requirements | ||
| 378 | +* AF (file specification dictionaries) | ||
| 379 | + * Page: AF | ||
| 380 | +* DPartRoot | ||
| 381 | + * Page: DPart | ||
| 382 | +* Version | ||
| 383 | + * Maximum | ||
| 384 | + | ||
| 385 | +Things that stay with the first document that has one and/or will not be supported | ||
| 386 | +* AA (Additional Actions) | ||
| 387 | + * Would be possible to combine and let the first contributor win, but it probably wouldn't usually | ||
| 388 | + be what we want. | ||
| 389 | +* Info (not part of document catalog) | ||
| 390 | +* ViewerPreferences | ||
| 391 | +* PageLayout | ||
| 392 | +* PageMode | ||
| 393 | +* OpenAction | ||
| 394 | +* URI | ||
| 395 | +* Metadata | ||
| 396 | +* Lang | ||
| 397 | +* NeedsRendering | ||
| 398 | +* Collection | ||
| 399 | +* Perms | ||
| 400 | +* Legal | ||
| 401 | +* DSS | ||
| 402 | + | ||
| 403 | +Name dictionary (7.7.4) | ||
| 404 | +* Dests | ||
| 405 | +* AP (appearance streams) | ||
| 406 | +* JavaScript | ||
| 407 | +* Pages (named pages) | ||
| 408 | +* Templates | ||
| 409 | + * Combine across all documents | ||
| 410 | + * Page: TemplateInstantiated points to a named page | ||
| 411 | +* IDS | ||
| 412 | +* URLS | ||
| 413 | +* EmbeddedFiles | ||
| 414 | +* AlternatePresentations | ||
| 415 | +* Renditions | ||
| 416 | + | ||
| 417 | +Most of chapter 12 applies. See Document-level navigation (12.3). | ||
| 286 | 418 | ||
| 287 | # Feature to Issue Mapping | 419 | # Feature to Issue Mapping |
| 288 | 420 | ||
| @@ -292,6 +424,8 @@ Last checked: 2023-12-29 | @@ -292,6 +424,8 @@ Last checked: 2023-12-29 | ||
| 292 | gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open | 424 | gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open |
| 293 | ``` | 425 | ``` |
| 294 | 426 | ||
| 427 | +* Allow an existing `QPDF` to be an input to a merge operation when using the QPDFJob C++ API | ||
| 428 | + * Issues: none | ||
| 295 | * Generate a mapping from source to destination for all destinations | 429 | * Generate a mapping from source to destination for all destinations |
| 296 | * Issues: #1077 | 430 | * Issues: #1077 |
| 297 | * Notes: | 431 | * Notes: |
| @@ -328,7 +462,7 @@ gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open | @@ -328,7 +462,7 @@ gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open | ||
| 328 | * This looks complicated. It may be not be possible to do this fully in the first increment, but | 462 | * This looks complicated. It may be not be possible to do this fully in the first increment, but |
| 329 | we have to keep it in mind and warn if we can't and we see /SD in an action. | 463 | we have to keep it in mind and warn if we can't and we see /SD in an action. |
| 330 | * #490 has some good analysis | 464 | * #490 has some good analysis |
| 331 | -* Assign page labels | 465 | +* Assign page labels (renumber pages) |
| 332 | * Issues: #939 | 466 | * Issues: #939 |
| 333 | * Notes: | 467 | * Notes: |
| 334 | * #939 has a good proposal | 468 | * #939 has a good proposal |
| @@ -381,286 +515,3 @@ gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open | @@ -381,286 +515,3 @@ gh search issues label:pages --repo qpdf/qpdf --limit 200 --state=open | ||
| 381 | * There is some helpful discussion in #343 including | 515 | * There is some helpful discussion in #343 including |
| 382 | * Preserving open/closed status | 516 | * Preserving open/closed status |
| 383 | * Preserving javascript actions | 517 | * Preserving javascript actions |
| 384 | - | ||
| 385 | -# XXX OLD NOTES | ||
| 386 | - | ||
| 387 | -I want to encapsulate various aspects of the logic into interfaces that can be implemented by | ||
| 388 | -developers to add their own logic. It should be easy to contribute these. Here are some rough ideas. | ||
| 389 | - | ||
| 390 | -A source is an input file, the output of another operation, or a blank page. In the API, it can be | ||
| 391 | -any QPDF object. | ||
| 392 | - | ||
| 393 | -A page group is just a group of pages. | ||
| 394 | - | ||
| 395 | -* PageSelector -- creates page groups from other page groups | ||
| 396 | -* PageTransformer -- selects a part of a page and possibly transforms it; applies to all pages of a | ||
| 397 | - group. Based on the page dictionary; does not look at the content stream | ||
| 398 | -* PageFilter -- apply arbitrary code to a page; may access the content stream | ||
| 399 | -* PageAssembler -- combines pages from groups into new groups whose pages are each assembled from | ||
| 400 | - corresponding pages of the input groups | ||
| 401 | - | ||
| 402 | -These should be able to be composed in arbitrary ways. There should be a natural API for doing this, | ||
| 403 | -and it there should be some specification, probably based on JSON, that can be provided on the | ||
| 404 | -command line or embedded in the job JSON format. I have been considering whether a lisp-like | ||
| 405 | -S-expression syntax may be less cumbersome to work with. I'll have to decide whether to support this | ||
| 406 | -or some other syntax in addition to a JSON representation. | ||
| 407 | - | ||
| 408 | -There also needs to be something to represent how document-level structures relate to this. I'm not | ||
| 409 | -sure exactly how this should work, but we need things like | ||
| 410 | -* what to do with page labels, especially when assembling pages from other pages | ||
| 411 | -* whether to preserve destinations (outlines, links, etc.), particularly when pages are duplicated | ||
| 412 | - * If A refers to B and there is more than one copy of B, how do you decide which copies of A link | ||
| 413 | - to which copies of B? | ||
| 414 | -* what to do with pages that belong to more than one group, e.g., what happens if you used document | ||
| 415 | - structure or outlines to form page groups and a group boundary lies in the middle of the page | ||
| 416 | - | ||
| 417 | -Maybe pages groups can have arbitrary, user-defined tags so we can specify that links should only | ||
| 418 | -point to other pages with the same value of some tag. We can probably many-to-one links if the | ||
| 419 | -source is duplicated. | ||
| 420 | - | ||
| 421 | -We probably need to hold onto the concept of the primary input file. If there is a primary input | ||
| 422 | -file, there may need to be a way to specify what gets preserved it. The behavior of qpdf prior to | ||
| 423 | -all of this is to preserve all document-level constructs from the primary input file and to try to | ||
| 424 | -preserve page labels from other input files when combining pages. | ||
| 425 | - | ||
| 426 | -Here are some examples. | ||
| 427 | - | ||
| 428 | -* PageSelector | ||
| 429 | - * all pages from an input file | ||
| 430 | - * pages from a group using a NumericRange | ||
| 431 | - * concatenate groups | ||
| 432 | - * pages from a group in reverse order | ||
| 433 | - * a group repeated as often as necessary until a specified number of pages is reached | ||
| 434 | - * a group padded with blank pages to create a multiple of n pages | ||
| 435 | - * odd or even pages from a group | ||
| 436 | - * every nth page from a group | ||
| 437 | - * pages interleaved from multiple groups | ||
| 438 | - * the left-front (left-back, right-front, right-back) pages of a booklet with signatures of n | ||
| 439 | - pages | ||
| 440 | - * all pages reachable from a section of the outline hierarchy or something based on threads or | ||
| 441 | - other structure | ||
| 442 | - * selection based on page labels | ||
| 443 | -* PageTransformer | ||
| 444 | - * clip to media box (trim box, crop box, etc.) | ||
| 445 | - * clip to specific absolute or relative size | ||
| 446 | - * scale | ||
| 447 | - * translate | ||
| 448 | - * rotate | ||
| 449 | - * apply transformation matrix | ||
| 450 | -* PageFilter | ||
| 451 | - * optimize images | ||
| 452 | - * flatten annotations | ||
| 453 | -* PageAssembler | ||
| 454 | - * Overlay/underlay all pages from one group onto corresponding pages from another group | ||
| 455 | - * Control placement based on properties of all the groups, so higher order than a stand-alone | ||
| 456 | - transformer | ||
| 457 | - * Examples | ||
| 458 | - * Scale the smaller page up to the size of the larger page | ||
| 459 | - * Center the smaller page horizontally and bottom-align the trim boxes | ||
| 460 | - * Generalized overlay/underlay allowing n pages in a given order with transformations. | ||
| 461 | - * n-up -- application of generalized overlay/underlay | ||
| 462 | - * make one long page with an arbitrary number of pages one after the other (#546) | ||
| 463 | - | ||
| 464 | -It should be possible to represent all of the existing qpdf operations using the above framework. It | ||
| 465 | -would be good to re-implement all of them in terms of this framework to exercise it. We will have to | ||
| 466 | -look through all the command-line arguments and make sure. Of course also make sure suggestions from | ||
| 467 | -issues can be implemented or at least supported by adding new selectors. | ||
| 468 | - | ||
| 469 | -Here are a few bits of scratch work. The top-level call is a selector. This doesn't capture | ||
| 470 | -everything. Implementing this would be tedious and challenging. It could be done using JSON arrays, | ||
| 471 | -but it would be clunky. This feels over-designed and possibly in conflict with QPDFJob. | ||
| 472 | - | ||
| 473 | -``` | ||
| 474 | -(concat | ||
| 475 | - (primary-input) | ||
| 476 | - (file "file2.pdf") | ||
| 477 | - (page-range (file "file3.pdf") "1-4,5-8") | ||
| 478 | -) | ||
| 479 | - | ||
| 480 | -(with | ||
| 481 | - ("a" | ||
| 482 | - (concat | ||
| 483 | - (primary-input) | ||
| 484 | - (file "file2.pdf") | ||
| 485 | - (page-range (file "file3.pdf") "1-4,5-8") | ||
| 486 | - ) | ||
| 487 | - ) | ||
| 488 | - (concat | ||
| 489 | - (even-pages (from "a")) | ||
| 490 | - (reverse (odd-pages (from "a"))) | ||
| 491 | - ) | ||
| 492 | -) | ||
| 493 | - | ||
| 494 | -(with | ||
| 495 | - ("a" | ||
| 496 | - (concat | ||
| 497 | - (primary-input) | ||
| 498 | - (file "file2.pdf") | ||
| 499 | - (page-range (file "file3.pdf") "1-4,5-8") | ||
| 500 | - ) | ||
| 501 | - "b-even" | ||
| 502 | - (even-pages (from "a")) | ||
| 503 | - "b-odd" | ||
| 504 | - (reverse (odd-pages (from "a"))) | ||
| 505 | - ) | ||
| 506 | - (stack | ||
| 507 | - (repeat-range (from "a") "z") | ||
| 508 | - (pad-end (from "b")) | ||
| 509 | - ) | ||
| 510 | -) | ||
| 511 | -``` | ||
| 512 | - | ||
| 513 | -```json | ||
| 514 | - | ||
| 515 | -``` | ||
| 516 | - | ||
| 517 | -# Supporting Document-level Features | ||
| 518 | - | ||
| 519 | -qpdf needs full support for document-level features like article threads, outlines, etc. There is no | ||
| 520 | -support for some things and partial support for others. See notes below for a comprehensive list. | ||
| 521 | - | ||
| 522 | -Most likely, this will be done by creating DocumentHelper and ObjectHelper classes. | ||
| 523 | - | ||
| 524 | -It will be necessary not only to read information about these structures from a single PDF file as | ||
| 525 | -the existing document helpers do but also to reconstruct or update these based on modifications to | ||
| 526 | -the pages in a file. I'm not sure how to do that, but one idea would be to allow a document helper | ||
| 527 | -to register a callback with QPDFPageDocumentHelper that notifies it when a page is added or removed. | ||
| 528 | -This may be able to take other parameters such as a document helper from a foreign file. | ||
| 529 | - | ||
| 530 | -Since these operations can be expensive, there will need to be a way to opt in and out. The default | ||
| 531 | -(to be clearly documented) should be that all supported document-level constructs are preserved. | ||
| 532 | -That way, as new features are added, changes to the output of previous operations to include | ||
| 533 | -information that was previously omitted will not constitute a non-backward compatible change that | ||
| 534 | -requires a major version bump. This will be a default for the API when using the higher-level page | ||
| 535 | -assemebly API (below) as well as the CLI. | ||
| 536 | - | ||
| 537 | -There will also need to be some kind of support for features that are document-level and not tied to | ||
| 538 | -any pages, such as (sometimes) embedded files. When splitting/merging files, there needs to be a way | ||
| 539 | -to specify what should happen with those things. Perhaps the default here should be that these are | ||
| 540 | -preserved from files from which all pages are selected. For some things, like viewer preferences, it | ||
| 541 | -may make sense to take them from the first file. | ||
| 542 | - | ||
| 543 | -# Page Assembly (page selection) | ||
| 544 | - | ||
| 545 | -In addition to the existing numeric ranges of page numbers, page selection could be driven by | ||
| 546 | -document-level features like the outlines hierarchy or article threads. There have been a lot of | ||
| 547 | -suggestions about this in various tickets. There will need to be some kind of page manipulation | ||
| 548 | -class with configuration options. I'm thinking something similar to QPDFJob, where you construct a | ||
| 549 | -class and then call a bunch of methods to configure it, including the ability to configure with | ||
| 550 | -JSON. Several suggestions have been made in issues, which I will go through and distill into a list. | ||
| 551 | -Off hand, some ideas include being able to split based on explicit chunks and being able to do all | ||
| 552 | -pages except a list of pages. | ||
| 553 | - | ||
| 554 | -For CLI, I'm probably going to have it take a JSON blob or JSON file on the CLI rather than having | ||
| 555 | -some absurd way of doing it with arguments (people have a lot of trouble with --pages as it is). See | ||
| 556 | -TODO for a feature on command-line/job JSON support for JSON specification arguments. | ||
| 557 | - | ||
| 558 | -There are some other things, like allowing n-up and genearlizing overlay/underlay to allow different | ||
| 559 | -placement and scaling options, that I think may also be in scope. | ||
| 560 | - | ||
| 561 | -# Scaling/Transforming Pages | ||
| 562 | - | ||
| 563 | -* Keep in mind that destinations, such as links and outlines, may need to be adjusted when a page is | ||
| 564 | - scaled or otherwise transformed. | ||
| 565 | - | ||
| 566 | -# Notes | ||
| 567 | - | ||
| 568 | -PDF document structure | ||
| 569 | - | ||
| 570 | -The trailer contains the catalog and the Info dictionary. We probably need to do something | ||
| 571 | -intelligent with the info dictionary. | ||
| 572 | - | ||
| 573 | -7.7.2 contains the list of all keys in the document catalog. | ||
| 574 | - | ||
| 575 | -Document-level structures to merge: | ||
| 576 | -* Extensions | ||
| 577 | - * Must be combination of Extensions from all input files | ||
| 578 | -* PageLabels | ||
| 579 | - * Ensure each page has its original label | ||
| 580 | - * Allow post-processing | ||
| 581 | -* Names -- see below | ||
| 582 | - * Combine per tree | ||
| 583 | - * May require disambiguation | ||
| 584 | - * Page: TemplateInstantiated | ||
| 585 | -* Dests | ||
| 586 | - * Keep referenced destinations across all files | ||
| 587 | - * May need to disambiguate or "flatten" or convert to named dests with the names tree | ||
| 588 | -* Outlines | ||
| 589 | -* Threads (easy) | ||
| 590 | - * Page: B | ||
| 591 | -* AcroForm | ||
| 592 | -* StructTreeRoot | ||
| 593 | - * Page: StructParents | ||
| 594 | -* MarkInfo (see 14.7 - Logical Structure, 14.8 Tagged PDF) | ||
| 595 | -* SpiderInfo | ||
| 596 | - * Page: ID | ||
| 597 | -* OutputIntents | ||
| 598 | - * Page: OutputIntents | ||
| 599 | -* PieceInfo | ||
| 600 | - * Page: PieceInfo | ||
| 601 | -* OCProperties | ||
| 602 | -* Requirements | ||
| 603 | -* AF (file specification dictionaries) | ||
| 604 | - * Page: AF | ||
| 605 | -* DPartRoot | ||
| 606 | - * Page: DPart | ||
| 607 | -* Version | ||
| 608 | - * Maximum | ||
| 609 | - | ||
| 610 | -Things that stay with the first document that has one and/or will not be supported | ||
| 611 | -* AA (Additional Actions) | ||
| 612 | - * Would be possible to combine and let the first contributor win, but it probably wouldn't usually | ||
| 613 | - be what we want. | ||
| 614 | -* Info (not part of document catalog) | ||
| 615 | -* ViewerPreferences | ||
| 616 | -* PageLayout | ||
| 617 | -* PageMode | ||
| 618 | -* OpenAction | ||
| 619 | -* URI | ||
| 620 | -* Metadata | ||
| 621 | -* Lang | ||
| 622 | -* NeedsRendering | ||
| 623 | -* Collection | ||
| 624 | -* Perms | ||
| 625 | -* Legal | ||
| 626 | -* DSS | ||
| 627 | - | ||
| 628 | -Name dictionary (7.7.4) | ||
| 629 | -* Dests | ||
| 630 | -* AP (appearance streams) | ||
| 631 | -* JavaScript | ||
| 632 | -* Pages (named pages) | ||
| 633 | -* Templates | ||
| 634 | - * Combine across all documents | ||
| 635 | - * Page: TemplateInstantiated points to a named page | ||
| 636 | -* IDS | ||
| 637 | -* URLS | ||
| 638 | -* EmbeddedFiles | ||
| 639 | -* AlternatePresentations | ||
| 640 | -* Renditions | ||
| 641 | - | ||
| 642 | -Most of chapter 12 applies. | ||
| 643 | - | ||
| 644 | -Document-level navigation (12.3) | ||
| 645 | - | ||
| 646 | -QPDF will need a global way to reference a page. This will most likely be in the form of the QPDF | ||
| 647 | -uuid and a QPDFObjectHandle to the page. If this can just be a QPDFObjectHandle, that would be | ||
| 648 | -better. I need to make sure we can meaningfully interact with QPDFObjectHandle objects from multiple | ||
| 649 | -QPDFs in a safe fashion. Figure out how this works with immediateCopyFrom, etc. Better to avoid this | ||
| 650 | -whole thing and make sure that we just keep all the document-level stuff specific to a PDF, but we | ||
| 651 | -will need to have some internal representation that can be used to reconstruct the document-level | ||
| 652 | -dictionaries when writing. Making this work with structures (structure destinations) will require | ||
| 653 | -more indirection. | ||
| 654 | - | ||
| 655 | -I imagine that there will be some internal repreentation of what document-level things come along | ||
| 656 | -for the ride when we take a page from a document. I wonder whether this need to change the way | ||
| 657 | -linearization works. | ||
| 658 | - | ||
| 659 | -There should be different ways to specify collections of pages. The existing one, which is using a | ||
| 660 | -numeric range, is just one. Other ideas include things related to document structure (all pages in | ||
| 661 | -an article thread, all pages in an outline hierarchy), page labels, book binding (Is that called | ||
| 662 | -folio? There's an issue for it.), subgroups, or any number of things. | ||
| 663 | - | ||
| 664 | -We will need to be able to start with document-level objects to get page groups and also to start | ||
| 665 | -with pages and reconstruct document level objects. For example, it should be possibe to reconstruct | ||
| 666 | -article threads to omit beads that don't belong to any of the pages. Likewise with outlines. |