Commit 0675a3f61a465f282eba8e1f54bdda3920257959
1 parent
cc889507
Decide not to allow stream data providers to modify dictionary
Showing
2 changed files
with
68 additions
and
12 deletions
TODO
| @@ -29,11 +29,6 @@ Candidates for upcoming release | @@ -29,11 +29,6 @@ Candidates for upcoming release | ||
| 29 | * big page even with --remove-unreferenced-resources=yes, even with --empty | 29 | * big page even with --remove-unreferenced-resources=yes, even with --empty |
| 30 | * optimize image failure because of colorspace | 30 | * optimize image failure because of colorspace |
| 31 | 31 | ||
| 32 | -* Make it possible for StreamDataProvider to modify the stream | ||
| 33 | - dictionary in addition to the stream data so it can calculate things | ||
| 34 | - about the dictionary at runtime. Will require a small change to | ||
| 35 | - QPDFWriter. | ||
| 36 | - | ||
| 37 | * Take flattenRotation code from pdf-split and do something with it, | 32 | * Take flattenRotation code from pdf-split and do something with it, |
| 38 | maybe adding it to the library. Once there, call it from pdf-split | 33 | maybe adding it to the library. Once there, call it from pdf-split |
| 39 | and bump up the required version of qpdf. | 34 | and bump up the required version of qpdf. |
| @@ -558,3 +553,49 @@ I find it useful to make reference to them in this list | @@ -558,3 +553,49 @@ I find it useful to make reference to them in this list | ||
| 558 | filtering and tokenizer rewrite and should be done in a manner that | 553 | filtering and tokenizer rewrite and should be done in a manner that |
| 559 | takes advantage of the other lexical features. This sanitizer | 554 | takes advantage of the other lexical features. This sanitizer |
| 560 | should also clear metadata and replace images. | 555 | should also clear metadata and replace images. |
| 556 | + | ||
| 557 | + * Here are some notes about having stream data providers modify | ||
| 558 | + stream dictionaries. I had wanted to add this functionality to make | ||
| 559 | + it more efficient to create stream data providers that may | ||
| 560 | + dynamically decide what kind of filters to use and that may end up | ||
| 561 | + modifying the dictionary conditionally depending on the original | ||
| 562 | + stream data. Ultimately I decided not to implement this feature. | ||
| 563 | + This paragraph describes why. | ||
| 564 | + | ||
| 565 | + * When writing, the way objects are placed into the queue for | ||
| 566 | + writing strongly precludes creation of any new indirect objects, | ||
| 567 | + or even changing which indirect objects are referenced from which | ||
| 568 | + other objects, because we sometimes write as we are traversing | ||
| 569 | + and enqueuing objects. For non-linearized files, there is a risk | ||
| 570 | + that an indirect object that used to be referenced would no | ||
| 571 | + longer be referenced, and whether it was already written to the | ||
| 572 | + output file would be based on an accident of where it was | ||
| 573 | + encountered when traversing the object structure. For linearized | ||
| 574 | + files, the situation is considerably worse. We decide which | ||
| 575 | + section of the file to write an object to based on a mapping of | ||
| 576 | + which objects are used by which other objects. Changing this | ||
| 577 | + mapping could cause an object to appear in the wrong section, to | ||
| 578 | + be written even though it is unreferenced, or to be entirely | ||
| 579 | + omitted since, during linearization, we don't enqueue new objects | ||
| 580 | + as we traverse for writing. | ||
| 581 | + | ||
| 582 | + * There are several places in QPDFWriter that query a stream's | ||
| 583 | + dictionary in order to prepare for writing or to make decisions | ||
| 584 | + about certain aspects of the writing process. If the stream data | ||
| 585 | + provider has the chance to modify the dictionary, every piece of | ||
| 586 | + code that gets stream data would have to be aware of this. This | ||
| 587 | + would potentially include end user code. For example, any code | ||
| 588 | + that called getDict() on a stream before installing a stream data | ||
| 589 | + provider and expected that dictionary to be valid would | ||
| 590 | + potentially be broken. As implemented right now, you must perform | ||
| 591 | + any modifications on the dictionary in advance and provided | ||
| 592 | + /Filter and /DecodeParms at the time you installed the stream | ||
| 593 | + data provider. This means that some computations would have to be | ||
| 594 | + done more than once, but for linearized files, stream data | ||
| 595 | + providers are already called more than once. If the work done by | ||
| 596 | + a stream data provider is especially expensive, it can implement | ||
| 597 | + its own cache. | ||
| 598 | + | ||
| 599 | + The implementation of pluggable stream filters includes an example | ||
| 600 | + that illustrates how a program might handle making decisions about | ||
| 601 | + filters and decode parameters based on the input data. |
include/qpdf/QPDFObjectHandle.hh
| @@ -70,13 +70,28 @@ class QPDFObjectHandle | @@ -70,13 +70,28 @@ class QPDFObjectHandle | ||
| 70 | // QPDFWriter may, in some cases, add compression, but if it | 70 | // QPDFWriter may, in some cases, add compression, but if it |
| 71 | // does, it will update the filters as needed. Every call to | 71 | // does, it will update the filters as needed. Every call to |
| 72 | // provideStreamData for a given stream must write the same | 72 | // provideStreamData for a given stream must write the same |
| 73 | - // data. The object ID and generation passed to this method | ||
| 74 | - // are those that belong to the stream on behalf of which the | ||
| 75 | - // provider is called. They may be ignored or used by the | ||
| 76 | - // implementation for indexing or other purposes. This | ||
| 77 | - // information is made available just to make it more | ||
| 78 | - // convenient to use a single StreamDataProvider object to | ||
| 79 | - // provide data for multiple streams. | 73 | + // data. Note that, when writing linearized files, qpdf will |
| 74 | + // call your provideStreamData twice, and if it generates | ||
| 75 | + // different output, you risk generating invalid output or | ||
| 76 | + // having qpdf throw an exception. The object ID and | ||
| 77 | + // generation passed to this method are those that belong to | ||
| 78 | + // the stream on behalf of which the provider is called. They | ||
| 79 | + // may be ignored or used by the implementation for indexing | ||
| 80 | + // or other purposes. This information is made available just | ||
| 81 | + // to make it more convenient to use a single | ||
| 82 | + // StreamDataProvider object to provide data for multiple | ||
| 83 | + // streams. | ||
| 84 | + | ||
| 85 | + // A few things to keep in mind: | ||
| 86 | + // | ||
| 87 | + // * Stream data providers must not modify any objects since | ||
| 88 | + // they may be called after some parts of the file have | ||
| 89 | + // already been written. | ||
| 90 | + // | ||
| 91 | + // * Since qpdf may call provideStreamData multiple times when | ||
| 92 | + // writing linearized files, if the work done by your stream | ||
| 93 | + // data provider is slow or computationally intensive, you | ||
| 94 | + // might want to implement your own cache. | ||
| 80 | 95 | ||
| 81 | // Prior to qpdf 10.0.0, it was not possible to handle errors | 96 | // Prior to qpdf 10.0.0, it was not possible to handle errors |
| 82 | // the way pipeStreamData does or to pass back success. | 97 | // the way pipeStreamData does or to pass back success. |