Commit 0675a3f61a465f282eba8e1f54bdda3920257959

Authored by Jay Berkenbilt
1 parent cc889507

Decide not to allow stream data providers to modify dictionary

... ... @@ -29,11 +29,6 @@ Candidates for upcoming release
29 29 * big page even with --remove-unreferenced-resources=yes, even with --empty
30 30 * optimize image failure because of colorspace
31 31  
32   -* Make it possible for StreamDataProvider to modify the stream
33   - dictionary in addition to the stream data so it can calculate things
34   - about the dictionary at runtime. Will require a small change to
35   - QPDFWriter.
36   -
37 32 * Take flattenRotation code from pdf-split and do something with it,
38 33 maybe adding it to the library. Once there, call it from pdf-split
39 34 and bump up the required version of qpdf.
... ... @@ -558,3 +553,49 @@ I find it useful to make reference to them in this list
558 553 filtering and tokenizer rewrite and should be done in a manner that
559 554 takes advantage of the other lexical features. This sanitizer
560 555 should also clear metadata and replace images.
  556 +
  557 + * Here are some notes about having stream data providers modify
  558 + stream dictionaries. I had wanted to add this functionality to make
  559 + it more efficient to create stream data providers that may
  560 + dynamically decide what kind of filters to use and that may end up
  561 + modifying the dictionary conditionally depending on the original
  562 + stream data. Ultimately I decided not to implement this feature.
  563 + This paragraph describes why.
  564 +
  565 + * When writing, the way objects are placed into the queue for
  566 + writing strongly precludes creation of any new indirect objects,
  567 + or even changing which indirect objects are referenced from which
  568 + other objects, because we sometimes write as we are traversing
  569 + and enqueuing objects. For non-linearized files, there is a risk
  570 + that an indirect object that used to be referenced would no
  571 + longer be referenced, and whether it was already written to the
  572 + output file would be based on an accident of where it was
  573 + encountered when traversing the object structure. For linearized
  574 + files, the situation is considerably worse. We decide which
  575 + section of the file to write an object to based on a mapping of
  576 + which objects are used by which other objects. Changing this
  577 + mapping could cause an object to appear in the wrong section, to
  578 + be written even though it is unreferenced, or to be entirely
  579 + omitted since, during linearization, we don't enqueue new objects
  580 + as we traverse for writing.
  581 +
  582 + * There are several places in QPDFWriter that query a stream's
  583 + dictionary in order to prepare for writing or to make decisions
  584 + about certain aspects of the writing process. If the stream data
  585 + provider has the chance to modify the dictionary, every piece of
  586 + code that gets stream data would have to be aware of this. This
  587 + would potentially include end user code. For example, any code
  588 + that called getDict() on a stream before installing a stream data
  589 + provider and expected that dictionary to be valid would
  590 + potentially be broken. As implemented right now, you must perform
  591 + any modifications on the dictionary in advance and provided
  592 + /Filter and /DecodeParms at the time you installed the stream
  593 + data provider. This means that some computations would have to be
  594 + done more than once, but for linearized files, stream data
  595 + providers are already called more than once. If the work done by
  596 + a stream data provider is especially expensive, it can implement
  597 + its own cache.
  598 +
  599 + The implementation of pluggable stream filters includes an example
  600 + that illustrates how a program might handle making decisions about
  601 + filters and decode parameters based on the input data.
... ...
include/qpdf/QPDFObjectHandle.hh
... ... @@ -70,13 +70,28 @@ class QPDFObjectHandle
70 70 // QPDFWriter may, in some cases, add compression, but if it
71 71 // does, it will update the filters as needed. Every call to
72 72 // provideStreamData for a given stream must write the same
73   - // data. The object ID and generation passed to this method
74   - // are those that belong to the stream on behalf of which the
75   - // provider is called. They may be ignored or used by the
76   - // implementation for indexing or other purposes. This
77   - // information is made available just to make it more
78   - // convenient to use a single StreamDataProvider object to
79   - // provide data for multiple streams.
  73 + // data. Note that, when writing linearized files, qpdf will
  74 + // call your provideStreamData twice, and if it generates
  75 + // different output, you risk generating invalid output or
  76 + // having qpdf throw an exception. The object ID and
  77 + // generation passed to this method are those that belong to
  78 + // the stream on behalf of which the provider is called. They
  79 + // may be ignored or used by the implementation for indexing
  80 + // or other purposes. This information is made available just
  81 + // to make it more convenient to use a single
  82 + // StreamDataProvider object to provide data for multiple
  83 + // streams.
  84 +
  85 + // A few things to keep in mind:
  86 + //
  87 + // * Stream data providers must not modify any objects since
  88 + // they may be called after some parts of the file have
  89 + // already been written.
  90 + //
  91 + // * Since qpdf may call provideStreamData multiple times when
  92 + // writing linearized files, if the work done by your stream
  93 + // data provider is slow or computationally intensive, you
  94 + // might want to implement your own cache.
80 95  
81 96 // Prior to qpdf 10.0.0, it was not possible to handle errors
82 97 // the way pipeStreamData does or to pass back success.
... ...