Decide not to allow stream data providers to modify dictionary

Jay Berkenbilt
1 parent cc889507
Showing 2 changed files with 68 additions and 12 deletions
TODO
include/qpdf/QPDFObjectHandle.hh
@@ -29,11 +29,6 @@ Candidates for upcoming release
   * big page even with --remove-unreferenced-resources=yes, even with --empty
   * optimize image failure because of colorspace
  
-* Make it possible for StreamDataProvider to modify the stream
-  dictionary in addition to the stream data so it can calculate things
-  about the dictionary at runtime. Will require a small change to
-  QPDFWriter.
-
 * Take flattenRotation code from pdf-split and do something with it,
   maybe adding it to the library. Once there, call it from pdf-split
   and bump up the required version of qpdf.
@@ -558,3 +553,49 @@ I find it useful to make reference to them in this list
    filtering and tokenizer rewrite and should be done in a manner that
    takes advantage of the other lexical features. This sanitizer
    should also clear metadata and replace images.
+
+ * Here are some notes about having stream data providers modify
+   stream dictionaries. I had wanted to add this functionality to make
+   it more efficient to create stream data providers that may
+   dynamically decide what kind of filters to use and that may end up
+   modifying the dictionary conditionally depending on the original
+   stream data. Ultimately I decided not to implement this feature.
+   This paragraph describes why.
+
+   * When writing, the way objects are placed into the queue for
+     writing strongly precludes creation of any new indirect objects,
+     or even changing which indirect objects are referenced from which
+     other objects, because we sometimes write as we are traversing
+     and enqueuing objects. For non-linearized files, there is a risk
+     that an indirect object that used to be referenced would no
+     longer be referenced, and whether it was already written to the
+     output file would be based on an accident of where it was
+     encountered when traversing the object structure. For linearized
+     files, the situation is considerably worse. We decide which
+     section of the file to write an object to based on a mapping of
+     which objects are used by which other objects. Changing this
+     mapping could cause an object to appear in the wrong section, to
+     be written even though it is unreferenced, or to be entirely
+     omitted since, during linearization, we don't enqueue new objects
+     as we traverse for writing.
+
+   * There are several places in QPDFWriter that query a stream's
+     dictionary in order to prepare for writing or to make decisions
+     about certain aspects of the writing process. If the stream data
+     provider has the chance to modify the dictionary, every piece of
+     code that gets stream data would have to be aware of this. This
+     would potentially include end user code. For example, any code
+     that called getDict() on a stream before installing a stream data
+     provider and expected that dictionary to be valid would
+     potentially be broken. As implemented right now, you must perform
+     any modifications on the dictionary in advance and provided
+     /Filter and /DecodeParms at the time you installed the stream
+     data provider. This means that some computations would have to be
+     done more than once, but for linearized files, stream data
+     providers are already called more than once. If the work done by
+     a stream data provider is especially expensive, it can implement
+     its own cache.
+
+   The implementation of pluggable stream filters includes an example
+   that illustrates how a program might handle making decisions about
+   filters and decode parameters based on the input data.
@@ -70,13 +70,28 @@ class QPDFObjectHandle
 	// QPDFWriter may, in some cases, add compression, but if it
 	// does, it will update the filters as needed. Every call to
 	// provideStreamData for a given stream must write the same
-	// data. The object ID and generation passed to this method
-	// are those that belong to the stream on behalf of which the
-	// provider is called. They may be ignored or used by the
-	// implementation for indexing or other purposes. This
-	// information is made available just to make it more
-	// convenient to use a single StreamDataProvider object to
-	// provide data for multiple streams.
+	// data. Note that, when writing linearized files, qpdf will
+	// call your provideStreamData twice, and if it generates
+	// different output, you risk generating invalid output or
+	// having qpdf throw an exception. The object ID and
+	// generation passed to this method are those that belong to
+	// the stream on behalf of which the provider is called. They
+	// may be ignored or used by the implementation for indexing
+	// or other purposes. This information is made available just
+	// to make it more convenient to use a single
+	// StreamDataProvider object to provide data for multiple
+	// streams.
+
+        // A few things to keep in mind:
+        //
+        // * Stream data providers must not modify any objects since
+        //   they may be called after some parts of the file have
+        //   already been written.
+        //
+        // * Since qpdf may call provideStreamData multiple times when
+        //   writing linearized files, if the work done by your stream
+        //   data provider is slow or computationally intensive, you
+        //   might want to implement your own cache.
  
         // Prior to qpdf 10.0.0, it was not possible to handle errors
         // the way pipeStreamData does or to pass back success.