Avoid merging adjacent tokens when concatenating contents (fixes #444)

Jay Berkenbilt
1 parent 0dea2769
Showing 16 changed files with 541 additions and 43 deletions
ChangeLog
TODO
libqpdf/QPDFObjectHandle.cc
manual/qpdf-manual.xml
qpdf/qpdf.testcov
qpdf/qtest/qpdf.test
qpdf/qtest/qpdf/coalesce-out.pdf
qpdf/qtest/qpdf/coalesce-out.qdf
qpdf/qtest/qpdf/coalesce.pdf
qpdf/qtest/qpdf/coalesce.qdf
qpdf/qtest/qpdf/normalize-warnings.out
qpdf/qtest/qpdf/coalesce-split-1-2.pdf → qpdf/qtest/qpdf/split-tokens-split-1-2.pdf
qpdf/qtest/qpdf/coalesce-split.out → qpdf/qtest/qpdf/split-tokens-split.out
qpdf/qtest/qpdf/split-tokens.pdf
qpdf/qtest/qpdf/split-tokens.qdf
qpdf/qtest/qpdf/token-filters-out.pdf
 2020-10-23  Jay Berkenbilt  <ejb@ql.org>
  
+	* Bug fix: when concatenating content streams, insert a newline if
+	needed to prevent the last token from the old stream from being
+	merged with the first token of the new stream. Qpdf was mistakenly
+	concatenating the streams without regard to the specification that
+	content streams are to be broken on token boundaries. Fixes #444.
+
 	* Bug fix: fix-qdf: properly handle empty streams with ignore
 	newline.
  
@@ -4,7 +4,6 @@ Candidates for upcoming release
 * Open "next" issues
   * bugs
     * #473: zsh completion with directories
-    * #444: concatenated stream/whitespace bug
   * Non-bugs
     * #446: recognize edited QDF files
     * #436: parsing of document with form xobject
@@ -165,6 +165,47 @@ QPDFObjectHandle::ParserCallbacks::terminateParsing()
     throw TerminateParsing();
 }
  
+class LastChar: public Pipeline
+{
+  public:
+    LastChar(Pipeline* next);
+    virtual ~LastChar() = default;
+    virtual void write(unsigned char* data, size_t len);
+    virtual void finish();
+    unsigned char getLastChar();
+
+  private:
+    unsigned char last_char;
+};
+
+LastChar::LastChar(Pipeline* next) :
+    Pipeline("lastchar", next),
+    last_char(0)
+{
+}
+
+void
+LastChar::write(unsigned char* data, size_t len)
+{
+    if (len > 0)
+    {
+        this->last_char = data[len - 1];
+    }
+    getNext()->write(data, len);
+}
+
+void
+LastChar::finish()
+{
+    getNext()->finish();
+}
+
+unsigned char
+LastChar::getLastChar()
+{
+    return this->last_char;
+}
+
 QPDFObjectHandle::QPDFObjectHandle() :
     initialized(false),
     qpdf(0),
@@ -1600,21 +1641,31 @@ QPDFObjectHandle::pipeContentStreams(
     std::vector<QPDFObjectHandle> streams =
         arrayOrStreamToStreamArray(
             description, all_description);
+    bool need_newline = false;
     for (std::vector<QPDFObjectHandle>::iterator iter = streams.begin();
          iter != streams.end(); ++iter)
     {
+        if (need_newline)
+        {
+            p->write(QUtil::unsigned_char_pointer("\n"), 1);
+        }
+        LastChar lc(p);
         QPDFObjectHandle stream = *iter;
         std::string og =
             QUtil::int_to_string(stream.getObjectID()) + " " +
             QUtil::int_to_string(stream.getGeneration());
         std::string w_description = "content stream object " + og;
-        if (! stream.pipeStreamData(p, 0, qpdf_dl_specialized))
+        if (! stream.pipeStreamData(&lc, 0, qpdf_dl_specialized))
         {
             QTC::TC("qpdf", "QPDFObjectHandle errors in parsecontent");
             throw QPDFExc(qpdf_e_damaged_pdf, "content stream",
                           w_description, 0,
                           "errors while decoding content stream");
         }
+        lc.finish();
+        need_newline = (lc.getLastChar() != static_cast<unsigned char>('\n'));
+        QTC::TC("qpdf", "QPDFObjectHandle need_newline",
+                need_newline ? 0 : 1);
     }
 }
  
@@ -2090,14 +2090,9 @@ outfile.pdf&lt;/option&gt;
         option causes qpdf to combine them into a single stream. Use
         of this option is never necessary for ordinary usage, but it
         can help when working with some files in some cases. For
-        example, some PDF writers split page contents into small
-        streams at arbitrary points that may fall in the middle of
-        lexical tokens within the content, and some PDF readers may
-        get confused on such files. If you use qpdf to coalesce the
-        content streams, such readers may be able to work with the
-        file more easily. This can also be combined with QDF mode or
-        content normalization to make it easier to look at all of a
-        page's contents at once.
+        example, this can also be combined with QDF mode or content
+        normalization to make it easier to look at all of a page's
+        contents at once.
        </para>
       </listitem>
      </varlistentry>
@@ -2398,25 +2393,15 @@ outfile.pdf&lt;/option&gt;
     You should not use this for &ldquo;production&rdquo; PDF files.
    </para>
    <para>
-    This paragraph discusses edge cases of content normalization that
-    are not of concern to most users and are not relevant when content
-    normalization is not enabled. When normalizing content, if qpdf
-    runs into any lexical errors, it will print a warning indicating
-    that content may be damaged. The only situation in which qpdf is
-    known to cause damage during content normalization is when a
-    page's contents are split across multiple streams and streams are
-    split in the middle of a lexical token such as a string, name, or
-    inline image. There may be some pathological cases in which qpdf
-    could damage content without noticing this, such as if the partial
-    tokens at the end of one stream and the beginning of the next
-    stream are both valid, but usually qpdf will be able to detect
-    this case. For slightly increased safety, you can specify
-    <option>--coalesce-contents</option> in addition to
-    <option>--normalize-content</option> or <option>--qdf</option>.
-    This will cause qpdf to combine all the content streams into one,
-    thus recombining any split tokens. However doing this will prevent
-    you from being able to see the original layout of the content
-    streams. If you must inspect the original content streams in an
+    When normalizing content, if qpdf runs into any lexical errors, it
+    will print a warning indicating that content may be damaged. The
+    only situation in which qpdf is known to cause damage during
+    content normalization is when a page's contents are split across
+    multiple streams and streams are split in the middle of a lexical
+    token such as a string, name, or inline image. Note that files
+    that do this are invalid since the PDF specification states that
+    content streams are not to be split in the middle of a token. If
+    you want to inspect the original content streams in an
     uncompressed format, you can always run with <option>--qdf
     --normalize-content=n</option> for a QDF file without content
     normalization, or alternatively
@@ -455,3 +455,4 @@ qpdf found shared resources in leaf 0
 qpdf found shared xobject in leaf 0
 QPDF copy foreign with data 1
 QPDF copy foreign with foreign_stream 1
+QPDFObjectHandle need_newline 1
@@ -1591,15 +1591,23 @@ $td-&gt;runtest(&quot;type checks with object streams&quot;,
  
 # ----------
 $td->notify("--- Coalesce contents ---");
-$n_tests += 6;
+$n_tests += 8;
  
 $td->runtest("qdf with normalize warnings",
              {$td->COMMAND =>
-                  "qpdf --qdf --static-id coalesce.pdf a.pdf"},
+                  "qpdf --qdf --static-id split-tokens.pdf a.pdf"},
              {$td->FILE => "normalize-warnings.out", $td->EXIT_STATUS => 3},
              $td->NORMALIZE_NEWLINES);
 $td->runtest("check output",
              {$td->FILE => "a.pdf"},
+             {$td->FILE => "split-tokens.qdf"});
+$td->runtest("coalesce to qdf",
+             {$td->COMMAND =>
+                  "qpdf --qdf --static-id coalesce.pdf a.pdf"},
+             {$td->STRING => "", $td->EXIT_STATUS => 0},
+             $td->NORMALIZE_NEWLINES);
+$td->runtest("check output",
+             {$td->FILE => "a.pdf"},
              {$td->FILE => "coalesce.qdf"});
 $td->runtest("coalesce contents with qdf",
              {$td->COMMAND =>
@@ -1831,12 +1839,12 @@ $td-&gt;runtest(&quot;unreferenced resources with bad token&quot;,
              {$td->COMMAND =>
                   "qpdf --qdf --static-id --split-pages=2" .
                   " --remove-unreferenced-resources=yes" .
-                  " coalesce.pdf split-out-bad-token.pdf"},
-             {$td->FILE => "coalesce-split.out", $td->EXIT_STATUS => 3},
+                  " split-tokens.pdf split-out-bad-token.pdf"},
+             {$td->FILE => "split-tokens-split.out", $td->EXIT_STATUS => 3},
              $td->NORMALIZE_NEWLINES);
 $td->runtest("check output",
              {$td->FILE => "split-out-bad-token-1-2.pdf"},
-             {$td->FILE => "coalesce-split-1-2.pdf"});
+             {$td->FILE => "split-tokens-split-1-2.pdf"});
  
 $td->runtest("shared images in form xobject",
              {$td->COMMAND => "qpdf --qdf --static-id --split-pages".
-WARNING: coalesce.pdf (offset 671): content normalization encountered bad tokens
-WARNING: coalesce.pdf (offset 671): normalized content ended with a bad token; you may be able to resolve this by coalescing content streams in combination with normalizing content. From the command line, specify --coalesce-contents
-WARNING: coalesce.pdf (offset 671): Resulting stream data may be corrupted but is may still useful for manual inspection. For more information on this warning, search for content normalization in the manual.
-WARNING: coalesce.pdf (offset 823): content normalization encountered bad tokens
-WARNING: coalesce.pdf (offset 823): Resulting stream data may be corrupted but is may still useful for manual inspection. For more information on this warning, search for content normalization in the manual.
-WARNING: coalesce.pdf (offset 962): content normalization encountered bad tokens
-WARNING: coalesce.pdf (offset 962): normalized content ended with a bad token; you may be able to resolve this by coalescing content streams in combination with normalizing content. From the command line, specify --coalesce-contents
-WARNING: coalesce.pdf (offset 962): Resulting stream data may be corrupted but is may still useful for manual inspection. For more information on this warning, search for content normalization in the manual.
+WARNING: split-tokens.pdf (offset 671): content normalization encountered bad tokens
+WARNING: split-tokens.pdf (offset 671): normalized content ended with a bad token; you may be able to resolve this by coalescing content streams in combination with normalizing content. From the command line, specify --coalesce-contents
+WARNING: split-tokens.pdf (offset 671): Resulting stream data may be corrupted but is may still useful for manual inspection. For more information on this warning, search for content normalization in the manual.
+WARNING: split-tokens.pdf (offset 823): content normalization encountered bad tokens
+WARNING: split-tokens.pdf (offset 823): Resulting stream data may be corrupted but is may still useful for manual inspection. For more information on this warning, search for content normalization in the manual.
+WARNING: split-tokens.pdf (offset 962): content normalization encountered bad tokens
+WARNING: split-tokens.pdf (offset 962): normalized content ended with a bad token; you may be able to resolve this by coalescing content streams in combination with normalizing content. From the command line, specify --coalesce-contents
+WARNING: split-tokens.pdf (offset 962): Resulting stream data may be corrupted but is may still useful for manual inspection. For more information on this warning, search for content normalization in the manual.
 qpdf: operation succeeded with warnings; resulting file may have some problems
-WARNING: coalesce.pdf, object 3 0 at offset 181: Bad token found while scanning content stream; not attempting to remove unreferenced objects from this page
+WARNING: split-tokens.pdf, object 3 0 at offset 181: Bad token found while scanning content stream; not attempting to remove unreferenced objects from this page
 WARNING: empty PDF: content normalization encountered bad tokens
 WARNING: empty PDF: normalized content ended with a bad token; you may be able to resolve this by coalescing content streams in combination with normalizing content. From the command line, specify --coalesce-contents
 WARNING: empty PDF: Resulting stream data may be corrupted but is may still useful for manual inspection. For more information on this warning, search for content normalization in the manual.