Additional checks for unreferenced resources

Explicitly abandon removal of unreferenced resources if there are any lexical errors in the page's contents. This case always generated a warning, but it now also prevents removal of unreferenced resources, this strongly decreasing the likelihood of data loss.

Additional checks for unreferenced resources
Explicitly abandon removal of unreferenced resources if there are any lexical errors in the page's contents. This case always generated a warning, but it now also prevents removal of unreferenced resources, this strongly decreasing the likelihood of data loss.
Jay Berkenbilt
1 parent e09ae710
Showing 5 changed files with 272 additions and 3 deletions
libqpdf/QPDFPageObjectHelper.cc
qpdf/qpdf.testcov
qpdf/qtest/qpdf.test
qpdf/qtest/qpdf/coalesce-split-1-2.pdf
qpdf/qtest/qpdf/coalesce-split.out
@@ -99,11 +99,16 @@ QPDFPageObjectHelper::addContentTokenFilter(
 class NameWatcher: public QPDFObjectHandle::TokenFilter
 {
   public:
+    NameWatcher() :
+        saw_bad(false)
+    {
+    }
     virtual ~NameWatcher()
     {
     }
     virtual void handleToken(QPDFTokenizer::Token const&);
     std::set<std::string> names;
+    bool saw_bad;
 };
 void
@@ -116,6 +121,10 @@ NameWatcher::handleToken(QPDFTokenizer::Token const&amp; token)
         this->names.insert(
             QPDFObjectHandle::newName(token.getValue()).getName());
     }
+    else if (token.getType() == QPDFTokenizer::tt_bad)
+    {
+        saw_bad = true;
+    }
     writeToken(token);
 }
@@ -134,6 +143,14 @@ QPDFPageObjectHelper::removeUnreferencedResources()
             "; not attempting to remove unreferenced objects from this page");
         return;
     }
+    if (nw.saw_bad)
+    {
+        QTC::TC("qpdf", "QPDFPageObjectHelper bad token finding names");
+        this->oh.warnIfPossible(
+            "Bad token found while scanning content stream; "
+            "not attempting to remove unreferenced objects from this page");
+        return;
+    }
     // Walk through /Font and /XObject dictionaries, removing any
     // resources that are not referenced. We must make copies of
     // resource dictionaries down into the dictionaries are mutating
@@ -412,3 +412,4 @@ QPDF copy foreign stream with provider 0
 QPDF copy foreign stream with buffer 0
 QPDF immediate copy stream data 0
 qpdf copy same page more than once 1
+QPDFPageObjectHelper bad token finding names 0
@@ -1384,7 +1384,7 @@ my @sp_cases = (
     [11, 'pdf extension', '', 'split-out.Pdf'],
     [4, 'fallback', '--pages 11-pages.pdf 1-3 minimal.pdf --', 'split-out'],
     );
-$n_tests += 21;
+$n_tests += 23;
 for (@sp_cases)
 {
     $n_tests += 1 + $_->[0];
@@ -1482,10 +1482,20 @@ $td-&gt;runtest(&quot;split shared font, xobject&quot;,
 foreach my $i (qw(1 2 3 4))
 {
     $td->runtest("check output ($i)",
-                 {$td->FILE => "shared-font-xobject-split-$i.pdf"},
-                 {$td->FILE => "split-out-shared-font-xobject-$i.pdf"});
+                 {$td->FILE => "split-out-shared-font-xobject-$i.pdf"},
+                 {$td->FILE => "shared-font-xobject-split-$i.pdf"});
 }
+$td->runtest("unreferenced resources with bad token",
+             {$td->COMMAND =>
+                  "qpdf --qdf --static-id --split-pages=2" .
+                  " coalesce.pdf split-out-bad-token.pdf"},
+             {$td->FILE => "coalesce-split.out", $td->EXIT_STATUS => 3},
+             $td->NORMALIZE_NEWLINES);
+$td->runtest("check output",
+             {$td->FILE => "split-out-bad-token-1-2.pdf"},
+             {$td->FILE => "coalesce-split-1-2.pdf"});
+
 show_ntests();
 # ----------
 $td->notify("--- Keep Files Open ---");
+WARNING: coalesce.pdf, object 3 0 at offset 181: Bad token found while scanning content stream; not attempting to remove unreferenced objects from this page
+WARNING: empty PDF: content normalization encountered bad tokens
+WARNING: empty PDF: normalized content ended with a bad token; you may be able to resolve this by coalescing content streams in combination with normalizing content. From the command line, specify --coalesce-contents
+WARNING: empty PDF: Resulting stream data may be corrupted but is may still useful for manual inspection. For more information on this warning, search for content normalization in the manual.
+WARNING: empty PDF: content normalization encountered bad tokens
+WARNING: empty PDF: Resulting stream data may be corrupted but is may still useful for manual inspection. For more information on this warning, search for content normalization in the manual.
+WARNING: empty PDF: content normalization encountered bad tokens
+WARNING: empty PDF: normalized content ended with a bad token; you may be able to resolve this by coalescing content streams in combination with normalizing content. From the command line, specify --coalesce-contents
+WARNING: empty PDF: Resulting stream data may be corrupted but is may still useful for manual inspection. For more information on this warning, search for content normalization in the manual.
+qpdf: operation succeeded with warnings; resulting file may have some problems