Commit e17585c2d2df9fea296364c0768c2ce5adbc4b91

Authored by Jay Berkenbilt
1 parent a15ec696

Remove unreferenced: ignore names that are not Fonts or XObjects

Converted ResourceFinder to ParserCallbacks so we can better detect
the name that precedes various operators and use the operators to sort
the names into resource types. This enables us to be smarter about
detecting unreferenced resources in pages and also sets the stage for
reconciling differences in /DR across documents.
ChangeLog
1 2021-03-01 Jay Berkenbilt <ejb@ql.org> 1 2021-03-01 Jay Berkenbilt <ejb@ql.org>
2 2
  3 + * Improve code that finds unreferenced resources to ignore names
  4 + in the content stream that are not fonts or XObjects. This should
  5 + reduce the number of cases when qpdf needlessly decides not to
  6 + remove unreferenced resources. Hopefully it doesn't create any new
  7 + bugs where it removes unreferenced resources that it isn't
  8 + supposed to.
  9 +
3 * QPDFObjectHandle::ParserCallbacks: add virtual handleWarning 10 * QPDFObjectHandle::ParserCallbacks: add virtual handleWarning
4 method, and provide default (empty) implementation of it and 11 method, and provide default (empty) implementation of it and
5 handleEOF(). 12 handleEOF().
@@ -34,15 +34,6 @@ Document-level work @@ -34,15 +34,6 @@ Document-level work
34 --copy-attachments-from to preserve these. What will the strategy be 34 --copy-attachments-from to preserve these. What will the strategy be
35 for deduplicating in the automatic case? 35 for deduplicating in the automatic case?
36 36
37 -* When I get to tagged PDF, note that the presence of /Artifact and  
38 - /Standard (and maybe others?) causes a false positive on detection  
39 - of unresolved names. Example: form-fields-and-annotations.pdf. This  
40 - used to give a warning (never in a released version), but the  
41 - warning was removed. See comments about tagged pdf in  
42 - QPDFPageObjectHelper::removeUnreferencedResourcesHelper. Another  
43 - potential solution is to recognize names that refer to fonts and  
44 - xobjects but only looking at names used with Tf and Do operators.  
45 -  
46 Fuzz Errors 37 Fuzz Errors
47 =========== 38 ===========
48 39
libqpdf/QPDFPageObjectHelper.cc
@@ -684,7 +684,7 @@ QPDFPageObjectHelper::removeUnreferencedResourcesHelper( @@ -684,7 +684,7 @@ QPDFPageObjectHelper::removeUnreferencedResourcesHelper(
684 ResourceFinder rf; 684 ResourceFinder rf;
685 try 685 try
686 { 686 {
687 - ph.filterContents(&rf); 687 + ph.parseContents(&rf);
688 } 688 }
689 catch (std::exception& e) 689 catch (std::exception& e)
690 { 690 {
@@ -711,9 +711,9 @@ QPDFPageObjectHelper::removeUnreferencedResourcesHelper( @@ -711,9 +711,9 @@ QPDFPageObjectHelper::removeUnreferencedResourcesHelper(
711 QPDFObjectHandle resources = ph.getAttribute("/Resources", true); 711 QPDFObjectHandle resources = ph.getAttribute("/Resources", true);
712 std::vector<QPDFObjectHandle> rdicts; 712 std::vector<QPDFObjectHandle> rdicts;
713 std::set<std::string> known_names; 713 std::set<std::string> known_names;
  714 + std::vector<std::string> to_filter = {"/Font", "/XObject"};
714 if (resources.isDictionary()) 715 if (resources.isDictionary())
715 { 716 {
716 - std::vector<std::string> to_filter = {"/Font", "/XObject"};  
717 for (auto const& iter: to_filter) 717 for (auto const& iter: to_filter)
718 { 718 {
719 QPDFObjectHandle dict = resources.getKey(iter); 719 QPDFObjectHandle dict = resources.getKey(iter);
@@ -729,12 +729,17 @@ QPDFPageObjectHelper::removeUnreferencedResourcesHelper( @@ -729,12 +729,17 @@ QPDFPageObjectHelper::removeUnreferencedResourcesHelper(
729 } 729 }
730 730
731 std::set<std::string> local_unresolved; 731 std::set<std::string> local_unresolved;
732 - for (auto const& name: rf.getNames()) 732 + auto names_by_rtype = rf.getNamesByResourceType();
  733 + for (auto const& i1: to_filter)
733 { 734 {
734 - if (! known_names.count(name)) 735 + for (auto const& n_iter: names_by_rtype[i1])
735 { 736 {
736 - unresolved.insert(name);  
737 - local_unresolved.insert(name); 737 + std::string const& name = n_iter.first;
  738 + if (! known_names.count(name))
  739 + {
  740 + unresolved.insert(name);
  741 + local_unresolved.insert(name);
  742 + }
738 } 743 }
739 } 744 }
740 // Older versions of the PDF spec allowed form XObjects to omit 745 // Older versions of the PDF spec allowed form XObjects to omit
@@ -754,11 +759,17 @@ QPDFPageObjectHelper::removeUnreferencedResourcesHelper( @@ -754,11 +759,17 @@ QPDFPageObjectHelper::removeUnreferencedResourcesHelper(
754 759
755 if ((! local_unresolved.empty()) && resources.isDictionary()) 760 if ((! local_unresolved.empty()) && resources.isDictionary())
756 { 761 {
757 - // Don't issue a warning for this case. There are some cases  
758 - // of names that aren't XObject references, for example,  
759 - // /Artifact in tagged PDF. Until we are certain that we know  
760 - // the meaning of every name in a content stream, we don't  
761 - // want to give warnings because they will be false positives. 762 + // It's not worth issuing a warning for this case. From qpdf
  763 + // 10.3, we are hopefully only looking at names that are
  764 + // referencing fonts and XObjects, but until we're certain
  765 + // that we know the meaning of every name in a content stream,
  766 + // we don't want to give warnings that might be false
  767 + // positives. Also, this can happen in legitimate cases with
  768 + // older PDFs, and there's nothing to be done about it, so
  769 + // there's no good reason to issue a warning. The only sad
  770 + // thing is that it was a false positive that alerted me to a
  771 + // logic error in the code, and any future such errors would
  772 + // now be hidden.
762 QTC::TC("qpdf", "QPDFPageObjectHelper unresolved names"); 773 QTC::TC("qpdf", "QPDFPageObjectHelper unresolved names");
763 return false; 774 return false;
764 } 775 }
libqpdf/ResourceFinder.cc
1 #include <qpdf/ResourceFinder.hh> 1 #include <qpdf/ResourceFinder.hh>
2 2
3 ResourceFinder::ResourceFinder() : 3 ResourceFinder::ResourceFinder() :
  4 + last_name_offset(0),
4 saw_bad(false) 5 saw_bad(false)
5 { 6 {
6 } 7 }
7 8
8 void 9 void
9 -ResourceFinder::handleToken(QPDFTokenizer::Token const& token) 10 +ResourceFinder::handleObject(QPDFObjectHandle obj, size_t offset, size_t)
10 { 11 {
11 - if ((token.getType() == QPDFTokenizer::tt_word) &&  
12 - (! this->last_name.empty())) 12 + if (obj.isOperator() && (! this->last_name.empty()))
13 { 13 {
14 - this->names.insert(this->last_name); 14 + static std::map<std::string, std::string> op_to_rtype = {
  15 + {"CS", "/ColorSpace"},
  16 + {"cs", "/ColorSpace"},
  17 + {"gs", "/ExtGState"},
  18 + {"Tf", "/Font"},
  19 + {"SCN", "/Pattern"},
  20 + {"scn", "/Pattern"},
  21 + {"BDC", "/Properties"},
  22 + {"DP", "/Properties"},
  23 + {"sh", "/Shading"},
  24 + {"Do", "/XObject"},
  25 + };
  26 + std::string op = obj.getOperatorValue();
  27 + std::string resource_type;
  28 + auto iter = op_to_rtype.find(op);
  29 + if (iter != op_to_rtype.end())
  30 + {
  31 + resource_type = iter->second;
  32 + }
  33 + if (! resource_type.empty())
  34 + {
  35 + this->names.insert(this->last_name);
  36 + this->names_by_resource_type[
  37 + resource_type][this->last_name].insert(this->last_name_offset);
  38 + }
15 } 39 }
16 - else if (token.getType() == QPDFTokenizer::tt_name) 40 + else if (obj.isName())
17 { 41 {
18 - this->last_name =  
19 - QPDFObjectHandle::newName(token.getValue()).getName(); 42 + this->last_name = obj.getName();
  43 + this->last_name_offset = offset;
20 } 44 }
21 - else if (token.getType() == QPDFTokenizer::tt_bad)  
22 - {  
23 - saw_bad = true;  
24 - }  
25 - writeToken(token); 45 +}
  46 +
  47 +void
  48 +ResourceFinder::handleWarning()
  49 +{
  50 + this->saw_bad = true;
26 } 51 }
27 52
28 std::set<std::string> const& 53 std::set<std::string> const&
@@ -31,6 +56,12 @@ ResourceFinder::getNames() const @@ -31,6 +56,12 @@ ResourceFinder::getNames() const
31 return this->names; 56 return this->names;
32 } 57 }
33 58
  59 +std::map<std::string, std::map<std::string, std::set<size_t>>> const&
  60 +ResourceFinder::getNamesByResourceType() const
  61 +{
  62 + return this->names_by_resource_type;
  63 +}
  64 +
34 bool 65 bool
35 ResourceFinder::sawBad() const 66 ResourceFinder::sawBad() const
36 { 67 {
libqpdf/qpdf/ResourceFinder.hh
@@ -3,19 +3,26 @@ @@ -3,19 +3,26 @@
3 3
4 #include <qpdf/QPDFObjectHandle.hh> 4 #include <qpdf/QPDFObjectHandle.hh>
5 5
6 -class ResourceFinder: public QPDFObjectHandle::TokenFilter 6 +class ResourceFinder: public QPDFObjectHandle::ParserCallbacks
7 { 7 {
8 public: 8 public:
9 ResourceFinder(); 9 ResourceFinder();
10 virtual ~ResourceFinder() = default; 10 virtual ~ResourceFinder() = default;
11 - virtual void handleToken(QPDFTokenizer::Token const&) override; 11 + virtual void handleObject(QPDFObjectHandle, size_t, size_t) override;
  12 + virtual void handleWarning() override;
12 std::set<std::string> const& getNames() const; 13 std::set<std::string> const& getNames() const;
  14 + std::map<std::string,
  15 + std::map<std::string,
  16 + std::set<size_t>>> const& getNamesByResourceType() const;
13 bool sawBad() const; 17 bool sawBad() const;
14 18
15 private: 19 private:
16 std::string last_name; 20 std::string last_name;
  21 + size_t last_name_offset;
17 std::set<std::string> names; 22 std::set<std::string> names;
18 - std::map<std::string, std::set<std::string>> names_by_resource_type; 23 + std::map<std::string,
  24 + std::map<std::string,
  25 + std::set<size_t>>> names_by_resource_type;
19 bool saw_bad; 26 bool saw_bad;
20 }; 27 };
21 28
qpdf/qtest/qpdf/split-tokens-split.out
  1 +WARNING: page object 3 0 stream 5 0, stream 7 0, stream 9 0, stream 11 0 (content, offset 375): null character not allowed in name token
1 WARNING: split-tokens.pdf, object 3 0 at offset 181: Bad token found while scanning content stream; not attempting to remove unreferenced objects from this object 2 WARNING: split-tokens.pdf, object 3 0 at offset 181: Bad token found while scanning content stream; not attempting to remove unreferenced objects from this object
2 WARNING: empty PDF: content normalization encountered bad tokens 3 WARNING: empty PDF: content normalization encountered bad tokens
3 WARNING: empty PDF: normalized content ended with a bad token; you may be able to resolve this by coalescing content streams in combination with normalizing content. From the command line, specify --coalesce-contents 4 WARNING: empty PDF: normalized content ended with a bad token; you may be able to resolve this by coalescing content streams in combination with normalizing content. From the command line, specify --coalesce-contents