Commit 419949574df4525c61ffe060ad1c63daf66e806c

Authored by Jay Berkenbilt
1 parent 0b05111d

Add information about helper classes to the documentation

Showing 1 changed file with 239 additions and 92 deletions
manual/qpdf-manual.xml
@@ -1751,53 +1751,54 @@ outfile.pdf</option> @@ -1751,53 +1751,54 @@ outfile.pdf</option>
1751 </para> 1751 </para>
1752 <para> 1752 <para>
1753 In general, one should adhere strictly to a specification when 1753 In general, one should adhere strictly to a specification when
1754 - writing but be liberal in reading. This way, the product of our 1754 + writing but be liberal in reading. This way, the product of our
1755 software will be accepted by the widest range of other programs, 1755 software will be accepted by the widest range of other programs,
1756 - and we will accept the widest range of input files. This library 1756 + and we will accept the widest range of input files. This library
1757 attempts to conform to that philosophy whenever possible but also 1757 attempts to conform to that philosophy whenever possible but also
1758 aims to provide strict checking for people who want to validate 1758 aims to provide strict checking for people who want to validate
1759 - PDF files. If you don't want to see warnings and are trying to 1759 + PDF files. If you don't want to see warnings and are trying to
1760 write something that is tolerant, you can call 1760 write something that is tolerant, you can call
1761 - <literal>setSuppressWarnings(true)</literal>. If you want to fail 1761 + <literal>setSuppressWarnings(true)</literal>. If you want to fail
1762 on the first error, you can call 1762 on the first error, you can call
1763 - <literal>setAttemptRecovery(false)</literal>. The default  
1764 - behavior is to generating warnings for recoverable problems. Note  
1765 - that recovery will not always produce the desired results even if  
1766 - it is able to get through the file. Unlike most other PDF files  
1767 - that produce generic warnings such as &ldquo;This file is 1763 + <literal>setAttemptRecovery(false)</literal>. The default behavior
  1764 + is to generating warnings for recoverable problems. Note that
  1765 + recovery will not always produce the desired results even if it is
  1766 + able to get through the file. Unlike most other PDF files that
  1767 + produce generic warnings such as &ldquo;This file is
1768 damaged,&rdquo;, qpdf generally issues a detailed error message 1768 damaged,&rdquo;, qpdf generally issues a detailed error message
1769 - that would be most useful to a PDF developer. This is by design  
1770 - as there seems to be a shortage of PDF validation tools out  
1771 - there. (This was, in fact, one of the major motivations behind  
1772 - the initial creation of qpdf.) 1769 + that would be most useful to a PDF developer. This is by design as
  1770 + there seems to be a shortage of PDF validation tools out there.
  1771 + This was, in fact, one of the major motivations behind the initial
  1772 + creation of qpdf.
1773 </para> 1773 </para>
1774 </sect1> 1774 </sect1>
1775 <sect1 id="ref.design-goals"> 1775 <sect1 id="ref.design-goals">
1776 <title>Design Goals</title> 1776 <title>Design Goals</title>
1777 <para> 1777 <para>
1778 The QPDF package includes support for reading and rewriting PDF 1778 The QPDF package includes support for reading and rewriting PDF
1779 - files. It aims to hide from the user details involving object 1779 + files. It aims to hide from the user details involving object
1780 locations, modified (appended) PDF files, the 1780 locations, modified (appended) PDF files, the
1781 directness/indirectness of objects, and stream filters including 1781 directness/indirectness of objects, and stream filters including
1782 - encryption. It does not aim to hide knowledge of the object  
1783 - hierarchy or content stream contents. Put another way, a user of 1782 + encryption. It does not aim to hide knowledge of the object
  1783 + hierarchy or content stream contents. Put another way, a user of
1784 the qpdf library is expected to have knowledge about how PDF files 1784 the qpdf library is expected to have knowledge about how PDF files
1785 work, but is not expected to have to keep track of bookkeeping 1785 work, but is not expected to have to keep track of bookkeeping
1786 details such as file positions. 1786 details such as file positions.
1787 </para> 1787 </para>
1788 <para> 1788 <para>
1789 A user of the library never has to care whether an object is 1789 A user of the library never has to care whether an object is
1790 - direct or indirect. All access to objects deals with this  
1791 - transparently. All memory management details are also handled by  
1792 - the library. 1790 + direct or indirect, though it is possible to determine whether an
  1791 + object is direct or not if this information is needed. All access
  1792 + to objects deals with this transparently. All memory management
  1793 + details are also handled by the library.
1793 </para> 1794 </para>
1794 <para> 1795 <para>
1795 The <classname>PointerHolder</classname> object is used internally 1796 The <classname>PointerHolder</classname> object is used internally
1796 - by the library to deal with memory management. This is basically  
1797 - a smart pointer object very similar in spirit to the Boost  
1798 - library's <classname>shared_ptr</classname> object, but predating  
1799 - it by several years. This library also makes use of a technique  
1800 - for giving fine-grained access to methods in one class to other 1797 + by the library to deal with memory management. This is basically a
  1798 + smart pointer object very similar in spirit to C++-11's
  1799 + <classname>std::shared_ptr</classname> object, but predating it by
  1800 + several years. This library also makes use of a technique for
  1801 + giving fine-grained access to methods in one class to other
1801 classes by using public subclasses with friends and only private 1802 classes by using public subclasses with friends and only private
1802 members that in turn call private methods of the containing class. 1803 members that in turn call private methods of the containing class.
1803 See <classname>QPDFObjectHandle::Factory</classname> as an 1804 See <classname>QPDFObjectHandle::Factory</classname> as an
@@ -1810,29 +1811,20 @@ outfile.pdf&lt;/option&gt; @@ -1810,29 +1811,20 @@ outfile.pdf&lt;/option&gt;
1810 files. 1811 files.
1811 </para> 1812 </para>
1812 <para> 1813 <para>
1813 - <classname>QPDFObject</classname> is the basic PDF Object class.  
1814 - It is an abstract base class from which are derived classes for  
1815 - each type of PDF object. Clients do not interact with Objects  
1816 - directly but instead interact with  
1817 - <classname>QPDFObjectHandle</classname>.  
1818 - </para>  
1819 - <para>  
1820 - <classname>QPDFObjectHandle</classname> contains  
1821 - <classname>PointerHolder&lt;QPDFObject&gt;</classname> and  
1822 - includes accessor methods that are type-safe proxies to the  
1823 - methods of the derived object classes as well as methods for  
1824 - querying object types. They can be passed around by value,  
1825 - copied, stored in containers, etc. with very low overhead.  
1826 - Instances of <classname>QPDFObjectHandle</classname> always  
1827 - contain a reference back to the <classname>QPDF</classname> object  
1828 - from which they were created. A 1814 + The primary class for interacting with PDF objects is
  1815 + <classname>QPDFObjectHandle</classname>. Instances of this class
  1816 + can be passed around by value, copied, stored in containers, etc.
  1817 + with very low overhead. Instances of
  1818 + <classname>QPDFObjectHandle</classname> created by reading from a
  1819 + file will always contain a reference back to the
  1820 + <classname>QPDF</classname> object from which they were created. A
1829 <classname>QPDFObjectHandle</classname> may be direct or indirect. 1821 <classname>QPDFObjectHandle</classname> may be direct or indirect.
1830 If indirect, the <classname>QPDFObject</classname> the 1822 If indirect, the <classname>QPDFObject</classname> the
1831 <classname>PointerHolder</classname> initially points to is a null 1823 <classname>PointerHolder</classname> initially points to is a null
1832 - pointer. In this case, the first attempt to access the underlying 1824 + pointer. In this case, the first attempt to access the underlying
1833 <classname>QPDFObject</classname> will result in the 1825 <classname>QPDFObject</classname> will result in the
1834 <classname>QPDFObject</classname> being resolved via a call to the 1826 <classname>QPDFObject</classname> being resolved via a call to the
1835 - referenced <classname>QPDF</classname> instance. This makes it 1827 + referenced <classname>QPDF</classname> instance. This makes it
1836 essentially impossible to make coding errors in which certain 1828 essentially impossible to make coding errors in which certain
1837 things will work for some PDF files and not for others based on 1829 things will work for some PDF files and not for others based on
1838 which objects are direct and which objects are indirect. 1830 which objects are direct and which objects are indirect.
@@ -1849,48 +1841,6 @@ outfile.pdf&lt;/option&gt; @@ -1849,48 +1841,6 @@ outfile.pdf&lt;/option&gt;
1849 <filename>QPDFObjectHandle.hh</filename> for details. 1841 <filename>QPDFObjectHandle.hh</filename> for details.
1850 </para> 1842 </para>
1851 <para> 1843 <para>
1852 - When the <classname>QPDF</classname> class creates a new object,  
1853 - it dynamically allocates the appropriate type of  
1854 - <classname>QPDFObject</classname> and immediately hands the  
1855 - pointer to an instance of <classname>QPDFObjectHandle</classname>.  
1856 - The parser reads a token from the current file position. If the  
1857 - token is a not either a dictionary or array opener, an object is  
1858 - immediately constructed from the single token and the parser  
1859 - returns. Otherwise, the parser is invoked recursively in a  
1860 - special mode in which it accumulates objects until it finds a  
1861 - balancing closer. During this process, the  
1862 - &ldquo;<literal>R</literal>&rdquo; keyword is recognized and an  
1863 - indirect <classname>QPDFObjectHandle</classname> may be  
1864 - constructed.  
1865 - </para>  
1866 - <para>  
1867 - The <function>QPDF::resolve()</function> method, which is used to  
1868 - resolve an indirect object, may be invoked from the  
1869 - <classname>QPDFObjectHandle</classname> class. It first checks a  
1870 - cache to see whether this object has already been read. If not,  
1871 - it reads the object from the PDF file and caches it. It the  
1872 - returns the resulting <classname>QPDFObjectHandle</classname>.  
1873 - The calling object handle then replaces its  
1874 - <classname>PointerHolder&lt;QDFObject&gt;</classname> with the one  
1875 - from the newly returned <classname>QPDFObjectHandle</classname>.  
1876 - In this way, only a single copy of any direct object need exist  
1877 - and clients can access objects transparently without knowing  
1878 - caring whether they are direct or indirect objects. Additionally,  
1879 - no object is ever read from the file more than once. That means  
1880 - that only the portions of the PDF file that are actually needed  
1881 - are ever read from the input file, thus allowing the qpdf package  
1882 - to take advantage of this important design goal of PDF files.  
1883 - </para>  
1884 - <para>  
1885 - If the requested object is inside of an object stream, the object  
1886 - stream itself is first read into memory. Then the tokenizer reads  
1887 - objects from the memory stream based on the offset information  
1888 - stored in the stream. Those individual objects are cached, after  
1889 - which the temporary buffer holding the object stream contents are  
1890 - discarded. In this way, the first time an object in an object  
1891 - stream is requested, all objects in the stream are cached.  
1892 - </para>  
1893 - <para>  
1894 An instance of <classname>QPDF</classname> is constructed by using 1844 An instance of <classname>QPDF</classname> is constructed by using
1895 the class's default constructor. If desired, the 1845 the class's default constructor. If desired, the
1896 <classname>QPDF</classname> object may be configured with various 1846 <classname>QPDF</classname> object may be configured with various
@@ -1934,8 +1884,206 @@ outfile.pdf&lt;/option&gt; @@ -1934,8 +1884,206 @@ outfile.pdf&lt;/option&gt;
1934 <para> 1884 <para>
1935 There are some convenience routines for very common operations 1885 There are some convenience routines for very common operations
1936 such as walking the page tree and returning a vector of all page 1886 such as walking the page tree and returning a vector of all page
1937 - objects. For full details, please see the header file  
1938 - <filename>QPDF.hh</filename>. 1887 + objects. For full details, please see the header files
  1888 + <filename>QPDF.hh</filename> and
  1889 + <filename>QPDFObjectHandle.hh</filename>. There are also some
  1890 + additional helper classes that provide higher level API functions
  1891 + for certain document constructions. These are discussed in <xref
  1892 + linkend="ref.helper-classes"/>.
  1893 + </para>
  1894 + </sect1>
  1895 + <sect1 id="ref.helper-classes">
  1896 + <title>Helper Classes</title>
  1897 + <para>
  1898 + QPDF version 8.1 introduced the concept of helper classes. Helper
  1899 + classes are intended to contain higher level APIs that allow
  1900 + developers to work with certain document constructs at an
  1901 + abstraction level above that of
  1902 + <classname>QPDFObjectHandle</classname> while staying true to
  1903 + qpdf's philosophy of not hiding document structure from the
  1904 + developer. As with qpdf in general, the goal is take away some of
  1905 + the more tedious bookkeeping aspects of working with PDF files,
  1906 + not to remove the need for the developer to understand how the PDF
  1907 + construction in question works. The driving factor behind the
  1908 + creation of helper classes was to allow the evolution of higher
  1909 + level interfaces in qpdf without polluting the interfaces of the
  1910 + main top-level classes <classname>QPDF</classname> and
  1911 + <classname>QPDFObjectHandle</classname>.
  1912 + </para>
  1913 + <para>
  1914 + There are two kinds of helper classes:
  1915 + <emphasis>document</emphasis> helpers and
  1916 + <emphasis>object</emphasis> helpers. Document helpers are
  1917 + constructed with a reference to a <classname>QPDF</classname>
  1918 + object and provide methods for working with structures that are at
  1919 + the document level. Object helpers are constructed with an
  1920 + instance of a <classname>QPDFObjectHandle</classname> and provide
  1921 + methods for working with specific types of objects.
  1922 + </para>
  1923 + <para>
  1924 + Examples of document helpers include
  1925 + <classname>QPDFPageDocumentHelper</classname>, which contains
  1926 + methods for operating on the document's page trees, such as
  1927 + enumerating all pages of a document and adding and removing pages;
  1928 + and <classname>QPDFAcroFormDocumentHelper</classname>, which
  1929 + contains document-level methods related to interactive forms, such
  1930 + as enumerating form fields and creating mappings between form
  1931 + fields and annotations.
  1932 + </para>
  1933 + <para>
  1934 + Examples of object helpers include
  1935 + <classname>QPDFPageObjectHelper</classname> for performing
  1936 + operations on pages such as page rotation and some operations on
  1937 + content streams, <classname>QPDFFormFieldObjectHelper</classname>
  1938 + for performing operations related to interactive form fields, and
  1939 + <classname>QPDFAnnotationObjectHelper</classname> for working with
  1940 + annotations.
  1941 + </para>
  1942 + <para>
  1943 + It is always possible to retrieve the underlying
  1944 + <classname>QPDF</classname> reference from a document helper and
  1945 + the underlying <classname>QPDFObjectHandle</classname> reference
  1946 + from an object helper. Helpers are designed to be helpers, not
  1947 + wrappers. The intention is that, in general, it is safe to freely
  1948 + intermix operations that use helpers with operations that use the
  1949 + underlying objects. Document and object helpers do not attempt to
  1950 + provide a complete interface for working with the things they are
  1951 + helping with, nor do they attempt to encapsulate underlying
  1952 + structures. They just provide a few methods to help with
  1953 + error-prone, repetitive, or complex tasks. In some cases, a helper
  1954 + object may cache some information that is expensive to gather. In
  1955 + such cases, the helper classes are implemented so that their own
  1956 + methods keep the cache consistent, and the header file will
  1957 + provide a method to invalidate the cache and a description of what
  1958 + kinds of operations would make the cache invalid. If in doubt, you
  1959 + can always discard a helper class and create a new one with the
  1960 + same underlying objects, which will ensure that you have discarded
  1961 + any stale information.
  1962 + </para>
  1963 + <para>
  1964 + By Convention, document helpers are called
  1965 + <classname>QPDFSomethingDocumentHelper</classname> and are derived
  1966 + from <classname>QPDFDocumentHelper</classname>, and object helpers
  1967 + are called <classname>QPDFSomethingObjectHelper</classname> and
  1968 + are derived from <classname>QPDFObjectHelper</classname>. For
  1969 + details on specific helpers, please see their header files. You
  1970 + can find them by looking at
  1971 + <filename>include/qpdf/QPDF*DocumentHelper.hh</filename> and
  1972 + <filename>include/qpdf/QPDF*ObjectHelper.hh</filename>.
  1973 + </para>
  1974 + <para>
  1975 + In order to avoid creation of circular dependencies, the following
  1976 + general guidelines are followed with helper classes:
  1977 + <itemizedlist>
  1978 + <listitem>
  1979 + <para>
  1980 + Core class interfaces do not know about helper classes. For
  1981 + example, no methods of <classname>QPDF</classname> or
  1982 + <classname>QPDFObjectHandle</classname> will include helper
  1983 + classes in their interfaces.
  1984 + </para>
  1985 + </listitem>
  1986 + <listitem>
  1987 + <para>
  1988 + Interfaces of object helpers will usually not use document
  1989 + helpers in their interfaces. This is because it is much more
  1990 + useful for document helpers to have methods that return object
  1991 + helpers. Most operations in PDF files start at the document
  1992 + level and go from there to the object level rather than the
  1993 + other way around. It can sometimes be useful to map back from
  1994 + object-level structures to document-level structures. If there
  1995 + is a desire to do this, it will generally be provided by a
  1996 + method in the document helper class.
  1997 + </para>
  1998 + </listitem>
  1999 + <listitem>
  2000 + <para>
  2001 + Most of the time, object helpers don't know about other object
  2002 + helpers. However, in some cases, one type of object may be a
  2003 + container for another type of object, in which case it may make
  2004 + sense for the outer object to know about the inner object. For
  2005 + example, there are methods in the
  2006 + <classname>QPDFPageObjectHelper</classname> that know
  2007 + <classname>QPDFAnnotationObjectHelper</classname> because
  2008 + references to annotations are contained in page dictionaries.
  2009 + </para>
  2010 + </listitem>
  2011 + <listitem>
  2012 + <para>
  2013 + Any helper or core library class may use helpers in their
  2014 + implementations.
  2015 + </para>
  2016 + </listitem>
  2017 + </itemizedlist>
  2018 + </para>
  2019 + <para>
  2020 + Prior to qpdf version 8.1, higher level interfaces were added as
  2021 + &ldquo;convenience functions&rdquo; in either
  2022 + <classname>QPDF</classname> or
  2023 + <classname>QPDFObjectHandle</classname>. For compatibility, older
  2024 + convenience functions for operating with pages will remain in
  2025 + those classes even as alternatives are provided in helper classes.
  2026 + Going forward, new higher level interfaces will be provided using
  2027 + helper classes.
  2028 + </para>
  2029 + </sect1>
  2030 + <sect1 id="ref.implementation-notes">
  2031 + <title>Implementation Notes</title>
  2032 + <para>
  2033 + This section contains a few notes about QPDF's internal
  2034 + implementation, particularly around what it does when it first
  2035 + processes a file. This section is a bit of a simplification of
  2036 + what it actually does, but it could serve as a starting point to
  2037 + someone trying to understand the implementation. There is nothing
  2038 + in this section that you need to know to use the qpdf library.
  2039 + </para>
  2040 + <para>
  2041 + <classname>QPDFObject</classname> is the basic PDF Object class.
  2042 + It is an abstract base class from which are derived classes for
  2043 + each type of PDF object. Clients do not interact with Objects
  2044 + directly but instead interact with
  2045 + <classname>QPDFObjectHandle</classname>.
  2046 + </para>
  2047 + <para>
  2048 + When the <classname>QPDF</classname> class creates a new object,
  2049 + it dynamically allocates the appropriate type of
  2050 + <classname>QPDFObject</classname> and immediately hands the
  2051 + pointer to an instance of <classname>QPDFObjectHandle</classname>.
  2052 + The parser reads a token from the current file position. If the
  2053 + token is a not either a dictionary or array opener, an object is
  2054 + immediately constructed from the single token and the parser
  2055 + returns. Otherwise, the parser iterates in a special mode in which
  2056 + it accumulates objects until it finds a balancing closer. During
  2057 + this process, the &ldquo;<literal>R</literal>&rdquo; keyword is
  2058 + recognized and an indirect <classname>QPDFObjectHandle</classname>
  2059 + may be constructed.
  2060 + </para>
  2061 + <para>
  2062 + The <function>QPDF::resolve()</function> method, which is used to
  2063 + resolve an indirect object, may be invoked from the
  2064 + <classname>QPDFObjectHandle</classname> class. It first checks a
  2065 + cache to see whether this object has already been read. If not,
  2066 + it reads the object from the PDF file and caches it. It the
  2067 + returns the resulting <classname>QPDFObjectHandle</classname>.
  2068 + The calling object handle then replaces its
  2069 + <classname>PointerHolder&lt;QDFObject&gt;</classname> with the one
  2070 + from the newly returned <classname>QPDFObjectHandle</classname>.
  2071 + In this way, only a single copy of any direct object need exist
  2072 + and clients can access objects transparently without knowing
  2073 + caring whether they are direct or indirect objects. Additionally,
  2074 + no object is ever read from the file more than once. That means
  2075 + that only the portions of the PDF file that are actually needed
  2076 + are ever read from the input file, thus allowing the qpdf package
  2077 + to take advantage of this important design goal of PDF files.
  2078 + </para>
  2079 + <para>
  2080 + If the requested object is inside of an object stream, the object
  2081 + stream itself is first read into memory. Then the tokenizer reads
  2082 + objects from the memory stream based on the offset information
  2083 + stored in the stream. Those individual objects are cached, after
  2084 + which the temporary buffer holding the object stream contents are
  2085 + discarded. In this way, the first time an object in an object
  2086 + stream is requested, all objects in the stream are cached.
1939 </para> 2087 </para>
1940 <para> 2088 <para>
1941 The following example should clarify how 2089 The following example should clarify how
@@ -1951,12 +2099,11 @@ outfile.pdf&lt;/option&gt; @@ -1951,12 +2099,11 @@ outfile.pdf&lt;/option&gt;
1951 <listitem> 2099 <listitem>
1952 <para> 2100 <para>
1953 The <classname>QPDF</classname> class checks the beginning of 2101 The <classname>QPDF</classname> class checks the beginning of
1954 - <filename>a.pdf</filename> for  
1955 - <literal>%!PDF-1.[0-9]+</literal>. It then reads the cross  
1956 - reference table mentioned at the end of the file, ensuring that  
1957 - it is looking before the last <literal>%%EOF</literal>. After  
1958 - getting to <literal>trailer</literal> keyword, it invokes the  
1959 - parser. 2102 + <filename>a.pdf</filename> for a PDF header. It then reads the
  2103 + cross reference table mentioned at the end of the file,
  2104 + ensuring that it is looking before the last
  2105 + <literal>%%EOF</literal>. After getting to
  2106 + <literal>trailer</literal> keyword, it invokes the parser.
1960 </para> 2107 </para>
1961 </listitem> 2108 </listitem>
1962 <listitem> 2109 <listitem>