Commit 419949574df4525c61ffe060ad1c63daf66e806c

Authored by Jay Berkenbilt
1 parent 0b05111d

Add information about helper classes to the documentation

Showing 1 changed file with 239 additions and 92 deletions
manual/qpdf-manual.xml
... ... @@ -1751,53 +1751,54 @@ outfile.pdf</option>
1751 1751 </para>
1752 1752 <para>
1753 1753 In general, one should adhere strictly to a specification when
1754   - writing but be liberal in reading. This way, the product of our
  1754 + writing but be liberal in reading. This way, the product of our
1755 1755 software will be accepted by the widest range of other programs,
1756   - and we will accept the widest range of input files. This library
  1756 + and we will accept the widest range of input files. This library
1757 1757 attempts to conform to that philosophy whenever possible but also
1758 1758 aims to provide strict checking for people who want to validate
1759   - PDF files. If you don't want to see warnings and are trying to
  1759 + PDF files. If you don't want to see warnings and are trying to
1760 1760 write something that is tolerant, you can call
1761   - <literal>setSuppressWarnings(true)</literal>. If you want to fail
  1761 + <literal>setSuppressWarnings(true)</literal>. If you want to fail
1762 1762 on the first error, you can call
1763   - <literal>setAttemptRecovery(false)</literal>. The default
1764   - behavior is to generating warnings for recoverable problems. Note
1765   - that recovery will not always produce the desired results even if
1766   - it is able to get through the file. Unlike most other PDF files
1767   - that produce generic warnings such as &ldquo;This file is
  1763 + <literal>setAttemptRecovery(false)</literal>. The default behavior
  1764 + is to generating warnings for recoverable problems. Note that
  1765 + recovery will not always produce the desired results even if it is
  1766 + able to get through the file. Unlike most other PDF files that
  1767 + produce generic warnings such as &ldquo;This file is
1768 1768 damaged,&rdquo;, qpdf generally issues a detailed error message
1769   - that would be most useful to a PDF developer. This is by design
1770   - as there seems to be a shortage of PDF validation tools out
1771   - there. (This was, in fact, one of the major motivations behind
1772   - the initial creation of qpdf.)
  1769 + that would be most useful to a PDF developer. This is by design as
  1770 + there seems to be a shortage of PDF validation tools out there.
  1771 + This was, in fact, one of the major motivations behind the initial
  1772 + creation of qpdf.
1773 1773 </para>
1774 1774 </sect1>
1775 1775 <sect1 id="ref.design-goals">
1776 1776 <title>Design Goals</title>
1777 1777 <para>
1778 1778 The QPDF package includes support for reading and rewriting PDF
1779   - files. It aims to hide from the user details involving object
  1779 + files. It aims to hide from the user details involving object
1780 1780 locations, modified (appended) PDF files, the
1781 1781 directness/indirectness of objects, and stream filters including
1782   - encryption. It does not aim to hide knowledge of the object
1783   - hierarchy or content stream contents. Put another way, a user of
  1782 + encryption. It does not aim to hide knowledge of the object
  1783 + hierarchy or content stream contents. Put another way, a user of
1784 1784 the qpdf library is expected to have knowledge about how PDF files
1785 1785 work, but is not expected to have to keep track of bookkeeping
1786 1786 details such as file positions.
1787 1787 </para>
1788 1788 <para>
1789 1789 A user of the library never has to care whether an object is
1790   - direct or indirect. All access to objects deals with this
1791   - transparently. All memory management details are also handled by
1792   - the library.
  1790 + direct or indirect, though it is possible to determine whether an
  1791 + object is direct or not if this information is needed. All access
  1792 + to objects deals with this transparently. All memory management
  1793 + details are also handled by the library.
1793 1794 </para>
1794 1795 <para>
1795 1796 The <classname>PointerHolder</classname> object is used internally
1796   - by the library to deal with memory management. This is basically
1797   - a smart pointer object very similar in spirit to the Boost
1798   - library's <classname>shared_ptr</classname> object, but predating
1799   - it by several years. This library also makes use of a technique
1800   - for giving fine-grained access to methods in one class to other
  1797 + by the library to deal with memory management. This is basically a
  1798 + smart pointer object very similar in spirit to C++-11's
  1799 + <classname>std::shared_ptr</classname> object, but predating it by
  1800 + several years. This library also makes use of a technique for
  1801 + giving fine-grained access to methods in one class to other
1801 1802 classes by using public subclasses with friends and only private
1802 1803 members that in turn call private methods of the containing class.
1803 1804 See <classname>QPDFObjectHandle::Factory</classname> as an
... ... @@ -1810,29 +1811,20 @@ outfile.pdf&lt;/option&gt;
1810 1811 files.
1811 1812 </para>
1812 1813 <para>
1813   - <classname>QPDFObject</classname> is the basic PDF Object class.
1814   - It is an abstract base class from which are derived classes for
1815   - each type of PDF object. Clients do not interact with Objects
1816   - directly but instead interact with
1817   - <classname>QPDFObjectHandle</classname>.
1818   - </para>
1819   - <para>
1820   - <classname>QPDFObjectHandle</classname> contains
1821   - <classname>PointerHolder&lt;QPDFObject&gt;</classname> and
1822   - includes accessor methods that are type-safe proxies to the
1823   - methods of the derived object classes as well as methods for
1824   - querying object types. They can be passed around by value,
1825   - copied, stored in containers, etc. with very low overhead.
1826   - Instances of <classname>QPDFObjectHandle</classname> always
1827   - contain a reference back to the <classname>QPDF</classname> object
1828   - from which they were created. A
  1814 + The primary class for interacting with PDF objects is
  1815 + <classname>QPDFObjectHandle</classname>. Instances of this class
  1816 + can be passed around by value, copied, stored in containers, etc.
  1817 + with very low overhead. Instances of
  1818 + <classname>QPDFObjectHandle</classname> created by reading from a
  1819 + file will always contain a reference back to the
  1820 + <classname>QPDF</classname> object from which they were created. A
1829 1821 <classname>QPDFObjectHandle</classname> may be direct or indirect.
1830 1822 If indirect, the <classname>QPDFObject</classname> the
1831 1823 <classname>PointerHolder</classname> initially points to is a null
1832   - pointer. In this case, the first attempt to access the underlying
  1824 + pointer. In this case, the first attempt to access the underlying
1833 1825 <classname>QPDFObject</classname> will result in the
1834 1826 <classname>QPDFObject</classname> being resolved via a call to the
1835   - referenced <classname>QPDF</classname> instance. This makes it
  1827 + referenced <classname>QPDF</classname> instance. This makes it
1836 1828 essentially impossible to make coding errors in which certain
1837 1829 things will work for some PDF files and not for others based on
1838 1830 which objects are direct and which objects are indirect.
... ... @@ -1849,48 +1841,6 @@ outfile.pdf&lt;/option&gt;
1849 1841 <filename>QPDFObjectHandle.hh</filename> for details.
1850 1842 </para>
1851 1843 <para>
1852   - When the <classname>QPDF</classname> class creates a new object,
1853   - it dynamically allocates the appropriate type of
1854   - <classname>QPDFObject</classname> and immediately hands the
1855   - pointer to an instance of <classname>QPDFObjectHandle</classname>.
1856   - The parser reads a token from the current file position. If the
1857   - token is a not either a dictionary or array opener, an object is
1858   - immediately constructed from the single token and the parser
1859   - returns. Otherwise, the parser is invoked recursively in a
1860   - special mode in which it accumulates objects until it finds a
1861   - balancing closer. During this process, the
1862   - &ldquo;<literal>R</literal>&rdquo; keyword is recognized and an
1863   - indirect <classname>QPDFObjectHandle</classname> may be
1864   - constructed.
1865   - </para>
1866   - <para>
1867   - The <function>QPDF::resolve()</function> method, which is used to
1868   - resolve an indirect object, may be invoked from the
1869   - <classname>QPDFObjectHandle</classname> class. It first checks a
1870   - cache to see whether this object has already been read. If not,
1871   - it reads the object from the PDF file and caches it. It the
1872   - returns the resulting <classname>QPDFObjectHandle</classname>.
1873   - The calling object handle then replaces its
1874   - <classname>PointerHolder&lt;QDFObject&gt;</classname> with the one
1875   - from the newly returned <classname>QPDFObjectHandle</classname>.
1876   - In this way, only a single copy of any direct object need exist
1877   - and clients can access objects transparently without knowing
1878   - caring whether they are direct or indirect objects. Additionally,
1879   - no object is ever read from the file more than once. That means
1880   - that only the portions of the PDF file that are actually needed
1881   - are ever read from the input file, thus allowing the qpdf package
1882   - to take advantage of this important design goal of PDF files.
1883   - </para>
1884   - <para>
1885   - If the requested object is inside of an object stream, the object
1886   - stream itself is first read into memory. Then the tokenizer reads
1887   - objects from the memory stream based on the offset information
1888   - stored in the stream. Those individual objects are cached, after
1889   - which the temporary buffer holding the object stream contents are
1890   - discarded. In this way, the first time an object in an object
1891   - stream is requested, all objects in the stream are cached.
1892   - </para>
1893   - <para>
1894 1844 An instance of <classname>QPDF</classname> is constructed by using
1895 1845 the class's default constructor. If desired, the
1896 1846 <classname>QPDF</classname> object may be configured with various
... ... @@ -1934,8 +1884,206 @@ outfile.pdf&lt;/option&gt;
1934 1884 <para>
1935 1885 There are some convenience routines for very common operations
1936 1886 such as walking the page tree and returning a vector of all page
1937   - objects. For full details, please see the header file
1938   - <filename>QPDF.hh</filename>.
  1887 + objects. For full details, please see the header files
  1888 + <filename>QPDF.hh</filename> and
  1889 + <filename>QPDFObjectHandle.hh</filename>. There are also some
  1890 + additional helper classes that provide higher level API functions
  1891 + for certain document constructions. These are discussed in <xref
  1892 + linkend="ref.helper-classes"/>.
  1893 + </para>
  1894 + </sect1>
  1895 + <sect1 id="ref.helper-classes">
  1896 + <title>Helper Classes</title>
  1897 + <para>
  1898 + QPDF version 8.1 introduced the concept of helper classes. Helper
  1899 + classes are intended to contain higher level APIs that allow
  1900 + developers to work with certain document constructs at an
  1901 + abstraction level above that of
  1902 + <classname>QPDFObjectHandle</classname> while staying true to
  1903 + qpdf's philosophy of not hiding document structure from the
  1904 + developer. As with qpdf in general, the goal is take away some of
  1905 + the more tedious bookkeeping aspects of working with PDF files,
  1906 + not to remove the need for the developer to understand how the PDF
  1907 + construction in question works. The driving factor behind the
  1908 + creation of helper classes was to allow the evolution of higher
  1909 + level interfaces in qpdf without polluting the interfaces of the
  1910 + main top-level classes <classname>QPDF</classname> and
  1911 + <classname>QPDFObjectHandle</classname>.
  1912 + </para>
  1913 + <para>
  1914 + There are two kinds of helper classes:
  1915 + <emphasis>document</emphasis> helpers and
  1916 + <emphasis>object</emphasis> helpers. Document helpers are
  1917 + constructed with a reference to a <classname>QPDF</classname>
  1918 + object and provide methods for working with structures that are at
  1919 + the document level. Object helpers are constructed with an
  1920 + instance of a <classname>QPDFObjectHandle</classname> and provide
  1921 + methods for working with specific types of objects.
  1922 + </para>
  1923 + <para>
  1924 + Examples of document helpers include
  1925 + <classname>QPDFPageDocumentHelper</classname>, which contains
  1926 + methods for operating on the document's page trees, such as
  1927 + enumerating all pages of a document and adding and removing pages;
  1928 + and <classname>QPDFAcroFormDocumentHelper</classname>, which
  1929 + contains document-level methods related to interactive forms, such
  1930 + as enumerating form fields and creating mappings between form
  1931 + fields and annotations.
  1932 + </para>
  1933 + <para>
  1934 + Examples of object helpers include
  1935 + <classname>QPDFPageObjectHelper</classname> for performing
  1936 + operations on pages such as page rotation and some operations on
  1937 + content streams, <classname>QPDFFormFieldObjectHelper</classname>
  1938 + for performing operations related to interactive form fields, and
  1939 + <classname>QPDFAnnotationObjectHelper</classname> for working with
  1940 + annotations.
  1941 + </para>
  1942 + <para>
  1943 + It is always possible to retrieve the underlying
  1944 + <classname>QPDF</classname> reference from a document helper and
  1945 + the underlying <classname>QPDFObjectHandle</classname> reference
  1946 + from an object helper. Helpers are designed to be helpers, not
  1947 + wrappers. The intention is that, in general, it is safe to freely
  1948 + intermix operations that use helpers with operations that use the
  1949 + underlying objects. Document and object helpers do not attempt to
  1950 + provide a complete interface for working with the things they are
  1951 + helping with, nor do they attempt to encapsulate underlying
  1952 + structures. They just provide a few methods to help with
  1953 + error-prone, repetitive, or complex tasks. In some cases, a helper
  1954 + object may cache some information that is expensive to gather. In
  1955 + such cases, the helper classes are implemented so that their own
  1956 + methods keep the cache consistent, and the header file will
  1957 + provide a method to invalidate the cache and a description of what
  1958 + kinds of operations would make the cache invalid. If in doubt, you
  1959 + can always discard a helper class and create a new one with the
  1960 + same underlying objects, which will ensure that you have discarded
  1961 + any stale information.
  1962 + </para>
  1963 + <para>
  1964 + By Convention, document helpers are called
  1965 + <classname>QPDFSomethingDocumentHelper</classname> and are derived
  1966 + from <classname>QPDFDocumentHelper</classname>, and object helpers
  1967 + are called <classname>QPDFSomethingObjectHelper</classname> and
  1968 + are derived from <classname>QPDFObjectHelper</classname>. For
  1969 + details on specific helpers, please see their header files. You
  1970 + can find them by looking at
  1971 + <filename>include/qpdf/QPDF*DocumentHelper.hh</filename> and
  1972 + <filename>include/qpdf/QPDF*ObjectHelper.hh</filename>.
  1973 + </para>
  1974 + <para>
  1975 + In order to avoid creation of circular dependencies, the following
  1976 + general guidelines are followed with helper classes:
  1977 + <itemizedlist>
  1978 + <listitem>
  1979 + <para>
  1980 + Core class interfaces do not know about helper classes. For
  1981 + example, no methods of <classname>QPDF</classname> or
  1982 + <classname>QPDFObjectHandle</classname> will include helper
  1983 + classes in their interfaces.
  1984 + </para>
  1985 + </listitem>
  1986 + <listitem>
  1987 + <para>
  1988 + Interfaces of object helpers will usually not use document
  1989 + helpers in their interfaces. This is because it is much more
  1990 + useful for document helpers to have methods that return object
  1991 + helpers. Most operations in PDF files start at the document
  1992 + level and go from there to the object level rather than the
  1993 + other way around. It can sometimes be useful to map back from
  1994 + object-level structures to document-level structures. If there
  1995 + is a desire to do this, it will generally be provided by a
  1996 + method in the document helper class.
  1997 + </para>
  1998 + </listitem>
  1999 + <listitem>
  2000 + <para>
  2001 + Most of the time, object helpers don't know about other object
  2002 + helpers. However, in some cases, one type of object may be a
  2003 + container for another type of object, in which case it may make
  2004 + sense for the outer object to know about the inner object. For
  2005 + example, there are methods in the
  2006 + <classname>QPDFPageObjectHelper</classname> that know
  2007 + <classname>QPDFAnnotationObjectHelper</classname> because
  2008 + references to annotations are contained in page dictionaries.
  2009 + </para>
  2010 + </listitem>
  2011 + <listitem>
  2012 + <para>
  2013 + Any helper or core library class may use helpers in their
  2014 + implementations.
  2015 + </para>
  2016 + </listitem>
  2017 + </itemizedlist>
  2018 + </para>
  2019 + <para>
  2020 + Prior to qpdf version 8.1, higher level interfaces were added as
  2021 + &ldquo;convenience functions&rdquo; in either
  2022 + <classname>QPDF</classname> or
  2023 + <classname>QPDFObjectHandle</classname>. For compatibility, older
  2024 + convenience functions for operating with pages will remain in
  2025 + those classes even as alternatives are provided in helper classes.
  2026 + Going forward, new higher level interfaces will be provided using
  2027 + helper classes.
  2028 + </para>
  2029 + </sect1>
  2030 + <sect1 id="ref.implementation-notes">
  2031 + <title>Implementation Notes</title>
  2032 + <para>
  2033 + This section contains a few notes about QPDF's internal
  2034 + implementation, particularly around what it does when it first
  2035 + processes a file. This section is a bit of a simplification of
  2036 + what it actually does, but it could serve as a starting point to
  2037 + someone trying to understand the implementation. There is nothing
  2038 + in this section that you need to know to use the qpdf library.
  2039 + </para>
  2040 + <para>
  2041 + <classname>QPDFObject</classname> is the basic PDF Object class.
  2042 + It is an abstract base class from which are derived classes for
  2043 + each type of PDF object. Clients do not interact with Objects
  2044 + directly but instead interact with
  2045 + <classname>QPDFObjectHandle</classname>.
  2046 + </para>
  2047 + <para>
  2048 + When the <classname>QPDF</classname> class creates a new object,
  2049 + it dynamically allocates the appropriate type of
  2050 + <classname>QPDFObject</classname> and immediately hands the
  2051 + pointer to an instance of <classname>QPDFObjectHandle</classname>.
  2052 + The parser reads a token from the current file position. If the
  2053 + token is a not either a dictionary or array opener, an object is
  2054 + immediately constructed from the single token and the parser
  2055 + returns. Otherwise, the parser iterates in a special mode in which
  2056 + it accumulates objects until it finds a balancing closer. During
  2057 + this process, the &ldquo;<literal>R</literal>&rdquo; keyword is
  2058 + recognized and an indirect <classname>QPDFObjectHandle</classname>
  2059 + may be constructed.
  2060 + </para>
  2061 + <para>
  2062 + The <function>QPDF::resolve()</function> method, which is used to
  2063 + resolve an indirect object, may be invoked from the
  2064 + <classname>QPDFObjectHandle</classname> class. It first checks a
  2065 + cache to see whether this object has already been read. If not,
  2066 + it reads the object from the PDF file and caches it. It the
  2067 + returns the resulting <classname>QPDFObjectHandle</classname>.
  2068 + The calling object handle then replaces its
  2069 + <classname>PointerHolder&lt;QDFObject&gt;</classname> with the one
  2070 + from the newly returned <classname>QPDFObjectHandle</classname>.
  2071 + In this way, only a single copy of any direct object need exist
  2072 + and clients can access objects transparently without knowing
  2073 + caring whether they are direct or indirect objects. Additionally,
  2074 + no object is ever read from the file more than once. That means
  2075 + that only the portions of the PDF file that are actually needed
  2076 + are ever read from the input file, thus allowing the qpdf package
  2077 + to take advantage of this important design goal of PDF files.
  2078 + </para>
  2079 + <para>
  2080 + If the requested object is inside of an object stream, the object
  2081 + stream itself is first read into memory. Then the tokenizer reads
  2082 + objects from the memory stream based on the offset information
  2083 + stored in the stream. Those individual objects are cached, after
  2084 + which the temporary buffer holding the object stream contents are
  2085 + discarded. In this way, the first time an object in an object
  2086 + stream is requested, all objects in the stream are cached.
1939 2087 </para>
1940 2088 <para>
1941 2089 The following example should clarify how
... ... @@ -1951,12 +2099,11 @@ outfile.pdf&lt;/option&gt;
1951 2099 <listitem>
1952 2100 <para>
1953 2101 The <classname>QPDF</classname> class checks the beginning of
1954   - <filename>a.pdf</filename> for
1955   - <literal>%!PDF-1.[0-9]+</literal>. It then reads the cross
1956   - reference table mentioned at the end of the file, ensuring that
1957   - it is looking before the last <literal>%%EOF</literal>. After
1958   - getting to <literal>trailer</literal> keyword, it invokes the
1959   - parser.
  2102 + <filename>a.pdf</filename> for a PDF header. It then reads the
  2103 + cross reference table mentioned at the end of the file,
  2104 + ensuring that it is looking before the last
  2105 + <literal>%%EOF</literal>. After getting to
  2106 + <literal>trailer</literal> keyword, it invokes the parser.
1960 2107 </para>
1961 2108 </listitem>
1962 2109 <listitem>
... ...