Commit 419949574df4525c61ffe060ad1c63daf66e806c
1 parent
0b05111d
Add information about helper classes to the documentation
Showing
1 changed file
with
239 additions
and
92 deletions
manual/qpdf-manual.xml
| ... | ... | @@ -1751,53 +1751,54 @@ outfile.pdf</option> |
| 1751 | 1751 | </para> |
| 1752 | 1752 | <para> |
| 1753 | 1753 | In general, one should adhere strictly to a specification when |
| 1754 | - writing but be liberal in reading. This way, the product of our | |
| 1754 | + writing but be liberal in reading. This way, the product of our | |
| 1755 | 1755 | software will be accepted by the widest range of other programs, |
| 1756 | - and we will accept the widest range of input files. This library | |
| 1756 | + and we will accept the widest range of input files. This library | |
| 1757 | 1757 | attempts to conform to that philosophy whenever possible but also |
| 1758 | 1758 | aims to provide strict checking for people who want to validate |
| 1759 | - PDF files. If you don't want to see warnings and are trying to | |
| 1759 | + PDF files. If you don't want to see warnings and are trying to | |
| 1760 | 1760 | write something that is tolerant, you can call |
| 1761 | - <literal>setSuppressWarnings(true)</literal>. If you want to fail | |
| 1761 | + <literal>setSuppressWarnings(true)</literal>. If you want to fail | |
| 1762 | 1762 | on the first error, you can call |
| 1763 | - <literal>setAttemptRecovery(false)</literal>. The default | |
| 1764 | - behavior is to generating warnings for recoverable problems. Note | |
| 1765 | - that recovery will not always produce the desired results even if | |
| 1766 | - it is able to get through the file. Unlike most other PDF files | |
| 1767 | - that produce generic warnings such as “This file is | |
| 1763 | + <literal>setAttemptRecovery(false)</literal>. The default behavior | |
| 1764 | + is to generating warnings for recoverable problems. Note that | |
| 1765 | + recovery will not always produce the desired results even if it is | |
| 1766 | + able to get through the file. Unlike most other PDF files that | |
| 1767 | + produce generic warnings such as “This file is | |
| 1768 | 1768 | damaged,”, qpdf generally issues a detailed error message |
| 1769 | - that would be most useful to a PDF developer. This is by design | |
| 1770 | - as there seems to be a shortage of PDF validation tools out | |
| 1771 | - there. (This was, in fact, one of the major motivations behind | |
| 1772 | - the initial creation of qpdf.) | |
| 1769 | + that would be most useful to a PDF developer. This is by design as | |
| 1770 | + there seems to be a shortage of PDF validation tools out there. | |
| 1771 | + This was, in fact, one of the major motivations behind the initial | |
| 1772 | + creation of qpdf. | |
| 1773 | 1773 | </para> |
| 1774 | 1774 | </sect1> |
| 1775 | 1775 | <sect1 id="ref.design-goals"> |
| 1776 | 1776 | <title>Design Goals</title> |
| 1777 | 1777 | <para> |
| 1778 | 1778 | The QPDF package includes support for reading and rewriting PDF |
| 1779 | - files. It aims to hide from the user details involving object | |
| 1779 | + files. It aims to hide from the user details involving object | |
| 1780 | 1780 | locations, modified (appended) PDF files, the |
| 1781 | 1781 | directness/indirectness of objects, and stream filters including |
| 1782 | - encryption. It does not aim to hide knowledge of the object | |
| 1783 | - hierarchy or content stream contents. Put another way, a user of | |
| 1782 | + encryption. It does not aim to hide knowledge of the object | |
| 1783 | + hierarchy or content stream contents. Put another way, a user of | |
| 1784 | 1784 | the qpdf library is expected to have knowledge about how PDF files |
| 1785 | 1785 | work, but is not expected to have to keep track of bookkeeping |
| 1786 | 1786 | details such as file positions. |
| 1787 | 1787 | </para> |
| 1788 | 1788 | <para> |
| 1789 | 1789 | A user of the library never has to care whether an object is |
| 1790 | - direct or indirect. All access to objects deals with this | |
| 1791 | - transparently. All memory management details are also handled by | |
| 1792 | - the library. | |
| 1790 | + direct or indirect, though it is possible to determine whether an | |
| 1791 | + object is direct or not if this information is needed. All access | |
| 1792 | + to objects deals with this transparently. All memory management | |
| 1793 | + details are also handled by the library. | |
| 1793 | 1794 | </para> |
| 1794 | 1795 | <para> |
| 1795 | 1796 | The <classname>PointerHolder</classname> object is used internally |
| 1796 | - by the library to deal with memory management. This is basically | |
| 1797 | - a smart pointer object very similar in spirit to the Boost | |
| 1798 | - library's <classname>shared_ptr</classname> object, but predating | |
| 1799 | - it by several years. This library also makes use of a technique | |
| 1800 | - for giving fine-grained access to methods in one class to other | |
| 1797 | + by the library to deal with memory management. This is basically a | |
| 1798 | + smart pointer object very similar in spirit to C++-11's | |
| 1799 | + <classname>std::shared_ptr</classname> object, but predating it by | |
| 1800 | + several years. This library also makes use of a technique for | |
| 1801 | + giving fine-grained access to methods in one class to other | |
| 1801 | 1802 | classes by using public subclasses with friends and only private |
| 1802 | 1803 | members that in turn call private methods of the containing class. |
| 1803 | 1804 | See <classname>QPDFObjectHandle::Factory</classname> as an |
| ... | ... | @@ -1810,29 +1811,20 @@ outfile.pdf</option> |
| 1810 | 1811 | files. |
| 1811 | 1812 | </para> |
| 1812 | 1813 | <para> |
| 1813 | - <classname>QPDFObject</classname> is the basic PDF Object class. | |
| 1814 | - It is an abstract base class from which are derived classes for | |
| 1815 | - each type of PDF object. Clients do not interact with Objects | |
| 1816 | - directly but instead interact with | |
| 1817 | - <classname>QPDFObjectHandle</classname>. | |
| 1818 | - </para> | |
| 1819 | - <para> | |
| 1820 | - <classname>QPDFObjectHandle</classname> contains | |
| 1821 | - <classname>PointerHolder<QPDFObject></classname> and | |
| 1822 | - includes accessor methods that are type-safe proxies to the | |
| 1823 | - methods of the derived object classes as well as methods for | |
| 1824 | - querying object types. They can be passed around by value, | |
| 1825 | - copied, stored in containers, etc. with very low overhead. | |
| 1826 | - Instances of <classname>QPDFObjectHandle</classname> always | |
| 1827 | - contain a reference back to the <classname>QPDF</classname> object | |
| 1828 | - from which they were created. A | |
| 1814 | + The primary class for interacting with PDF objects is | |
| 1815 | + <classname>QPDFObjectHandle</classname>. Instances of this class | |
| 1816 | + can be passed around by value, copied, stored in containers, etc. | |
| 1817 | + with very low overhead. Instances of | |
| 1818 | + <classname>QPDFObjectHandle</classname> created by reading from a | |
| 1819 | + file will always contain a reference back to the | |
| 1820 | + <classname>QPDF</classname> object from which they were created. A | |
| 1829 | 1821 | <classname>QPDFObjectHandle</classname> may be direct or indirect. |
| 1830 | 1822 | If indirect, the <classname>QPDFObject</classname> the |
| 1831 | 1823 | <classname>PointerHolder</classname> initially points to is a null |
| 1832 | - pointer. In this case, the first attempt to access the underlying | |
| 1824 | + pointer. In this case, the first attempt to access the underlying | |
| 1833 | 1825 | <classname>QPDFObject</classname> will result in the |
| 1834 | 1826 | <classname>QPDFObject</classname> being resolved via a call to the |
| 1835 | - referenced <classname>QPDF</classname> instance. This makes it | |
| 1827 | + referenced <classname>QPDF</classname> instance. This makes it | |
| 1836 | 1828 | essentially impossible to make coding errors in which certain |
| 1837 | 1829 | things will work for some PDF files and not for others based on |
| 1838 | 1830 | which objects are direct and which objects are indirect. |
| ... | ... | @@ -1849,48 +1841,6 @@ outfile.pdf</option> |
| 1849 | 1841 | <filename>QPDFObjectHandle.hh</filename> for details. |
| 1850 | 1842 | </para> |
| 1851 | 1843 | <para> |
| 1852 | - When the <classname>QPDF</classname> class creates a new object, | |
| 1853 | - it dynamically allocates the appropriate type of | |
| 1854 | - <classname>QPDFObject</classname> and immediately hands the | |
| 1855 | - pointer to an instance of <classname>QPDFObjectHandle</classname>. | |
| 1856 | - The parser reads a token from the current file position. If the | |
| 1857 | - token is a not either a dictionary or array opener, an object is | |
| 1858 | - immediately constructed from the single token and the parser | |
| 1859 | - returns. Otherwise, the parser is invoked recursively in a | |
| 1860 | - special mode in which it accumulates objects until it finds a | |
| 1861 | - balancing closer. During this process, the | |
| 1862 | - “<literal>R</literal>” keyword is recognized and an | |
| 1863 | - indirect <classname>QPDFObjectHandle</classname> may be | |
| 1864 | - constructed. | |
| 1865 | - </para> | |
| 1866 | - <para> | |
| 1867 | - The <function>QPDF::resolve()</function> method, which is used to | |
| 1868 | - resolve an indirect object, may be invoked from the | |
| 1869 | - <classname>QPDFObjectHandle</classname> class. It first checks a | |
| 1870 | - cache to see whether this object has already been read. If not, | |
| 1871 | - it reads the object from the PDF file and caches it. It the | |
| 1872 | - returns the resulting <classname>QPDFObjectHandle</classname>. | |
| 1873 | - The calling object handle then replaces its | |
| 1874 | - <classname>PointerHolder<QDFObject></classname> with the one | |
| 1875 | - from the newly returned <classname>QPDFObjectHandle</classname>. | |
| 1876 | - In this way, only a single copy of any direct object need exist | |
| 1877 | - and clients can access objects transparently without knowing | |
| 1878 | - caring whether they are direct or indirect objects. Additionally, | |
| 1879 | - no object is ever read from the file more than once. That means | |
| 1880 | - that only the portions of the PDF file that are actually needed | |
| 1881 | - are ever read from the input file, thus allowing the qpdf package | |
| 1882 | - to take advantage of this important design goal of PDF files. | |
| 1883 | - </para> | |
| 1884 | - <para> | |
| 1885 | - If the requested object is inside of an object stream, the object | |
| 1886 | - stream itself is first read into memory. Then the tokenizer reads | |
| 1887 | - objects from the memory stream based on the offset information | |
| 1888 | - stored in the stream. Those individual objects are cached, after | |
| 1889 | - which the temporary buffer holding the object stream contents are | |
| 1890 | - discarded. In this way, the first time an object in an object | |
| 1891 | - stream is requested, all objects in the stream are cached. | |
| 1892 | - </para> | |
| 1893 | - <para> | |
| 1894 | 1844 | An instance of <classname>QPDF</classname> is constructed by using |
| 1895 | 1845 | the class's default constructor. If desired, the |
| 1896 | 1846 | <classname>QPDF</classname> object may be configured with various |
| ... | ... | @@ -1934,8 +1884,206 @@ outfile.pdf</option> |
| 1934 | 1884 | <para> |
| 1935 | 1885 | There are some convenience routines for very common operations |
| 1936 | 1886 | such as walking the page tree and returning a vector of all page |
| 1937 | - objects. For full details, please see the header file | |
| 1938 | - <filename>QPDF.hh</filename>. | |
| 1887 | + objects. For full details, please see the header files | |
| 1888 | + <filename>QPDF.hh</filename> and | |
| 1889 | + <filename>QPDFObjectHandle.hh</filename>. There are also some | |
| 1890 | + additional helper classes that provide higher level API functions | |
| 1891 | + for certain document constructions. These are discussed in <xref | |
| 1892 | + linkend="ref.helper-classes"/>. | |
| 1893 | + </para> | |
| 1894 | + </sect1> | |
| 1895 | + <sect1 id="ref.helper-classes"> | |
| 1896 | + <title>Helper Classes</title> | |
| 1897 | + <para> | |
| 1898 | + QPDF version 8.1 introduced the concept of helper classes. Helper | |
| 1899 | + classes are intended to contain higher level APIs that allow | |
| 1900 | + developers to work with certain document constructs at an | |
| 1901 | + abstraction level above that of | |
| 1902 | + <classname>QPDFObjectHandle</classname> while staying true to | |
| 1903 | + qpdf's philosophy of not hiding document structure from the | |
| 1904 | + developer. As with qpdf in general, the goal is take away some of | |
| 1905 | + the more tedious bookkeeping aspects of working with PDF files, | |
| 1906 | + not to remove the need for the developer to understand how the PDF | |
| 1907 | + construction in question works. The driving factor behind the | |
| 1908 | + creation of helper classes was to allow the evolution of higher | |
| 1909 | + level interfaces in qpdf without polluting the interfaces of the | |
| 1910 | + main top-level classes <classname>QPDF</classname> and | |
| 1911 | + <classname>QPDFObjectHandle</classname>. | |
| 1912 | + </para> | |
| 1913 | + <para> | |
| 1914 | + There are two kinds of helper classes: | |
| 1915 | + <emphasis>document</emphasis> helpers and | |
| 1916 | + <emphasis>object</emphasis> helpers. Document helpers are | |
| 1917 | + constructed with a reference to a <classname>QPDF</classname> | |
| 1918 | + object and provide methods for working with structures that are at | |
| 1919 | + the document level. Object helpers are constructed with an | |
| 1920 | + instance of a <classname>QPDFObjectHandle</classname> and provide | |
| 1921 | + methods for working with specific types of objects. | |
| 1922 | + </para> | |
| 1923 | + <para> | |
| 1924 | + Examples of document helpers include | |
| 1925 | + <classname>QPDFPageDocumentHelper</classname>, which contains | |
| 1926 | + methods for operating on the document's page trees, such as | |
| 1927 | + enumerating all pages of a document and adding and removing pages; | |
| 1928 | + and <classname>QPDFAcroFormDocumentHelper</classname>, which | |
| 1929 | + contains document-level methods related to interactive forms, such | |
| 1930 | + as enumerating form fields and creating mappings between form | |
| 1931 | + fields and annotations. | |
| 1932 | + </para> | |
| 1933 | + <para> | |
| 1934 | + Examples of object helpers include | |
| 1935 | + <classname>QPDFPageObjectHelper</classname> for performing | |
| 1936 | + operations on pages such as page rotation and some operations on | |
| 1937 | + content streams, <classname>QPDFFormFieldObjectHelper</classname> | |
| 1938 | + for performing operations related to interactive form fields, and | |
| 1939 | + <classname>QPDFAnnotationObjectHelper</classname> for working with | |
| 1940 | + annotations. | |
| 1941 | + </para> | |
| 1942 | + <para> | |
| 1943 | + It is always possible to retrieve the underlying | |
| 1944 | + <classname>QPDF</classname> reference from a document helper and | |
| 1945 | + the underlying <classname>QPDFObjectHandle</classname> reference | |
| 1946 | + from an object helper. Helpers are designed to be helpers, not | |
| 1947 | + wrappers. The intention is that, in general, it is safe to freely | |
| 1948 | + intermix operations that use helpers with operations that use the | |
| 1949 | + underlying objects. Document and object helpers do not attempt to | |
| 1950 | + provide a complete interface for working with the things they are | |
| 1951 | + helping with, nor do they attempt to encapsulate underlying | |
| 1952 | + structures. They just provide a few methods to help with | |
| 1953 | + error-prone, repetitive, or complex tasks. In some cases, a helper | |
| 1954 | + object may cache some information that is expensive to gather. In | |
| 1955 | + such cases, the helper classes are implemented so that their own | |
| 1956 | + methods keep the cache consistent, and the header file will | |
| 1957 | + provide a method to invalidate the cache and a description of what | |
| 1958 | + kinds of operations would make the cache invalid. If in doubt, you | |
| 1959 | + can always discard a helper class and create a new one with the | |
| 1960 | + same underlying objects, which will ensure that you have discarded | |
| 1961 | + any stale information. | |
| 1962 | + </para> | |
| 1963 | + <para> | |
| 1964 | + By Convention, document helpers are called | |
| 1965 | + <classname>QPDFSomethingDocumentHelper</classname> and are derived | |
| 1966 | + from <classname>QPDFDocumentHelper</classname>, and object helpers | |
| 1967 | + are called <classname>QPDFSomethingObjectHelper</classname> and | |
| 1968 | + are derived from <classname>QPDFObjectHelper</classname>. For | |
| 1969 | + details on specific helpers, please see their header files. You | |
| 1970 | + can find them by looking at | |
| 1971 | + <filename>include/qpdf/QPDF*DocumentHelper.hh</filename> and | |
| 1972 | + <filename>include/qpdf/QPDF*ObjectHelper.hh</filename>. | |
| 1973 | + </para> | |
| 1974 | + <para> | |
| 1975 | + In order to avoid creation of circular dependencies, the following | |
| 1976 | + general guidelines are followed with helper classes: | |
| 1977 | + <itemizedlist> | |
| 1978 | + <listitem> | |
| 1979 | + <para> | |
| 1980 | + Core class interfaces do not know about helper classes. For | |
| 1981 | + example, no methods of <classname>QPDF</classname> or | |
| 1982 | + <classname>QPDFObjectHandle</classname> will include helper | |
| 1983 | + classes in their interfaces. | |
| 1984 | + </para> | |
| 1985 | + </listitem> | |
| 1986 | + <listitem> | |
| 1987 | + <para> | |
| 1988 | + Interfaces of object helpers will usually not use document | |
| 1989 | + helpers in their interfaces. This is because it is much more | |
| 1990 | + useful for document helpers to have methods that return object | |
| 1991 | + helpers. Most operations in PDF files start at the document | |
| 1992 | + level and go from there to the object level rather than the | |
| 1993 | + other way around. It can sometimes be useful to map back from | |
| 1994 | + object-level structures to document-level structures. If there | |
| 1995 | + is a desire to do this, it will generally be provided by a | |
| 1996 | + method in the document helper class. | |
| 1997 | + </para> | |
| 1998 | + </listitem> | |
| 1999 | + <listitem> | |
| 2000 | + <para> | |
| 2001 | + Most of the time, object helpers don't know about other object | |
| 2002 | + helpers. However, in some cases, one type of object may be a | |
| 2003 | + container for another type of object, in which case it may make | |
| 2004 | + sense for the outer object to know about the inner object. For | |
| 2005 | + example, there are methods in the | |
| 2006 | + <classname>QPDFPageObjectHelper</classname> that know | |
| 2007 | + <classname>QPDFAnnotationObjectHelper</classname> because | |
| 2008 | + references to annotations are contained in page dictionaries. | |
| 2009 | + </para> | |
| 2010 | + </listitem> | |
| 2011 | + <listitem> | |
| 2012 | + <para> | |
| 2013 | + Any helper or core library class may use helpers in their | |
| 2014 | + implementations. | |
| 2015 | + </para> | |
| 2016 | + </listitem> | |
| 2017 | + </itemizedlist> | |
| 2018 | + </para> | |
| 2019 | + <para> | |
| 2020 | + Prior to qpdf version 8.1, higher level interfaces were added as | |
| 2021 | + “convenience functions” in either | |
| 2022 | + <classname>QPDF</classname> or | |
| 2023 | + <classname>QPDFObjectHandle</classname>. For compatibility, older | |
| 2024 | + convenience functions for operating with pages will remain in | |
| 2025 | + those classes even as alternatives are provided in helper classes. | |
| 2026 | + Going forward, new higher level interfaces will be provided using | |
| 2027 | + helper classes. | |
| 2028 | + </para> | |
| 2029 | + </sect1> | |
| 2030 | + <sect1 id="ref.implementation-notes"> | |
| 2031 | + <title>Implementation Notes</title> | |
| 2032 | + <para> | |
| 2033 | + This section contains a few notes about QPDF's internal | |
| 2034 | + implementation, particularly around what it does when it first | |
| 2035 | + processes a file. This section is a bit of a simplification of | |
| 2036 | + what it actually does, but it could serve as a starting point to | |
| 2037 | + someone trying to understand the implementation. There is nothing | |
| 2038 | + in this section that you need to know to use the qpdf library. | |
| 2039 | + </para> | |
| 2040 | + <para> | |
| 2041 | + <classname>QPDFObject</classname> is the basic PDF Object class. | |
| 2042 | + It is an abstract base class from which are derived classes for | |
| 2043 | + each type of PDF object. Clients do not interact with Objects | |
| 2044 | + directly but instead interact with | |
| 2045 | + <classname>QPDFObjectHandle</classname>. | |
| 2046 | + </para> | |
| 2047 | + <para> | |
| 2048 | + When the <classname>QPDF</classname> class creates a new object, | |
| 2049 | + it dynamically allocates the appropriate type of | |
| 2050 | + <classname>QPDFObject</classname> and immediately hands the | |
| 2051 | + pointer to an instance of <classname>QPDFObjectHandle</classname>. | |
| 2052 | + The parser reads a token from the current file position. If the | |
| 2053 | + token is a not either a dictionary or array opener, an object is | |
| 2054 | + immediately constructed from the single token and the parser | |
| 2055 | + returns. Otherwise, the parser iterates in a special mode in which | |
| 2056 | + it accumulates objects until it finds a balancing closer. During | |
| 2057 | + this process, the “<literal>R</literal>” keyword is | |
| 2058 | + recognized and an indirect <classname>QPDFObjectHandle</classname> | |
| 2059 | + may be constructed. | |
| 2060 | + </para> | |
| 2061 | + <para> | |
| 2062 | + The <function>QPDF::resolve()</function> method, which is used to | |
| 2063 | + resolve an indirect object, may be invoked from the | |
| 2064 | + <classname>QPDFObjectHandle</classname> class. It first checks a | |
| 2065 | + cache to see whether this object has already been read. If not, | |
| 2066 | + it reads the object from the PDF file and caches it. It the | |
| 2067 | + returns the resulting <classname>QPDFObjectHandle</classname>. | |
| 2068 | + The calling object handle then replaces its | |
| 2069 | + <classname>PointerHolder<QDFObject></classname> with the one | |
| 2070 | + from the newly returned <classname>QPDFObjectHandle</classname>. | |
| 2071 | + In this way, only a single copy of any direct object need exist | |
| 2072 | + and clients can access objects transparently without knowing | |
| 2073 | + caring whether they are direct or indirect objects. Additionally, | |
| 2074 | + no object is ever read from the file more than once. That means | |
| 2075 | + that only the portions of the PDF file that are actually needed | |
| 2076 | + are ever read from the input file, thus allowing the qpdf package | |
| 2077 | + to take advantage of this important design goal of PDF files. | |
| 2078 | + </para> | |
| 2079 | + <para> | |
| 2080 | + If the requested object is inside of an object stream, the object | |
| 2081 | + stream itself is first read into memory. Then the tokenizer reads | |
| 2082 | + objects from the memory stream based on the offset information | |
| 2083 | + stored in the stream. Those individual objects are cached, after | |
| 2084 | + which the temporary buffer holding the object stream contents are | |
| 2085 | + discarded. In this way, the first time an object in an object | |
| 2086 | + stream is requested, all objects in the stream are cached. | |
| 1939 | 2087 | </para> |
| 1940 | 2088 | <para> |
| 1941 | 2089 | The following example should clarify how |
| ... | ... | @@ -1951,12 +2099,11 @@ outfile.pdf</option> |
| 1951 | 2099 | <listitem> |
| 1952 | 2100 | <para> |
| 1953 | 2101 | The <classname>QPDF</classname> class checks the beginning of |
| 1954 | - <filename>a.pdf</filename> for | |
| 1955 | - <literal>%!PDF-1.[0-9]+</literal>. It then reads the cross | |
| 1956 | - reference table mentioned at the end of the file, ensuring that | |
| 1957 | - it is looking before the last <literal>%%EOF</literal>. After | |
| 1958 | - getting to <literal>trailer</literal> keyword, it invokes the | |
| 1959 | - parser. | |
| 2102 | + <filename>a.pdf</filename> for a PDF header. It then reads the | |
| 2103 | + cross reference table mentioned at the end of the file, | |
| 2104 | + ensuring that it is looking before the last | |
| 2105 | + <literal>%%EOF</literal>. After getting to | |
| 2106 | + <literal>trailer</literal> keyword, it invokes the parser. | |
| 1960 | 2107 | </para> |
| 1961 | 2108 | </listitem> |
| 1962 | 2109 | <listitem> | ... | ... |