Commit 24aeb9ae2227c6b55297d9a946bf82f31656a685

Authored by Jay Berkenbilt
1 parent 86f9b4c4

Document json support

ChangeLog
  1 +2018-12-22 Jay Berkenbilt <ejb@ql.org>
  2 +
  3 + * Add new options --json, --json-key, and --json-object to
  4 + generate a json representation of the PDF file. This is described
  5 + in more depth in the manual. You can also run qpdf --json-help to
  6 + get a description of the json format.
  7 +
1 8 2018-12-21 Jay Berkenbilt <ejb@ql.org>
2 9  
3 10 * Allow --show-object=trailer for showing the document trailer.
... ...
manual/qpdf-manual.xml
... ... @@ -1515,7 +1515,7 @@ outfile.pdf&lt;/option&gt;
1515 1515 </listitem>
1516 1516 </varlistentry>
1517 1517 <varlistentry>
1518   - <term><option>--show-object=obj[,gen]</option></term>
  1518 + <term><option>--show-object=trailer|obj[,gen]</option></term>
1519 1519 <listitem>
1520 1520 <para>
1521 1521 Show the contents of the given object. This is especially
... ... @@ -1581,6 +1581,44 @@ outfile.pdf&lt;/option&gt;
1581 1581 </listitem>
1582 1582 </varlistentry>
1583 1583 <varlistentry>
  1584 + <term><option>--json</option></term>
  1585 + <listitem>
  1586 + <para>
  1587 + Generate a json representation of the file. This is described
  1588 + in depth in <xref linkend="ref.json"/>
  1589 + </para>
  1590 + </listitem>
  1591 + </varlistentry>
  1592 + <varlistentry>
  1593 + <term><option>--json-help</option></term>
  1594 + <listitem>
  1595 + <para>
  1596 + Describe the format of the json output.
  1597 + </para>
  1598 + </listitem>
  1599 + </varlistentry>
  1600 + <varlistentry>
  1601 + <term><option>--json-key=key</option></term>
  1602 + <listitem>
  1603 + <para>
  1604 + This option is repeatable. If specified, only top-level keys
  1605 + specified will be included in the json output. If not
  1606 + specified, all keys wil be shown.
  1607 + </para>
  1608 + </listitem>
  1609 + </varlistentry>
  1610 + <varlistentry>
  1611 + <term><option>--json-object=trailer|obj[,gen]</option></term>
  1612 + <listitem>
  1613 + <para>
  1614 + This option is repeatable. If specified, only specified
  1615 + objects will be shown in the
  1616 + &ldquo;<literal>objects</literal>&rdquo; key of the json
  1617 + output. If absent, all objects will be shown.
  1618 + </para>
  1619 + </listitem>
  1620 + </varlistentry>
  1621 + <varlistentry>
1584 1622 <term><option>--check</option></term>
1585 1623 <listitem>
1586 1624 <para>
... ... @@ -1777,6 +1815,8 @@ outfile.pdf&lt;/option&gt;
1777 1815 </chapter>
1778 1816 <chapter id="ref.using-library">
1779 1817 <title>Using the QPDF Library</title>
  1818 + <sect1 id="ref.using.from-cxx">
  1819 + <title>Using QPDF from C++</title>
1780 1820 <para>
1781 1821 The source tree for the qpdf package has an
1782 1822 <filename>examples</filename> directory that contains a few
... ... @@ -1808,6 +1848,234 @@ outfile.pdf&lt;/option&gt;
1808 1848 time. Multiple threads may simultaneously work with different
1809 1849 instances of these and all other QPDF objects.
1810 1850 </para>
  1851 + </sect1>
  1852 + <sect1 id="ref.using.other-languages">
  1853 + <title>Using QPDF from other languages</title>
  1854 + <para>
  1855 + The qpdf library is implemented in C++, which makes it hard to use
  1856 + directly in other languages. There are a few things that can help.
  1857 + </para>
  1858 + <variablelist>
  1859 + <varlistentry>
  1860 + <term>&ldquo;C&rdquo;</term>
  1861 + <listitem>
  1862 + <para>
  1863 + The qpdf library includes a &ldquo;C&rdquo; language interface
  1864 + that provides a subset of the overall capabilities. The header
  1865 + file <filename>qpdf/qpdf-c.h</filename> includes information
  1866 + about its use. As long as you use a C++ linker, you can link C
  1867 + programs with qpdf and use the C API. For languages that can
  1868 + directly load methods from a shared library, the C API can also
  1869 + be useful. People have reported success using the C API from
  1870 + other languages on Windows by directly calling functions in the
  1871 + DLL.
  1872 + </para>
  1873 + </listitem>
  1874 + </varlistentry>
  1875 + <varlistentry>
  1876 + <term>Python</term>
  1877 + <listitem>
  1878 + <para>
  1879 + A Python module called <ulink
  1880 + url="https://pypi.org/project/pikepdf/">pikepdf</ulink>
  1881 + provides a clean and highly functional set of Python bindings
  1882 + to the qpdf library. Using pikepdf, you can work with PDF files
  1883 + in a natural way and combine qpdf's capabilities with other
  1884 + functionality provided by Python's rich standard library and
  1885 + available modules.
  1886 + </para>
  1887 + </listitem>
  1888 + </varlistentry>
  1889 + <varlistentry>
  1890 + <term>Other Languages</term>
  1891 + <listitem>
  1892 + <para>
  1893 + Starting with version 8.3.0, the <command>qpdf</command>
  1894 + command-line tool can produce a json representation of the PDF
  1895 + file's non-content data. This can facilitate interacting
  1896 + programmatically with PDF files through qpdf's command line
  1897 + interface. For more information, please see <xref
  1898 + linkend="ref.json"/>.
  1899 + </para>
  1900 + </listitem>
  1901 + </varlistentry>
  1902 + </variablelist>
  1903 + </sect1>
  1904 + </chapter>
  1905 + <chapter id="ref.json">
  1906 + <title>QPDF JSON</title>
  1907 + <para>
  1908 + Beginning with qpdf version 8.3.0, the <command>qpdf</command>
  1909 + command-line program can produce a json representation of the
  1910 + non-content data in a PDF file. It includes a dump in json format
  1911 + of all objects in the PDF file excluding the content of streams.
  1912 + This json representation makes it very easy to look in detail at
  1913 + the structure of a given PDF file, and it also provides a great way
  1914 + to work with PDF files programmatically from the command-line in
  1915 + languages that can't call or link with the qpdf library directly.
  1916 + Note that stream data can be extracted from PDF files using other
  1917 + qpdf command-line options.
  1918 + </para>
  1919 + <para>
  1920 + The qpdf json representation includes a json serialization of the
  1921 + raw objects in the PDF file as well as some computed information in
  1922 + a more easily extracted format. QPDF provides some guarantees about
  1923 + its json format. These guarantees are designed to simplify the
  1924 + experience of a developer working with the JSON format.
  1925 + <variablelist>
  1926 + <varlistentry>
  1927 + <term>Compatibility</term>
  1928 + <listitem>
  1929 + <para>
  1930 + The top-level json object output is a dictionary. The json
  1931 + output contains various nested dictionaries and arrays. With
  1932 + the exception of dictionaries that are populated by the fields
  1933 + of objects from the file, all instances of a dictionary are
  1934 + guaranteed to have exactly the same keys. Future versions of
  1935 + qpdf are free to add additional keys but not to remove keys or
  1936 + change the type of object that a key points to. The qpdf
  1937 + program validates this guarantee, and in the unlikely event
  1938 + that a bug in qpdf should cause it to generate data that
  1939 + doesn't conform to this rule, it will ask you to file a bug
  1940 + report.
  1941 + </para>
  1942 + <para>
  1943 + The top-level json structure contains a
  1944 + &ldquo;<literal>version</literal>&rdquo; key whose value is
  1945 + simple integer. The value of the <literal>version</literal> key
  1946 + will be incremented if a non-compatible change is made. A
  1947 + non-compatible change would be any change that involves removal
  1948 + of a key, a change to the format of data pointed to by a key,
  1949 + or a semantic change that requires a different interpretation
  1950 + of a previously existing key. A strong effort will be made to
  1951 + avoid breaking compatibility.
  1952 + </para>
  1953 + </listitem>
  1954 + </varlistentry>
  1955 + <varlistentry>
  1956 + <term>Documentation</term>
  1957 + <listitem>
  1958 + <para>
  1959 + The <command>qpdf</command> command can be invoked with the
  1960 + <option>--json-help</option> option. This will output a json
  1961 + structure that has the same structure as the json output that
  1962 + qpdf generates, except that each field in the help output is a
  1963 + description of the corresponding field in the json output. The
  1964 + specific guarantees are as follows:
  1965 + <itemizedlist>
  1966 + <listitem>
  1967 + <para>
  1968 + A dictionary in the help output means that the corresponding
  1969 + location in the actual json output is also a dictionary with
  1970 + exactly the same keys; that is, no keys present in help are
  1971 + absent in the real output, and no keys will be present in
  1972 + the real output that are not in help.
  1973 + </para>
  1974 + </listitem>
  1975 + <listitem>
  1976 + <para>
  1977 + A string in the help output is a description of the item
  1978 + that appears in the corresponding location of the actual
  1979 + output. The corresponding output can have any format.
  1980 + </para>
  1981 + </listitem>
  1982 + <listitem>
  1983 + <para>
  1984 + An array in the help output always contains a single
  1985 + element. It indicates that the corresponding location in the
  1986 + actual output is also an array, and that each element of the
  1987 + array has whatever format is implied by the single element
  1988 + of the help output's array.
  1989 + </para>
  1990 + </listitem>
  1991 + </itemizedlist>
  1992 + For example, the help output indicates includes a
  1993 + &ldquo;<literal>pagelabels</literal>&rdquo; key whose value is
  1994 + an array of one element. That element is a dictionary with keys
  1995 + &ldquo;<literal>index</literal>&rdquo; and
  1996 + &ldquo;<literal>label</literal>&rdquo;. In addition to
  1997 + describing the meaning of those keys, this tells you that the
  1998 + actual json output will contain a <literal>pagelabels</literal>
  1999 + array, each of whose elements is a dictionary that contains an
  2000 + <literal>index</literal> key, a <literal>label</literal> key,
  2001 + and no other keys.
  2002 + </para>
  2003 + </listitem>
  2004 + </varlistentry>
  2005 + <varlistentry>
  2006 + <term>Directness and Simplicity</term>
  2007 + <listitem>
  2008 + <para>
  2009 + The json output contains the value of every object in the file,
  2010 + but it also contains some processed data. This is analogous to
  2011 + how qpdf's library interface works. The processed data is
  2012 + similar to the helper functions in that it allows you to look
  2013 + at certain aspects of the PDF file without having to understand
  2014 + all the nuances of the PDF specification, while the raw objects
  2015 + allow you to mine the PDF for anything that the higher-level
  2016 + interfaces are lacking.
  2017 + </para>
  2018 + </listitem>
  2019 + </varlistentry>
  2020 + </variablelist>
  2021 + </para>
  2022 + <para>
  2023 + There are a few limitations to be aware of with the json structure:
  2024 + <itemizedlist>
  2025 + <listitem>
  2026 + <para>
  2027 + Strings, names, and indirect object references in the original
  2028 + PDF file are all converted to strings in the json
  2029 + representation. In the case of a &ldquo;normal&rdquo; PDF file,
  2030 + you can tell the difference because a name starts with a slash
  2031 + (<literal>/</literal>), and an indirect object reference looks
  2032 + like <literal>n n R</literal>, but if there were to be a string
  2033 + that looked like a name or indirect object reference, there
  2034 + would be no way to tell this from the json output. Note that
  2035 + there are certain cases where you know for sure what something
  2036 + is, such as knowing that dictionary keys in objects are always
  2037 + names and that certain things in the higher-level computed data
  2038 + are known to contain indirect object references.
  2039 + </para>
  2040 + </listitem>
  2041 + <listitem>
  2042 + <para>
  2043 + The json format doesn't support binary data very well. Mostly
  2044 + the details are not important, but they are presented here for
  2045 + information. When qpdf outputs a string in the json
  2046 + representation, it converts the string to UTF-8, assuming usual
  2047 + PDF string semantics. Specifically, if the original string is
  2048 + UTF-16, it is converted to UTF-8. Otherwise, it is assumed to
  2049 + have PDF doc encoding, and is converted to UTF-8 with that
  2050 + assumption. This causes strange things to happen to binary
  2051 + strings. For example, if you had the binary string
  2052 + <literal>&lt;038051&gt;</literal>, this would be output to the
  2053 + json as <literal>\u0003•Q</literal> because
  2054 + <literal>03</literal> is not a printable character and
  2055 + <literal>80</literal> is the bullet character in PDF doc
  2056 + encoding and is mapped to the Unicode value
  2057 + <literal>2022</literal>. Since <literal>51</literal> is
  2058 + <literal>Q</literal>, it is output as is. If you wanted to
  2059 + convert back from here to a binary string, would have to
  2060 + recognize Unicode values whose code points are higher than
  2061 + <literal>0xFF</literal> and map those back to their
  2062 + corresponding PDF doc encoding characters. There is no way to
  2063 + tell the difference between a Unicode string that was originally
  2064 + encoded as UTF-16 or one that was converted from PDF doc
  2065 + encoding. In other words, it's best if you don't try to use the
  2066 + json format to extract binary strings from the PDF file, but if
  2067 + you really had to, it could be done. Note that qpdf's
  2068 + <option>--show-object</option> option does not have this
  2069 + limitation and will reveal the string as encoded in the original
  2070 + file.
  2071 + </para>
  2072 + </listitem>
  2073 + </itemizedlist>
  2074 + </para>
  2075 + <para>
  2076 + For specific details on the information provided in the json
  2077 + output, please run <command>qpdf --json-help</command>.
  2078 + </para>
1811 2079 </chapter>
1812 2080 <chapter id="ref.design">
1813 2081 <title>Design and Library Notes</title>
... ...