Commit 24aeb9ae2227c6b55297d9a946bf82f31656a685

Authored by Jay Berkenbilt
1 parent 86f9b4c4

Document json support

ChangeLog
  1 +2018-12-22 Jay Berkenbilt <ejb@ql.org>
  2 +
  3 + * Add new options --json, --json-key, and --json-object to
  4 + generate a json representation of the PDF file. This is described
  5 + in more depth in the manual. You can also run qpdf --json-help to
  6 + get a description of the json format.
  7 +
1 2018-12-21 Jay Berkenbilt <ejb@ql.org> 8 2018-12-21 Jay Berkenbilt <ejb@ql.org>
2 9
3 * Allow --show-object=trailer for showing the document trailer. 10 * Allow --show-object=trailer for showing the document trailer.
manual/qpdf-manual.xml
@@ -1515,7 +1515,7 @@ outfile.pdf&lt;/option&gt; @@ -1515,7 +1515,7 @@ outfile.pdf&lt;/option&gt;
1515 </listitem> 1515 </listitem>
1516 </varlistentry> 1516 </varlistentry>
1517 <varlistentry> 1517 <varlistentry>
1518 - <term><option>--show-object=obj[,gen]</option></term> 1518 + <term><option>--show-object=trailer|obj[,gen]</option></term>
1519 <listitem> 1519 <listitem>
1520 <para> 1520 <para>
1521 Show the contents of the given object. This is especially 1521 Show the contents of the given object. This is especially
@@ -1581,6 +1581,44 @@ outfile.pdf&lt;/option&gt; @@ -1581,6 +1581,44 @@ outfile.pdf&lt;/option&gt;
1581 </listitem> 1581 </listitem>
1582 </varlistentry> 1582 </varlistentry>
1583 <varlistentry> 1583 <varlistentry>
  1584 + <term><option>--json</option></term>
  1585 + <listitem>
  1586 + <para>
  1587 + Generate a json representation of the file. This is described
  1588 + in depth in <xref linkend="ref.json"/>
  1589 + </para>
  1590 + </listitem>
  1591 + </varlistentry>
  1592 + <varlistentry>
  1593 + <term><option>--json-help</option></term>
  1594 + <listitem>
  1595 + <para>
  1596 + Describe the format of the json output.
  1597 + </para>
  1598 + </listitem>
  1599 + </varlistentry>
  1600 + <varlistentry>
  1601 + <term><option>--json-key=key</option></term>
  1602 + <listitem>
  1603 + <para>
  1604 + This option is repeatable. If specified, only top-level keys
  1605 + specified will be included in the json output. If not
  1606 + specified, all keys wil be shown.
  1607 + </para>
  1608 + </listitem>
  1609 + </varlistentry>
  1610 + <varlistentry>
  1611 + <term><option>--json-object=trailer|obj[,gen]</option></term>
  1612 + <listitem>
  1613 + <para>
  1614 + This option is repeatable. If specified, only specified
  1615 + objects will be shown in the
  1616 + &ldquo;<literal>objects</literal>&rdquo; key of the json
  1617 + output. If absent, all objects will be shown.
  1618 + </para>
  1619 + </listitem>
  1620 + </varlistentry>
  1621 + <varlistentry>
1584 <term><option>--check</option></term> 1622 <term><option>--check</option></term>
1585 <listitem> 1623 <listitem>
1586 <para> 1624 <para>
@@ -1777,6 +1815,8 @@ outfile.pdf&lt;/option&gt; @@ -1777,6 +1815,8 @@ outfile.pdf&lt;/option&gt;
1777 </chapter> 1815 </chapter>
1778 <chapter id="ref.using-library"> 1816 <chapter id="ref.using-library">
1779 <title>Using the QPDF Library</title> 1817 <title>Using the QPDF Library</title>
  1818 + <sect1 id="ref.using.from-cxx">
  1819 + <title>Using QPDF from C++</title>
1780 <para> 1820 <para>
1781 The source tree for the qpdf package has an 1821 The source tree for the qpdf package has an
1782 <filename>examples</filename> directory that contains a few 1822 <filename>examples</filename> directory that contains a few
@@ -1808,6 +1848,234 @@ outfile.pdf&lt;/option&gt; @@ -1808,6 +1848,234 @@ outfile.pdf&lt;/option&gt;
1808 time. Multiple threads may simultaneously work with different 1848 time. Multiple threads may simultaneously work with different
1809 instances of these and all other QPDF objects. 1849 instances of these and all other QPDF objects.
1810 </para> 1850 </para>
  1851 + </sect1>
  1852 + <sect1 id="ref.using.other-languages">
  1853 + <title>Using QPDF from other languages</title>
  1854 + <para>
  1855 + The qpdf library is implemented in C++, which makes it hard to use
  1856 + directly in other languages. There are a few things that can help.
  1857 + </para>
  1858 + <variablelist>
  1859 + <varlistentry>
  1860 + <term>&ldquo;C&rdquo;</term>
  1861 + <listitem>
  1862 + <para>
  1863 + The qpdf library includes a &ldquo;C&rdquo; language interface
  1864 + that provides a subset of the overall capabilities. The header
  1865 + file <filename>qpdf/qpdf-c.h</filename> includes information
  1866 + about its use. As long as you use a C++ linker, you can link C
  1867 + programs with qpdf and use the C API. For languages that can
  1868 + directly load methods from a shared library, the C API can also
  1869 + be useful. People have reported success using the C API from
  1870 + other languages on Windows by directly calling functions in the
  1871 + DLL.
  1872 + </para>
  1873 + </listitem>
  1874 + </varlistentry>
  1875 + <varlistentry>
  1876 + <term>Python</term>
  1877 + <listitem>
  1878 + <para>
  1879 + A Python module called <ulink
  1880 + url="https://pypi.org/project/pikepdf/">pikepdf</ulink>
  1881 + provides a clean and highly functional set of Python bindings
  1882 + to the qpdf library. Using pikepdf, you can work with PDF files
  1883 + in a natural way and combine qpdf's capabilities with other
  1884 + functionality provided by Python's rich standard library and
  1885 + available modules.
  1886 + </para>
  1887 + </listitem>
  1888 + </varlistentry>
  1889 + <varlistentry>
  1890 + <term>Other Languages</term>
  1891 + <listitem>
  1892 + <para>
  1893 + Starting with version 8.3.0, the <command>qpdf</command>
  1894 + command-line tool can produce a json representation of the PDF
  1895 + file's non-content data. This can facilitate interacting
  1896 + programmatically with PDF files through qpdf's command line
  1897 + interface. For more information, please see <xref
  1898 + linkend="ref.json"/>.
  1899 + </para>
  1900 + </listitem>
  1901 + </varlistentry>
  1902 + </variablelist>
  1903 + </sect1>
  1904 + </chapter>
  1905 + <chapter id="ref.json">
  1906 + <title>QPDF JSON</title>
  1907 + <para>
  1908 + Beginning with qpdf version 8.3.0, the <command>qpdf</command>
  1909 + command-line program can produce a json representation of the
  1910 + non-content data in a PDF file. It includes a dump in json format
  1911 + of all objects in the PDF file excluding the content of streams.
  1912 + This json representation makes it very easy to look in detail at
  1913 + the structure of a given PDF file, and it also provides a great way
  1914 + to work with PDF files programmatically from the command-line in
  1915 + languages that can't call or link with the qpdf library directly.
  1916 + Note that stream data can be extracted from PDF files using other
  1917 + qpdf command-line options.
  1918 + </para>
  1919 + <para>
  1920 + The qpdf json representation includes a json serialization of the
  1921 + raw objects in the PDF file as well as some computed information in
  1922 + a more easily extracted format. QPDF provides some guarantees about
  1923 + its json format. These guarantees are designed to simplify the
  1924 + experience of a developer working with the JSON format.
  1925 + <variablelist>
  1926 + <varlistentry>
  1927 + <term>Compatibility</term>
  1928 + <listitem>
  1929 + <para>
  1930 + The top-level json object output is a dictionary. The json
  1931 + output contains various nested dictionaries and arrays. With
  1932 + the exception of dictionaries that are populated by the fields
  1933 + of objects from the file, all instances of a dictionary are
  1934 + guaranteed to have exactly the same keys. Future versions of
  1935 + qpdf are free to add additional keys but not to remove keys or
  1936 + change the type of object that a key points to. The qpdf
  1937 + program validates this guarantee, and in the unlikely event
  1938 + that a bug in qpdf should cause it to generate data that
  1939 + doesn't conform to this rule, it will ask you to file a bug
  1940 + report.
  1941 + </para>
  1942 + <para>
  1943 + The top-level json structure contains a
  1944 + &ldquo;<literal>version</literal>&rdquo; key whose value is
  1945 + simple integer. The value of the <literal>version</literal> key
  1946 + will be incremented if a non-compatible change is made. A
  1947 + non-compatible change would be any change that involves removal
  1948 + of a key, a change to the format of data pointed to by a key,
  1949 + or a semantic change that requires a different interpretation
  1950 + of a previously existing key. A strong effort will be made to
  1951 + avoid breaking compatibility.
  1952 + </para>
  1953 + </listitem>
  1954 + </varlistentry>
  1955 + <varlistentry>
  1956 + <term>Documentation</term>
  1957 + <listitem>
  1958 + <para>
  1959 + The <command>qpdf</command> command can be invoked with the
  1960 + <option>--json-help</option> option. This will output a json
  1961 + structure that has the same structure as the json output that
  1962 + qpdf generates, except that each field in the help output is a
  1963 + description of the corresponding field in the json output. The
  1964 + specific guarantees are as follows:
  1965 + <itemizedlist>
  1966 + <listitem>
  1967 + <para>
  1968 + A dictionary in the help output means that the corresponding
  1969 + location in the actual json output is also a dictionary with
  1970 + exactly the same keys; that is, no keys present in help are
  1971 + absent in the real output, and no keys will be present in
  1972 + the real output that are not in help.
  1973 + </para>
  1974 + </listitem>
  1975 + <listitem>
  1976 + <para>
  1977 + A string in the help output is a description of the item
  1978 + that appears in the corresponding location of the actual
  1979 + output. The corresponding output can have any format.
  1980 + </para>
  1981 + </listitem>
  1982 + <listitem>
  1983 + <para>
  1984 + An array in the help output always contains a single
  1985 + element. It indicates that the corresponding location in the
  1986 + actual output is also an array, and that each element of the
  1987 + array has whatever format is implied by the single element
  1988 + of the help output's array.
  1989 + </para>
  1990 + </listitem>
  1991 + </itemizedlist>
  1992 + For example, the help output indicates includes a
  1993 + &ldquo;<literal>pagelabels</literal>&rdquo; key whose value is
  1994 + an array of one element. That element is a dictionary with keys
  1995 + &ldquo;<literal>index</literal>&rdquo; and
  1996 + &ldquo;<literal>label</literal>&rdquo;. In addition to
  1997 + describing the meaning of those keys, this tells you that the
  1998 + actual json output will contain a <literal>pagelabels</literal>
  1999 + array, each of whose elements is a dictionary that contains an
  2000 + <literal>index</literal> key, a <literal>label</literal> key,
  2001 + and no other keys.
  2002 + </para>
  2003 + </listitem>
  2004 + </varlistentry>
  2005 + <varlistentry>
  2006 + <term>Directness and Simplicity</term>
  2007 + <listitem>
  2008 + <para>
  2009 + The json output contains the value of every object in the file,
  2010 + but it also contains some processed data. This is analogous to
  2011 + how qpdf's library interface works. The processed data is
  2012 + similar to the helper functions in that it allows you to look
  2013 + at certain aspects of the PDF file without having to understand
  2014 + all the nuances of the PDF specification, while the raw objects
  2015 + allow you to mine the PDF for anything that the higher-level
  2016 + interfaces are lacking.
  2017 + </para>
  2018 + </listitem>
  2019 + </varlistentry>
  2020 + </variablelist>
  2021 + </para>
  2022 + <para>
  2023 + There are a few limitations to be aware of with the json structure:
  2024 + <itemizedlist>
  2025 + <listitem>
  2026 + <para>
  2027 + Strings, names, and indirect object references in the original
  2028 + PDF file are all converted to strings in the json
  2029 + representation. In the case of a &ldquo;normal&rdquo; PDF file,
  2030 + you can tell the difference because a name starts with a slash
  2031 + (<literal>/</literal>), and an indirect object reference looks
  2032 + like <literal>n n R</literal>, but if there were to be a string
  2033 + that looked like a name or indirect object reference, there
  2034 + would be no way to tell this from the json output. Note that
  2035 + there are certain cases where you know for sure what something
  2036 + is, such as knowing that dictionary keys in objects are always
  2037 + names and that certain things in the higher-level computed data
  2038 + are known to contain indirect object references.
  2039 + </para>
  2040 + </listitem>
  2041 + <listitem>
  2042 + <para>
  2043 + The json format doesn't support binary data very well. Mostly
  2044 + the details are not important, but they are presented here for
  2045 + information. When qpdf outputs a string in the json
  2046 + representation, it converts the string to UTF-8, assuming usual
  2047 + PDF string semantics. Specifically, if the original string is
  2048 + UTF-16, it is converted to UTF-8. Otherwise, it is assumed to
  2049 + have PDF doc encoding, and is converted to UTF-8 with that
  2050 + assumption. This causes strange things to happen to binary
  2051 + strings. For example, if you had the binary string
  2052 + <literal>&lt;038051&gt;</literal>, this would be output to the
  2053 + json as <literal>\u0003•Q</literal> because
  2054 + <literal>03</literal> is not a printable character and
  2055 + <literal>80</literal> is the bullet character in PDF doc
  2056 + encoding and is mapped to the Unicode value
  2057 + <literal>2022</literal>. Since <literal>51</literal> is
  2058 + <literal>Q</literal>, it is output as is. If you wanted to
  2059 + convert back from here to a binary string, would have to
  2060 + recognize Unicode values whose code points are higher than
  2061 + <literal>0xFF</literal> and map those back to their
  2062 + corresponding PDF doc encoding characters. There is no way to
  2063 + tell the difference between a Unicode string that was originally
  2064 + encoded as UTF-16 or one that was converted from PDF doc
  2065 + encoding. In other words, it's best if you don't try to use the
  2066 + json format to extract binary strings from the PDF file, but if
  2067 + you really had to, it could be done. Note that qpdf's
  2068 + <option>--show-object</option> option does not have this
  2069 + limitation and will reveal the string as encoded in the original
  2070 + file.
  2071 + </para>
  2072 + </listitem>
  2073 + </itemizedlist>
  2074 + </para>
  2075 + <para>
  2076 + For specific details on the information provided in the json
  2077 + output, please run <command>qpdf --json-help</command>.
  2078 + </para>
1811 </chapter> 2079 </chapter>
1812 <chapter id="ref.design"> 2080 <chapter id="ref.design">
1813 <title>Design and Library Notes</title> 2081 <title>Design and Library Notes</title>