Commit 24aeb9ae2227c6b55297d9a946bf82f31656a685
1 parent
86f9b4c4
Document json support
Showing
2 changed files
with
276 additions
and
1 deletions
ChangeLog
| 1 | +2018-12-22 Jay Berkenbilt <ejb@ql.org> | ||
| 2 | + | ||
| 3 | + * Add new options --json, --json-key, and --json-object to | ||
| 4 | + generate a json representation of the PDF file. This is described | ||
| 5 | + in more depth in the manual. You can also run qpdf --json-help to | ||
| 6 | + get a description of the json format. | ||
| 7 | + | ||
| 1 | 2018-12-21 Jay Berkenbilt <ejb@ql.org> | 8 | 2018-12-21 Jay Berkenbilt <ejb@ql.org> |
| 2 | 9 | ||
| 3 | * Allow --show-object=trailer for showing the document trailer. | 10 | * Allow --show-object=trailer for showing the document trailer. |
manual/qpdf-manual.xml
| @@ -1515,7 +1515,7 @@ outfile.pdf</option> | @@ -1515,7 +1515,7 @@ outfile.pdf</option> | ||
| 1515 | </listitem> | 1515 | </listitem> |
| 1516 | </varlistentry> | 1516 | </varlistentry> |
| 1517 | <varlistentry> | 1517 | <varlistentry> |
| 1518 | - <term><option>--show-object=obj[,gen]</option></term> | 1518 | + <term><option>--show-object=trailer|obj[,gen]</option></term> |
| 1519 | <listitem> | 1519 | <listitem> |
| 1520 | <para> | 1520 | <para> |
| 1521 | Show the contents of the given object. This is especially | 1521 | Show the contents of the given object. This is especially |
| @@ -1581,6 +1581,44 @@ outfile.pdf</option> | @@ -1581,6 +1581,44 @@ outfile.pdf</option> | ||
| 1581 | </listitem> | 1581 | </listitem> |
| 1582 | </varlistentry> | 1582 | </varlistentry> |
| 1583 | <varlistentry> | 1583 | <varlistentry> |
| 1584 | + <term><option>--json</option></term> | ||
| 1585 | + <listitem> | ||
| 1586 | + <para> | ||
| 1587 | + Generate a json representation of the file. This is described | ||
| 1588 | + in depth in <xref linkend="ref.json"/> | ||
| 1589 | + </para> | ||
| 1590 | + </listitem> | ||
| 1591 | + </varlistentry> | ||
| 1592 | + <varlistentry> | ||
| 1593 | + <term><option>--json-help</option></term> | ||
| 1594 | + <listitem> | ||
| 1595 | + <para> | ||
| 1596 | + Describe the format of the json output. | ||
| 1597 | + </para> | ||
| 1598 | + </listitem> | ||
| 1599 | + </varlistentry> | ||
| 1600 | + <varlistentry> | ||
| 1601 | + <term><option>--json-key=key</option></term> | ||
| 1602 | + <listitem> | ||
| 1603 | + <para> | ||
| 1604 | + This option is repeatable. If specified, only top-level keys | ||
| 1605 | + specified will be included in the json output. If not | ||
| 1606 | + specified, all keys wil be shown. | ||
| 1607 | + </para> | ||
| 1608 | + </listitem> | ||
| 1609 | + </varlistentry> | ||
| 1610 | + <varlistentry> | ||
| 1611 | + <term><option>--json-object=trailer|obj[,gen]</option></term> | ||
| 1612 | + <listitem> | ||
| 1613 | + <para> | ||
| 1614 | + This option is repeatable. If specified, only specified | ||
| 1615 | + objects will be shown in the | ||
| 1616 | + “<literal>objects</literal>” key of the json | ||
| 1617 | + output. If absent, all objects will be shown. | ||
| 1618 | + </para> | ||
| 1619 | + </listitem> | ||
| 1620 | + </varlistentry> | ||
| 1621 | + <varlistentry> | ||
| 1584 | <term><option>--check</option></term> | 1622 | <term><option>--check</option></term> |
| 1585 | <listitem> | 1623 | <listitem> |
| 1586 | <para> | 1624 | <para> |
| @@ -1777,6 +1815,8 @@ outfile.pdf</option> | @@ -1777,6 +1815,8 @@ outfile.pdf</option> | ||
| 1777 | </chapter> | 1815 | </chapter> |
| 1778 | <chapter id="ref.using-library"> | 1816 | <chapter id="ref.using-library"> |
| 1779 | <title>Using the QPDF Library</title> | 1817 | <title>Using the QPDF Library</title> |
| 1818 | + <sect1 id="ref.using.from-cxx"> | ||
| 1819 | + <title>Using QPDF from C++</title> | ||
| 1780 | <para> | 1820 | <para> |
| 1781 | The source tree for the qpdf package has an | 1821 | The source tree for the qpdf package has an |
| 1782 | <filename>examples</filename> directory that contains a few | 1822 | <filename>examples</filename> directory that contains a few |
| @@ -1808,6 +1848,234 @@ outfile.pdf</option> | @@ -1808,6 +1848,234 @@ outfile.pdf</option> | ||
| 1808 | time. Multiple threads may simultaneously work with different | 1848 | time. Multiple threads may simultaneously work with different |
| 1809 | instances of these and all other QPDF objects. | 1849 | instances of these and all other QPDF objects. |
| 1810 | </para> | 1850 | </para> |
| 1851 | + </sect1> | ||
| 1852 | + <sect1 id="ref.using.other-languages"> | ||
| 1853 | + <title>Using QPDF from other languages</title> | ||
| 1854 | + <para> | ||
| 1855 | + The qpdf library is implemented in C++, which makes it hard to use | ||
| 1856 | + directly in other languages. There are a few things that can help. | ||
| 1857 | + </para> | ||
| 1858 | + <variablelist> | ||
| 1859 | + <varlistentry> | ||
| 1860 | + <term>“C”</term> | ||
| 1861 | + <listitem> | ||
| 1862 | + <para> | ||
| 1863 | + The qpdf library includes a “C” language interface | ||
| 1864 | + that provides a subset of the overall capabilities. The header | ||
| 1865 | + file <filename>qpdf/qpdf-c.h</filename> includes information | ||
| 1866 | + about its use. As long as you use a C++ linker, you can link C | ||
| 1867 | + programs with qpdf and use the C API. For languages that can | ||
| 1868 | + directly load methods from a shared library, the C API can also | ||
| 1869 | + be useful. People have reported success using the C API from | ||
| 1870 | + other languages on Windows by directly calling functions in the | ||
| 1871 | + DLL. | ||
| 1872 | + </para> | ||
| 1873 | + </listitem> | ||
| 1874 | + </varlistentry> | ||
| 1875 | + <varlistentry> | ||
| 1876 | + <term>Python</term> | ||
| 1877 | + <listitem> | ||
| 1878 | + <para> | ||
| 1879 | + A Python module called <ulink | ||
| 1880 | + url="https://pypi.org/project/pikepdf/">pikepdf</ulink> | ||
| 1881 | + provides a clean and highly functional set of Python bindings | ||
| 1882 | + to the qpdf library. Using pikepdf, you can work with PDF files | ||
| 1883 | + in a natural way and combine qpdf's capabilities with other | ||
| 1884 | + functionality provided by Python's rich standard library and | ||
| 1885 | + available modules. | ||
| 1886 | + </para> | ||
| 1887 | + </listitem> | ||
| 1888 | + </varlistentry> | ||
| 1889 | + <varlistentry> | ||
| 1890 | + <term>Other Languages</term> | ||
| 1891 | + <listitem> | ||
| 1892 | + <para> | ||
| 1893 | + Starting with version 8.3.0, the <command>qpdf</command> | ||
| 1894 | + command-line tool can produce a json representation of the PDF | ||
| 1895 | + file's non-content data. This can facilitate interacting | ||
| 1896 | + programmatically with PDF files through qpdf's command line | ||
| 1897 | + interface. For more information, please see <xref | ||
| 1898 | + linkend="ref.json"/>. | ||
| 1899 | + </para> | ||
| 1900 | + </listitem> | ||
| 1901 | + </varlistentry> | ||
| 1902 | + </variablelist> | ||
| 1903 | + </sect1> | ||
| 1904 | + </chapter> | ||
| 1905 | + <chapter id="ref.json"> | ||
| 1906 | + <title>QPDF JSON</title> | ||
| 1907 | + <para> | ||
| 1908 | + Beginning with qpdf version 8.3.0, the <command>qpdf</command> | ||
| 1909 | + command-line program can produce a json representation of the | ||
| 1910 | + non-content data in a PDF file. It includes a dump in json format | ||
| 1911 | + of all objects in the PDF file excluding the content of streams. | ||
| 1912 | + This json representation makes it very easy to look in detail at | ||
| 1913 | + the structure of a given PDF file, and it also provides a great way | ||
| 1914 | + to work with PDF files programmatically from the command-line in | ||
| 1915 | + languages that can't call or link with the qpdf library directly. | ||
| 1916 | + Note that stream data can be extracted from PDF files using other | ||
| 1917 | + qpdf command-line options. | ||
| 1918 | + </para> | ||
| 1919 | + <para> | ||
| 1920 | + The qpdf json representation includes a json serialization of the | ||
| 1921 | + raw objects in the PDF file as well as some computed information in | ||
| 1922 | + a more easily extracted format. QPDF provides some guarantees about | ||
| 1923 | + its json format. These guarantees are designed to simplify the | ||
| 1924 | + experience of a developer working with the JSON format. | ||
| 1925 | + <variablelist> | ||
| 1926 | + <varlistentry> | ||
| 1927 | + <term>Compatibility</term> | ||
| 1928 | + <listitem> | ||
| 1929 | + <para> | ||
| 1930 | + The top-level json object output is a dictionary. The json | ||
| 1931 | + output contains various nested dictionaries and arrays. With | ||
| 1932 | + the exception of dictionaries that are populated by the fields | ||
| 1933 | + of objects from the file, all instances of a dictionary are | ||
| 1934 | + guaranteed to have exactly the same keys. Future versions of | ||
| 1935 | + qpdf are free to add additional keys but not to remove keys or | ||
| 1936 | + change the type of object that a key points to. The qpdf | ||
| 1937 | + program validates this guarantee, and in the unlikely event | ||
| 1938 | + that a bug in qpdf should cause it to generate data that | ||
| 1939 | + doesn't conform to this rule, it will ask you to file a bug | ||
| 1940 | + report. | ||
| 1941 | + </para> | ||
| 1942 | + <para> | ||
| 1943 | + The top-level json structure contains a | ||
| 1944 | + “<literal>version</literal>” key whose value is | ||
| 1945 | + simple integer. The value of the <literal>version</literal> key | ||
| 1946 | + will be incremented if a non-compatible change is made. A | ||
| 1947 | + non-compatible change would be any change that involves removal | ||
| 1948 | + of a key, a change to the format of data pointed to by a key, | ||
| 1949 | + or a semantic change that requires a different interpretation | ||
| 1950 | + of a previously existing key. A strong effort will be made to | ||
| 1951 | + avoid breaking compatibility. | ||
| 1952 | + </para> | ||
| 1953 | + </listitem> | ||
| 1954 | + </varlistentry> | ||
| 1955 | + <varlistentry> | ||
| 1956 | + <term>Documentation</term> | ||
| 1957 | + <listitem> | ||
| 1958 | + <para> | ||
| 1959 | + The <command>qpdf</command> command can be invoked with the | ||
| 1960 | + <option>--json-help</option> option. This will output a json | ||
| 1961 | + structure that has the same structure as the json output that | ||
| 1962 | + qpdf generates, except that each field in the help output is a | ||
| 1963 | + description of the corresponding field in the json output. The | ||
| 1964 | + specific guarantees are as follows: | ||
| 1965 | + <itemizedlist> | ||
| 1966 | + <listitem> | ||
| 1967 | + <para> | ||
| 1968 | + A dictionary in the help output means that the corresponding | ||
| 1969 | + location in the actual json output is also a dictionary with | ||
| 1970 | + exactly the same keys; that is, no keys present in help are | ||
| 1971 | + absent in the real output, and no keys will be present in | ||
| 1972 | + the real output that are not in help. | ||
| 1973 | + </para> | ||
| 1974 | + </listitem> | ||
| 1975 | + <listitem> | ||
| 1976 | + <para> | ||
| 1977 | + A string in the help output is a description of the item | ||
| 1978 | + that appears in the corresponding location of the actual | ||
| 1979 | + output. The corresponding output can have any format. | ||
| 1980 | + </para> | ||
| 1981 | + </listitem> | ||
| 1982 | + <listitem> | ||
| 1983 | + <para> | ||
| 1984 | + An array in the help output always contains a single | ||
| 1985 | + element. It indicates that the corresponding location in the | ||
| 1986 | + actual output is also an array, and that each element of the | ||
| 1987 | + array has whatever format is implied by the single element | ||
| 1988 | + of the help output's array. | ||
| 1989 | + </para> | ||
| 1990 | + </listitem> | ||
| 1991 | + </itemizedlist> | ||
| 1992 | + For example, the help output indicates includes a | ||
| 1993 | + “<literal>pagelabels</literal>” key whose value is | ||
| 1994 | + an array of one element. That element is a dictionary with keys | ||
| 1995 | + “<literal>index</literal>” and | ||
| 1996 | + “<literal>label</literal>”. In addition to | ||
| 1997 | + describing the meaning of those keys, this tells you that the | ||
| 1998 | + actual json output will contain a <literal>pagelabels</literal> | ||
| 1999 | + array, each of whose elements is a dictionary that contains an | ||
| 2000 | + <literal>index</literal> key, a <literal>label</literal> key, | ||
| 2001 | + and no other keys. | ||
| 2002 | + </para> | ||
| 2003 | + </listitem> | ||
| 2004 | + </varlistentry> | ||
| 2005 | + <varlistentry> | ||
| 2006 | + <term>Directness and Simplicity</term> | ||
| 2007 | + <listitem> | ||
| 2008 | + <para> | ||
| 2009 | + The json output contains the value of every object in the file, | ||
| 2010 | + but it also contains some processed data. This is analogous to | ||
| 2011 | + how qpdf's library interface works. The processed data is | ||
| 2012 | + similar to the helper functions in that it allows you to look | ||
| 2013 | + at certain aspects of the PDF file without having to understand | ||
| 2014 | + all the nuances of the PDF specification, while the raw objects | ||
| 2015 | + allow you to mine the PDF for anything that the higher-level | ||
| 2016 | + interfaces are lacking. | ||
| 2017 | + </para> | ||
| 2018 | + </listitem> | ||
| 2019 | + </varlistentry> | ||
| 2020 | + </variablelist> | ||
| 2021 | + </para> | ||
| 2022 | + <para> | ||
| 2023 | + There are a few limitations to be aware of with the json structure: | ||
| 2024 | + <itemizedlist> | ||
| 2025 | + <listitem> | ||
| 2026 | + <para> | ||
| 2027 | + Strings, names, and indirect object references in the original | ||
| 2028 | + PDF file are all converted to strings in the json | ||
| 2029 | + representation. In the case of a “normal” PDF file, | ||
| 2030 | + you can tell the difference because a name starts with a slash | ||
| 2031 | + (<literal>/</literal>), and an indirect object reference looks | ||
| 2032 | + like <literal>n n R</literal>, but if there were to be a string | ||
| 2033 | + that looked like a name or indirect object reference, there | ||
| 2034 | + would be no way to tell this from the json output. Note that | ||
| 2035 | + there are certain cases where you know for sure what something | ||
| 2036 | + is, such as knowing that dictionary keys in objects are always | ||
| 2037 | + names and that certain things in the higher-level computed data | ||
| 2038 | + are known to contain indirect object references. | ||
| 2039 | + </para> | ||
| 2040 | + </listitem> | ||
| 2041 | + <listitem> | ||
| 2042 | + <para> | ||
| 2043 | + The json format doesn't support binary data very well. Mostly | ||
| 2044 | + the details are not important, but they are presented here for | ||
| 2045 | + information. When qpdf outputs a string in the json | ||
| 2046 | + representation, it converts the string to UTF-8, assuming usual | ||
| 2047 | + PDF string semantics. Specifically, if the original string is | ||
| 2048 | + UTF-16, it is converted to UTF-8. Otherwise, it is assumed to | ||
| 2049 | + have PDF doc encoding, and is converted to UTF-8 with that | ||
| 2050 | + assumption. This causes strange things to happen to binary | ||
| 2051 | + strings. For example, if you had the binary string | ||
| 2052 | + <literal><038051></literal>, this would be output to the | ||
| 2053 | + json as <literal>\u0003•Q</literal> because | ||
| 2054 | + <literal>03</literal> is not a printable character and | ||
| 2055 | + <literal>80</literal> is the bullet character in PDF doc | ||
| 2056 | + encoding and is mapped to the Unicode value | ||
| 2057 | + <literal>2022</literal>. Since <literal>51</literal> is | ||
| 2058 | + <literal>Q</literal>, it is output as is. If you wanted to | ||
| 2059 | + convert back from here to a binary string, would have to | ||
| 2060 | + recognize Unicode values whose code points are higher than | ||
| 2061 | + <literal>0xFF</literal> and map those back to their | ||
| 2062 | + corresponding PDF doc encoding characters. There is no way to | ||
| 2063 | + tell the difference between a Unicode string that was originally | ||
| 2064 | + encoded as UTF-16 or one that was converted from PDF doc | ||
| 2065 | + encoding. In other words, it's best if you don't try to use the | ||
| 2066 | + json format to extract binary strings from the PDF file, but if | ||
| 2067 | + you really had to, it could be done. Note that qpdf's | ||
| 2068 | + <option>--show-object</option> option does not have this | ||
| 2069 | + limitation and will reveal the string as encoded in the original | ||
| 2070 | + file. | ||
| 2071 | + </para> | ||
| 2072 | + </listitem> | ||
| 2073 | + </itemizedlist> | ||
| 2074 | + </para> | ||
| 2075 | + <para> | ||
| 2076 | + For specific details on the information provided in the json | ||
| 2077 | + output, please run <command>qpdf --json-help</command>. | ||
| 2078 | + </para> | ||
| 1811 | </chapter> | 2079 | </chapter> |
| 1812 | <chapter id="ref.design"> | 2080 | <chapter id="ref.design"> |
| 1813 | <title>Design and Library Notes</title> | 2081 | <title>Design and Library Notes</title> |