Add page position information to json

Jay Berkenbilt
1 parent 52a0b767
Showing 2 changed files with 249 additions and 161 deletions
manual/qpdf-manual.xml
qpdf/qpdf.cc
@@ -1940,178 +1940,235 @@ outfile.pdf&lt;/option&gt;
  </chapter>
  <chapter id="ref.json">
   <title>QPDF JSON</title>
-  <para>
-   Beginning with qpdf version 8.3.0, the <command>qpdf</command>
-   command-line program can produce a json representation of the
-   non-content data in a PDF file. It includes a dump in json format
-   of all objects in the PDF file excluding the content of streams.
-   This json representation makes it very easy to look in detail at
-   the structure of a given PDF file, and it also provides a great way
-   to work with PDF files programmatically from the command-line in
-   languages that can't call or link with the qpdf library directly.
-   Note that stream data can be extracted from PDF files using other
-   qpdf command-line options.
-  </para>
-  <para>
-   The qpdf json representation includes a json serialization of the
-   raw objects in the PDF file as well as some computed information in
-   a more easily extracted format. QPDF provides some guarantees about
-   its json format. These guarantees are designed to simplify the
-   experience of a developer working with the JSON format.
-   <variablelist>
-    <varlistentry>
-     <term>Compatibility</term>
+  <sect1 id="ref.json-overview">
+   <title>Overview</title>
+   <para>
+    Beginning with qpdf version 8.3.0, the <command>qpdf</command>
+    command-line program can produce a json representation of the
+    non-content data in a PDF file. It includes a dump in json format
+    of all objects in the PDF file excluding the content of streams.
+    This json representation makes it very easy to look in detail at
+    the structure of a given PDF file, and it also provides a great way
+    to work with PDF files programmatically from the command-line in
+    languages that can't call or link with the qpdf library directly.
+    Note that stream data can be extracted from PDF files using other
+    qpdf command-line options.
+   </para>
+  </sect1>
+  <sect1 id="ref.json-guarantees">
+   <title>JSON Guarantees</title>
+   <para>
+    The qpdf json representation includes a json serialization of the
+    raw objects in the PDF file as well as some computed information in
+    a more easily extracted format. QPDF provides some guarantees about
+    its json format. These guarantees are designed to simplify the
+    experience of a developer working with the JSON format.
+    <variablelist>
+     <varlistentry>
+      <term>Compatibility</term>
+      <listitem>
+       <para>
+        The top-level json object output is a dictionary. The json
+        output contains various nested dictionaries and arrays. With
+        the exception of dictionaries that are populated by the fields
+        of objects from the file, all instances of a dictionary are
+        guaranteed to have exactly the same keys. Future versions of
+        qpdf are free to add additional keys but not to remove keys or
+        change the type of object that a key points to. The qpdf
+        program validates this guarantee, and in the unlikely event
+        that a bug in qpdf should cause it to generate data that
+        doesn't conform to this rule, it will ask you to file a bug
+        report.
+       </para>
+       <para>
+        The top-level json structure contains a
+        &ldquo;<literal>version</literal>&rdquo; key whose value is
+        simple integer. The value of the <literal>version</literal> key
+        will be incremented if a non-compatible change is made. A
+        non-compatible change would be any change that involves removal
+        of a key, a change to the format of data pointed to by a key,
+        or a semantic change that requires a different interpretation
+        of a previously existing key. A strong effort will be made to
+        avoid breaking compatibility.
+       </para>
+      </listitem>
+     </varlistentry>
+     <varlistentry>
+      <term>Documentation</term>
+      <listitem>
+       <para>
+        The <command>qpdf</command> command can be invoked with the
+        <option>--json-help</option> option. This will output a json
+        structure that has the same structure as the json output that
+        qpdf generates, except that each field in the help output is a
+        description of the corresponding field in the json output. The
+        specific guarantees are as follows:
+        <itemizedlist>
+         <listitem>
+          <para>
+           A dictionary in the help output means that the corresponding
+           location in the actual json output is also a dictionary with
+           exactly the same keys; that is, no keys present in help are
+           absent in the real output, and no keys will be present in
+           the real output that are not in help.
+          </para>
+         </listitem>
+         <listitem>
+          <para>
+           A string in the help output is a description of the item
+           that appears in the corresponding location of the actual
+           output. The corresponding output can have any format.
+          </para>
+         </listitem>
+         <listitem>
+          <para>
+           An array in the help output always contains a single
+           element. It indicates that the corresponding location in the
+           actual output is also an array, and that each element of the
+           array has whatever format is implied by the single element
+           of the help output's array.
+          </para>
+         </listitem>
+        </itemizedlist>
+        For example, the help output indicates includes a
+        &ldquo;<literal>pagelabels</literal>&rdquo; key whose value is
+        an array of one element. That element is a dictionary with keys
+        &ldquo;<literal>index</literal>&rdquo; and
+        &ldquo;<literal>label</literal>&rdquo;. In addition to
+        describing the meaning of those keys, this tells you that the
+        actual json output will contain a <literal>pagelabels</literal>
+        array, each of whose elements is a dictionary that contains an
+        <literal>index</literal> key, a <literal>label</literal> key,
+        and no other keys.
+       </para>
+      </listitem>
+     </varlistentry>
+     <varlistentry>
+      <term>Directness and Simplicity</term>
+      <listitem>
+       <para>
+        The json output contains the value of every object in the file,
+        but it also contains some processed data. This is analogous to
+        how qpdf's library interface works. The processed data is
+        similar to the helper functions in that it allows you to look
+        at certain aspects of the PDF file without having to understand
+        all the nuances of the PDF specification, while the raw objects
+        allow you to mine the PDF for anything that the higher-level
+        interfaces are lacking.
+       </para>
+      </listitem>
+     </varlistentry>
+    </variablelist>
+   </para>
+  </sect1>
+  <sect1 id="json.limitations">
+   <title>Limitations of JSON Representation</title>
+   <para>
+    There are a few limitations to be aware of with the json structure:
+    <itemizedlist>
      <listitem>
       <para>
-       The top-level json object output is a dictionary. The json
-       output contains various nested dictionaries and arrays. With
-       the exception of dictionaries that are populated by the fields
-       of objects from the file, all instances of a dictionary are
-       guaranteed to have exactly the same keys. Future versions of
-       qpdf are free to add additional keys but not to remove keys or
-       change the type of object that a key points to. The qpdf
-       program validates this guarantee, and in the unlikely event
-       that a bug in qpdf should cause it to generate data that
-       doesn't conform to this rule, it will ask you to file a bug
-       report.
+       Strings, names, and indirect object references in the original
+       PDF file are all converted to strings in the json
+       representation. In the case of a &ldquo;normal&rdquo; PDF file,
+       you can tell the difference because a name starts with a slash
+       (<literal>/</literal>), and an indirect object reference looks
+       like <literal>n n R</literal>, but if there were to be a string
+       that looked like a name or indirect object reference, there
+       would be no way to tell this from the json output. Note that
+       there are certain cases where you know for sure what something
+       is, such as knowing that dictionary keys in objects are always
+       names and that certain things in the higher-level computed data
+       are known to contain indirect object references.
       </para>
+     </listitem>
+     <listitem>
       <para>
-       The top-level json structure contains a
-       &ldquo;<literal>version</literal>&rdquo; key whose value is
-       simple integer. The value of the <literal>version</literal> key
-       will be incremented if a non-compatible change is made. A
-       non-compatible change would be any change that involves removal
-       of a key, a change to the format of data pointed to by a key,
-       or a semantic change that requires a different interpretation
-       of a previously existing key. A strong effort will be made to
-       avoid breaking compatibility.
+       The json format doesn't support binary data very well. Mostly
+       the details are not important, but they are presented here for
+       information. When qpdf outputs a string in the json
+       representation, it converts the string to UTF-8, assuming usual
+       PDF string semantics. Specifically, if the original string is
+       UTF-16, it is converted to UTF-8. Otherwise, it is assumed to
+       have PDF doc encoding, and is converted to UTF-8 with that
+       assumption. This causes strange things to happen to binary
+       strings. For example, if you had the binary string
+       <literal>&lt;038051&gt;</literal>, this would be output to the
+       json as <literal>\u0003•Q</literal> because
+       <literal>03</literal> is not a printable character and
+       <literal>80</literal> is the bullet character in PDF doc
+       encoding and is mapped to the Unicode value
+       <literal>2022</literal>. Since <literal>51</literal> is
+       <literal>Q</literal>, it is output as is. If you wanted to
+       convert back from here to a binary string, would have to
+       recognize Unicode values whose code points are higher than
+       <literal>0xFF</literal> and map those back to their
+       corresponding PDF doc encoding characters. There is no way to
+       tell the difference between a Unicode string that was originally
+       encoded as UTF-16 or one that was converted from PDF doc
+       encoding. In other words, it's best if you don't try to use the
+       json format to extract binary strings from the PDF file, but if
+       you really had to, it could be done. Note that qpdf's
+       <option>--show-object</option> option does not have this
+       limitation and will reveal the string as encoded in the original
+       file.
       </para>
      </listitem>
-    </varlistentry>
-    <varlistentry>
-     <term>Documentation</term>
+    </itemizedlist>
+   </para>
+  </sect1>
+  <sect1 id="json.considerations">
+   <title>JSON: Special Considerations</title>
+   <para>
+    For the most part, the built-in JSON help tells you everything you
+    need to know about the JSON format, but there are a few
+    non-obvious things to be aware of:
+    <itemizedlist>
      <listitem>
       <para>
-       The <command>qpdf</command> command can be invoked with the
-       <option>--json-help</option> option. This will output a json
-       structure that has the same structure as the json output that
-       qpdf generates, except that each field in the help output is a
-       description of the corresponding field in the json output. The
-       specific guarantees are as follows:
-       <itemizedlist>
-        <listitem>
-         <para>
-          A dictionary in the help output means that the corresponding
-          location in the actual json output is also a dictionary with
-          exactly the same keys; that is, no keys present in help are
-          absent in the real output, and no keys will be present in
-          the real output that are not in help.
-         </para>
-        </listitem>
-        <listitem>
-         <para>
-          A string in the help output is a description of the item
-          that appears in the corresponding location of the actual
-          output. The corresponding output can have any format.
-         </para>
-        </listitem>
-        <listitem>
-         <para>
-          An array in the help output always contains a single
-          element. It indicates that the corresponding location in the
-          actual output is also an array, and that each element of the
-          array has whatever format is implied by the single element
-          of the help output's array.
-         </para>
-        </listitem>
-       </itemizedlist>
-       For example, the help output indicates includes a
-       &ldquo;<literal>pagelabels</literal>&rdquo; key whose value is
-       an array of one element. That element is a dictionary with keys
-       &ldquo;<literal>index</literal>&rdquo; and
-       &ldquo;<literal>label</literal>&rdquo;. In addition to
-       describing the meaning of those keys, this tells you that the
-       actual json output will contain a <literal>pagelabels</literal>
-       array, each of whose elements is a dictionary that contains an
-       <literal>index</literal> key, a <literal>label</literal> key,
-       and no other keys.
+       While qpdf guarantees that keys present in the help will be
+       present in the output, those fields may be null or empty if the
+       information is not known or absent in the file. Also, if you
+       specify <option>--json-keys</option>, the keys that are not
+       listed will be excluded entirely except for those that
+       <option>--json-help</option> says are always present.
       </para>
      </listitem>
-    </varlistentry>
-    <varlistentry>
-     <term>Directness and Simplicity</term>
      <listitem>
       <para>
-       The json output contains the value of every object in the file,
-       but it also contains some processed data. This is analogous to
-       how qpdf's library interface works. The processed data is
-       similar to the helper functions in that it allows you to look
-       at certain aspects of the PDF file without having to understand
-       all the nuances of the PDF specification, while the raw objects
-       allow you to mine the PDF for anything that the higher-level
-       interfaces are lacking.
+       In a few places, there are keys with names containing
+       <literal>pageposfrom1</literal>. The values of these keys are
+       null or an integer. If an integer, they point to a page index
+       within the file numbering from 1. Note that json indexes from
+       0, and you would also use 0-based indexing using the API.
+       However, 1-based indexing is easier in this case because the
+       command-line syntax for specifying page ranges is 1-based. If
+       you were going to write a program that looked through the json
+       for information about specific pages and then use the
+       command-line to extract those pages, 1-based indexing is
+       easier. Besides, it's more convenient to subtract 1 from a
+       program in a real programming language than it is to add 1 from
+       shell code.
       </para>
      </listitem>
-    </varlistentry>
-   </variablelist>
-  </para>
-  <para>
-   There are a few limitations to be aware of with the json structure:
-   <itemizedlist>
-    <listitem>
-     <para>
-      Strings, names, and indirect object references in the original
-      PDF file are all converted to strings in the json
-      representation. In the case of a &ldquo;normal&rdquo; PDF file,
-      you can tell the difference because a name starts with a slash
-      (<literal>/</literal>), and an indirect object reference looks
-      like <literal>n n R</literal>, but if there were to be a string
-      that looked like a name or indirect object reference, there
-      would be no way to tell this from the json output. Note that
-      there are certain cases where you know for sure what something
-      is, such as knowing that dictionary keys in objects are always
-      names and that certain things in the higher-level computed data
-      are known to contain indirect object references.
-     </para>
-    </listitem>
-    <listitem>
-     <para>
-      The json format doesn't support binary data very well. Mostly
-      the details are not important, but they are presented here for
-      information. When qpdf outputs a string in the json
-      representation, it converts the string to UTF-8, assuming usual
-      PDF string semantics. Specifically, if the original string is
-      UTF-16, it is converted to UTF-8. Otherwise, it is assumed to
-      have PDF doc encoding, and is converted to UTF-8 with that
-      assumption. This causes strange things to happen to binary
-      strings. For example, if you had the binary string
-      <literal>&lt;038051&gt;</literal>, this would be output to the
-      json as <literal>\u0003•Q</literal> because
-      <literal>03</literal> is not a printable character and
-      <literal>80</literal> is the bullet character in PDF doc
-      encoding and is mapped to the Unicode value
-      <literal>2022</literal>. Since <literal>51</literal> is
-      <literal>Q</literal>, it is output as is. If you wanted to
-      convert back from here to a binary string, would have to
-      recognize Unicode values whose code points are higher than
-      <literal>0xFF</literal> and map those back to their
-      corresponding PDF doc encoding characters. There is no way to
-      tell the difference between a Unicode string that was originally
-      encoded as UTF-16 or one that was converted from PDF doc
-      encoding. In other words, it's best if you don't try to use the
-      json format to extract binary strings from the PDF file, but if
-      you really had to, it could be done. Note that qpdf's
-      <option>--show-object</option> option does not have this
-      limitation and will reveal the string as encoded in the original
-      file.
-     </para>
-    </listitem>
-   </itemizedlist>
-  </para>
-  <para>
-   For specific details on the information provided in the json
-   output, please run <command>qpdf --json-help</command>.
-  </para>
+     <listitem>
+      <para>
+       The image information included in the <literal>page</literal>
+       section of the json output includes the key
+       &ldquo;<literal>filterable</literal>&rdquo;. Note that the
+       value of this field may depend on the
+       <option>--decode-level</option> that you invoke qpdf with. The
+       json output includes a top-level key
+       &ldquo;<literal>parameters</literal>&rdquo; that indicates the
+       decode level used for computing whether a stream was
+       filterable. For example, jpeg images will be shown as not
+       filterable by default, but they will be shown as filterable if
+       you run <command>qpdf --json --decode-level=all</command>.
+      </para>
+     </listitem>
+    </itemizedlist>
+   </para>
+  </sect1>
  </chapter>
  <chapter id="ref.design">
   <title>Design and Library Notes</title>
@@ -338,6 +338,9 @@ static JSON json_schema(std::set&lt;std::string&gt;* keys = 0)
         outline.addDictionaryMember(
             "dest",
             JSON::makeString("outline destination dictionary"));
+        page.addDictionaryMember(
+            "pageposfrom1",
+            JSON::makeString("position of page in document numbering from 1"));
     }
     if (all_keys || keys->count("pagelabels"))
     {
@@ -371,6 +374,10 @@ static JSON json_schema(std::set&lt;std::string&gt;* keys = 0)
         outlines.addDictionaryMember(
             "open",
             JSON::makeString("whether the outline is displayed expanded"));
+        outlines.addDictionaryMember(
+            "destpageposfrom1",
+            JSON::makeString("position of destination page in document"
+                             " numbered from 1; null if not known"));
     }
     return schema;
 }
@@ -2813,6 +2820,7 @@ static void do_json_pages(QPDF&amp; pdf, Options&amp; o, JSON&amp; j)
             j_outline.addDictionaryMember(
                 "dest", (*oiter).getDest().getJSON(true));
         }
+        j_page.addDictionaryMember("pageposfrom1", JSON::makeInt(1 + pageno));
     }
 }
@@ -2847,7 +2855,8 @@ static void do_json_page_labels(QPDF&amp; pdf, Options&amp; o, JSON&amp; j)
 }
 static void add_outlines_to_json(
-    std::list<QPDFOutlineObjectHelper> outlines, JSON& j)
+    std::list<QPDFOutlineObjectHelper> outlines, JSON& j,
+    std::map<QPDFObjGen, int>& page_numbers)
 {
     for (std::list<QPDFOutlineObjectHelper>::iterator iter = outlines.begin();
          iter != outlines.end(); ++iter)
@@ -2858,17 +2867,39 @@ static void add_outlines_to_json(
         jo.addDictionaryMember("title", JSON::makeString(ol.getTitle()));
         jo.addDictionaryMember("dest", ol.getDest().getJSON(true));
         jo.addDictionaryMember("open", JSON::makeBool(ol.getCount() >= 0));
+        QPDFObjectHandle page = ol.getDestPage();
+        JSON j_destpage = JSON::makeNull();
+        if (page.isIndirect())
+        {
+            QPDFObjGen og = page.getObjGen();
+            if (page_numbers.count(og))
+            {
+                j_destpage = JSON::makeInt(page_numbers[og]);
+            }
+        }
+        jo.addDictionaryMember("destpageposfrom1", j_destpage);
         JSON j_kids = jo.addDictionaryMember("kids", JSON::makeArray());
-        add_outlines_to_json(ol.getKids(), j_kids);
+        add_outlines_to_json(ol.getKids(), j_kids, page_numbers);
     }
 }
 static void do_json_outlines(QPDF& pdf, Options& o, JSON& j)
 {
+    std::map<QPDFObjGen, int> page_numbers;
+    QPDFPageDocumentHelper dh(pdf);
+    std::vector<QPDFPageObjectHelper> pages = dh.getAllPages();
+    int n = 0;
+    for (std::vector<QPDFPageObjectHelper>::iterator iter = pages.begin();
+	 iter != pages.end(); ++iter)
+    {
+	QPDFObjectHandle oh = (*iter).getObjectHandle();
+	page_numbers[oh.getObjGen()] = ++n;
+    }
+
     JSON j_outlines = j.addDictionaryMember(
         "outlines", JSON::makeArray());
     QPDFOutlineDocumentHelper odh(pdf);
-    add_outlines_to_json(odh.getTopLevelOutlines(), j_outlines);
+    add_outlines_to_json(odh.getTopLevelOutlines(), j_outlines, page_numbers);
 }
 static void do_json(QPDF& pdf, Options& o)