Commit f95e0549cc6d402ab29f64306560e5677e528dad

Authored by Jay Berkenbilt
1 parent ed04b80c

Update documentation to clarify some limitations of qpdf JSON

Showing 2 changed files with 72 additions and 7 deletions
... ... @@ -11,8 +11,6 @@ Next
11 11 Before Release:
12 12  
13 13 * Stay on top of https://github.com/pikepdf/pikepdf/pull/315
14   -* Consider whether otherwise unreferenced object streams should be
15   - included in json output. Probably not. Or maybe optionally.
16 14 * Support json v2 in the C API. At a minimum, write_json,
17 15 create_from_json, and update_from_json need to be there and should
18 16 take the same kinds of functions as the C API for logger.
... ... @@ -56,6 +54,20 @@ direct objects, which are always "resolved" in QPDFObjectHandle.
56 54 Possible future JSON enhancements
57 55 =================================
58 56  
  57 +* Consider not including unreferenced objects and trimming the trailer
  58 + in the same way that QPDFWriter does (except don't remove `/ID`).
  59 + This means excluding the linearization dictionary and hint stream,
  60 + the encryption dictionary, all keys from trailer that are removed by
  61 + QPDFWriter::getTrimmedTrailer except `/ID`, any object streams, and
  62 + the xref stream as long as all those objects are unreferenced. (They
  63 + always should be, but there could be some bizarre case of someone
  64 + creating a PDF file that has an indirect reference to one of those,
  65 + in which case we need to preserve it.) If this is done, make
  66 + `--preserve-unreferenced` preserve unreference objects and also
  67 + those extra keys. Search for "linear" and "trailer" in json.rst to
  68 + update the various places in the documentation that discuss this.
  69 + Also update the help for --json and --preserve-unreferenced.
  70 +
59 71 * Add to JSON output the information available from a few additional
60 72 informational options:
61 73  
... ... @@ -376,7 +388,8 @@ I find it useful to make reference to them in this list.
376 388 convertible back to a valid PDF. Since providing the password may
377 389 reveal additional details, --show-encryption could potentially retry
378 390 with this option if the first time doesn't work. Then, with the file
379   - open, we can read the encryption dictionary normally.
  391 + open, we can read the encryption dictionary normally. If this is
  392 + done, search for "raw, encrypted" in json.rst.
380 393  
381 394 * In libtests, separate executables that need the object library
382 395 from those that strictly use public API. Move as many of the test
... ...
manual/json.rst
... ... @@ -52,6 +52,22 @@ changes a handful of defaults so that the resulting JSON is as close
52 52 as possible to the original input and is ready for being converted
53 53 back to PDF.
54 54  
  55 +The qpdf JSON data includes unreferenced objects. This may be
  56 +addressed in a future version of qpdf. For now, that means that
  57 +certain objects that are not useful in the JSON representation are
  58 +included. This includes linearization and encryption dictionaries,
  59 +linearization hint streams, object streams, and the cross-reference
  60 +(xref) stream associated with the trailer dictionary where applicable.
  61 +For the best experience with qpdf JSON, you can run the file through
  62 +qpdf first to remove encryption, linearization, and object streams.
  63 +For example:
  64 +
  65 +::
  66 +
  67 + qpdf --decrypt --object-streams=disable in.pdf out.pdf
  68 + qpdf --json-output out.pdf out.json
  69 +
  70 +
55 71 .. _json-terminology:
56 72  
57 73 JSON Terminology
... ... @@ -299,10 +315,46 @@ Object Values
299 315 Note that writing JSON output is done by ``QPDF``, not ``QPDFWriter``.
300 316 As such, none of the things ``QPDFWriter`` does apply. This includes
301 317 recompression of streams, renumbering of objects, removal of
302   -unreferenced objects, anything to do with object streams (which are
303   -not represented by qpdf JSON at all since they are PDF syntax, not
304   -semantics), encryption, decryption, linearization, QDF mode, etc. See
305   -:ref:`rewriting` for a more in-depth discussion.
  318 +unreferenced objects, encryption, decryption, linearization, QDF
  319 +mode, etc. See :ref:`rewriting` for a more in-depth discussion. This
  320 +has a few noteworthy implications:
  321 +
  322 +- Decryption is handled transparently by qpdf. As there are no QPDF
  323 + APIs, even internal to the library, that allow retrieval of
  324 + encrypted data in its raw, encrypted form, qpdf JSON always includes
  325 + decrypted data. It is possible that a future version of qpdf may
  326 + allow access to raw, encrypted string and stream data.
  327 +
  328 +- Objects that are related to a PDF file's structure, rather than its
  329 + content, are included in the JSON output, even though they are not
  330 + particularly useful. In a future version of qpdf, this may be fixed,
  331 + and the :qpdf:ref:`--preserve-unreferenced` flag may be able to be
  332 + used to get the existing behavior. For now, to avoid this, run the
  333 + file through ``qpdf --decrypt --object-streams=disable in.pdf
  334 + out.pdf`` to generate a new PDF file that contains no unreferenced
  335 + or structural objects.
  336 +
  337 + - Linearized PDF files include a linearization dictionary which is not
  338 + referenced from any other object and which references the
  339 + linearization hint stream by offset. The JSON from a linearized PDF
  340 + file contains both of these objects, even though they are not useful
  341 + in the JSON. Offset information is not represented in the JSON, so
  342 + there's no way to find the linearization hint stream from the
  343 + JSON. If a new PDF is created from JSON that was written, the
  344 + objects will be read back in but will just be unreferenced objects
  345 + that will be ignored by ``QPDFWriter`` when the file is rewritten.
  346 +
  347 + - The JSON from a file with object streams will include the original
  348 + object stream and will also include all the objects in the stream
  349 + as top-level objects.
  350 +
  351 + - In files with object streams, the trailer "dictionary" is a
  352 + stream. In qpdf JSON files, the ``"trailer"`` key will contain a
  353 + dictionary with all the keys in it relating to the stream, and the
  354 + stream will also appear as an unreferenced object.
  355 +
  356 + - Encrypted files are decrypted, but the encryption dictionary still
  357 + appears in the JSON output.
306 358  
307 359 .. _json.example:
308 360  
... ...