Commit f95e0549cc6d402ab29f64306560e5677e528dad
1 parent
ed04b80c
Update documentation to clarify some limitations of qpdf JSON
Showing
2 changed files
with
72 additions
and
7 deletions
TODO
| @@ -11,8 +11,6 @@ Next | @@ -11,8 +11,6 @@ Next | ||
| 11 | Before Release: | 11 | Before Release: |
| 12 | 12 | ||
| 13 | * Stay on top of https://github.com/pikepdf/pikepdf/pull/315 | 13 | * Stay on top of https://github.com/pikepdf/pikepdf/pull/315 |
| 14 | -* Consider whether otherwise unreferenced object streams should be | ||
| 15 | - included in json output. Probably not. Or maybe optionally. | ||
| 16 | * Support json v2 in the C API. At a minimum, write_json, | 14 | * Support json v2 in the C API. At a minimum, write_json, |
| 17 | create_from_json, and update_from_json need to be there and should | 15 | create_from_json, and update_from_json need to be there and should |
| 18 | take the same kinds of functions as the C API for logger. | 16 | take the same kinds of functions as the C API for logger. |
| @@ -56,6 +54,20 @@ direct objects, which are always "resolved" in QPDFObjectHandle. | @@ -56,6 +54,20 @@ direct objects, which are always "resolved" in QPDFObjectHandle. | ||
| 56 | Possible future JSON enhancements | 54 | Possible future JSON enhancements |
| 57 | ================================= | 55 | ================================= |
| 58 | 56 | ||
| 57 | +* Consider not including unreferenced objects and trimming the trailer | ||
| 58 | + in the same way that QPDFWriter does (except don't remove `/ID`). | ||
| 59 | + This means excluding the linearization dictionary and hint stream, | ||
| 60 | + the encryption dictionary, all keys from trailer that are removed by | ||
| 61 | + QPDFWriter::getTrimmedTrailer except `/ID`, any object streams, and | ||
| 62 | + the xref stream as long as all those objects are unreferenced. (They | ||
| 63 | + always should be, but there could be some bizarre case of someone | ||
| 64 | + creating a PDF file that has an indirect reference to one of those, | ||
| 65 | + in which case we need to preserve it.) If this is done, make | ||
| 66 | + `--preserve-unreferenced` preserve unreference objects and also | ||
| 67 | + those extra keys. Search for "linear" and "trailer" in json.rst to | ||
| 68 | + update the various places in the documentation that discuss this. | ||
| 69 | + Also update the help for --json and --preserve-unreferenced. | ||
| 70 | + | ||
| 59 | * Add to JSON output the information available from a few additional | 71 | * Add to JSON output the information available from a few additional |
| 60 | informational options: | 72 | informational options: |
| 61 | 73 | ||
| @@ -376,7 +388,8 @@ I find it useful to make reference to them in this list. | @@ -376,7 +388,8 @@ I find it useful to make reference to them in this list. | ||
| 376 | convertible back to a valid PDF. Since providing the password may | 388 | convertible back to a valid PDF. Since providing the password may |
| 377 | reveal additional details, --show-encryption could potentially retry | 389 | reveal additional details, --show-encryption could potentially retry |
| 378 | with this option if the first time doesn't work. Then, with the file | 390 | with this option if the first time doesn't work. Then, with the file |
| 379 | - open, we can read the encryption dictionary normally. | 391 | + open, we can read the encryption dictionary normally. If this is |
| 392 | + done, search for "raw, encrypted" in json.rst. | ||
| 380 | 393 | ||
| 381 | * In libtests, separate executables that need the object library | 394 | * In libtests, separate executables that need the object library |
| 382 | from those that strictly use public API. Move as many of the test | 395 | from those that strictly use public API. Move as many of the test |
manual/json.rst
| @@ -52,6 +52,22 @@ changes a handful of defaults so that the resulting JSON is as close | @@ -52,6 +52,22 @@ changes a handful of defaults so that the resulting JSON is as close | ||
| 52 | as possible to the original input and is ready for being converted | 52 | as possible to the original input and is ready for being converted |
| 53 | back to PDF. | 53 | back to PDF. |
| 54 | 54 | ||
| 55 | +The qpdf JSON data includes unreferenced objects. This may be | ||
| 56 | +addressed in a future version of qpdf. For now, that means that | ||
| 57 | +certain objects that are not useful in the JSON representation are | ||
| 58 | +included. This includes linearization and encryption dictionaries, | ||
| 59 | +linearization hint streams, object streams, and the cross-reference | ||
| 60 | +(xref) stream associated with the trailer dictionary where applicable. | ||
| 61 | +For the best experience with qpdf JSON, you can run the file through | ||
| 62 | +qpdf first to remove encryption, linearization, and object streams. | ||
| 63 | +For example: | ||
| 64 | + | ||
| 65 | +:: | ||
| 66 | + | ||
| 67 | + qpdf --decrypt --object-streams=disable in.pdf out.pdf | ||
| 68 | + qpdf --json-output out.pdf out.json | ||
| 69 | + | ||
| 70 | + | ||
| 55 | .. _json-terminology: | 71 | .. _json-terminology: |
| 56 | 72 | ||
| 57 | JSON Terminology | 73 | JSON Terminology |
| @@ -299,10 +315,46 @@ Object Values | @@ -299,10 +315,46 @@ Object Values | ||
| 299 | Note that writing JSON output is done by ``QPDF``, not ``QPDFWriter``. | 315 | Note that writing JSON output is done by ``QPDF``, not ``QPDFWriter``. |
| 300 | As such, none of the things ``QPDFWriter`` does apply. This includes | 316 | As such, none of the things ``QPDFWriter`` does apply. This includes |
| 301 | recompression of streams, renumbering of objects, removal of | 317 | recompression of streams, renumbering of objects, removal of |
| 302 | -unreferenced objects, anything to do with object streams (which are | ||
| 303 | -not represented by qpdf JSON at all since they are PDF syntax, not | ||
| 304 | -semantics), encryption, decryption, linearization, QDF mode, etc. See | ||
| 305 | -:ref:`rewriting` for a more in-depth discussion. | 318 | +unreferenced objects, encryption, decryption, linearization, QDF |
| 319 | +mode, etc. See :ref:`rewriting` for a more in-depth discussion. This | ||
| 320 | +has a few noteworthy implications: | ||
| 321 | + | ||
| 322 | +- Decryption is handled transparently by qpdf. As there are no QPDF | ||
| 323 | + APIs, even internal to the library, that allow retrieval of | ||
| 324 | + encrypted data in its raw, encrypted form, qpdf JSON always includes | ||
| 325 | + decrypted data. It is possible that a future version of qpdf may | ||
| 326 | + allow access to raw, encrypted string and stream data. | ||
| 327 | + | ||
| 328 | +- Objects that are related to a PDF file's structure, rather than its | ||
| 329 | + content, are included in the JSON output, even though they are not | ||
| 330 | + particularly useful. In a future version of qpdf, this may be fixed, | ||
| 331 | + and the :qpdf:ref:`--preserve-unreferenced` flag may be able to be | ||
| 332 | + used to get the existing behavior. For now, to avoid this, run the | ||
| 333 | + file through ``qpdf --decrypt --object-streams=disable in.pdf | ||
| 334 | + out.pdf`` to generate a new PDF file that contains no unreferenced | ||
| 335 | + or structural objects. | ||
| 336 | + | ||
| 337 | + - Linearized PDF files include a linearization dictionary which is not | ||
| 338 | + referenced from any other object and which references the | ||
| 339 | + linearization hint stream by offset. The JSON from a linearized PDF | ||
| 340 | + file contains both of these objects, even though they are not useful | ||
| 341 | + in the JSON. Offset information is not represented in the JSON, so | ||
| 342 | + there's no way to find the linearization hint stream from the | ||
| 343 | + JSON. If a new PDF is created from JSON that was written, the | ||
| 344 | + objects will be read back in but will just be unreferenced objects | ||
| 345 | + that will be ignored by ``QPDFWriter`` when the file is rewritten. | ||
| 346 | + | ||
| 347 | + - The JSON from a file with object streams will include the original | ||
| 348 | + object stream and will also include all the objects in the stream | ||
| 349 | + as top-level objects. | ||
| 350 | + | ||
| 351 | + - In files with object streams, the trailer "dictionary" is a | ||
| 352 | + stream. In qpdf JSON files, the ``"trailer"`` key will contain a | ||
| 353 | + dictionary with all the keys in it relating to the stream, and the | ||
| 354 | + stream will also appear as an unreferenced object. | ||
| 355 | + | ||
| 356 | + - Encrypted files are decrypted, but the encryption dictionary still | ||
| 357 | + appears in the JSON output. | ||
| 306 | 358 | ||
| 307 | .. _json.example: | 359 | .. _json.example: |
| 308 | 360 |