Commit f95e0549cc6d402ab29f64306560e5677e528dad
1 parent
ed04b80c
Update documentation to clarify some limitations of qpdf JSON
Showing
2 changed files
with
72 additions
and
7 deletions
TODO
| ... | ... | @@ -11,8 +11,6 @@ Next |
| 11 | 11 | Before Release: |
| 12 | 12 | |
| 13 | 13 | * Stay on top of https://github.com/pikepdf/pikepdf/pull/315 |
| 14 | -* Consider whether otherwise unreferenced object streams should be | |
| 15 | - included in json output. Probably not. Or maybe optionally. | |
| 16 | 14 | * Support json v2 in the C API. At a minimum, write_json, |
| 17 | 15 | create_from_json, and update_from_json need to be there and should |
| 18 | 16 | take the same kinds of functions as the C API for logger. |
| ... | ... | @@ -56,6 +54,20 @@ direct objects, which are always "resolved" in QPDFObjectHandle. |
| 56 | 54 | Possible future JSON enhancements |
| 57 | 55 | ================================= |
| 58 | 56 | |
| 57 | +* Consider not including unreferenced objects and trimming the trailer | |
| 58 | + in the same way that QPDFWriter does (except don't remove `/ID`). | |
| 59 | + This means excluding the linearization dictionary and hint stream, | |
| 60 | + the encryption dictionary, all keys from trailer that are removed by | |
| 61 | + QPDFWriter::getTrimmedTrailer except `/ID`, any object streams, and | |
| 62 | + the xref stream as long as all those objects are unreferenced. (They | |
| 63 | + always should be, but there could be some bizarre case of someone | |
| 64 | + creating a PDF file that has an indirect reference to one of those, | |
| 65 | + in which case we need to preserve it.) If this is done, make | |
| 66 | + `--preserve-unreferenced` preserve unreference objects and also | |
| 67 | + those extra keys. Search for "linear" and "trailer" in json.rst to | |
| 68 | + update the various places in the documentation that discuss this. | |
| 69 | + Also update the help for --json and --preserve-unreferenced. | |
| 70 | + | |
| 59 | 71 | * Add to JSON output the information available from a few additional |
| 60 | 72 | informational options: |
| 61 | 73 | |
| ... | ... | @@ -376,7 +388,8 @@ I find it useful to make reference to them in this list. |
| 376 | 388 | convertible back to a valid PDF. Since providing the password may |
| 377 | 389 | reveal additional details, --show-encryption could potentially retry |
| 378 | 390 | with this option if the first time doesn't work. Then, with the file |
| 379 | - open, we can read the encryption dictionary normally. | |
| 391 | + open, we can read the encryption dictionary normally. If this is | |
| 392 | + done, search for "raw, encrypted" in json.rst. | |
| 380 | 393 | |
| 381 | 394 | * In libtests, separate executables that need the object library |
| 382 | 395 | from those that strictly use public API. Move as many of the test | ... | ... |
manual/json.rst
| ... | ... | @@ -52,6 +52,22 @@ changes a handful of defaults so that the resulting JSON is as close |
| 52 | 52 | as possible to the original input and is ready for being converted |
| 53 | 53 | back to PDF. |
| 54 | 54 | |
| 55 | +The qpdf JSON data includes unreferenced objects. This may be | |
| 56 | +addressed in a future version of qpdf. For now, that means that | |
| 57 | +certain objects that are not useful in the JSON representation are | |
| 58 | +included. This includes linearization and encryption dictionaries, | |
| 59 | +linearization hint streams, object streams, and the cross-reference | |
| 60 | +(xref) stream associated with the trailer dictionary where applicable. | |
| 61 | +For the best experience with qpdf JSON, you can run the file through | |
| 62 | +qpdf first to remove encryption, linearization, and object streams. | |
| 63 | +For example: | |
| 64 | + | |
| 65 | +:: | |
| 66 | + | |
| 67 | + qpdf --decrypt --object-streams=disable in.pdf out.pdf | |
| 68 | + qpdf --json-output out.pdf out.json | |
| 69 | + | |
| 70 | + | |
| 55 | 71 | .. _json-terminology: |
| 56 | 72 | |
| 57 | 73 | JSON Terminology |
| ... | ... | @@ -299,10 +315,46 @@ Object Values |
| 299 | 315 | Note that writing JSON output is done by ``QPDF``, not ``QPDFWriter``. |
| 300 | 316 | As such, none of the things ``QPDFWriter`` does apply. This includes |
| 301 | 317 | recompression of streams, renumbering of objects, removal of |
| 302 | -unreferenced objects, anything to do with object streams (which are | |
| 303 | -not represented by qpdf JSON at all since they are PDF syntax, not | |
| 304 | -semantics), encryption, decryption, linearization, QDF mode, etc. See | |
| 305 | -:ref:`rewriting` for a more in-depth discussion. | |
| 318 | +unreferenced objects, encryption, decryption, linearization, QDF | |
| 319 | +mode, etc. See :ref:`rewriting` for a more in-depth discussion. This | |
| 320 | +has a few noteworthy implications: | |
| 321 | + | |
| 322 | +- Decryption is handled transparently by qpdf. As there are no QPDF | |
| 323 | + APIs, even internal to the library, that allow retrieval of | |
| 324 | + encrypted data in its raw, encrypted form, qpdf JSON always includes | |
| 325 | + decrypted data. It is possible that a future version of qpdf may | |
| 326 | + allow access to raw, encrypted string and stream data. | |
| 327 | + | |
| 328 | +- Objects that are related to a PDF file's structure, rather than its | |
| 329 | + content, are included in the JSON output, even though they are not | |
| 330 | + particularly useful. In a future version of qpdf, this may be fixed, | |
| 331 | + and the :qpdf:ref:`--preserve-unreferenced` flag may be able to be | |
| 332 | + used to get the existing behavior. For now, to avoid this, run the | |
| 333 | + file through ``qpdf --decrypt --object-streams=disable in.pdf | |
| 334 | + out.pdf`` to generate a new PDF file that contains no unreferenced | |
| 335 | + or structural objects. | |
| 336 | + | |
| 337 | + - Linearized PDF files include a linearization dictionary which is not | |
| 338 | + referenced from any other object and which references the | |
| 339 | + linearization hint stream by offset. The JSON from a linearized PDF | |
| 340 | + file contains both of these objects, even though they are not useful | |
| 341 | + in the JSON. Offset information is not represented in the JSON, so | |
| 342 | + there's no way to find the linearization hint stream from the | |
| 343 | + JSON. If a new PDF is created from JSON that was written, the | |
| 344 | + objects will be read back in but will just be unreferenced objects | |
| 345 | + that will be ignored by ``QPDFWriter`` when the file is rewritten. | |
| 346 | + | |
| 347 | + - The JSON from a file with object streams will include the original | |
| 348 | + object stream and will also include all the objects in the stream | |
| 349 | + as top-level objects. | |
| 350 | + | |
| 351 | + - In files with object streams, the trailer "dictionary" is a | |
| 352 | + stream. In qpdf JSON files, the ``"trailer"`` key will contain a | |
| 353 | + dictionary with all the keys in it relating to the stream, and the | |
| 354 | + stream will also appear as an unreferenced object. | |
| 355 | + | |
| 356 | + - Encrypted files are decrypted, but the encryption dictionary still | |
| 357 | + appears in the JSON output. | |
| 306 | 358 | |
| 307 | 359 | .. _json.example: |
| 308 | 360 | ... | ... |