Commit 0bd908b550603a6bcc399a825a170a1263378b22
1 parent
b7bbf12e
Update documentation for qpdf JSON v2
Showing
14 changed files
with
903 additions
and
419 deletions
TODO
| @@ -2,14 +2,13 @@ | @@ -2,14 +2,13 @@ | ||
| 2 | Next | 2 | Next |
| 3 | ==== | 3 | ==== |
| 4 | 4 | ||
| 5 | +Before Release: | ||
| 6 | + | ||
| 5 | * At next release, hide release-qpdf-10.6.3.0cmake* versions at readthedocs | 7 | * At next release, hide release-qpdf-10.6.3.0cmake* versions at readthedocs |
| 6 | * Stay on top of https://github.com/pikepdf/pikepdf/pull/315 | 8 | * Stay on top of https://github.com/pikepdf/pikepdf/pull/315 |
| 7 | * Release qtest with updates to qtest-driver and copy back into qpdf | 9 | * Release qtest with updates to qtest-driver and copy back into qpdf |
| 8 | 10 | ||
| 9 | -In order: | ||
| 10 | -* json v2 | ||
| 11 | - | ||
| 12 | -Other (do in any order): | 11 | +Pending changes: |
| 13 | 12 | ||
| 14 | * Good C API for json v2 | 13 | * Good C API for json v2 |
| 15 | * QPDFPagesTree -- avoid ever flattening the pages tree. | 14 | * QPDFPagesTree -- avoid ever flattening the pages tree. |
| @@ -50,180 +49,10 @@ Other (do in any order): | @@ -50,180 +49,10 @@ Other (do in any order): | ||
| 50 | * Rework tests so that nothing is written into the source directory. | 49 | * Rework tests so that nothing is written into the source directory. |
| 51 | Ideally then the entire build could be done with a read-only | 50 | Ideally then the entire build could be done with a read-only |
| 52 | source tree. | 51 | source tree. |
| 52 | +* Consider adding fuzzer code for JSON | ||
| 53 | 53 | ||
| 54 | Soon: Break ground on "Document-level work" | 54 | Soon: Break ground on "Document-level work" |
| 55 | 55 | ||
| 56 | -Output JSON v2 | ||
| 57 | -============== | ||
| 58 | - | ||
| 59 | -Remaining work: | ||
| 60 | - | ||
| 61 | -* Make sure all the information from informational options is | ||
| 62 | - available in the json output. | ||
| 63 | - | ||
| 64 | - * --check: add but maybe not by default? | ||
| 65 | - | ||
| 66 | - * --show-linearization: add but maybe not by default? Also figure | ||
| 67 | - out whether warnings reported for some of the PDF specs (1.7) are | ||
| 68 | - qpdf problems. This may not be worth adding in the first | ||
| 69 | - increment. | ||
| 70 | - | ||
| 71 | - * --show-xref: add | ||
| 72 | - | ||
| 73 | -* Consider having --check, --show-encryption, etc., just select the | ||
| 74 | - right keys when in json mode. I don't think I want check on by | ||
| 75 | - default, so that might be different. | ||
| 76 | - | ||
| 77 | -* Consider having warnings be included in the json in a "warnings" key | ||
| 78 | - in json mode. | ||
| 79 | - | ||
| 80 | -Notes for documentation: | ||
| 81 | - | ||
| 82 | -* Find all mentions of json in the manual and update. | ||
| 83 | - | ||
| 84 | -* Document typo fix in encrypt in release notes along with any other | ||
| 85 | - non-compatible json 2 changes. Scrutinize all the output to decide | ||
| 86 | - what should change. | ||
| 87 | - | ||
| 88 | -* Keys other than "qpdf-v2" are ignored so people can stash their own | ||
| 89 | - stuff. Unknown keys are ignored at other places for future | ||
| 90 | - compatibility. Readers of qpdf json should continue to ignore keys | ||
| 91 | - they don't recognize. | ||
| 92 | - | ||
| 93 | -* Change: names are written in canonical form with a leading slash | ||
| 94 | - just as they are treated in the code. In v1, they were written in | ||
| 95 | - PDF syntax in the json file. Example: /text#2fplain in pdf will be | ||
| 96 | - written as /text/plain in json v2 and as /text#2fplain in json v1. | ||
| 97 | - | ||
| 98 | -* Document changes to strings, objects, streams, object keys. | ||
| 99 | - | ||
| 100 | -* CLI: --json-input, --json-output[=version], --update-from-json. With | ||
| 101 | - --json-input, the input file is a JSON file instead of a PDF file. | ||
| 102 | - It must be complete, meaning that a PDF version must be given, all | ||
| 103 | - streams must have exactly one of data or datafile, and a trailer | ||
| 104 | - dictionary must be present, even if empty. | ||
| 105 | - | ||
| 106 | - With --update-from-json, the JSON file updates objects in place. If | ||
| 107 | - updating an old stream, if stream data is omitted, the data remains | ||
| 108 | - untouched. The dictionary is always required. Remember that | ||
| 109 | - QPDFWriter does not preserve object numbers, though --json-output | ||
| 110 | - does. Therefore, if you want to update a PDF with a JSON, the input | ||
| 111 | - to --update-from-json must be the same PDF as the one that | ||
| 112 | - --json-output was run on previously. Otherwise, object numbers won't | ||
| 113 | - match. Show this with an example. When updating, | ||
| 114 | - | ||
| 115 | -* Certain fields are ignored when reading the JSON. This includes | ||
| 116 | - maxobjectid, any computed fields in trailer (such as /Size), and all | ||
| 117 | - /Length keys in stream dictionaries. There is no need for the user | ||
| 118 | - to correct, remove, or otherwise worry about any values those keys | ||
| 119 | - might have. The maxobjectid field is present in the original output | ||
| 120 | - to assist with adding new objects to the file. | ||
| 121 | - | ||
| 122 | -* JSON strings within PDF objects: | ||
| 123 | - | ||
| 124 | - * "n n R" is an indirect object | ||
| 125 | - | ||
| 126 | - * "/Name" is a name in canonical form with a leading slash (like | ||
| 127 | - "/text/plain"), not PDF syntax (like "/text#2fplain"). | ||
| 128 | - | ||
| 129 | - * "b:hex-digits" is a binary string ("b:feff03c0"). Hex digits may be | ||
| 130 | - mixed case. There must be an even number of digits. | ||
| 131 | - | ||
| 132 | - * "u:utf-8" is a UTF-8 encoded string ("u:ฯ", "u:\u03c0"). UTF-16 | ||
| 133 | - surrogate pairs are allowed. These are all equivalent: "u:๐ฅ", | ||
| 134 | - "u:\ud83e\udd54", "b:FEFFD83EDD54", "b:efbbbff09fa594". | ||
| 135 | - | ||
| 136 | - * Both "b:" and "u:" are valid representations of the empty string. | ||
| 137 | - | ||
| 138 | - * Anything else is an error | ||
| 139 | - | ||
| 140 | -* Document use of --json-input and --json-output together to show | ||
| 141 | - preservation of object numbers. Draw attention to "original object | ||
| 142 | - ID" comments in qdf as another way to show it. | ||
| 143 | - | ||
| 144 | -* Document top-level keys of "qpdf-v2" ("pdfversion", "objects", | ||
| 145 | - "maxobjectid") noting that "maxobjectid" is ignored when reading. | ||
| 146 | - | ||
| 147 | -* Stream data: "data" is base64-encoded stream data. "datafile" is the | ||
| 148 | - path to a file (relative path recommended but not required) | ||
| 149 | - containing the binary data. As with any PDF representation, the data | ||
| 150 | - must be consistent with the filters. --decode-level is honored by | ||
| 151 | - --json-output. | ||
| 152 | - | ||
| 153 | -* Other changes from v1: | ||
| 154 | - | ||
| 155 | - * in "objects", keys are "obj:o g R" or "trailer" | ||
| 156 | - | ||
| 157 | - * Non-stream objects are dictionaries with a "value" key whose value | ||
| 158 | - is the object. Stream objects are dictionaries with a "stream" key | ||
| 159 | - whose value is {"dict": stream-dictionary}. The "/Length" key is | ||
| 160 | - omitted from the stream dictionary. | ||
| 161 | - | ||
| 162 | - * "objectinfo" is gone as it is now possible to tell a stream from a | ||
| 163 | - non-stream directly. To get stream data, use the --json-output | ||
| 164 | - option. Note about how "pages" may cause the pages tree to be | ||
| 165 | - corrected. | ||
| 166 | - | ||
| 167 | -For non-streams: | ||
| 168 | - | ||
| 169 | - "obj:o g R": { | ||
| 170 | - "value": ... | ||
| 171 | - } | ||
| 172 | - | ||
| 173 | -For streams: | ||
| 174 | - | ||
| 175 | - "obj:o g R": { | ||
| 176 | - "stream": { | ||
| 177 | - "dict": { ... stream dictionary ... }, | ||
| 178 | - "data": "base64-encoded data", | ||
| 179 | - "datafile": "path to base64-encoded data" | ||
| 180 | - } | ||
| 181 | - } | ||
| 182 | - | ||
| 183 | -Rationale of "obj:o g R" is that indirect object references are just | ||
| 184 | -"o g R", and so code that wants to resolve one can do so easily by | ||
| 185 | -just prepending "obj:" and not having to parse or split the string. | ||
| 186 | -Having a prefix rather than making the key just "o g R" makes it much | ||
| 187 | -easier to search in the JSON for the definition of an object. | ||
| 188 | - | ||
| 189 | -CLI: | ||
| 190 | - | ||
| 191 | -Example workflow: | ||
| 192 | -* qpdf in.pdf --json-output pdf.json | ||
| 193 | -* edit pdf.json | ||
| 194 | -* qpdf --json-input pdf.json out.pdf | ||
| 195 | - | ||
| 196 | -* qpdf in.pdf --json-output pdf.json | ||
| 197 | -* edit pdf.json keeping only objects that need to be changed | ||
| 198 | -* qpdf in.pdf --update-from-json=pdf.json out.pdf | ||
| 199 | - | ||
| 200 | -To modify a single object: | ||
| 201 | - | ||
| 202 | -* qpdf in.pdf --json-output pdf.json --json-object=o,g | ||
| 203 | -* edit pdf.json | ||
| 204 | -* qpdf in.pdf --update-from-json=pdf.json out.pdf | ||
| 205 | - | ||
| 206 | -Historical note: you can't create a PDF from v1 json because | ||
| 207 | - | ||
| 208 | -* The PDF version header is not recorded | ||
| 209 | - | ||
| 210 | -* Strings cannot be unambiguously encoded/decoded | ||
| 211 | - | ||
| 212 | - * Can't tell string from name from indirect object | ||
| 213 | - | ||
| 214 | - * Strings are treated as PDF doc encoding and output as UTF-8, which | ||
| 215 | - doesn't work since multiple PDF doc code points are undefined and | ||
| 216 | - is absurd for binary strings | ||
| 217 | - | ||
| 218 | -* There is no representation of stream data | ||
| 219 | - | ||
| 220 | -* You can't tell a stream from a dictionary except by looking in both | ||
| 221 | - "object" and "objectinfo". | ||
| 222 | - | ||
| 223 | -* Using "n n R" as a key in "objects" and "objectinfo" makes it hard | ||
| 224 | - to search for things when viewing the JSON file in an editor. | ||
| 225 | - | ||
| 226 | - | ||
| 227 | QPDFPagesTree | 56 | QPDFPagesTree |
| 228 | ============= | 57 | ============= |
| 229 | 58 | ||
| @@ -256,6 +85,28 @@ sure /Count and /Parent are correct. | @@ -256,6 +85,28 @@ sure /Count and /Parent are correct. | ||
| 256 | refs/attic/QPDFPagesTree-old -- original, abandoned branch -- clean up | 85 | refs/attic/QPDFPagesTree-old -- original, abandoned branch -- clean up |
| 257 | when done. | 86 | when done. |
| 258 | 87 | ||
| 88 | +Possible future JSON enhancements | ||
| 89 | +================================= | ||
| 90 | + | ||
| 91 | +* Add to JSON output the information available from a few additional | ||
| 92 | + informational options: | ||
| 93 | + | ||
| 94 | + * --check: add but maybe not by default? | ||
| 95 | + | ||
| 96 | + * --show-linearization: add but maybe not by default? Also figure | ||
| 97 | + out whether warnings reported for some of the PDF specs (1.7) are | ||
| 98 | + qpdf problems. This may not be worth adding in the first | ||
| 99 | + increment. | ||
| 100 | + | ||
| 101 | + * --show-xref: add | ||
| 102 | + | ||
| 103 | +* Consider having --check, --show-encryption, etc., just select the | ||
| 104 | + right keys when in json mode. I don't think I want check on by | ||
| 105 | + default, so that might be different. | ||
| 106 | + | ||
| 107 | +* Consider having warnings be included in the json in a "warnings" key | ||
| 108 | + in json mode. | ||
| 109 | + | ||
| 259 | QPDFJob | 110 | QPDFJob |
| 260 | ======= | 111 | ======= |
| 261 | 112 |
cSpell.json
include/qpdf/QPDF.hh
| @@ -112,8 +112,11 @@ class QPDF | @@ -112,8 +112,11 @@ class QPDF | ||
| 112 | 112 | ||
| 113 | // Create a PDF from an input source that contains JSON as written | 113 | // Create a PDF from an input source that contains JSON as written |
| 114 | // by writeJSON (or qpdf --json-output, version 2 or higher). The | 114 | // by writeJSON (or qpdf --json-output, version 2 or higher). The |
| 115 | - // JSON must be a complete representation of a PDF. See "QPDF JSON | ||
| 116 | - // Format" in the manual for details. | 115 | + // JSON must be a complete representation of a PDF. See "qpdf |
| 116 | + // JSON" in the manual for details. The input JSON may be | ||
| 117 | + // arbitrarily large. QPDF does not load stream data into memory | ||
| 118 | + // for more than one stream at a time, even if the stream data is | ||
| 119 | + // specified inline. | ||
| 117 | QPDF_DLL | 120 | QPDF_DLL |
| 118 | void createFromJSON(std::string const& json_file); | 121 | void createFromJSON(std::string const& json_file); |
| 119 | QPDF_DLL | 122 | QPDF_DLL |
| @@ -122,24 +125,40 @@ class QPDF | @@ -122,24 +125,40 @@ class QPDF | ||
| 122 | // Update a PDF from an input source that contains JSON in the | 125 | // Update a PDF from an input source that contains JSON in the |
| 123 | // same format as is written by writeJSON (or qpdf --json-output, | 126 | // same format as is written by writeJSON (or qpdf --json-output, |
| 124 | // version 2 or higher). Objects in the PDF and not in the JSON | 127 | // version 2 or higher). Objects in the PDF and not in the JSON |
| 125 | - // are not modified. See "QPDF JSON Format" in the manual for | ||
| 126 | - // details. | 128 | + // are not modified. See "qpdf JSON" in the manual for details. As |
| 129 | + // with createFromJSON, the input JSON may be arbitrarily large. | ||
| 127 | QPDF_DLL | 130 | QPDF_DLL |
| 128 | void updateFromJSON(std::string const& json_file); | 131 | void updateFromJSON(std::string const& json_file); |
| 129 | QPDF_DLL | 132 | QPDF_DLL |
| 130 | void updateFromJSON(std::shared_ptr<InputSource>); | 133 | void updateFromJSON(std::shared_ptr<InputSource>); |
| 131 | 134 | ||
| 132 | - // Write qpdf json format. The only supported version is 2. If | ||
| 133 | - // wanted_objects is empty, write all objects. Otherwise, write | ||
| 134 | - // only objects whose keys are in wanted_objects. Keys may be | ||
| 135 | - // either "trailer" or of the form "obj:n n R". Invalid keys are | ||
| 136 | - // ignored. | 135 | + // Write qpdf json format to the pipeline "p". The only supported |
| 136 | + // version is 2. The finish() method is called on the pipeline at | ||
| 137 | + // the end. The decode_level parameter controls which streams are | ||
| 138 | + // uncompressed in the JSON. Use qpdf_dl_none to preserve all | ||
| 139 | + // stream data exactly as it appears in the input. The possible | ||
| 140 | + // values for json_stream_data can be found in qpdf/Constants.h | ||
| 141 | + // and correspond to the --json-stream-data command-line argument. | ||
| 142 | + // If json_stream_data is qpdf_sj_file, file_prefix must be | ||
| 143 | + // specified. Each stream will be written to a file whose path is | ||
| 144 | + // constructed by appending "-nnn" to file_prefix, where "nnn" is | ||
| 145 | + // the object number (not zero-filled). If wanted_objects is | ||
| 146 | + // empty, write all objects. Otherwise, write only objects whose | ||
| 147 | + // keys are in wanted_objects. Keys may be either "trailer" or of | ||
| 148 | + // the form "obj:n n R". Invalid keys are ignored. This | ||
| 149 | + // corresponds to the --json-object command-line argument. | ||
| 150 | + // | ||
| 151 | + // QPDF is efficient with regard to memory when writing, allowing | ||
| 152 | + // you to write arbitrarily large PDF files to a pipeline. You can | ||
| 153 | + // use a pipeline like Pl_Buffer or Pl_String to capture the JSON | ||
| 154 | + // output in memory, but do so with caution as this will allocate | ||
| 155 | + // enough memory to hold the entire PDF file. | ||
| 137 | QPDF_DLL | 156 | QPDF_DLL |
| 138 | void writeJSON( | 157 | void writeJSON( |
| 139 | int version, | 158 | int version, |
| 140 | - Pipeline*, | ||
| 141 | - qpdf_stream_decode_level_e, | ||
| 142 | - qpdf_json_stream_data_e, | 159 | + Pipeline* p, |
| 160 | + qpdf_stream_decode_level_e decode_level, | ||
| 161 | + qpdf_json_stream_data_e json_stream_data, | ||
| 143 | std::string const& file_prefix, | 162 | std::string const& file_prefix, |
| 144 | std::set<std::string> wanted_objects); | 163 | std::set<std::string> wanted_objects); |
| 145 | 164 |
job.sums
| @@ -8,10 +8,10 @@ include/qpdf/auto_job_c_pages.hh b3cc0f21029f6d89efa043dcdbfa183cb59325b6506001c | @@ -8,10 +8,10 @@ include/qpdf/auto_job_c_pages.hh b3cc0f21029f6d89efa043dcdbfa183cb59325b6506001c | ||
| 8 | include/qpdf/auto_job_c_uo.hh ae21b69a1efa9333050f4833d465f6daff87e5b38e5106e49bbef5d4132e4ed1 | 8 | include/qpdf/auto_job_c_uo.hh ae21b69a1efa9333050f4833d465f6daff87e5b38e5106e49bbef5d4132e4ed1 |
| 9 | job.yml 3b2b3c6f92b48f6c76109711cbfdd74669fa31a80cd17379548b09f8e76be05d | 9 | job.yml 3b2b3c6f92b48f6c76109711cbfdd74669fa31a80cd17379548b09f8e76be05d |
| 10 | libqpdf/qpdf/auto_job_decl.hh 74df4d7fdbdf51ecd0d58ce1e9844bb5525b9adac5a45f7c9a787ecdda2868df | 10 | libqpdf/qpdf/auto_job_decl.hh 74df4d7fdbdf51ecd0d58ce1e9844bb5525b9adac5a45f7c9a787ecdda2868df |
| 11 | -libqpdf/qpdf/auto_job_help.hh c1cc99f6fe17285ee5e40730f6280e37d17da1a5f408086ce34e01af121df7ad | 11 | +libqpdf/qpdf/auto_job_help.hh 3aaae4cde004e5314d3ac6d554da575e40209c0f0611f6a308957986f9c7967b |
| 12 | libqpdf/qpdf/auto_job_init.hh 7ea8e0641dc26fdfba6e283e14dbbff0c016654e174cdace8054f8bef53750fd | 12 | libqpdf/qpdf/auto_job_init.hh 7ea8e0641dc26fdfba6e283e14dbbff0c016654e174cdace8054f8bef53750fd |
| 13 | libqpdf/qpdf/auto_job_json_decl.hh 06caa46eaf71db8a50c046f91866baa8087745a9474319fb7c86d92634cc8297 | 13 | libqpdf/qpdf/auto_job_json_decl.hh 06caa46eaf71db8a50c046f91866baa8087745a9474319fb7c86d92634cc8297 |
| 14 | libqpdf/qpdf/auto_job_json_init.hh 5f6b53e3c81d4b54ce5c4cf9c3f52d0c02f987c53bf8841c0280367bad23e335 | 14 | libqpdf/qpdf/auto_job_json_init.hh 5f6b53e3c81d4b54ce5c4cf9c3f52d0c02f987c53bf8841c0280367bad23e335 |
| 15 | libqpdf/qpdf/auto_job_schema.hh 9d543cd4a43eafffc2c4b8a6fee29e399c271c52cb6f7d417ae5497b3c1127dc | 15 | libqpdf/qpdf/auto_job_schema.hh 9d543cd4a43eafffc2c4b8a6fee29e399c271c52cb6f7d417ae5497b3c1127dc |
| 16 | manual/_ext/qpdf.py 6add6321666031d55ed4aedf7c00e5662bba856dfcd66ccb526563bffefbb580 | 16 | manual/_ext/qpdf.py 6add6321666031d55ed4aedf7c00e5662bba856dfcd66ccb526563bffefbb580 |
| 17 | -manual/cli.rst 82ead389c03bbf5e0498bd0571a11dc06544d591f4e4454c00322e3473fc556d | 17 | +manual/cli.rst e3f4331befa17450e0d0fff87569722a5aab42ea619ef64f0a3a04e1f99ed65c |
libqpdf/QPDF_json.cc
| @@ -817,4 +817,5 @@ QPDF::writeJSON( | @@ -817,4 +817,5 @@ QPDF::writeJSON( | ||
| 817 | JSON::writeDictionaryClose(p, first_qpdf, 1); | 817 | JSON::writeDictionaryClose(p, first_qpdf, 1); |
| 818 | JSON::writeDictionaryClose(p, first, 0); | 818 | JSON::writeDictionaryClose(p, first, 0); |
| 819 | *p << "\n"; | 819 | *p << "\n"; |
| 820 | + p->finish(); | ||
| 820 | } | 821 | } |
libqpdf/qpdf/auto_job_help.hh
| @@ -70,6 +70,9 @@ ap.addOptionHelp("--copyright", "help", "show copyright information", R"(Display | @@ -70,6 +70,9 @@ ap.addOptionHelp("--copyright", "help", "show copyright information", R"(Display | ||
| 70 | ap.addOptionHelp("--show-crypto", "help", "show available crypto providers", R"(Show a list of available crypto providers, one per line. The | 70 | ap.addOptionHelp("--show-crypto", "help", "show available crypto providers", R"(Show a list of available crypto providers, one per line. The |
| 71 | default provider is shown first. | 71 | default provider is shown first. |
| 72 | )"); | 72 | )"); |
| 73 | +ap.addOptionHelp("--job-json-help", "help", "show format of job JSON", R"(Describe the format of the QPDFJob JSON input used by | ||
| 74 | +--job-json-file. | ||
| 75 | +)"); | ||
| 73 | ap.addHelpTopic("general", "general options", R"(General options control qpdf's behavior in ways that are not | 76 | ap.addHelpTopic("general", "general options", R"(General options control qpdf's behavior in ways that are not |
| 74 | directly related to the operation it is performing. | 77 | directly related to the operation it is performing. |
| 75 | )"); | 78 | )"); |
| @@ -87,11 +90,11 @@ ap.addOptionHelp("--verbose", "general", "print additional information", R"(Outp | @@ -87,11 +90,11 @@ ap.addOptionHelp("--verbose", "general", "print additional information", R"(Outp | ||
| 87 | doing, including information about files created and operations | 90 | doing, including information about files created and operations |
| 88 | performed. | 91 | performed. |
| 89 | )"); | 92 | )"); |
| 90 | -ap.addOptionHelp("--progress", "general", "show progress when writing", R"(Indicate progress when writing files. | ||
| 91 | -)"); | ||
| 92 | } | 93 | } |
| 93 | static void add_help_2(QPDFArgParser& ap) | 94 | static void add_help_2(QPDFArgParser& ap) |
| 94 | { | 95 | { |
| 96 | +ap.addOptionHelp("--progress", "general", "show progress when writing", R"(Indicate progress when writing files. | ||
| 97 | +)"); | ||
| 95 | ap.addOptionHelp("--no-warn", "general", "suppress printing of warning messages", R"(Suppress printing of warning messages. If warnings were | 98 | ap.addOptionHelp("--no-warn", "general", "suppress printing of warning messages", R"(Suppress printing of warning messages. If warnings were |
| 96 | encountered, qpdf still exits with exit status 3. | 99 | encountered, qpdf still exits with exit status 3. |
| 97 | Use --warning-exit-0 with --no-warn to completely ignore | 100 | Use --warning-exit-0 with --no-warn to completely ignore |
| @@ -172,12 +175,12 @@ companion tool "fix-qdf" can be used to repair hand-edited QDF | @@ -172,12 +175,12 @@ companion tool "fix-qdf" can be used to repair hand-edited QDF | ||
| 172 | files. QDF is a feature specific to the qpdf tool. Please see | 175 | files. QDF is a feature specific to the qpdf tool. Please see |
| 173 | the "QDF Mode" chapter in the manual. | 176 | the "QDF Mode" chapter in the manual. |
| 174 | )"); | 177 | )"); |
| 175 | -ap.addOptionHelp("--no-original-object-ids", "transformation", "omit original object IDs in qdf", R"(Omit comments in a QDF file indicating the object ID an object | ||
| 176 | -had in the original file. | ||
| 177 | -)"); | ||
| 178 | } | 178 | } |
| 179 | static void add_help_3(QPDFArgParser& ap) | 179 | static void add_help_3(QPDFArgParser& ap) |
| 180 | { | 180 | { |
| 181 | +ap.addOptionHelp("--no-original-object-ids", "transformation", "omit original object IDs in qdf", R"(Omit comments in a QDF file indicating the object ID an object | ||
| 182 | +had in the original file. | ||
| 183 | +)"); | ||
| 181 | ap.addOptionHelp("--compress-streams", "transformation", "compress uncompressed streams", R"(--compress-streams=[y|n] | 184 | ap.addOptionHelp("--compress-streams", "transformation", "compress uncompressed streams", R"(--compress-streams=[y|n] |
| 182 | 185 | ||
| 183 | Setting --compress-streams=n prevents qpdf from compressing | 186 | Setting --compress-streams=n prevents qpdf from compressing |
| @@ -188,9 +191,11 @@ ap.addOptionHelp("--decode-level", "transformation", "control which streams to u | @@ -188,9 +191,11 @@ ap.addOptionHelp("--decode-level", "transformation", "control which streams to u | ||
| 188 | 191 | ||
| 189 | When uncompressing streams, control which types of compression | 192 | When uncompressing streams, control which types of compression |
| 190 | schemes should be uncompressed: | 193 | schemes should be uncompressed: |
| 191 | -- none: don't uncompress anything. This is the default with --json-output. | 194 | +- none: don't uncompress anything. This is the default with |
| 195 | + --json-output. | ||
| 192 | - generalized: uncompress streams compressed with a | 196 | - generalized: uncompress streams compressed with a |
| 193 | - general-purpose compression algorithm. This is the default. | 197 | + general-purpose compression algorithm. This is the default |
| 198 | + except when --json-output is given. | ||
| 194 | - specialized: in addition to generalized, also uncompress | 199 | - specialized: in addition to generalized, also uncompress |
| 195 | streams compressed with a special-purpose but non-lossy | 200 | streams compressed with a special-purpose but non-lossy |
| 196 | compression scheme | 201 | compression scheme |
| @@ -290,13 +295,13 @@ from the resulting set, not based on the original page numbers. | @@ -290,13 +295,13 @@ from the resulting set, not based on the original page numbers. | ||
| 290 | ap.addHelpTopic("modification", "change parts of the PDF", R"(Modification options make systematic changes to certain parts of | 295 | ap.addHelpTopic("modification", "change parts of the PDF", R"(Modification options make systematic changes to certain parts of |
| 291 | the PDF, causing the PDF to render differently from the original. | 296 | the PDF, causing the PDF to render differently from the original. |
| 292 | )"); | 297 | )"); |
| 298 | +} | ||
| 299 | +static void add_help_4(QPDFArgParser& ap) | ||
| 300 | +{ | ||
| 293 | ap.addOptionHelp("--pages", "modification", "begin page selection", R"(--pages file [--password=password] [page-range] [...] -- | 301 | ap.addOptionHelp("--pages", "modification", "begin page selection", R"(--pages file [--password=password] [page-range] [...] -- |
| 294 | 302 | ||
| 295 | Run qpdf --help=page-selection for details. | 303 | Run qpdf --help=page-selection for details. |
| 296 | )"); | 304 | )"); |
| 297 | -} | ||
| 298 | -static void add_help_4(QPDFArgParser& ap) | ||
| 299 | -{ | ||
| 300 | ap.addOptionHelp("--collate", "modification", "collate with --pages", R"(--collate[=n] | 305 | ap.addOptionHelp("--collate", "modification", "collate with --pages", R"(--collate[=n] |
| 301 | 306 | ||
| 302 | Collate rather than concatenate pages specified with --pages. | 307 | Collate rather than concatenate pages specified with --pages. |
| @@ -460,14 +465,14 @@ ap.addOptionHelp("--assemble", "encryption", "restrict document assembly", R"(-- | @@ -460,14 +465,14 @@ ap.addOptionHelp("--assemble", "encryption", "restrict document assembly", R"(-- | ||
| 460 | Enable/disable document assembly (rotation and reordering of | 465 | Enable/disable document assembly (rotation and reordering of |
| 461 | pages). This option is not available with 40-bit encryption. | 466 | pages). This option is not available with 40-bit encryption. |
| 462 | )"); | 467 | )"); |
| 468 | +} | ||
| 469 | +static void add_help_5(QPDFArgParser& ap) | ||
| 470 | +{ | ||
| 463 | ap.addOptionHelp("--extract", "encryption", "restrict text/graphic extraction", R"(--extract=[y|n] | 471 | ap.addOptionHelp("--extract", "encryption", "restrict text/graphic extraction", R"(--extract=[y|n] |
| 464 | 472 | ||
| 465 | Enable/disable text/graphic extraction for purposes other than | 473 | Enable/disable text/graphic extraction for purposes other than |
| 466 | accessibility. | 474 | accessibility. |
| 467 | )"); | 475 | )"); |
| 468 | -} | ||
| 469 | -static void add_help_5(QPDFArgParser& ap) | ||
| 470 | -{ | ||
| 471 | ap.addOptionHelp("--form", "encryption", "restrict form filling", R"(--form=[y|n] | 476 | ap.addOptionHelp("--form", "encryption", "restrict form filling", R"(--form=[y|n] |
| 472 | 477 | ||
| 473 | Enable/disable whether filling form fields is allowed even if | 478 | Enable/disable whether filling form fields is allowed even if |
| @@ -638,6 +643,9 @@ ap.addOptionHelp("--remove-attachment", "attachments", "remove an embedded file" | @@ -638,6 +643,9 @@ ap.addOptionHelp("--remove-attachment", "attachments", "remove an embedded file" | ||
| 638 | Remove an embedded file using its key. Get the key with | 643 | Remove an embedded file using its key. Get the key with |
| 639 | --list-attachments. | 644 | --list-attachments. |
| 640 | )"); | 645 | )"); |
| 646 | +} | ||
| 647 | +static void add_help_6(QPDFArgParser& ap) | ||
| 648 | +{ | ||
| 641 | ap.addHelpTopic("pdf-dates", "PDF date format", R"(When a date is required, the date should conform to the PDF date | 649 | ap.addHelpTopic("pdf-dates", "PDF date format", R"(When a date is required, the date should conform to the PDF date |
| 642 | format specification, which is "D:yyyymmddhhmmssz" where "z" is | 650 | format specification, which is "D:yyyymmddhhmmssz" where "z" is |
| 643 | either literally upper case "Z" for UTC or a timezone offset in | 651 | either literally upper case "Z" for UTC or a timezone offset in |
| @@ -650,9 +658,6 @@ Examples: | @@ -650,9 +658,6 @@ Examples: | ||
| 650 | - D:20210207161528-05'00' February 7, 2021 at 4:15:28 p.m. | 658 | - D:20210207161528-05'00' February 7, 2021 at 4:15:28 p.m. |
| 651 | - D:20210207211528Z February 7, 2021 at 21:15:28 UTC | 659 | - D:20210207211528Z February 7, 2021 at 21:15:28 UTC |
| 652 | )"); | 660 | )"); |
| 653 | -} | ||
| 654 | -static void add_help_6(QPDFArgParser& ap) | ||
| 655 | -{ | ||
| 656 | ap.addHelpTopic("add-attachment", "attach (embed) files", R"(The options listed below appear between --add-attachment and its | 661 | ap.addHelpTopic("add-attachment", "attach (embed) files", R"(The options listed below appear between --add-attachment and its |
| 657 | terminating "--". | 662 | terminating "--". |
| 658 | )"); | 663 | )"); |
| @@ -747,14 +752,14 @@ the linearization hint tables are correct. | @@ -747,14 +752,14 @@ the linearization hint tables are correct. | ||
| 747 | )"); | 752 | )"); |
| 748 | ap.addOptionHelp("--show-linearization", "inspection", "show linearization hint tables", R"(Check and display all data in the linearization hint tables. | 753 | ap.addOptionHelp("--show-linearization", "inspection", "show linearization hint tables", R"(Check and display all data in the linearization hint tables. |
| 749 | )"); | 754 | )"); |
| 755 | +} | ||
| 756 | +static void add_help_7(QPDFArgParser& ap) | ||
| 757 | +{ | ||
| 750 | ap.addOptionHelp("--show-xref", "inspection", "show cross reference data", R"(Show the contents of the cross-reference table or stream (object | 758 | ap.addOptionHelp("--show-xref", "inspection", "show cross reference data", R"(Show the contents of the cross-reference table or stream (object |
| 751 | locations in the file) in a human-readable form. This is | 759 | locations in the file) in a human-readable form. This is |
| 752 | especially useful for files with cross-reference streams, which | 760 | especially useful for files with cross-reference streams, which |
| 753 | are stored in a binary format. | 761 | are stored in a binary format. |
| 754 | )"); | 762 | )"); |
| 755 | -} | ||
| 756 | -static void add_help_7(QPDFArgParser& ap) | ||
| 757 | -{ | ||
| 758 | ap.addOptionHelp("--show-object", "inspection", "show contents of an object", R"(--show-object={trailer|obj[,gen]} | 763 | ap.addOptionHelp("--show-object", "inspection", "show contents of an object", R"(--show-object={trailer|obj[,gen]} |
| 759 | 764 | ||
| 760 | Show the contents of the given object. This is especially useful | 765 | Show the contents of the given object. This is especially useful |
| @@ -814,21 +819,20 @@ This option is repeatable. If given, only specified objects will | @@ -814,21 +819,20 @@ This option is repeatable. If given, only specified objects will | ||
| 814 | be shown in the "objects" key of the JSON output. Otherwise, all | 819 | be shown in the "objects" key of the JSON output. Otherwise, all |
| 815 | objects will be shown. | 820 | objects will be shown. |
| 816 | )"); | 821 | )"); |
| 817 | -ap.addOptionHelp("--job-json-help", "json", "show format of job JSON", R"(Describe the format of the QPDFJob JSON input used by | ||
| 818 | ---job-json-file. | ||
| 819 | -)"); | ||
| 820 | ap.addOptionHelp("--json-stream-data", "json", "how to handle streams in json output", R"(--json-stream-data={none|inline|file} | 822 | ap.addOptionHelp("--json-stream-data", "json", "how to handle streams in json output", R"(--json-stream-data={none|inline|file} |
| 821 | 823 | ||
| 822 | -Control whether streams in json output should be omitted, | ||
| 823 | -written inline (base64-encoded) or written to a file. If "file" | ||
| 824 | -is chosen, the file will be the name of the input file appended | ||
| 825 | -with -nnn where nnn is the object number. The prefix can be | ||
| 826 | -overridden with --json-stream-prefix. | 824 | +When used with --json-output, this option controls whether |
| 825 | +streams in json output should be omitted, written inline | ||
| 826 | +(base64-encoded) or written to a file. If "file" is chosen, the | ||
| 827 | +file will be the name of the output file appended with -nnn where | ||
| 828 | +nnn is the object number. The prefix can be overridden with | ||
| 829 | +--json-stream-prefix. | ||
| 827 | )"); | 830 | )"); |
| 828 | ap.addOptionHelp("--json-stream-prefix", "json", "prefix for json stream data files", R"(--json-stream-prefix=file-prefix | 831 | ap.addOptionHelp("--json-stream-prefix", "json", "prefix for json stream data files", R"(--json-stream-prefix=file-prefix |
| 829 | 832 | ||
| 830 | -When --json-stream-data=file is given, override the input file | ||
| 831 | -name as the prefix for stream data files. Whatever is given here | 833 | +When used with --json-output, --json-stream-data=file-prefix |
| 834 | +sets the prefix for stream data files, overriding the default, | ||
| 835 | +which is to use the output file name. Whatever is given here | ||
| 832 | will be appended with -nnn to create the name of the file that | 836 | will be appended with -nnn to create the name of the file that |
| 833 | will contain the data for the stream stream in object nnn. | 837 | will contain the data for the stream stream in object nnn. |
| 834 | )"); | 838 | )"); |
| @@ -836,19 +840,19 @@ ap.addOptionHelp("--json-output", "json", "serialize to JSON", R"(--json-output[ | @@ -836,19 +840,19 @@ ap.addOptionHelp("--json-output", "json", "serialize to JSON", R"(--json-output[ | ||
| 836 | 840 | ||
| 837 | The output file will be qpdf JSON format at the given version. | 841 | The output file will be qpdf JSON format at the given version. |
| 838 | "version" may be a specific version or "latest" (the default). | 842 | "version" may be a specific version or "latest" (the default). |
| 839 | -Version 1 is not supported. See also --json-stream-data, | 843 | +The only supported version is 2. See also --json-stream-data, |
| 840 | --json-stream-prefix, and --decode-level. | 844 | --json-stream-prefix, and --decode-level. |
| 841 | )"); | 845 | )"); |
| 842 | ap.addOptionHelp("--json-input", "json", "input file is qpdf JSON", R"(Treat the input file as a JSON file in qpdf JSON format as | 846 | ap.addOptionHelp("--json-input", "json", "input file is qpdf JSON", R"(Treat the input file as a JSON file in qpdf JSON format as |
| 843 | -written by qpdf --json-output. See the "QPDF JSON Format" | 847 | +written by qpdf --json-output. See the "qpdf JSON Format" |
| 844 | section of the manual for information about how to use this | 848 | section of the manual for information about how to use this |
| 845 | option. | 849 | option. |
| 846 | )"); | 850 | )"); |
| 847 | ap.addOptionHelp("--update-from-json", "json", "update a PDF from qpdf JSON", R"(--update-from-json=qpdf-json-file | 851 | ap.addOptionHelp("--update-from-json", "json", "update a PDF from qpdf JSON", R"(--update-from-json=qpdf-json-file |
| 848 | 852 | ||
| 849 | -Update a PDF file from a JSON file. Please see the "QPDF JSON | ||
| 850 | -Format" section of the manual for information about how to use | ||
| 851 | -this option. | 853 | +Update a PDF file from a JSON file. Please see the "qpdf JSON" |
| 854 | +chapter of the manual for information about how to use this | ||
| 855 | +option. | ||
| 852 | )"); | 856 | )"); |
| 853 | } | 857 | } |
| 854 | static void add_help_8(QPDFArgParser& ap) | 858 | static void add_help_8(QPDFArgParser& ap) |
manual/cli.rst
| @@ -171,7 +171,9 @@ Related Options | @@ -171,7 +171,9 @@ Related Options | ||
| 171 | equivalent command-line arguments were supplied. It can be repeated | 171 | equivalent command-line arguments were supplied. It can be repeated |
| 172 | and mixed freely with other options. Run ``qpdf`` with | 172 | and mixed freely with other options. Run ``qpdf`` with |
| 173 | :qpdf:ref:`--job-json-help` for a description of the job JSON input | 173 | :qpdf:ref:`--job-json-help` for a description of the job JSON input |
| 174 | - file format. For more information, see :ref:`qpdf-job`. | 174 | + file format. For more information, see :ref:`qpdf-job`. Note that |
| 175 | + this is unrelated to :qpdf:ref:`--json` but may be combined with | ||
| 176 | + it. | ||
| 175 | 177 | ||
| 176 | .. _exit-status: | 178 | .. _exit-status: |
| 177 | 179 | ||
| @@ -341,6 +343,17 @@ Related Options | @@ -341,6 +343,17 @@ Related Options | ||
| 341 | itself. The default provider is always listed first. See | 343 | itself. The default provider is always listed first. See |
| 342 | :ref:`crypto` for more information about crypto providers. | 344 | :ref:`crypto` for more information about crypto providers. |
| 343 | 345 | ||
| 346 | +.. qpdf:option:: --job-json-help | ||
| 347 | + | ||
| 348 | + .. help: show format of job JSON | ||
| 349 | + | ||
| 350 | + Describe the format of the QPDFJob JSON input used by | ||
| 351 | + --job-json-file. | ||
| 352 | + | ||
| 353 | + Describe the format of the QPDFJob JSON input used by | ||
| 354 | + :qpdf:ref:`--job-json-file`. For more information about QPDFJob, | ||
| 355 | + see :ref:`qpdf-job`. | ||
| 356 | + | ||
| 344 | .. _general-options: | 357 | .. _general-options: |
| 345 | 358 | ||
| 346 | General Options | 359 | General Options |
| @@ -852,9 +865,11 @@ Related Options | @@ -852,9 +865,11 @@ Related Options | ||
| 852 | 865 | ||
| 853 | When uncompressing streams, control which types of compression | 866 | When uncompressing streams, control which types of compression |
| 854 | schemes should be uncompressed: | 867 | schemes should be uncompressed: |
| 855 | - - none: don't uncompress anything. This is the default with --json-output. | 868 | + - none: don't uncompress anything. This is the default with |
| 869 | + --json-output. | ||
| 856 | - generalized: uncompress streams compressed with a | 870 | - generalized: uncompress streams compressed with a |
| 857 | - general-purpose compression algorithm. This is the default. | 871 | + general-purpose compression algorithm. This is the default |
| 872 | + except when --json-output is given. | ||
| 858 | - specialized: in addition to generalized, also uncompress | 873 | - specialized: in addition to generalized, also uncompress |
| 859 | streams compressed with a special-purpose but non-lossy | 874 | streams compressed with a special-purpose but non-lossy |
| 860 | compression scheme | 875 | compression scheme |
| @@ -875,7 +890,8 @@ Related Options | @@ -875,7 +890,8 @@ Related Options | ||
| 875 | ``/ASCII85Decode``, and ``/ASCIIHexDecode``. We define | 890 | ``/ASCII85Decode``, and ``/ASCIIHexDecode``. We define |
| 876 | generalized filters as those to be used for general-purpose | 891 | generalized filters as those to be used for general-purpose |
| 877 | compression or encoding, as opposed to filters specifically | 892 | compression or encoding, as opposed to filters specifically |
| 878 | - designed for image data. This is the default. | 893 | + designed for image data. This is the default except when |
| 894 | + :qpdf:ref:`--json-output` is given. | ||
| 879 | 895 | ||
| 880 | - :samp:`specialized`: in addition to generalized, decode streams | 896 | - :samp:`specialized`: in addition to generalized, decode streams |
| 881 | with supported non-lossy specialized filters; currently this is | 897 | with supported non-lossy specialized filters; currently this is |
| @@ -3126,8 +3142,9 @@ Related Options | @@ -3126,8 +3142,9 @@ Related Options | ||
| 3126 | is usually but not always equal to the file name and is needed by | 3142 | is usually but not always equal to the file name and is needed by |
| 3127 | some of the other options. See also :ref:`attachments`. Note that | 3143 | some of the other options. See also :ref:`attachments`. Note that |
| 3128 | this option displays dates in PDF timestamp syntax. When attachment | 3144 | this option displays dates in PDF timestamp syntax. When attachment |
| 3129 | - information is included in json output (see :ref:`--json`), dates | ||
| 3130 | - are shown in ISO-8601 format. | 3145 | + information is included in json output in the ``"attachments"`` key |
| 3146 | + (see :ref:`--json`), dates are shown (just within that object) in | ||
| 3147 | + ISO-8601 format. | ||
| 3131 | 3148 | ||
| 3132 | .. qpdf:option:: --show-attachment=key | 3149 | .. qpdf:option:: --show-attachment=key |
| 3133 | 3150 | ||
| @@ -3169,14 +3186,11 @@ Related Options | @@ -3169,14 +3186,11 @@ Related Options | ||
| 3169 | 3186 | ||
| 3170 | Generate a JSON representation of the file. This is described in | 3187 | Generate a JSON representation of the file. This is described in |
| 3171 | depth in :ref:`json`. The version parameter can be used to specify | 3188 | depth in :ref:`json`. The version parameter can be used to specify |
| 3172 | - which version of the qpdf JSON format should be output. The only | ||
| 3173 | - supported value is ``1``, but it's possible that a new JSON output | ||
| 3174 | - version will be added in a future version. You can also specify | ||
| 3175 | - ``latest`` to use the latest JSON version. For backward | ||
| 3176 | - compatibility, the default value will remain ``1`` until qpdf | ||
| 3177 | - version 11, after which point it will become ``latest``. In all | ||
| 3178 | - case, you can tell what version of the JSON output you have from | ||
| 3179 | - the ``"version"`` key in the output. Use the | 3189 | + which version of the qpdf JSON format should be output. The version |
| 3190 | + number be a number or ``latest``. The default is ``latest``. As of | ||
| 3191 | + qpdf 11, the latest version is ``2``. If you have code that reads | ||
| 3192 | + qpdf JSON output, you can tell what version of the JSON output you | ||
| 3193 | + have from the ``"version"`` key in the output. Use the | ||
| 3180 | :qpdf:ref:`--json-help` option to get a description of the JSON | 3194 | :qpdf:ref:`--json-help` option to get a description of the JSON |
| 3181 | object. | 3195 | object. |
| 3182 | 3196 | ||
| @@ -3189,11 +3203,11 @@ Related Options | @@ -3189,11 +3203,11 @@ Related Options | ||
| 3189 | containing descriptive text. | 3203 | containing descriptive text. |
| 3190 | 3204 | ||
| 3191 | Describe the format of the JSON output by writing to standard | 3205 | Describe the format of the JSON output by writing to standard |
| 3192 | - output a JSON object with the same structure with the same keys as | ||
| 3193 | - the JSON generated by qpdf. In the output written by | ||
| 3194 | - ``--json-help``, each key's value is a description of the key. The | ||
| 3195 | - specific contract guaranteed by qpdf in its JSON representation is | ||
| 3196 | - explained in more detail in the :ref:`json`. | 3206 | + output a JSON object with the same structure as the JSON generated |
| 3207 | + by qpdf. In the output written by ``--json-help``, each key's value | ||
| 3208 | + is a description of the key. The specific contract guaranteed by | ||
| 3209 | + qpdf in its JSON representation is explained in more detail in the | ||
| 3210 | + :ref:`json`. | ||
| 3197 | 3211 | ||
| 3198 | .. qpdf:option:: --json-key=key | 3212 | .. qpdf:option:: --json-key=key |
| 3199 | 3213 | ||
| @@ -3216,53 +3230,50 @@ Related Options | @@ -3216,53 +3230,50 @@ Related Options | ||
| 3216 | be shown in the "objects" key of the JSON output. Otherwise, all | 3230 | be shown in the "objects" key of the JSON output. Otherwise, all |
| 3217 | objects will be shown. | 3231 | objects will be shown. |
| 3218 | 3232 | ||
| 3219 | - This option is repeatable. If given, only specified objects will | ||
| 3220 | - be shown in the "``objects``" key of the JSON output. Otherwise, all | ||
| 3221 | - objects will be shown. | ||
| 3222 | - | ||
| 3223 | -.. qpdf:option:: --job-json-help | ||
| 3224 | - | ||
| 3225 | - .. help: show format of job JSON | ||
| 3226 | - | ||
| 3227 | - Describe the format of the QPDFJob JSON input used by | ||
| 3228 | - --job-json-file. | ||
| 3229 | - | ||
| 3230 | - Describe the format of the QPDFJob JSON input used by | ||
| 3231 | - :qpdf:ref:`--job-json-file`. For more information about QPDFJob, | ||
| 3232 | - see :ref:`qpdf-job`. | 3233 | + This option is repeatable. If given, only specified objects will be |
| 3234 | + shown in the ``"objects"`` key of the JSON output. Otherwise, all | ||
| 3235 | + objects will be shown. For qpdf JSON version 1, this also affects | ||
| 3236 | + the ``"objectinfo"`` key, which is not present in version 2. This | ||
| 3237 | + option may be used with :qpdf:ref:`--json` and also with | ||
| 3238 | + :qpdf:ref:`--json-output`. | ||
| 3233 | 3239 | ||
| 3234 | .. qpdf:option:: --json-stream-data={none|inline|file} | 3240 | .. qpdf:option:: --json-stream-data={none|inline|file} |
| 3235 | 3241 | ||
| 3236 | .. help: how to handle streams in json output | 3242 | .. help: how to handle streams in json output |
| 3237 | 3243 | ||
| 3238 | - Control whether streams in json output should be omitted, | ||
| 3239 | - written inline (base64-encoded) or written to a file. If "file" | ||
| 3240 | - is chosen, the file will be the name of the input file appended | ||
| 3241 | - with -nnn where nnn is the object number. The prefix can be | ||
| 3242 | - overridden with --json-stream-prefix. | ||
| 3243 | - | ||
| 3244 | - Control whether streams in json output should be omitted, written | ||
| 3245 | - inline (base64-encoded) or written to a file. If ``file`` is | ||
| 3246 | - chosen, the file will be the name of the input file appended with | ||
| 3247 | - :samp:`-{nnn}` where :samp:`{nnn}` is the object number. The prefix | ||
| 3248 | - can be overridden with :qpdf:ref:`--json-stream-prefix`. This | ||
| 3249 | - option only applies when used with :qpdf:ref:`--json-output`. | 3244 | + When used with --json-output, this option controls whether |
| 3245 | + streams in json output should be omitted, written inline | ||
| 3246 | + (base64-encoded) or written to a file. If "file" is chosen, the | ||
| 3247 | + file will be the name of the output file appended with -nnn where | ||
| 3248 | + nnn is the object number. The prefix can be overridden with | ||
| 3249 | + --json-stream-prefix. | ||
| 3250 | + | ||
| 3251 | + When used with :qpdf:ref:`--json-output`, this option controls | ||
| 3252 | + whether streams in JSON output should be omitted, written inline | ||
| 3253 | + (base64-encoded) or written to a file. If ``file`` is chosen, the | ||
| 3254 | + file will be the name of the output file appended with | ||
| 3255 | + :samp:`-{nnn}` where :samp:`{nnn}` is the object number. The stream | ||
| 3256 | + data file prefix can be overridden with | ||
| 3257 | + :qpdf:ref:`--json-stream-prefix`. This option only applies when | ||
| 3258 | + used with :qpdf:ref:`--json-output`. | ||
| 3250 | 3259 | ||
| 3251 | .. qpdf:option:: --json-stream-prefix=file-prefix | 3260 | .. qpdf:option:: --json-stream-prefix=file-prefix |
| 3252 | 3261 | ||
| 3253 | .. help: prefix for json stream data files | 3262 | .. help: prefix for json stream data files |
| 3254 | 3263 | ||
| 3255 | - When --json-stream-data=file is given, override the input file | ||
| 3256 | - name as the prefix for stream data files. Whatever is given here | 3264 | + When used with --json-output, --json-stream-data=file-prefix |
| 3265 | + sets the prefix for stream data files, overriding the default, | ||
| 3266 | + which is to use the output file name. Whatever is given here | ||
| 3257 | will be appended with -nnn to create the name of the file that | 3267 | will be appended with -nnn to create the name of the file that |
| 3258 | will contain the data for the stream stream in object nnn. | 3268 | will contain the data for the stream stream in object nnn. |
| 3259 | 3269 | ||
| 3260 | - When :qpdf:ref:`--json-stream-data` is given with the value | ||
| 3261 | - ``file``, override the input file name as the prefix for stream | ||
| 3262 | - data files. Whatever is given here will be appended with | ||
| 3263 | - :samp:`-{nnn}` to create the name of the file that will contain the | ||
| 3264 | - data for the stream stream in object :samp:`{nnn}`. This | ||
| 3265 | - option only applies when used with :qpdf:ref:`--json-output`. | 3270 | + When used with :qpdf:ref:`--json-output`, |
| 3271 | + ``--json-stream-data=file-prefix`` sets the prefix for stream data | ||
| 3272 | + files, overriding the default, which is to use the output file | ||
| 3273 | + name. Whatever is given here will be appended with :samp:`-{nnn}` | ||
| 3274 | + to create the name of the file that will contain the data for the | ||
| 3275 | + stream stream in object :samp:`{nnn}`. This option only applies | ||
| 3276 | + when used with :qpdf:ref:`--json-output`. | ||
| 3266 | 3277 | ||
| 3267 | .. qpdf:option:: --json-output[=version] | 3278 | .. qpdf:option:: --json-output[=version] |
| 3268 | 3279 | ||
| @@ -3270,44 +3281,45 @@ Related Options | @@ -3270,44 +3281,45 @@ Related Options | ||
| 3270 | 3281 | ||
| 3271 | The output file will be qpdf JSON format at the given version. | 3282 | The output file will be qpdf JSON format at the given version. |
| 3272 | "version" may be a specific version or "latest" (the default). | 3283 | "version" may be a specific version or "latest" (the default). |
| 3273 | - Version 1 is not supported. See also --json-stream-data, | 3284 | + The only supported version is 2. See also --json-stream-data, |
| 3274 | --json-stream-prefix, and --decode-level. | 3285 | --json-stream-prefix, and --decode-level. |
| 3275 | 3286 | ||
| 3276 | - The output file will be qpdf JSON format at the given version. | ||
| 3277 | - ``version`` may be a specific version or ``latest`` (the default). | ||
| 3278 | - Version 1 is not supported. See also :qpdf:ref:`--json-stream-data` | ||
| 3279 | - and :qpdf:ref:`--json-stream-prefix`. The default decode level is | ||
| 3280 | - ``none``, but you can override it with :qpdf:ref:`--decode-level`. | ||
| 3281 | - If you want to look at the contents of streams easily as you would | ||
| 3282 | - in QDF mode (see :ref:`qdf`), you can use | ||
| 3283 | - ``--decode-level=generalized`` and ``--json-stream-data=file`` for | ||
| 3284 | - a convenient way to do that. | 3287 | + The output file, instead of being a PDF file, will be a JSON file |
| 3288 | + in qpdf JSON format at the given version. ``version`` may be a | ||
| 3289 | + specific version or ``latest`` (the default). The only supported | ||
| 3290 | + version is 2. See also :qpdf:ref:`--json-stream-data` and | ||
| 3291 | + :qpdf:ref:`--json-stream-prefix`. When this option is specified, | ||
| 3292 | + the default decode level for stream data is ``none``, but you can | ||
| 3293 | + override it with :qpdf:ref:`--decode-level`. If you want to look at | ||
| 3294 | + the contents of streams easily as you would in QDF mode (see | ||
| 3295 | + :ref:`qdf`), you can use ``--decode-level=generalized`` and | ||
| 3296 | + ``--json-stream-data=file`` for a convenient way to do that. | ||
| 3285 | 3297 | ||
| 3286 | .. qpdf:option:: --json-input | 3298 | .. qpdf:option:: --json-input |
| 3287 | 3299 | ||
| 3288 | .. help: input file is qpdf JSON | 3300 | .. help: input file is qpdf JSON |
| 3289 | 3301 | ||
| 3290 | Treat the input file as a JSON file in qpdf JSON format as | 3302 | Treat the input file as a JSON file in qpdf JSON format as |
| 3291 | - written by qpdf --json-output. See the "QPDF JSON Format" | 3303 | + written by qpdf --json-output. See the "qpdf JSON Format" |
| 3292 | section of the manual for information about how to use this | 3304 | section of the manual for information about how to use this |
| 3293 | option. | 3305 | option. |
| 3294 | 3306 | ||
| 3295 | Treat the input file as a JSON file in qpdf JSON format as written | 3307 | Treat the input file as a JSON file in qpdf JSON format as written |
| 3296 | by ``qpdf --json-output``. The input file must be complete and | 3308 | by ``qpdf --json-output``. The input file must be complete and |
| 3297 | include all stream data. For information about converting between | 3309 | include all stream data. For information about converting between |
| 3298 | - PDF and JSON, please see :ref:`qpdf-json`. | 3310 | + PDF and JSON, please see :ref:`json`. |
| 3299 | 3311 | ||
| 3300 | .. qpdf:option:: --update-from-json=qpdf-json-file | 3312 | .. qpdf:option:: --update-from-json=qpdf-json-file |
| 3301 | 3313 | ||
| 3302 | .. help: update a PDF from qpdf JSON | 3314 | .. help: update a PDF from qpdf JSON |
| 3303 | 3315 | ||
| 3304 | - Update a PDF file from a JSON file. Please see the "QPDF JSON | ||
| 3305 | - Format" section of the manual for information about how to use | ||
| 3306 | - this option. | 3316 | + Update a PDF file from a JSON file. Please see the "qpdf JSON" |
| 3317 | + chapter of the manual for information about how to use this | ||
| 3318 | + option. | ||
| 3307 | 3319 | ||
| 3308 | - This option updates a PDF file from a qpdf JSON file. For a | ||
| 3309 | - information about how to use this option, please see | ||
| 3310 | - :ref:`qpdf-json`. | 3320 | + This option updates a PDF file from the specified qpdf JSON file. |
| 3321 | + For a information about how to use this option, please see | ||
| 3322 | + :ref:`json`. | ||
| 3311 | 3323 | ||
| 3312 | .. _test-options: | 3324 | .. _test-options: |
| 3313 | 3325 | ||
| @@ -3420,7 +3432,7 @@ Related Options | @@ -3420,7 +3432,7 @@ Related Options | ||
| 3420 | 3432 | ||
| 3421 | This is used by qpdf's test suite to check consistency between the | 3433 | This is used by qpdf's test suite to check consistency between the |
| 3422 | output of ``qpdf --json`` and the output of ``qpdf --json-help``. | 3434 | output of ``qpdf --json`` and the output of ``qpdf --json-help``. |
| 3423 | - This option causes an extra copy of the generated json to appear in | 3435 | + This option causes an extra copy of the generated JSON to appear in |
| 3424 | memory and is therefore unsuitable for use with large files. This | 3436 | memory and is therefore unsuitable for use with large files. This |
| 3425 | is why it's also not on by default. | 3437 | is why it's also not on by default. |
| 3426 | 3438 |
manual/design.rst
| @@ -242,7 +242,7 @@ the current file position. If the token is a not either a dictionary or | @@ -242,7 +242,7 @@ the current file position. If the token is a not either a dictionary or | ||
| 242 | array opener, an object is immediately constructed from the single token | 242 | array opener, an object is immediately constructed from the single token |
| 243 | and the parser returns. Otherwise, the parser iterates in a special mode | 243 | and the parser returns. Otherwise, the parser iterates in a special mode |
| 244 | in which it accumulates objects until it finds a balancing closer. | 244 | in which it accumulates objects until it finds a balancing closer. |
| 245 | -During this process, the "``R``" keyword is recognized and an indirect | 245 | +During this process, the ``R`` keyword is recognized and an indirect |
| 246 | ``QPDFObjectHandle`` may be constructed. | 246 | ``QPDFObjectHandle`` may be constructed. |
| 247 | 247 | ||
| 248 | The ``QPDF::resolve()`` method, which is used to resolve an indirect | 248 | The ``QPDF::resolve()`` method, which is used to resolve an indirect |
| @@ -280,15 +280,15 @@ file. | @@ -280,15 +280,15 @@ file. | ||
| 280 | it is looking before the last ``%%EOF``. After getting to ``trailer`` | 280 | it is looking before the last ``%%EOF``. After getting to ``trailer`` |
| 281 | keyword, it invokes the parser. | 281 | keyword, it invokes the parser. |
| 282 | 282 | ||
| 283 | -- The parser sees "``<<``", so it calls itself recursively in | 283 | +- The parser sees ``<<``, so it calls itself recursively in |
| 284 | dictionary creation mode. | 284 | dictionary creation mode. |
| 285 | 285 | ||
| 286 | - In dictionary creation mode, the parser keeps accumulating objects | 286 | - In dictionary creation mode, the parser keeps accumulating objects |
| 287 | - until it encounters "``>>``". Each object that is read is pushed onto | ||
| 288 | - a stack. If "``R``" is read, the last two objects on the stack are | 287 | + until it encounters ``>>``. Each object that is read is pushed onto |
| 288 | + a stack. If ``R`` is read, the last two objects on the stack are | ||
| 289 | inspected. If they are integers, they are popped off the stack and | 289 | inspected. If they are integers, they are popped off the stack and |
| 290 | their values are used to construct an indirect object handle which is | 290 | their values are used to construct an indirect object handle which is |
| 291 | - then pushed onto the stack. When "``>>``" is finally read, the stack | 291 | + then pushed onto the stack. When ``>>`` is finally read, the stack |
| 292 | is converted into a ``QPDF_Dictionary`` which is placed in a | 292 | is converted into a ``QPDF_Dictionary`` which is placed in a |
| 293 | ``QPDFObjectHandle`` and returned. | 293 | ``QPDFObjectHandle`` and returned. |
| 294 | 294 |
manual/json.rst
| 1 | +.. cSpell:ignore moddifyannotations | ||
| 2 | +.. cSpell:ignore feff | ||
| 3 | + | ||
| 1 | .. _json: | 4 | .. _json: |
| 2 | 5 | ||
| 3 | -QPDF JSON | 6 | +qpdf JSON |
| 4 | ========= | 7 | ========= |
| 5 | 8 | ||
| 6 | .. _json-overview: | 9 | .. _json-overview: |
| @@ -8,27 +11,540 @@ QPDF JSON | @@ -8,27 +11,540 @@ QPDF JSON | ||
| 8 | Overview | 11 | Overview |
| 9 | -------- | 12 | -------- |
| 10 | 13 | ||
| 11 | -Beginning with qpdf version 8.3.0, the :command:`qpdf` | ||
| 12 | -command-line program can produce a JSON representation of the | ||
| 13 | -non-content data in a PDF file. It includes a dump in JSON format of all | ||
| 14 | -objects in the PDF file excluding the content of streams. This JSON | ||
| 15 | -representation makes it very easy to look in detail at the structure of | ||
| 16 | -a given PDF file, and it also provides a great way to work with PDF | ||
| 17 | -files programmatically from the command-line in languages that can't | ||
| 18 | -call or link with the qpdf library directly. Note that stream data can | ||
| 19 | -be extracted from PDF files using other qpdf command-line options. | 14 | +Beginning with qpdf version 11.0.0, the qpdf library and command-line |
| 15 | +program can produce a JSON representation of the in a PDF file. qpdf | ||
| 16 | +version 11 introduces JSON format version 2. Prior to qpdf 11, | ||
| 17 | +versions 8.3.0 onward had a more limited JSON representation | ||
| 18 | +accessible only from the command-line. For details on what changed, | ||
| 19 | +see :ref:`json-v2-changes`. The rest of this chapter documents qpdf | ||
| 20 | +JSON version 2. | ||
| 21 | + | ||
| 22 | +Please note: this chapter discusses *qpdf JSON format*, which | ||
| 23 | +represents the contents of a PDF file. This is distinct from the | ||
| 24 | +*QPDFJob JSON format* which provides a higher-level interface | ||
| 25 | +interacting with qpdf the way the command-line tool does. For | ||
| 26 | +information about that, see :ref:`qpdf-job`. | ||
| 27 | + | ||
| 28 | +The qpdf JSON format is specific to qpdf. There are two ways to use | ||
| 29 | +qpdf JSON: | ||
| 30 | + | ||
| 31 | +- The :qpdf:ref:`--json` command-ine flag causes creation of a JSON | ||
| 32 | + representation of all the objects in a PDF file, excluding stream | ||
| 33 | + data. This includes an unambiguous representation of the PDF object | ||
| 34 | + structure and also provides JSON-formatted summaries of other | ||
| 35 | + information about the file. This functionality is built into | ||
| 36 | + ``QPDFJob`` and can be accessed from the ``qpdf`` command-line tool | ||
| 37 | + or from the ``QPDFJob`` C or C++ API. | ||
| 38 | + | ||
| 39 | +- qpdf can create a JSON file that completely represents a PDF file. | ||
| 40 | + You can think of this as using JSON as an *alternative syntax* for | ||
| 41 | + representing a PDF file. Using qpdf JSON, it is possible to | ||
| 42 | + convert a PDF file to JSON, manipulate the structure or contents of | ||
| 43 | + the objects at a low level, and convert the results back to a PDF | ||
| 44 | + file. This functionality can be accessed from the command-line with | ||
| 45 | + the :qpdf:ref:`--json-output`, :qpdf:ref:`--json-input`, and | ||
| 46 | + :qpdf:ref:`--update-from-json` flags, or from the API using the | ||
| 47 | + ``QPDF::writeJSON``, ``QPDF::createFromJSON``, and | ||
| 48 | + ``QPDF::updateFromJSON`` methods. | ||
| 49 | + | ||
| 50 | +.. _json-terminology: | ||
| 51 | + | ||
| 52 | +JSON Terminology | ||
| 53 | +---------------- | ||
| 54 | + | ||
| 55 | +Notes about terminology: | ||
| 56 | + | ||
| 57 | +- In JavaScript and JSON, that thing that has keys and values is | ||
| 58 | + typically called an *object*. | ||
| 59 | + | ||
| 60 | +- In PDF, that thing that has keys and values is typically called a | ||
| 61 | + *dictionary*. An *object* is a PDF object such as integer, real, | ||
| 62 | + boolean, null, string, array, dictionary, or stream. | ||
| 63 | + | ||
| 64 | +- Some languages that use JSON call an *object* a *dictionary*, a | ||
| 65 | + *map*, or a *hash*. | ||
| 66 | + | ||
| 67 | +- Sometimes, it's called on *object* if it has fixed keys and a | ||
| 68 | + *dictionary* if it has variable keys. | ||
| 69 | + | ||
| 70 | +This manual is not entirely consistent about its use of *dictionary* | ||
| 71 | +vs. *object* because sometimes one term or another is clearer in | ||
| 72 | +context. Just be aware of the ambiguity when reading the manual. We | ||
| 73 | +frequently use the term *dictionary* to refer to a JSON object because | ||
| 74 | +of the consistency with PDF terminology. | ||
| 75 | + | ||
| 76 | +.. _what-qpdf-json-is-not: | ||
| 77 | + | ||
| 78 | +What qpdf JSON is not | ||
| 79 | +--------------------- | ||
| 80 | + | ||
| 81 | +Please note that qpdf JSON offers a convenient syntax for manipulating | ||
| 82 | +PDF files at a low level using JSON syntax. JSON syntax is much easier | ||
| 83 | +to work with than native PDF syntax, and there are good JSON libraries | ||
| 84 | +in virtually every commonly used programming language. Working with | ||
| 85 | +PDF objects in JSON removes the need to worry about stream lengths, | ||
| 86 | +cross reference tables, and PDF-specific representations of Unicode or | ||
| 87 | +binary strings that appear outside of content streams. It does not | ||
| 88 | +eliminate the need to understand the semantic structure of PDF files. | ||
| 89 | +Working with qpdf JSON still requires familiarity with the PDF | ||
| 90 | +specification. | ||
| 91 | + | ||
| 92 | +In particular, qpdf JSON *does not* provide any of the following | ||
| 93 | +capabilities: | ||
| 94 | + | ||
| 95 | +- Text extraction. While you could use qpdf JSON syntax to navigate to | ||
| 96 | + a page's content streams and font structures, text within pages is | ||
| 97 | + still encoded using PDF syntax within content streams, and there is | ||
| 98 | + no assistance for text extraction. | ||
| 99 | + | ||
| 100 | +- Reflowing text, document structure. qpdf JSON does not add any new | ||
| 101 | + information or insight into the content of PDF files. If you have a | ||
| 102 | + PDF file that lacks any structural information, qpdf JSON won't help | ||
| 103 | + you solve any of those problems. | ||
| 104 | + | ||
| 105 | +This is what we mean when we say that JSON provides a *alternative | ||
| 106 | +syntax* for working with PDF data. Semantically, it is identical to | ||
| 107 | +native PDF. | ||
| 20 | 108 | ||
| 21 | .. _qpdf-json: | 109 | .. _qpdf-json: |
| 22 | 110 | ||
| 23 | -QPDF JSON Format | 111 | +qpdf JSON Format |
| 24 | ---------------- | 112 | ---------------- |
| 25 | 113 | ||
| 26 | -XXX Write this. | 114 | +This section describes how qpdf represents PDF objects in JSON format. |
| 115 | +It also describes how to work with qpdf JSON to create or | ||
| 116 | +modify PDF files. | ||
| 117 | + | ||
| 118 | +.. _json.objects: | ||
| 119 | + | ||
| 120 | +qpdf JSON Object Representation | ||
| 121 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| 122 | + | ||
| 123 | +This section describes the representation of PDF objects in qpdf JSON | ||
| 124 | +version 2. PDF objects are represented within the ``"objects"`` | ||
| 125 | +dictionary of a qpdf JSON file. This is true both for PDF serialized | ||
| 126 | +to JSON (:qpdf:ref:`--json-output`, ``QPDF::writeJSON``) or objects as | ||
| 127 | +they appear in the output of ``qpdf`` with the :qpdf:ref:`--json` | ||
| 128 | +option. | ||
| 129 | + | ||
| 130 | +Each key in the ``"objects"`` dictionary is either ``"trailer"`` or a | ||
| 131 | +string of the form ``"obj:O G R"`` where ``O`` and ``G`` are the | ||
| 132 | +object and generation numbers and ``R`` is the literal string ``R``. | ||
| 133 | +This is the PDF syntax for the indirect object reference prepended by | ||
| 134 | +``obj:``. The value, representing the object itself, is a JSON object | ||
| 135 | +whose structure is described below. | ||
| 136 | + | ||
| 137 | +Top-level Stream Objects | ||
| 138 | + Stream objects are represented as a JSON object with the single key | ||
| 139 | + ``"stream"``. The stream object has a key called ``"dict"`` whose | ||
| 140 | + value is the stream dictionary as an object value (described below) | ||
| 141 | + with the ``"/Length"`` key omitted. Other keys are determined by the | ||
| 142 | + value for json stream data (:qpdf:ref:`--json-stream-data`, or a | ||
| 143 | + parameter of type ``qpdf_json_stream_data_e``) as follows: | ||
| 144 | + | ||
| 145 | + - ``none``: stream data is not represented; no other keys are | ||
| 146 | + present | ||
| 147 | + | ||
| 148 | + - ``inline``: the stream data appears as a base64-encoded string as | ||
| 149 | + the value of the ``"data"`` key | ||
| 150 | + | ||
| 151 | + - ``file``: the stream data is written to a file, and the path to | ||
| 152 | + the file is stored in the ``"datafile"`` key. A relative path is | ||
| 153 | + interpreted as relative to the current directory when qpdf is | ||
| 154 | + invoked. | ||
| 155 | + | ||
| 156 | + Keys other than ``"dict"``, ``"data"``, and ``"datafile"`` are | ||
| 157 | + ignored. This is primarily for future compatibility in case a newer | ||
| 158 | + version of qpdf includes additional information. | ||
| 159 | + | ||
| 160 | + As with the native PDF representation, the stream data must be | ||
| 161 | + consistent with whatever filters and decode parameters are specified | ||
| 162 | + in the stream dictionary. | ||
| 163 | + | ||
| 164 | +Top-level Non-stream Objects | ||
| 165 | + Non-stream objects are represented as a dictionary with the single | ||
| 166 | + key ``"value"``. Other keys are ignored for future compatibility. | ||
| 167 | + The value's structure is described in "Object Values" below. | ||
| 168 | + | ||
| 169 | + Note: in files that use object streams, the trailer "dictionary" is | ||
| 170 | + actually a stream, but in the JSON representation, the value of the | ||
| 171 | + ``"trailer"`` key is always written as a dictionary (with a | ||
| 172 | + ``"value"`` key like other non-stream objects). There will also be a | ||
| 173 | + a stream object whose key is the object ID of the cross-reference | ||
| 174 | + stream, even though this stream will generally be unreferenced. This | ||
| 175 | + makes it possible to assume ``"trailer"`` points to a dictionary | ||
| 176 | + without having to consider whether the file uses object streams or | ||
| 177 | + not. It is also consistent with how ``QPDF::getTrailer`` behaves in | ||
| 178 | + the C++ API. | ||
| 179 | + | ||
| 180 | +Object Values | ||
| 181 | + Within ``"value"`` or ``"stream"."dict"``, PDF objects are | ||
| 182 | + represented as follows: | ||
| 183 | + | ||
| 184 | + - Objects of type Boolean or null are represented as JSON objects of | ||
| 185 | + the same type. | ||
| 186 | + | ||
| 187 | + - Objects that are numeric are represented as numeric in the JSON | ||
| 188 | + without regard to precision. Internally, qpdf stores numeric | ||
| 189 | + values as strings, so qpdf will preserve arbitrary precision | ||
| 190 | + numerical values when reading and writing JSON. It is likely that | ||
| 191 | + other JSON readers and writers will have implementation-dependent | ||
| 192 | + ways of handling numerical values that are out of range. | ||
| 193 | + | ||
| 194 | + - Name objects are represented as JSON strings that start with ``/`` | ||
| 195 | + and are followed by the PDF name in canonical form with all PDF | ||
| 196 | + syntax resolved. For example, the name whose canonical form (per | ||
| 197 | + the PDF specification) is ``text/plain`` would be represented in | ||
| 198 | + JSON as ``"/text/plain"`` and in PDF as ``"/text#2fplain"``. | ||
| 199 | + | ||
| 200 | + - Indirect object references are represented as JSON strings that | ||
| 201 | + look like a PDF indirect object reference and have the form ``"O G | ||
| 202 | + R"`` where ``O`` and ``G`` are the object and generation numbers | ||
| 203 | + and ``R`` is the literal string ``R``. For example, ``"3 0 R"`` | ||
| 204 | + would represent a reference to the object with object ID 3 and | ||
| 205 | + generation 0. | ||
| 206 | + | ||
| 207 | + - PDF strings are represented as JSON strings in one of two ways: | ||
| 208 | + | ||
| 209 | + - ``"u:utf8-encoded-string"``: this format is used when the PDF | ||
| 210 | + string can be unambiguously represented as a Unicode string and | ||
| 211 | + contains no unprintable characters. This is the case whether the | ||
| 212 | + input string is encoded as UTF-16, UTF-8 (as allowed by PDF | ||
| 213 | + 2.0), or PDF doc encoding. Strings are only represented this way | ||
| 214 | + if they can be encoded without loss of information. | ||
| 215 | + | ||
| 216 | + - ``"b:hex-string"``: this format is used to represent any binary | ||
| 217 | + string value that can't be represented as a Unicode string. | ||
| 218 | + ``hex-string`` must have an even number of characters that range | ||
| 219 | + from ``a`` through ``f``, ``A`` through ``F``, or ``0`` through | ||
| 220 | + ``9``. | ||
| 221 | + | ||
| 222 | + qpdf writes empty strings as ``"u:"``, but both ``"b:"`` and | ||
| 223 | + ``"u:"`` are valid representations of the empty string. | ||
| 224 | + | ||
| 225 | + There is full support for UTF-16 surrogate pairs. Binary strings | ||
| 226 | + encoded with ``"b:..."`` are the internal PDF representations. | ||
| 227 | + As such, the following are equivalent: | ||
| 228 | + | ||
| 229 | + - ``"u:\ud83e\udd54"`` -- representation of U+1F954 as a surrogate | ||
| 230 | + pair in JSON syntax | ||
| 231 | + | ||
| 232 | + - ``"b:FEFFD83EDD54"`` -- representation of U+1F954 as the bytes | ||
| 233 | + of a UTF-16 string in PDF syntax with the leading ``FEFF`` | ||
| 234 | + indicating UTF-16 | ||
| 235 | + | ||
| 236 | + - ``"b:efbbbff09fa594"`` -- representation of U+1F954 as the | ||
| 237 | + bytes of a UTF-8 string in PDF syntax (as allowed by PDF 2.0) | ||
| 238 | + with the leading ``EF``, ``BB``, ``BF`` sequence (which is just | ||
| 239 | + UTF-8 encoding of ``FEFF``). | ||
| 240 | + | ||
| 241 | + - A JSON string whose contents are ``u:`` followed by the UTF-8 | ||
| 242 | + representation of U+1F954. This is the potato emoji. | ||
| 243 | + Unfortunately, I am not able to render it in the PDF version | ||
| 244 | + of this manual. | ||
| 245 | + | ||
| 246 | + - PDF arrays are represented as JSON arrays of objects as described | ||
| 247 | + above | ||
| 248 | + | ||
| 249 | + - PDF dictionaries are represented as JSON objects whose keys are | ||
| 250 | + the string representations of names and whose values are | ||
| 251 | + representations of PDF objects. | ||
| 252 | + | ||
| 253 | +.. _json.output: | ||
| 254 | + | ||
| 255 | +qpdf JSON Output | ||
| 256 | +~~~~~~~~~~~~~~~~ | ||
| 257 | + | ||
| 258 | +The format of the JSON written by qpdf's :qpdf:ref:`--json-output` | ||
| 259 | +flag or the ``QPDF::writeJSON`` API call is a JSON object consisting | ||
| 260 | +of a single key: ``"qpdf-v2"``. Any other top-level keys are ignored. | ||
| 261 | +While unknown keys in other places are ignored for future | ||
| 262 | +compatibility, in this case, ignoring other top-level keys is an | ||
| 263 | +explicit decision to allow users to include other keys for their own | ||
| 264 | +use. No new top-level keys will be added in JSON version 2. | ||
| 265 | + | ||
| 266 | +The ``"qpdf-v2"`` key points to a JSON object with the following keys: | ||
| 267 | + | ||
| 268 | +- ``"pdfversion"`` -- a string containing PDF version as indicated in | ||
| 269 | + the PDF header (e.g. ``"1.7"``, ``"2.0"``) | ||
| 270 | + | ||
| 271 | +- ``"maxobjectid"`` -- a number indicating the object ID of the | ||
| 272 | + highest numbered object in the file. This is provided to make it | ||
| 273 | + easier for software that wants to add new objects to the file as you | ||
| 274 | + can safely start with one above that number when creating new | ||
| 275 | + objects. Note that the value of ``"maxobjectid"`` may be higher than | ||
| 276 | + the actual maximum object that appears in the input PDF since it | ||
| 277 | + takes into consideration any dangling indirect object references | ||
| 278 | + from the original file. This prevents you from unwittingly creating | ||
| 279 | + an object that doesn't exist but that is referenced, which may have | ||
| 280 | + unintended side effects. (The PDF specification explicitly allows | ||
| 281 | + dangling references and says to treat them as nulls. This can happen | ||
| 282 | + if objects are removed from a PDF file.) | ||
| 283 | + | ||
| 284 | +- ``"objects"`` -- the actual PDF objects as described in | ||
| 285 | + :ref:`json.objects`. | ||
| 286 | + | ||
| 287 | +Note that writing JSON output is done by ``QPDF``, not ``QPDFWriter``. | ||
| 288 | +As such, none of the things ``QPDFWriter`` does apply. This includes | ||
| 289 | +recompression of streams, renumbering of objects, anything to do with | ||
| 290 | +object streams (which are not represented by qpdf JSON at all since | ||
| 291 | +they are PDF syntax, not semantics), encryption, decryption, | ||
| 292 | +linearization, QDF mode, etc. | ||
| 293 | + | ||
| 294 | +.. _json.example: | ||
| 295 | + | ||
| 296 | +qpdf JSON Example | ||
| 297 | +~~~~~~~~~~~~~~~~~ | ||
| 298 | + | ||
| 299 | +The JSON below shows an example of a simple PDF file represented in | ||
| 300 | +qpdf JSON format. | ||
| 301 | + | ||
| 302 | +.. code-block:: json | ||
| 303 | + | ||
| 304 | + { | ||
| 305 | + "qpdf-v2": { | ||
| 306 | + "pdfversion": "1.3", | ||
| 307 | + "maxobjectid": 5, | ||
| 308 | + "objects": { | ||
| 309 | + "obj:1 0 R": { | ||
| 310 | + "value": { | ||
| 311 | + "/Pages": "2 0 R", | ||
| 312 | + "/Type": "/Catalog" | ||
| 313 | + } | ||
| 314 | + }, | ||
| 315 | + "obj:2 0 R": { | ||
| 316 | + "value": { | ||
| 317 | + "/Count": 1, | ||
| 318 | + "/Kids": [ "3 0 R" ], | ||
| 319 | + "/Type": "/Pages" | ||
| 320 | + } | ||
| 321 | + }, | ||
| 322 | + "obj:3 0 R": { | ||
| 323 | + "value": { | ||
| 324 | + "/Contents": "4 0 R", | ||
| 325 | + "/MediaBox": [ 0, 0, 612, 792 ], | ||
| 326 | + "/Parent": "2 0 R", | ||
| 327 | + "/Resources": { | ||
| 328 | + "/Font": { | ||
| 329 | + "/F1": "5 0 R" | ||
| 330 | + } | ||
| 331 | + }, | ||
| 332 | + "/Type": "/Page" | ||
| 333 | + } | ||
| 334 | + }, | ||
| 335 | + "obj:4 0 R": { | ||
| 336 | + "stream": { | ||
| 337 | + "data": "eJxzCuFSUNB3M1QwMlEISQOyzY2AyEAhJAXI1gjIL0ksyddUCMnicg3hAgDLAQnI", | ||
| 338 | + "dict": { | ||
| 339 | + "/Filter": "/FlateDecode" | ||
| 340 | + } | ||
| 341 | + } | ||
| 342 | + }, | ||
| 343 | + "obj:5 0 R": { | ||
| 344 | + "value": { | ||
| 345 | + "/BaseFont": "/Helvetica", | ||
| 346 | + "/Encoding": "/WinAnsiEncoding", | ||
| 347 | + "/Subtype": "/Type1", | ||
| 348 | + "/Type": "/Font" | ||
| 349 | + } | ||
| 350 | + }, | ||
| 351 | + "trailer": { | ||
| 352 | + "value": { | ||
| 353 | + "/ID": [ | ||
| 354 | + "b:98b5a26966fba4d3a769b715b2558da6", | ||
| 355 | + "b:98b5a26966fba4d3a769b715b2558da6" | ||
| 356 | + ], | ||
| 357 | + "/Root": "1 0 R", | ||
| 358 | + "/Size": 6 | ||
| 359 | + } | ||
| 360 | + } | ||
| 361 | + } | ||
| 362 | + } | ||
| 363 | + } | ||
| 364 | + | ||
| 365 | +.. _json.input: | ||
| 366 | + | ||
| 367 | +qpdf JSON Input | ||
| 368 | +~~~~~~~~~~~~~~~ | ||
| 369 | + | ||
| 370 | +Output in the JSON output format described in :ref:`json.output` can | ||
| 371 | +be used in two different ways: | ||
| 372 | + | ||
| 373 | +- By using the :qpdf:ref:`--json-input` flag or calling | ||
| 374 | + ``QPDF::createFromJSON`` in place of ``QPDF::processFile``, a qpdf | ||
| 375 | + JSON file can be used in place of a PDF file as the input to qpdf. | ||
| 376 | + | ||
| 377 | +- By using the :qpdf:ref:`--update-from-json` flag or calling | ||
| 378 | + ``QPDF::updateFromJSON`` on an initialized ``QPDF`` object, a qpdf | ||
| 379 | + JSON file can be used to apply changes to an existing ``QPDF`` | ||
| 380 | + object. That ``QPDF`` object can have come from any source including | ||
| 381 | + a PDF file, a qpdf JSON file, or the result of any other process | ||
| 382 | + that results in a valid, initialized ``QPDF`` object. | ||
| 383 | + | ||
| 384 | +Here are some important things to know about qpdf JSON input. | ||
| 385 | + | ||
| 386 | +- When a qpdf JSON file is used as the primary input file, it must be | ||
| 387 | + complete. This means | ||
| 388 | + | ||
| 389 | + - A PDF version number must be specified with the ``"pdfversion"`` | ||
| 390 | + key | ||
| 391 | + | ||
| 392 | + - Stream data must be present for all streams | ||
| 393 | + | ||
| 394 | + - The trailer dictionary must be present, though only the | ||
| 395 | + ``"/Root"`` key is required. | ||
| 396 | + | ||
| 397 | +- Certain fields from the input are ignored whether creating or | ||
| 398 | + updating from a JSON file: | ||
| 399 | + | ||
| 400 | + - ``"maxobjectid"`` is ignored, so it is not necessary to update it | ||
| 401 | + when adding new objects. | ||
| 402 | + | ||
| 403 | + - ``"/Length"`` is ignored in all stream dictionaries. qpdf doesn't | ||
| 404 | + put it there when it creates JSON output, and it is not necessary | ||
| 405 | + to add it. | ||
| 406 | + | ||
| 407 | + - ``"/Size"`` is ignored if it appears in a trailer dictionary as | ||
| 408 | + that is always recomputed by ``QPDFWriter``. | ||
| 409 | + | ||
| 410 | + - Unknown keys at the to top level of the file, within ``objects``, | ||
| 411 | + at the top level of each individual object (inside the object that | ||
| 412 | + has the ``"value"`` or ``"stream"`` key) and directly within | ||
| 413 | + ``"stream"`` are ignored for future compatibility. You should | ||
| 414 | + avoid putting your own values in those places if you wish to avoid | ||
| 415 | + risking that your JSON files will not work in future versions of | ||
| 416 | + qpdf. The exception to this advice is at the top level of the | ||
| 417 | + overall file where it is explicitly supported for you to add your | ||
| 418 | + own keys. For example, you could add your own metadata at the top | ||
| 419 | + level, and qpdf will ignore it. Note that extra top-level keys are | ||
| 420 | + not preserved when qpdf reads your JSON file. | ||
| 421 | + | ||
| 422 | +- When qpdf reads a PDF file, the internal object numbers are always | ||
| 423 | + preserved. However, when qpdf writes a file using ``QPDFWriter``, | ||
| 424 | + ``QPDFWriter`` does its own numbering and, in general, does not | ||
| 425 | + preserve input object numbers. That means that a qpdf JSON file that | ||
| 426 | + is used to update an existing PDF must have object numbers that | ||
| 427 | + match the input file it is modifying. In practical terms, this means | ||
| 428 | + that you can't use a JSON file created from one PDF file to modify | ||
| 429 | + the *output of running qpdf on that file*. | ||
| 430 | + | ||
| 431 | + To put this more concretely, the following is valid: | ||
| 432 | + | ||
| 433 | + :: | ||
| 434 | + | ||
| 435 | + qpdf --json-output in.pdf pdf.json | ||
| 436 | + # edit pdf.json | ||
| 437 | + qpdf in.pdf out.pdf --update-from-json=pdf.json | ||
| 438 | + | ||
| 439 | + The following will not produce predictable results because | ||
| 440 | + ``out.pdf`` won't have the same object numbers as ``pdf.json`` and | ||
| 441 | + ``in.pdf``. | ||
| 442 | + | ||
| 443 | + :: | ||
| 444 | + | ||
| 445 | + qpdf --json-output in.pdf pdf.json | ||
| 446 | + # edit pdf.json | ||
| 447 | + qpdf in.pdf out.pdf --update-from-json=pdf.json | ||
| 448 | + # edit pdf.json again | ||
| 449 | + # Don't do this | ||
| 450 | + qpdf out.pdf out2.pdf --update-from-json=pdf.json | ||
| 451 | + | ||
| 452 | +- When updating from a JSON file (:qpdf:ref:`--update-from-json`, | ||
| 453 | + ``QPDF::updateFromJSON``), existing objects are updated in place. | ||
| 454 | + This has the following implications: | ||
| 455 | + | ||
| 456 | + - You may omit both ``"data"`` and ``"datafile"`` if the object you | ||
| 457 | + are updating is already a stream. In that case the original stream | ||
| 458 | + data is preserved. You must always provide a stream dictionary, | ||
| 459 | + but it may be empty. Note that an empty stream dictionary will | ||
| 460 | + clear the old dictionary. There is no way to indicate that an old | ||
| 461 | + stream dictionary should be left alone, so if your intention is to | ||
| 462 | + replace the stream data and preserve the dictionary, the | ||
| 463 | + original dictionary must appear in the JSON file. | ||
| 464 | + | ||
| 465 | + - You can change one object type to another object type including | ||
| 466 | + replacing a stream with a non-stream or a non-stream with a | ||
| 467 | + stream. If you replace a non-stream with a stream, you must | ||
| 468 | + provide data for the stream. | ||
| 469 | + | ||
| 470 | + - Objects that you do not wish to modify can be omitted from the | ||
| 471 | + JSON. That includes the trailer. That means you can use the output | ||
| 472 | + of a qpdf JSON file that was written using | ||
| 473 | + :qpdf:ref:`--json-object` to have it include only the objects you | ||
| 474 | + intend to modify. | ||
| 475 | + | ||
| 476 | + - You can omit the ``"pdfversion"`` key. The input PDF version will | ||
| 477 | + be preserved. | ||
| 478 | + | ||
| 479 | +.. _json.workflow-cli: | ||
| 480 | + | ||
| 481 | +qpdf JSON Workflow: CLI | ||
| 482 | +~~~~~~~~~~~~~~~~~~~~~~~ | ||
| 483 | + | ||
| 484 | +This section includes a few examples of using qpdf JSON. | ||
| 485 | + | ||
| 486 | +- Convert a PDF file to JSON format, edit the JSON, and convert back | ||
| 487 | + to PDF. This is an alternative to using QDF mode (see :ref:`qdf`) to | ||
| 488 | + modify PDF files in a text editor. Each method has its own | ||
| 489 | + advantages and disadvantages. | ||
| 490 | + | ||
| 491 | + :: | ||
| 492 | + | ||
| 493 | + qpdf --json-output in.pdf pdf.json | ||
| 494 | + # edit pdf.json | ||
| 495 | + qpdf --json-input pdf.json out.pdf | ||
| 496 | + | ||
| 497 | +- Extract only a specific object into a JSON file, modify the object | ||
| 498 | + in JSON, and use the modified object to update the original PDF. In | ||
| 499 | + this case, we're editing object 4, whatever that may happen to be. | ||
| 500 | + You would have to know through some other means which object you | ||
| 501 | + wanted to edit, such as by looking at other JSON output or using a | ||
| 502 | + tool (possibly but not necessarily qpdf) to identify the object. | ||
| 503 | + | ||
| 504 | + :: | ||
| 505 | + | ||
| 506 | + qpdf --json-output in.pdf pdf.json --json-object=4,0 | ||
| 507 | + # edit pdf.json | ||
| 508 | + qpdf in.pdf --update-from-json=pdf.json out.pdf | ||
| 509 | + | ||
| 510 | + Rather than using :qpdf:ref:`--json-object` as in the above example, | ||
| 511 | + you could edit the JSON file to remove the objects you didn't need. | ||
| 512 | + You could also just leave them there, though the update process | ||
| 513 | + would be slower. | ||
| 514 | + | ||
| 515 | + You could also add new objects to a file by adding them to | ||
| 516 | + ``pdf.json``. Just be sure the object number doesn't conflict with | ||
| 517 | + an existing object. The ``"maxobjectid"`` field in the original | ||
| 518 | + output can help with this. You don't have to update it if you add | ||
| 519 | + objects as it is ignored when the file is read back in. | ||
| 520 | + | ||
| 521 | +- Use :qpdf:ref:`--json-input` and :qpdf:ref:`--json-output` together | ||
| 522 | + to demonstrate preservation of object numbers. In this example, | ||
| 523 | + ``a.json`` and ``b.json`` will have the same objects and object | ||
| 524 | + numbers. The files may not be identical since strings may be | ||
| 525 | + normalized, fields may appear in a different order, etc. However | ||
| 526 | + ``b.json`` and ``c.json`` are probably identical. | ||
| 527 | + | ||
| 528 | + :: | ||
| 529 | + | ||
| 530 | + qpdf --json-output in.pdf a.json | ||
| 531 | + qpdf --json-input --json-output a.json b.json | ||
| 532 | + qpdf --json-input --json-output b.json c.json | ||
| 533 | + | ||
| 534 | + | ||
| 535 | +.. _json.workflow-api: | ||
| 536 | + | ||
| 537 | +qpdf JSON Workflow: API | ||
| 538 | +~~~~~~~~~~~~~~~~~~~~~~~ | ||
| 539 | + | ||
| 540 | +Everything that can be done using the qpdf CLI can be done using the | ||
| 541 | +C++ API. See comments in :file:`QPDF.hh` for ``writeJSON``, | ||
| 542 | +``createFromJSON``, and ``updateFromJSON`` for details. | ||
| 27 | 543 | ||
| 28 | .. _json-guarantees: | 544 | .. _json-guarantees: |
| 29 | 545 | ||
| 30 | -JSON Guarantees | ||
| 31 | ---------------- | 546 | +JSON Compatibility Guarantees |
| 547 | +----------------------------- | ||
| 32 | 548 | ||
| 33 | The qpdf JSON representation includes a JSON serialization of the raw | 549 | The qpdf JSON representation includes a JSON serialization of the raw |
| 34 | objects in the PDF file as well as some computed information in a more | 550 | objects in the PDF file as well as some computed information in a more |
| @@ -37,24 +553,23 @@ format. These guarantees are designed to simplify the experience of a | @@ -37,24 +553,23 @@ format. These guarantees are designed to simplify the experience of a | ||
| 37 | developer working with the JSON format. | 553 | developer working with the JSON format. |
| 38 | 554 | ||
| 39 | Compatibility | 555 | Compatibility |
| 40 | - The top-level JSON object output is a dictionary. The JSON output | ||
| 41 | - contains various nested dictionaries and arrays. With the exception | ||
| 42 | - of dictionaries that are populated by the fields of objects from the | ||
| 43 | - file, all instances of a dictionary are guaranteed to have exactly | ||
| 44 | - the same keys. Future versions of qpdf are free to add additional | ||
| 45 | - keys but not to remove keys or change the type of object that a key | ||
| 46 | - points to. The qpdf program validates this guarantee, and in the | ||
| 47 | - unlikely event that a bug in qpdf should cause it to generate data | ||
| 48 | - that doesn't conform to this rule, it will ask you to file a bug | ||
| 49 | - report. | ||
| 50 | - | ||
| 51 | - The top-level JSON structure contains a "``version``" key whose value | ||
| 52 | - is simple integer. The value of the ``version`` key will be | 556 | + The top-level JSON object is a dictionary (JSON "object"). The JSON |
| 557 | + output contains various nested dictionaries and arrays. With the | ||
| 558 | + exception of dictionaries that are populated by the fields of | ||
| 559 | + PDF objects from the file, all instances of a dictionary are | ||
| 560 | + guaranteed to have exactly the same keys. | ||
| 561 | + | ||
| 562 | + The top-level JSON structure contains a ``"version"`` key whose | ||
| 563 | + value is simple integer. The value of the ``version`` key will be | ||
| 53 | incremented if a non-compatible change is made. A non-compatible | 564 | incremented if a non-compatible change is made. A non-compatible |
| 54 | change would be any change that involves removal of a key, a change | 565 | change would be any change that involves removal of a key, a change |
| 55 | - to the format of data pointed to by a key, or a semantic change that | ||
| 56 | - requires a different interpretation of a previously existing key. A | ||
| 57 | - strong effort will be made to avoid breaking compatibility. | 566 | + to the format of data pointed to by a key, or a semantic change |
| 567 | + that requires a different interpretation of a previously existing | ||
| 568 | + key. | ||
| 569 | + | ||
| 570 | + With a specific qpdf JSON version, future versions of qpdf are free | ||
| 571 | + to add additional keys but not to remove keys or change the type of | ||
| 572 | + object that a key points to. | ||
| 58 | 573 | ||
| 59 | Documentation | 574 | Documentation |
| 60 | The :command:`qpdf` command can be invoked with the | 575 | The :command:`qpdf` command can be invoked with the |
| @@ -66,28 +581,29 @@ Documentation | @@ -66,28 +581,29 @@ Documentation | ||
| 66 | 581 | ||
| 67 | - A dictionary in the help output means that the corresponding | 582 | - A dictionary in the help output means that the corresponding |
| 68 | location in the actual JSON output is also a dictionary with | 583 | location in the actual JSON output is also a dictionary with |
| 69 | - exactly the same keys; that is, no keys present in help are absent | ||
| 70 | - in the real output, and no keys will be present in the real output | ||
| 71 | - that are not in help. As a special case, if the dictionary has a | ||
| 72 | - single key whose name starts with ``<`` and ends with ``>``, it | ||
| 73 | - means that the JSON output is a dictionary that can have any keys, | ||
| 74 | - each of which conforms to the value of the special key. This is | ||
| 75 | - used for cases in which the keys of the dictionary are things like | ||
| 76 | - object IDs. | 584 | + exactly the same keys; that is, no keys present in help are |
| 585 | + absent in the real output, and no keys will be present in the | ||
| 586 | + real output that are not in help. It is possible for a key to be | ||
| 587 | + present and have a value that is explicitly ``null``. As a | ||
| 588 | + special case, if the dictionary has a single key whose name | ||
| 589 | + starts with ``<`` and ends with ``>``, it means that the JSON | ||
| 590 | + output is a dictionary that can have any value as a key. This is | ||
| 591 | + used for cases in which the keys of the dictionary are things | ||
| 592 | + like object IDs. | ||
| 77 | 593 | ||
| 78 | - A string in the help output is a description of the item that | 594 | - A string in the help output is a description of the item that |
| 79 | appears in the corresponding location of the actual output. The | 595 | appears in the corresponding location of the actual output. The |
| 80 | - corresponding output can have any format. | 596 | + corresponding output can have any value including ``null``. |
| 81 | 597 | ||
| 82 | - An array in the help output always contains a single element. It | 598 | - An array in the help output always contains a single element. It |
| 83 | indicates that the corresponding location in the actual output is | 599 | indicates that the corresponding location in the actual output is |
| 84 | - also an array, and that each element of the array has whatever | ||
| 85 | - format is implied by the single element of the help output's | ||
| 86 | - array. | 600 | + an array of any length, and that each element of the array has |
| 601 | + whatever format is implied by the single element of the help | ||
| 602 | + output's array. | ||
| 87 | 603 | ||
| 88 | - For example, the help output indicates includes a "``pagelabels``" | 604 | + For example, the help output indicates includes a ``"pagelabels"`` |
| 89 | key whose value is an array of one element. That element is a | 605 | key whose value is an array of one element. That element is a |
| 90 | - dictionary with keys "``index``" and "``label``". In addition to | 606 | + dictionary with keys ``"index"`` and ``"label"``. In addition to |
| 91 | describing the meaning of those keys, this tells you that the actual | 607 | describing the meaning of those keys, this tells you that the actual |
| 92 | JSON output will contain a ``pagelabels`` array, each of whose | 608 | JSON output will contain a ``pagelabels`` array, each of whose |
| 93 | elements is a dictionary that contains an ``index`` key, a ``label`` | 609 | elements is a dictionary that contains an ``index`` key, a ``label`` |
| @@ -95,56 +611,13 @@ Documentation | @@ -95,56 +611,13 @@ Documentation | ||
| 95 | 611 | ||
| 96 | Directness and Simplicity | 612 | Directness and Simplicity |
| 97 | The JSON output contains the value of every object in the file, but | 613 | The JSON output contains the value of every object in the file, but |
| 98 | - it also contains some processed data. This is analogous to how qpdf's | ||
| 99 | - library interface works. The processed data is similar to the helper | ||
| 100 | - functions in that it allows you to look at certain aspects of the PDF | ||
| 101 | - file without having to understand all the nuances of the PDF | 614 | + it also contains some summary data. This is analogous to how qpdf's |
| 615 | + library interface works. The summary data is similar to the helper | ||
| 616 | + functions in that it allows you to look at certain aspects of the | ||
| 617 | + PDF file without having to understand all the nuances of the PDF | ||
| 102 | specification, while the raw objects allow you to mine the PDF for | 618 | specification, while the raw objects allow you to mine the PDF for |
| 103 | anything that the higher-level interfaces are lacking. | 619 | anything that the higher-level interfaces are lacking. |
| 104 | 620 | ||
| 105 | -.. _json.limitations: | ||
| 106 | - | ||
| 107 | -Limitations of JSON Representation | ||
| 108 | ----------------------------------- | ||
| 109 | - | ||
| 110 | -There are a few limitations to be aware of with the JSON structure: | ||
| 111 | - | ||
| 112 | -- Strings, names, and indirect object references in the original PDF | ||
| 113 | - file are all converted to strings in the JSON representation. In the | ||
| 114 | - case of a "normal" PDF file, you can tell the difference because a | ||
| 115 | - name starts with a slash (``/``), and an indirect object reference | ||
| 116 | - looks like ``n n R``, but if there were to be a string that looked | ||
| 117 | - like a name or indirect object reference, there would be no way to | ||
| 118 | - tell this from the JSON output. Note that there are certain cases | ||
| 119 | - where you know for sure what something is, such as knowing that | ||
| 120 | - dictionary keys in objects are always names and that certain things | ||
| 121 | - in the higher-level computed data are known to contain indirect | ||
| 122 | - object references. | ||
| 123 | - | ||
| 124 | -- The JSON format doesn't support binary data very well. Mostly the | ||
| 125 | - details are not important, but they are presented here for | ||
| 126 | - information. When qpdf outputs a string in the JSON representation, | ||
| 127 | - it converts the string to UTF-8, assuming usual PDF string semantics. | ||
| 128 | - Specifically, if the original string is UTF-16, it is converted to | ||
| 129 | - UTF-8. Otherwise, it is assumed to have PDF doc encoding, and is | ||
| 130 | - converted to UTF-8 with that assumption. This causes strange things | ||
| 131 | - to happen to binary strings. For example, if you had the binary | ||
| 132 | - string ``<038051>``, this would be output to the JSON as ``\u0003โขQ`` | ||
| 133 | - because ``03`` is not a printable character and ``80`` is the bullet | ||
| 134 | - character in PDF doc encoding and is mapped to the Unicode value | ||
| 135 | - ``2022``. Since ``51`` is ``Q``, it is output as is. If you wanted to | ||
| 136 | - convert back from here to a binary string, would have to recognize | ||
| 137 | - Unicode values whose code points are higher than ``0xFF`` and map | ||
| 138 | - those back to their corresponding PDF doc encoding characters. There | ||
| 139 | - is no way to tell the difference between a Unicode string that was | ||
| 140 | - originally encoded as UTF-16 or one that was converted from PDF doc | ||
| 141 | - encoding. In other words, it's best if you don't try to use the JSON | ||
| 142 | - format to extract binary strings from the PDF file, but if you really | ||
| 143 | - had to, it could be done. Note that qpdf's | ||
| 144 | - :qpdf:ref:`--show-object` option does not have this | ||
| 145 | - limitation and will reveal the string as encoded in the original | ||
| 146 | - file. | ||
| 147 | - | ||
| 148 | .. _json.considerations: | 621 | .. _json.considerations: |
| 149 | 622 | ||
| 150 | JSON: Special Considerations | 623 | JSON: Special Considerations |
| @@ -157,12 +630,15 @@ be aware of: | @@ -157,12 +630,15 @@ be aware of: | ||
| 157 | - If a PDF file has certain types of errors in its pages tree (such as | 630 | - If a PDF file has certain types of errors in its pages tree (such as |
| 158 | page objects that are direct or multiple pages sharing the same | 631 | page objects that are direct or multiple pages sharing the same |
| 159 | object ID), qpdf will automatically repair the pages tree. If you | 632 | object ID), qpdf will automatically repair the pages tree. If you |
| 160 | - specify ``"objects"`` and/or ``"objectinfo"`` without any other | ||
| 161 | - keys, you will see the original pages tree without any corrections. | ||
| 162 | - If you specify any of keys that require page tree traversal (for | ||
| 163 | - example, ``"pages"``, ``"outlines"``, or ``"pagelabel"``), then | ||
| 164 | - ``"objects"`` and ``"objectinfo"`` will show the repaired page tree | ||
| 165 | - so that object references will be consistent throughout the file. | 633 | + specify ``"objects"`` (and, with qpdf JSON version 1, also |
| 634 | + ``"objectinfo"``) without any other keys, you will see the original | ||
| 635 | + pages tree without any corrections. If you specify any of keys that | ||
| 636 | + require page tree traversal (for example, ``"pages"``, | ||
| 637 | + ``"outlines"``, or ``"pagelabel"``), then ``"objects"`` (and | ||
| 638 | + ``"objectinfo"``) will show the repaired page tree so that object | ||
| 639 | + references will be consistent throughout the file. This is not an | ||
| 640 | + issue with :qpdf:ref:`--json-output`, which doesn't repair the pages | ||
| 641 | + tree. | ||
| 166 | 642 | ||
| 167 | - While qpdf guarantees that keys present in the help will be present | 643 | - While qpdf guarantees that keys present in the help will be present |
| 168 | in the output, those fields may be null or empty if the information | 644 | in the output, those fields may be null or empty if the information |
| @@ -177,22 +653,128 @@ be aware of: | @@ -177,22 +653,128 @@ be aware of: | ||
| 177 | 1. Note that JSON indexes from 0, and you would also use 0-based | 653 | 1. Note that JSON indexes from 0, and you would also use 0-based |
| 178 | indexing using the API. However, 1-based indexing is easier in this | 654 | indexing using the API. However, 1-based indexing is easier in this |
| 179 | case because the command-line syntax for specifying page ranges is | 655 | case because the command-line syntax for specifying page ranges is |
| 180 | - 1-based. If you were going to write a program that looked through the | ||
| 181 | - JSON for information about specific pages and then use the | 656 | + 1-based. If you were going to write a program that looked through |
| 657 | + the JSON for information about specific pages and then use the | ||
| 182 | command-line to extract those pages, 1-based indexing is easier. | 658 | command-line to extract those pages, 1-based indexing is easier. |
| 183 | - Besides, it's more convenient to subtract 1 from a program in a real | ||
| 184 | - programming language than it is to add 1 from shell code. | 659 | + Besides, it's more convenient to subtract 1 in a real programming |
| 660 | + language than it is to add 1 in shell code. | ||
| 185 | 661 | ||
| 186 | - The image information included in the ``page`` section of the JSON | 662 | - The image information included in the ``page`` section of the JSON |
| 187 | - output includes the key "``filterable``". Note that the value of this | ||
| 188 | - field may depend on the :qpdf:ref:`--decode-level` that | ||
| 189 | - you invoke qpdf with. The JSON output includes a top-level key | ||
| 190 | - "``parameters``" that indicates the decode level used for computing | ||
| 191 | - whether a stream was filterable. For example, jpeg images will be | ||
| 192 | - shown as not filterable by default, but they will be shown as | ||
| 193 | - filterable if you run :command:`qpdf --json | 663 | + output includes the key ``"filterable"``. Note that the value of |
| 664 | + this field may depend on the :qpdf:ref:`--decode-level` that you | ||
| 665 | + invoke qpdf with. The JSON output includes a top-level key | ||
| 666 | + ``"parameters"`` that indicates the decode level that was used for | ||
| 667 | + computing whether a stream was filterable. For example, jpeg images | ||
| 668 | + will be shown as not filterable by default, but they will be shown | ||
| 669 | + as filterable if you run :command:`qpdf --json | ||
| 194 | --decode-level=all`. | 670 | --decode-level=all`. |
| 195 | 671 | ||
| 196 | - The ``encrypt`` key's values will be populated for non-encrypted | 672 | - The ``encrypt`` key's values will be populated for non-encrypted |
| 197 | files. Some values will be null, and others will have values that | 673 | files. Some values will be null, and others will have values that |
| 198 | apply to unencrypted files. | 674 | apply to unencrypted files. |
| 675 | + | ||
| 676 | +- The qpdf library itself never loads an entire PDF into memory. This | ||
| 677 | + remains true for PDF files represented in JSON format. In general, | ||
| 678 | + qpdf will hold the entire object structure in memory once a file has | ||
| 679 | + been fully read (objects are loaded into memory lazily but stay | ||
| 680 | + there once loaded), but it will never have more than two copies of a | ||
| 681 | + stream in memory at once. That said, if you ask qpdf to write JSON | ||
| 682 | + to memory, it will do so, so be careful about this if you are | ||
| 683 | + working with very large PDF files. There is nothing in the qpdf | ||
| 684 | + library itself that prevents working with PDF files much larger than | ||
| 685 | + available system memory. qpdf can both read and write such files in | ||
| 686 | + JSON format. If you need to work with a PDF file's json | ||
| 687 | + representation in memory, it is recommended that you use either | ||
| 688 | + ``none`` or ``file`` as the argument to | ||
| 689 | + :qpdf:ref:`--json-stream-data`, or if using the API, use | ||
| 690 | + ``qpdf_sj_none`` or ``pdf_sj_file`` as the json stream data value. | ||
| 691 | + If using ``none``, you can use other means to obtain the stream | ||
| 692 | + data. | ||
| 693 | + | ||
| 694 | +.. _json-v2-changes: | ||
| 695 | + | ||
| 696 | +Changes from JSON v1 to v2 | ||
| 697 | +-------------------------- | ||
| 698 | + | ||
| 699 | +The following changes were made to qpdf's JSON output format for | ||
| 700 | +version 2. | ||
| 701 | + | ||
| 702 | +- The representation of objects has changed. For details, see | ||
| 703 | + :ref:`json.objects`. | ||
| 704 | + | ||
| 705 | + - The representation of strings is now unambiguous for all strings. | ||
| 706 | + Strings a prefixed with either ``u:`` for Unicode strings or | ||
| 707 | + ``b:`` for byte strings. | ||
| 708 | + | ||
| 709 | + - Names are shown in qpdf's canonical form rather than in PDF | ||
| 710 | + syntax. (Example: the PDF-syntax name ``/text#2fplain`` appeared | ||
| 711 | + as ``"/text#2fplain"`` in v1 but appears as ``"/text/plain"`` in | ||
| 712 | + v2. | ||
| 713 | + | ||
| 714 | + - The top-level representation of an object in ``"objects"`` is a | ||
| 715 | + dictionary containing either a ``"value"`` key or a ``"stream"`` | ||
| 716 | + key, making it possible to distinguish streams from other objects. | ||
| 717 | + | ||
| 718 | +- The ``"objectinfo"`` key has been removed in favor of a | ||
| 719 | + representation in ``"objects"`` that differentiates between a stream | ||
| 720 | + and other kinds of objects. In v1, it was not possible to tell a | ||
| 721 | + stream from a dictionary within ``"objects"``. | ||
| 722 | + | ||
| 723 | +- Within the ``"objects"`` dictionary, keys are now ``"obj:O G R"`` | ||
| 724 | + where ``O`` and ``G`` are the object and generation number. | ||
| 725 | + ``"trailer"`` remains the key for the trailer dictionary. In v1, the | ||
| 726 | + ``obj:`` prefix was not present. The rationale for this change is as | ||
| 727 | + follows: | ||
| 728 | + | ||
| 729 | + - Having a unique prefix (``obj:``) makes it much easier to search | ||
| 730 | + in the JSON file for the definition of an object | ||
| 731 | + | ||
| 732 | + - Having the key still contain ``O G R`` makes it much easier to | ||
| 733 | + construct the key from an indirect reference. You just have to | ||
| 734 | + prepend ``obj:``. There is no need to parse the indirect object | ||
| 735 | + reference. | ||
| 736 | + | ||
| 737 | +- In the ``"encrypt"`` object, the ``"modifyannotations"`` was | ||
| 738 | + misspelled as ``"moddifyannotations"`` in v1. This has been | ||
| 739 | + corrected. | ||
| 740 | + | ||
| 741 | +Motivation for qpdf JSON version 2 | ||
| 742 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| 743 | + | ||
| 744 | +qpdf JSON version 2 was created to make it possible to manipulate PDF | ||
| 745 | +files using JSON syntax instead of native PDF syntax. This makes it | ||
| 746 | +possible to make low-level updates to PDF files from just about any | ||
| 747 | +programming language or even to do so from the command-line using | ||
| 748 | +tools like ``jq`` or any editor that's capable of working with JSON | ||
| 749 | +files. There were several limitations of JSON format version 1 that | ||
| 750 | +made this impossible: | ||
| 751 | + | ||
| 752 | +- Strings, names, and indirect object references in the original PDF | ||
| 753 | + file were all converted to strings in the JSON representation. For | ||
| 754 | + casual human inspection, this was fine, but in the general case, | ||
| 755 | + there was no way to tell the difference between a string that looked | ||
| 756 | + like a name or indirect object reference from an actual name or | ||
| 757 | + indirect object reference. | ||
| 758 | + | ||
| 759 | +- PDF strings were not unambiguously represented in the JSON format. | ||
| 760 | + The way qpdf JSON v1 represented a string was to try to convert the | ||
| 761 | + string to UTF-8. This was done by assuming a string that was not | ||
| 762 | + explicitly marked as Unicode was encoded in PDF doc encoding. The | ||
| 763 | + problem is that there is not a perfect bidirectional mapping between | ||
| 764 | + Unicode and PDF doc encoding, so if a binary string happened to | ||
| 765 | + contain characters that couldn't be bidirectionally mapped, there | ||
| 766 | + would be no way to get back to the original PDF string. Even when | ||
| 767 | + possible, trying to map from the JSON representation of a binary | ||
| 768 | + string back to the original string required knowledge of the mapping | ||
| 769 | + between PDF doc encoding and Unicode. | ||
| 770 | + | ||
| 771 | +- There was no representation of stream data. If you wanted to extract | ||
| 772 | + stream data, you could use :qpdf:ref:`--show-object`, so this wasn't | ||
| 773 | + that important for inspection, but it was a blocker for being able | ||
| 774 | + to go from JSON back to PDF. qpdf JSON version 2 allows stream data | ||
| 775 | + to be included inline as base64-encoded data. There is also an | ||
| 776 | + option to write all stream data to external files, which makes it | ||
| 777 | + possible to work with very large PDF files in JSON format even with | ||
| 778 | + tools that try to read the entire JSON structure into memory. | ||
| 779 | + | ||
| 780 | +- The PDF version from PDF header was not represented in qpdf JSON v1. |
manual/library.rst
| @@ -70,12 +70,14 @@ Python | @@ -70,12 +70,14 @@ Python | ||
| 70 | qpdf's capabilities with other functionality provided by Python's | 70 | qpdf's capabilities with other functionality provided by Python's |
| 71 | rich standard library and available modules. | 71 | rich standard library and available modules. |
| 72 | 72 | ||
| 73 | -Other Languages | ||
| 74 | - Starting with version 8.3.0, the :command:`qpdf` | ||
| 75 | - command-line tool can produce a JSON representation of the PDF file's | ||
| 76 | - non-content data. This can facilitate interacting programmatically | ||
| 77 | - with PDF files through qpdf's command line interface. For more | ||
| 78 | - information, please see :ref:`json`. | 73 | +Other Languages Starting with version 11.0.0, the :command:`qpdf` |
| 74 | + command-line tool can produce an unambiguous JSON representation of | ||
| 75 | + a PDF file and can also create or update PDF files using this JSON | ||
| 76 | + representation. qpdf versions from 8.3.0 through 10.6.3 had a more | ||
| 77 | + limited JSON output format. The qpdf JSON format makes it possible | ||
| 78 | + to inspect and modify the structure of a PDF file down to the | ||
| 79 | + object level from the command-line or from any language that can | ||
| 80 | + handle JSON data. Please see :ref:`json` for details. | ||
| 79 | 81 | ||
| 80 | Wrappers | 82 | Wrappers |
| 81 | The `qpdf Wiki <https://github.com/qpdf/qpdf/wiki>`__ contains a | 83 | The `qpdf Wiki <https://github.com/qpdf/qpdf/wiki>`__ contains a |
manual/object-streams.rst
| @@ -122,7 +122,7 @@ entries in ``/W`` above. Each entry consists of one or more fields, the | @@ -122,7 +122,7 @@ entries in ``/W`` above. Each entry consists of one or more fields, the | ||
| 122 | first of which is the type of the field. The number of bytes for each | 122 | first of which is the type of the field. The number of bytes for each |
| 123 | field is given by ``/W`` above. A 0 in ``/W`` indicates that the field | 123 | field is given by ``/W`` above. A 0 in ``/W`` indicates that the field |
| 124 | is omitted and has the default value. The default value for the field | 124 | is omitted and has the default value. The default value for the field |
| 125 | -type is "``1``". All other default values are "``0``". | 125 | +type is ``1``. All other default values are ``0``. |
| 126 | 126 | ||
| 127 | PDF 1.5 has three field types: | 127 | PDF 1.5 has three field types: |
| 128 | 128 |
manual/qdf.rst
| @@ -28,6 +28,13 @@ able to restore edited files to a correct state. The | @@ -28,6 +28,13 @@ able to restore edited files to a correct state. The | ||
| 28 | arguments. It reads a possibly edited QDF file from standard input and | 28 | arguments. It reads a possibly edited QDF file from standard input and |
| 29 | writes a repaired file to standard output. | 29 | writes a repaired file to standard output. |
| 30 | 30 | ||
| 31 | +For another way to work with PDF files in an editor, see :ref:`json`. | ||
| 32 | +Using qpdf JSON format allows you to edit the PDF file semantically | ||
| 33 | +without having to be concerned about PDF syntax. However, QDF files | ||
| 34 | +are actually valid PDF files, so the feedback cycle may be faster if | ||
| 35 | +previewing with a PDF reader. Also, since QDF files are valid PDF, you | ||
| 36 | +can experiment with all aspects of the PDF file, including syntax. | ||
| 37 | + | ||
| 31 | The following attributes characterize a QDF file: | 38 | The following attributes characterize a QDF file: |
| 32 | 39 | ||
| 33 | - All objects appear in numerical order in the PDF file, including when | 40 | - All objects appear in numerical order in the PDF file, including when |
manual/qpdf-job.rst
| @@ -27,6 +27,10 @@ executable is available from inside the C++ library using the | @@ -27,6 +27,10 @@ executable is available from inside the C++ library using the | ||
| 27 | 27 | ||
| 28 | - Use from the C API with ``qpdfjob_run_from_json`` from :file:`qpdfjob-c.h` | 28 | - Use from the C API with ``qpdfjob_run_from_json`` from :file:`qpdfjob-c.h` |
| 29 | 29 | ||
| 30 | + - Note: this is unrelated to :qpdf:ref:`--json` but can be combined | ||
| 31 | + with it. For more information on qpdf JSON (vs. QPDFJob JSON), see | ||
| 32 | + :ref:`json`. | ||
| 33 | + | ||
| 30 | - The ``QPDFJob`` C++ API | 34 | - The ``QPDFJob`` C++ API |
| 31 | 35 | ||
| 32 | If you can understand how to use the :command:`qpdf` CLI, you can | 36 | If you can understand how to use the :command:`qpdf` CLI, you can |
manual/release-notes.rst
| @@ -60,7 +60,8 @@ For a detailed list of changes, please see the file | @@ -60,7 +60,8 @@ For a detailed list of changes, please see the file | ||
| 60 | - CLI: breaking changes | 60 | - CLI: breaking changes |
| 61 | 61 | ||
| 62 | - The default json output version when :qpdf:ref:`--json` is | 62 | - The default json output version when :qpdf:ref:`--json` is |
| 63 | - specified has been changed from ``1`` to ``latest``. | 63 | + specified has been changed from ``1`` to ``latest``, which is |
| 64 | + now ``2``. | ||
| 64 | 65 | ||
| 65 | - The :qpdf:ref:`--allow-weak-crypto` flag is now mandatory when | 66 | - The :qpdf:ref:`--allow-weak-crypto` flag is now mandatory when |
| 66 | explicitly creating files with weak cryptographic algorithms. | 67 | explicitly creating files with weak cryptographic algorithms. |
| @@ -100,7 +101,7 @@ For a detailed list of changes, please see the file | @@ -100,7 +101,7 @@ For a detailed list of changes, please see the file | ||
| 100 | 101 | ||
| 101 | - ``qpdf --list-attachments --verbose`` include some additional | 102 | - ``qpdf --list-attachments --verbose`` include some additional |
| 102 | information about attachments. Additional information about | 103 | information about attachments. Additional information about |
| 103 | - attachments is also included in the ``attachments`` json key | 104 | + attachments is also included in the ``attachments`` JSON key |
| 104 | with ``--json``. | 105 | with ``--json``. |
| 105 | 106 | ||
| 106 | - For encrypted files, ``qpdf --json`` reveals the user password | 107 | - For encrypted files, ``qpdf --json`` reveals the user password |
| @@ -647,8 +648,8 @@ For a detailed list of changes, please see the file | @@ -647,8 +648,8 @@ For a detailed list of changes, please see the file | ||
| 647 | passwords from files or standard input than using | 648 | passwords from files or standard input than using |
| 648 | :samp:`@file` for this purpose. | 649 | :samp:`@file` for this purpose. |
| 649 | 650 | ||
| 650 | - - Add some information about attachments to the json output, and | ||
| 651 | - added ``attachments`` as an additional json key. The | 651 | + - Add some information about attachments to the JSON output, and |
| 652 | + added ``attachments`` as an additional JSON key. The | ||
| 652 | information included here is limited to the preferred name and | 653 | information included here is limited to the preferred name and |
| 653 | content stream and a reference to the file spec object. This is | 654 | content stream and a reference to the file spec object. This is |
| 654 | enough detail for clients to avoid the hassle of navigating a | 655 | enough detail for clients to avoid the hassle of navigating a |