Commit 905e99a3141edc7d6523e8da47e624b1c1e664a3
1 parent
36794a60
TODO: flesh out JSON v2 details
Showing
1 changed file
with
152 additions
and
36 deletions
TODO
| 1 | + | |
| 1 | 2 | Next |
| 2 | 3 | ==== |
| 3 | 4 | |
| ... | ... | @@ -9,6 +10,7 @@ Priorities for 11: |
| 9 | 10 | * cmake |
| 10 | 11 | * PointerHolder -> shared_ptr |
| 11 | 12 | * ABI |
| 13 | +* --json default is latest | |
| 12 | 14 | |
| 13 | 15 | Misc |
| 14 | 16 | * Get rid of "ugly switch statements" in QUtil.cc -- replace with |
| ... | ... | @@ -17,6 +19,16 @@ Misc |
| 17 | 19 | * Consider exposing get_next_utf8_codepoint in QUtil |
| 18 | 20 | * Add QUtil::is_explicit_utf8 that does what QPDF_String::getUTF8Val |
| 19 | 21 | does to detect UTF-8 encoded strings per PDF 2.0 spec. |
| 22 | +* Add an option --ignore-encryption to ignore encryption information | |
| 23 | + and treat encrypted files as if they weren't encrypted. This should | |
| 24 | + make it possible to solve #598 (--show-encryption without a | |
| 25 | + password). We'll need to make sure we don't try to filter any | |
| 26 | + streams in this mode. Ideally we should be able to combine this with | |
| 27 | + --json so we can look at the raw encrypted strings and streams if we | |
| 28 | + want to. Since providing the password may reveal additional details, | |
| 29 | + --show-encryption could potentially retry with this option if the | |
| 30 | + first time doesn't work. Then, with the file open, we can read the | |
| 31 | + encryption dictionary normally. | |
| 20 | 32 | |
| 21 | 33 | Soon: Break ground on "Document-level work" |
| 22 | 34 | |
| ... | ... | @@ -82,21 +94,17 @@ A .clang-format file can be created at the top of the repository. |
| 82 | 94 | Output JSON v2 |
| 83 | 95 | ============== |
| 84 | 96 | |
| 85 | -Output JSON v2 contain enough information to completely recreate a PDF | |
| 86 | -file. | |
| 87 | - | |
| 88 | -This is not an ABI change as long as the default --json version is 1. | |
| 97 | +Output JSON v2 will contain enough information to completely recreate | |
| 98 | +a PDF file. In other words, qpdf will have full, bidirectional, | |
| 99 | +lossless json serialization/deserialization of PDF. | |
| 89 | 100 | |
| 90 | 101 | If this is done, update --json option in cli.rst to mention v2. Also |
| 91 | 102 | update QPDFJob::Config::json and of course other parts of the docs |
| 92 | 103 | (json.rst). |
| 93 | 104 | |
| 94 | -Fix the following problems: | |
| 105 | +You can't create a PDF from v1 json because | |
| 95 | 106 | |
| 96 | -* Include the PDF version header somewhere. | |
| 97 | - | |
| 98 | -* Using "n n R" as a key in "objects" and "objectinfo" messes up | |
| 99 | - searching for things | |
| 107 | +* The PDF version header is not recorded | |
| 100 | 108 | |
| 101 | 109 | * Strings cannot be unambiguously encoded/decoded |
| 102 | 110 | |
| ... | ... | @@ -110,36 +118,83 @@ Fix the following problems: |
| 110 | 118 | * You can't tell a stream from a dictionary except by looking in both |
| 111 | 119 | "object" and "objectinfo". Fix this, and then remove "objectinfo". |
| 112 | 120 | |
| 113 | -* There are differences between information shown in the json format | |
| 114 | - vs. information shown with options like --check, --list-attachments, | |
| 121 | +Additionally, using "n n R" as a key in "objects" and "objectinfo" | |
| 122 | +messes up searching for things. | |
| 123 | + | |
| 124 | +For json v2: | |
| 125 | + | |
| 126 | +* Make sure it is possible to serialize and deserializes a PDF to JSON | |
| 127 | + without loading the whole thing into memory. This is substantial. It | |
| 128 | + means we need sax-style parsing and handling so we can | |
| 129 | + handle/generate objects as we go. We'll have to be able to keep | |
| 130 | + track of keys for dictionary error checking. May want to add json to | |
| 131 | + large file tests. | |
| 132 | + | |
| 133 | +* Resolve differences between information shown in the json format vs. | |
| 134 | + information shown with options like --check, --list-attachments, | |
| 115 | 135 | etc. The json format should be able to completely replace things |
| 116 | - that write to stdout. | |
| 136 | + that write to stdout. Be sure getAllPages() and other top-level | |
| 137 | + convenience routines are there so people don't need to parse the | |
| 138 | + pages tree themselves. For many workflows, it should be possible for | |
| 139 | + someone to work in the json file based on json metadata rather than | |
| 140 | + calling the QPDF API. (Of course, you still need the QPDF API for | |
| 141 | + higher level helper objects.) | |
| 117 | 142 | |
| 118 | 143 | * Consider using camelCase in multi-word key names to be consistent |
| 119 | 144 | with job JSON and with how JSON is often represented in languages |
| 120 | - that use it more natively | |
| 145 | + that use it more natively. | |
| 121 | 146 | |
| 122 | 147 | * Consider changing the contract to allow fields to be absent even |
| 123 | 148 | when present in the schema. It's reasonable for people to check for |
| 124 | 149 | presence of a key. Most languages make this easy to do. |
| 125 | 150 | |
| 151 | +* If we allow --json to be mixed with --ignore-encryption, we must | |
| 152 | + emphasize that the resulting json can't be turned back into a valid | |
| 153 | + PDF. | |
| 154 | + | |
| 126 | 155 | Most things that are informational can stay the same. We will have to |
| 127 | -go through every item to decide for sure. | |
| 156 | +go through every item to decide for sure, especially when camelCase is | |
| 157 | +taken into consideration. | |
| 158 | + | |
| 159 | +New APIs: | |
| 128 | 160 | |
| 129 | -To address ambiguity, consider the following: | |
| 161 | +QPDFObjectHandle::parseJSON(QPDF* context, JSON); | |
| 162 | +QPDFObjectHandle::parseJSON(QPDF* context, std::string const&); | |
| 163 | +operator ""_qpdf_json | |
| 164 | +C API to create a QPDFObjectHandle from a json string | |
| 130 | 165 | |
| 131 | -Whenever a direct PDF object appears, disambiguate things represented | |
| 132 | -in JSON as strings as follows: | |
| 166 | +JSON::parseFile | |
| 167 | +QPDF::parseJSON(JSON) (like parseFile, etc. -- deserializes json) | |
| 168 | +QPDF::updateFromJSON(JSON) | |
| 133 | 169 | |
| 134 | -* "/Name" -- if it starts with /, it's a name | |
| 135 | -* "n n R" -- if it is "n n R", it's an indirect object | |
| 136 | -* "u:utf8-encoded" -- a utf8-encoded string | |
| 137 | -* "b:<12ab34>" -- a binary string | |
| 170 | +CLI: --infile-is-json -- indicate that the input is a qpdf json file | |
| 171 | +rather than a PDF file | |
| 172 | +CLI: --update-from-json=file.json | |
| 138 | 173 | |
| 139 | -In "objects", the key is "obj:o,g", and the value is a dictionary with | |
| 140 | -exactly one of "value" or "stream" as its single key. | |
| 174 | +Have a "qpdf" key in the output that contains "jsonVersion", | |
| 175 | +"pdfVersion", and "objects". This replaces the "objects" field at the | |
| 176 | +top level. "objects" and "objectinfo" disappear from the top-level. | |
| 177 | +".version" and ".qpdf.jsonVersion" will match. The input to parseJSON | |
| 178 | +and updateFromJSON will have to have the "qpdf" key in it. All other | |
| 179 | +keys are ignored. | |
| 141 | 180 | |
| 142 | -For non-streams, the value of "value" is as described above. | |
| 181 | +When creating from a JSON file, the JSON must be complete with data | |
| 182 | +for all streams, a trailer, and a pdfVersion. When updating from a | |
| 183 | +JSON: | |
| 184 | + | |
| 185 | +* Any object whose value is null (not "value": null, but just null) is | |
| 186 | + deleted. | |
| 187 | +* For any stream that appears without stream data, the stream data is | |
| 188 | + left alone. | |
| 189 | +* Otherwise, the object from the JSON completely replaces the input | |
| 190 | + object. No dictionary merges or anything like that are performed. | |
| 191 | + It will call replaceObject. | |
| 192 | + | |
| 193 | +Within .qpdf.objects, the key is "obj:o,g" or "obj:trailer", and the | |
| 194 | +value is a dictionary with exactly one of "value" or "stream" as its | |
| 195 | +single key. | |
| 196 | + | |
| 197 | +For non-streams: | |
| 143 | 198 | |
| 144 | 199 | { |
| 145 | 200 | "obj:o,g": { |
| ... | ... | @@ -149,7 +204,6 @@ For non-streams, the value of "value" is as described above. |
| 149 | 204 | |
| 150 | 205 | For streams: |
| 151 | 206 | |
| 152 | -{ | |
| 153 | 207 | "obj:o,g": { |
| 154 | 208 | "stream": { |
| 155 | 209 | "dict": { ... stream dictionary ... }, |
| ... | ... | @@ -160,27 +214,89 @@ For streams: |
| 160 | 214 | } |
| 161 | 215 | } |
| 162 | 216 | |
| 163 | -Notes about stream data: | |
| 217 | +Wherever a PDF object appears in the JSON output, including "value" | |
| 218 | +and "stream"."dict" above as well as other places where they might | |
| 219 | +appear, objects are represented as follows: | |
| 220 | + | |
| 221 | +* Arrays, dictionaries, booleans, nulls, integers, and real numbers | |
| 222 | + with no more than six decimal places are represented as their native | |
| 223 | + JSON type. | |
| 224 | +* Real numbers with more than six decimal places are represented as | |
| 225 | + "r:{real-value}". | |
| 226 | +* Names: "/Name" -- internal/canonical representation (e.g. | |
| 227 | + "/Text/Plain", not #xx quoted) | |
| 228 | +* Indirect objects: "n n R" | |
| 229 | +* Strings: one of | |
| 230 | + "s:json string treated as Unicode" | |
| 231 | + "b:json string treated as bytes; character > \u00ff is an error" | |
| 232 | + "e:base64-encoded bytes" | |
| 233 | + | |
| 234 | +Test cases: these are the same: | |
| 235 | +* "b:\u00c8\u0080", "s:π", "s:\u03c0", and "e:z4A=" | |
| 236 | +* "b:\u00d8\u003e\u00dd\u0054", "s:🥔", "s:\ud83e\udd54", and "e:8J+llA==" | |
| 237 | + | |
| 238 | +When creating output from a string: | |
| 239 | +* If the string is explicitly unicode (UTF-8 or UTF-16), encode as | |
| 240 | + "s:" without the leading U+FEFF | |
| 241 | +* Else if the string can be bidirectionally mapped between pdf-doc and | |
| 242 | + unicode, transcode to unicode and encode as "s:" | |
| 243 | +* Else if the string would be decoded as binary, encode as "e:" | |
| 244 | +* Else encode as "b:" | |
| 245 | + | |
| 246 | +When reading a string, any string that doesn't follow the above rules | |
| 247 | +is an error. This includes "r:" strings not paresable as a real | |
| 248 | +number, "/Name" strings containing a NUL character, "s:" or "b:" | |
| 249 | +strings that are not valid JSON strings, "b:" strings containing | |
| 250 | +character values > 0xff, or "e:" values that are not valid base64. | |
| 251 | +Once the string is read in, if the "s:" string can be bidirectionally | |
| 252 | +mapped between pdf-doc and unicode, store as PDFDoc. Otherwise store | |
| 253 | +as UTF-16BE. "b:" strings are stored as bytes, and "e:" are decoded | |
| 254 | +and stored as bytes. | |
| 255 | + | |
| 256 | +Implementing this will require some refactoring of things between | |
| 257 | +QUtil and QPDF_String, plus we will need to implement a base64 | |
| 258 | +encoder/decoder. | |
| 259 | + | |
| 260 | +This enables a workflow like this: | |
| 261 | + | |
| 262 | +* qpdf --json=latest infile.pdf > pdf.json | |
| 263 | +* modify pdf.json | |
| 264 | +* qpdf infile.pdf --update-from=pdf.json out.pdf | |
| 265 | + | |
| 266 | +or | |
| 267 | + | |
| 268 | +* qpdf --json=latest --json-stream-data=raw|filtered infile.pdf > pdf.json | |
| 269 | +* modify pdf.json | |
| 270 | +* qpdf pdf.json --infile-is-json out.pdf | |
| 271 | + | |
| 272 | +Notes about streams and stream data: | |
| 273 | + | |
| 274 | +* Always include "dict". "/Length" is removed from the stream | |
| 275 | + dictionary. | |
| 164 | 276 | |
| 165 | -* Always include "dict". | |
| 277 | +* Add new flag --json-stream-data={raw,filtered,none}. At most one of | |
| 278 | + "raw" and "filtered" will appear for each stream. If "filtered" | |
| 279 | + appears, "/Filter" and "/DecodeParms" are removed from the stream | |
| 280 | + dictionary. This makes the stream data and dictionary match for when | |
| 281 | + the file is read back in. | |
| 166 | 282 | |
| 167 | 283 | * Always include "filterable" regardless of value of |
| 168 | 284 | --json-stream-data. The value of filterable is influenced by |
| 169 | 285 | --decode-level, which is already in parameters. |
| 170 | 286 | |
| 171 | -* Add new flag --json-stream-data={raw,filtered,none}. At most one of | |
| 172 | - "raw" and "filtered" will appear for each stream. | |
| 173 | - | |
| 174 | 287 | * Add to parameters: value of json-stream-data, default is none |
| 175 | 288 | |
| 176 | -* If none, omit stream data entirely | |
| 289 | +* If --json-stream-data=none, omit stream data entirely | |
| 177 | 290 | |
| 178 | -* If raw, include raw stream data as base64 | |
| 291 | +* If --json-stream-data=raw, include raw stream data as base64. Show | |
| 292 | + the data even for unfiltered streams in "raw". | |
| 179 | 293 | |
| 180 | -* If filtered, including the base64-encoded filtered stream data if we | |
| 181 | - can and should decode it based on decode-level. Otherwise, include | |
| 182 | - the base64-encoded raw data. See if we can honor | |
| 183 | - --normalize-content. | |
| 294 | +* If --json-stream-data=filtered, include the base64-encoded filtered | |
| 295 | + stream data if we can and should decode it based on decode-level. | |
| 296 | + Otherwise, include the base64-encoded raw data. See if we can honor | |
| 297 | + --normalize-content. If a stream appears unfiltered in the input, | |
| 298 | + still show it as filtered. Remove /DecodeParms and /Filter if | |
| 299 | + filtering. | |
| 184 | 300 | |
| 185 | 301 | Note that --json-stream-data=filtered is different from |
| 186 | 302 | --filtered-stream-data in that --filtered-stream-data implies | ... | ... |