Commit 2a92b1b0d6e389c9b033fffe1fc2821a63ca1621
1 parent
0500d434
TODO: solidify remaining json v2 work
Showing
1 changed file
with
167 additions
and
262 deletions
TODO
| @@ -10,6 +10,10 @@ In order: | @@ -10,6 +10,10 @@ In order: | ||
| 10 | 10 | ||
| 11 | Other (do in any order): | 11 | Other (do in any order): |
| 12 | 12 | ||
| 13 | +* See if I can change all output and error messages issued by the | ||
| 14 | + library, when context is available, to have a pipeline rather than a | ||
| 15 | + FILE* or std::ostream. This makes it possible for people to capture | ||
| 16 | + output more flexibly. | ||
| 13 | * Make job JSON accept a single element and treat as an array of one | 17 | * Make job JSON accept a single element and treat as an array of one |
| 14 | when an array is expected. This allows for making things repeatable | 18 | when an array is expected. This allows for making things repeatable |
| 15 | in the future without breaking compatibility and is needed for the | 19 | in the future without breaking compatibility and is needed for the |
| @@ -20,10 +24,11 @@ Other (do in any order): | @@ -20,10 +24,11 @@ Other (do in any order): | ||
| 20 | password). We'll need to make sure we don't try to filter any | 24 | password). We'll need to make sure we don't try to filter any |
| 21 | streams in this mode. Ideally we should be able to combine this with | 25 | streams in this mode. Ideally we should be able to combine this with |
| 22 | --json so we can look at the raw encrypted strings and streams if we | 26 | --json so we can look at the raw encrypted strings and streams if we |
| 23 | - want to. Since providing the password may reveal additional details, | ||
| 24 | - --show-encryption could potentially retry with this option if the | ||
| 25 | - first time doesn't work. Then, with the file open, we can read the | ||
| 26 | - encryption dictionary normally. | 27 | + want to, though be sure to document that the resulting JSON won't be |
| 28 | + convertible back to a valid PDF. Since providing the password may | ||
| 29 | + reveal additional details, --show-encryption could potentially retry | ||
| 30 | + with this option if the first time doesn't work. Then, with the file | ||
| 31 | + open, we can read the encryption dictionary normally. | ||
| 27 | * Find all places in the code that write to std::cout, std::err, | 32 | * Find all places in the code that write to std::cout, std::err, |
| 28 | stdout, or stderr to make sure they obey default output stream | 33 | stdout, or stderr to make sure they obey default output stream |
| 29 | settings for QPDF and QPDFJob. This probably includes adding a | 34 | settings for QPDF and QPDFJob. This probably includes adding a |
| @@ -43,209 +48,92 @@ Soon: Break ground on "Document-level work" | @@ -43,209 +48,92 @@ Soon: Break ground on "Document-level work" | ||
| 43 | Output JSON v2 | 48 | Output JSON v2 |
| 44 | ============== | 49 | ============== |
| 45 | 50 | ||
| 46 | ----- | ||
| 47 | -notes from 5/2: | ||
| 48 | - | ||
| 49 | -See if I can change all output and error messages issued by the | ||
| 50 | -library, when context is available, to have a pipeline rather than a | ||
| 51 | -FILE* or std::ostream. This makes it possible for people to capture | ||
| 52 | -output more flexibly. | ||
| 53 | - | ||
| 54 | -For json output, do not unparse to string. Use the writers instead. | ||
| 55 | -Write incrementally. This changes ordering only, but we should be able | ||
| 56 | -manually update the test output for those cases. Objects should be | ||
| 57 | -written in numerical order, not lexically sorted. It probably makes | ||
| 58 | -sense to put the trailer at the end since that's where it is in a | ||
| 59 | -regular PDF. | ||
| 60 | - | ||
| 61 | -When we get to full serialization, add json serialization performance | ||
| 62 | -test. | ||
| 63 | - | ||
| 64 | -Some if not all of the json output functionality for v2 should move | ||
| 65 | -into QPDF proper rather than living in QPDFJob. There can be a | ||
| 66 | -top-level QPDF method that takes a pipeline and writes the JSON | ||
| 67 | -serialization to it. | ||
| 68 | - | ||
| 69 | -Decide what the API/CLI will be for serializing to v2. Will it just be | ||
| 70 | -part of --json or will it be its own separate thing? Probably we | ||
| 71 | -should make it so that a serialized PDF is different but uses the same | ||
| 72 | -object format as regular json mode. | ||
| 73 | - | ||
| 74 | -For going back from JSON to PDF, a separate utility will be needed. | ||
| 75 | -It's not practical for QPDFObjectHandle to be able to read JSON | ||
| 76 | -because of the special handling that is required for indirect objects, | ||
| 77 | -and QPDF can't just accept JSON because the way InputSource is used is | ||
| 78 | -complete different. Instead, we will need a separate utility that has | ||
| 79 | -logic similar to what copyForeignObject does. It will go something | ||
| 80 | -like this: | ||
| 81 | - | ||
| 82 | -* Create an empty QPDF (not emptyPDF, one with no objects in it at | ||
| 83 | - all). This works: | ||
| 84 | - | ||
| 85 | -``` | ||
| 86 | -%PDF-1.3 | ||
| 87 | -xref | ||
| 88 | -0 1 | ||
| 89 | -0000000000 65535 f | ||
| 90 | -trailer << /Size 1 >> | ||
| 91 | -startxref | ||
| 92 | -9 | ||
| 93 | -%%EOF | ||
| 94 | -``` | ||
| 95 | - | ||
| 96 | -For each object: | ||
| 97 | - | ||
| 98 | -* Walk through the object detecting any indirect objects. For each one | ||
| 99 | - that is not already known, reserve the object. We can also validate | ||
| 100 | - but we should try to do the best we can with invalid JSON so people | ||
| 101 | - can get good error messages. | ||
| 102 | -* Construct a QPDFObjectHandle from the JSON | ||
| 103 | -* If the object is the trailer, update the trailer | ||
| 104 | -* Else if the object doesn't exist, reserve it | ||
| 105 | -* If the object is reserved, call replaceReserved() | ||
| 106 | -* Else the object already exists; this is an error. | ||
| 107 | - | ||
| 108 | -This can almost be done through public API. I think all we need is the | ||
| 109 | -ability to create a reserved object with a specific object ID. | ||
| 110 | - | ||
| 111 | -The choices for json_key (job.yml) will be different for v1 and v2. | ||
| 112 | -That information is already duplicated in multiple places. | ||
| 113 | - | ||
| 114 | ----- | ||
| 115 | - | ||
| 116 | -Remember typo: search for "Typo" In QPDFJob::doJSONEncrypt. | ||
| 117 | - | ||
| 118 | -Remember to test interaction between generators and schemas. | ||
| 119 | - | ||
| 120 | -Should I have allowed array and object generators? Or maybe just | ||
| 121 | -string generators for stream data? | ||
| 122 | - | ||
| 123 | -When switching to generators for output, it's going to be very | ||
| 124 | -important not to break the logic around having things that look at all | ||
| 125 | -objects going first. Right now, there are good tests for it -- if you | ||
| 126 | -either comment out pushInheritedAttributesToPage or do something that | ||
| 127 | -postpones serializing the objects from allObjects (or even getting | ||
| 128 | -them), you get test failures either way. However, if we were to | ||
| 129 | -blindly overwrite test files, we might accidentally lose this. We will | ||
| 130 | -have to try to get most of the logic working before trying to use | ||
| 131 | -generators. Or maybe we shouldn't use generators at all for the | ||
| 132 | -objects and only use it for the stream data. Or maybe we can use | ||
| 133 | -generators but write it out early by exposing the depth() parameter. | ||
| 134 | -That might actually the safest way to do it. But that will be hard | ||
| 135 | -with schemas. Another thing might be to not combine serializing with | ||
| 136 | -other kinds of metadata. | ||
| 137 | - | ||
| 138 | -Output JSON v2 will contain enough information to completely recreate | ||
| 139 | -a PDF file. In other words, qpdf will have full, bidirectional, | ||
| 140 | -lossless json serialization/deserialization of PDF. | ||
| 141 | - | ||
| 142 | -If this is done, update --json option in cli.rst to mention v2. Also | ||
| 143 | -update QPDFJob::Config::json and of course other parts of the docs | ||
| 144 | -(json.rst). | ||
| 145 | - | ||
| 146 | -You can't create a PDF from v1 json because | ||
| 147 | - | ||
| 148 | -* The PDF version header is not recorded | 51 | +Before starting on v2 format: |
| 52 | + | ||
| 53 | +* Some if not all of the json output functionality should move from | ||
| 54 | + QPDFJob to QPDF. There can top-level QPDF methods that take a | ||
| 55 | + pipeline and write the JSON serialization to it. For things that | ||
| 56 | + generate smaller amounts of output (constant-size stuff, lists of | ||
| 57 | + attachments), we can also have a version that returns a string. For | ||
| 58 | + the benefit of users of other languages, we can have something that | ||
| 59 | + takes a FILE* or writes to stdout as well. This would be a good time | ||
| 60 | + to make sure all the information from --check and other | ||
| 61 | + informational options (--show-linearization, --show-encryption, | ||
| 62 | + --show-xref, --list-attachments, --show-npages) is available in the | ||
| 63 | + json output. | ||
| 64 | + | ||
| 65 | +* Writing objects should write in numerical order with the trailer at | ||
| 66 | + the end. | ||
| 67 | + | ||
| 68 | +* Having QPDFJob call these methods will change output ordering. We | ||
| 69 | + should fix the json test outputs manually (or programmatically from | ||
| 70 | + the input), not by overwriting, in case this has any unwanted side | ||
| 71 | + effects. | ||
| 72 | + | ||
| 73 | +* Figure out how/whether to do schema checks with incremental write. | ||
| 74 | + Consider changing the contract to allow fields to be absent even | ||
| 75 | + when present in the schema. It's reasonable for people to check for | ||
| 76 | + presence of a key. Most languages make this easy to do. | ||
| 149 | 77 | ||
| 150 | -* Strings cannot be unambiguously encoded/decoded | 78 | +General things to remember: |
| 151 | 79 | ||
| 152 | - * Can't tell string from name from indirect object | 80 | +* deprecate getJSON without a version |
| 153 | 81 | ||
| 154 | - * Strings are treated as PDF doc encoding and output as UTF-8, which | ||
| 155 | - doesn't work since multiple PDF doc code points are undefined | 82 | +* The choices for json_key (job.yml) will be different for v1 and v2. |
| 83 | + That information is already duplicated in multiple places. | ||
| 156 | 84 | ||
| 157 | -* There is no representation of stream data | ||
| 158 | - | ||
| 159 | -* You can't tell a stream from a dictionary except by looking in both | ||
| 160 | - "object" and "objectinfo". Fix this, and then remove "objectinfo". | ||
| 161 | - | ||
| 162 | -Additionally, using "n n R" as a key in "objects" and "objectinfo" | ||
| 163 | -messes up searching for things. | ||
| 164 | - | ||
| 165 | -For json v2: | ||
| 166 | - | ||
| 167 | -* Make sure it is possible to serialize and deserializes a PDF to JSON | ||
| 168 | - without loading the whole thing into memory. | ||
| 169 | - | ||
| 170 | - * As with a regular PDF, we can load everything into memory at once | ||
| 171 | - except stream data. | ||
| 172 | - | ||
| 173 | - * I think we can do this by having the concept of generated values, | ||
| 174 | - which we can make just be strings. We would have a JSON subclass | ||
| 175 | - whose value is a lambda that gets called to generate output. When | ||
| 176 | - we construct the JSON the stream values would be lambda functions | ||
| 177 | - that generate the stream data. | ||
| 178 | - | ||
| 179 | - * When we parse the file, we'll have to have a way for the parser to | ||
| 180 | - know that it should create a lambda that reads the data from the | ||
| 181 | - file. I think this means we want something that parses JSON from | ||
| 182 | - an input source. It would have to keep track of the offset and | ||
| 183 | - length of a value from the input source and have a (probably a | ||
| 184 | - lambda that it can call with a path) that would indicate whether | ||
| 185 | - to store the value or whether to create a lambda that retrieves | ||
| 186 | - it. We would have to keep a std::shared_ptr<InputSource> around. | ||
| 187 | - | ||
| 188 | - * Add json to the large file tests. | ||
| 189 | - | ||
| 190 | -* Resolve differences between information shown in the json format vs. | ||
| 191 | - information shown with options like --check, --list-attachments, | ||
| 192 | - etc. The json format should be able to completely replace things | ||
| 193 | - that write to stdout. Be sure getAllPages() and other top-level | ||
| 194 | - convenience routines are there so people don't need to parse the | ||
| 195 | - pages tree themselves. For many workflows, it should be possible for | ||
| 196 | - someone to work in the json file based on json metadata rather than | ||
| 197 | - calling the QPDF API. (Of course, you still need the QPDF API for | ||
| 198 | - higher level helper objects.) | 85 | +* Remember typo: search for "Typo" In QPDFJob::doJSONEncrypt. |
| 199 | 86 | ||
| 200 | * Consider using camelCase in multi-word key names to be consistent | 87 | * Consider using camelCase in multi-word key names to be consistent |
| 201 | with job JSON and with how JSON is often represented in languages | 88 | with job JSON and with how JSON is often represented in languages |
| 202 | that use it more natively. | 89 | that use it more natively. |
| 203 | 90 | ||
| 204 | -* Consider changing the contract to allow fields to be absent even | ||
| 205 | - when present in the schema. It's reasonable for people to check for | ||
| 206 | - presence of a key. Most languages make this easy to do. | 91 | +* When we get to full serialization, add json serialization |
| 92 | + performance test. | ||
| 207 | 93 | ||
| 208 | -* If we allow --json to be mixed with --ignore-encryption, we must | ||
| 209 | - emphasize that the resulting json can't be turned back into a valid | ||
| 210 | - PDF. | 94 | +* Add json to the large file tests. |
| 211 | 95 | ||
| 212 | -Most things that are informational can stay the same. We will have to | ||
| 213 | -go through every item to decide for sure, especially when camelCase is | ||
| 214 | -taken into consideration. | 96 | +* We could consider arguments like --replace-object that would take a |
| 97 | + JSON representation of the object and could include indirect | ||
| 98 | + references, etc. We could also add --delete object. | ||
| 215 | 99 | ||
| 216 | -New APIs: | 100 | +Object Representation: |
| 217 | 101 | ||
| 218 | -QPDFObjectHandle::parseJSON(QPDF* context, JSON); | ||
| 219 | -QPDFObjectHandle::parseJSON(QPDF* context, std::string const&); | ||
| 220 | -operator ""_qpdf_json | ||
| 221 | -C API to create a QPDFObjectHandle from a json string | 102 | +* Arrays, dictionaries, booleans, nulls, integers, and real numbers |
| 103 | + are represented as their native JSON type. Real numbers that are out | ||
| 104 | + of range will just be dealt with by however whatever JSON parser is | ||
| 105 | + in use deals with it. Numbers like that shouldn't appear in PDF and, | ||
| 106 | + if they do, they won't work right for anything. QPDF's JSON | ||
| 107 | + representation allows for arbitrary precision. | ||
| 108 | +* Names: "/Name" -- internal/canonical representation (e.g. | ||
| 109 | + "/Text/Plain", not #xx quoted) | ||
| 110 | +* Indirect objects: "n n R" | ||
| 111 | +* Strings: one of | ||
| 112 | + "u:json utf-8-encoded string" | ||
| 113 | + "b:hex-encoded bytes" | ||
| 114 | + Test cases: these are the same: | ||
| 115 | + * "b:cf80", "b:CF80", "u:ฯ", "u:\u03c0" | ||
| 116 | + * "b:d83edd54", "u:๐ฅ", "u:\ud83e\udd54" | ||
| 222 | 117 | ||
| 223 | -JSON::parseFile | ||
| 224 | -QPDF::parseJSON(JSON) (like parseFile, etc. -- deserializes json) | ||
| 225 | -QPDF::updateFromJSON(JSON) | 118 | +When creating output from a string: |
| 119 | +* If the string is explicitly unicode (UTF-8 or UTF-16), encode as | ||
| 120 | + "u:" without the leading U+FEFF | ||
| 121 | +* Else if the string can be bidirectionally mapped between pdf-doc and | ||
| 122 | + unicode, transcode to unicode and encode as "u:" | ||
| 123 | +* Else encode as "b:" | ||
| 226 | 124 | ||
| 227 | -CLI: --infile-is-json -- indicate that the input is a qpdf json file | ||
| 228 | -rather than a PDF file | ||
| 229 | -CLI: --update-from-json=file.json | 125 | +When reading a JSON string, any string that doesn't follow the above rules |
| 126 | +is an error. Just use newUnicodeString on "u:" strings. For "b:" | ||
| 127 | +strings, decode the bytes with hex_decode and use newString. | ||
| 230 | 128 | ||
| 231 | -Have a "qpdf" key in the output that contains "jsonVersion", | ||
| 232 | -"pdfVersion", and "objects". This replaces the "objects" field at the | ||
| 233 | -top level. "objects" and "objectinfo" disappear from the top-level. | ||
| 234 | -".version" and ".qpdf.jsonVersion" will match. The input to parseJSON | ||
| 235 | -and updateFromJSON will have to have the "qpdf" key in it. All other | ||
| 236 | -keys are ignored. | 129 | +Serialized PDF: |
| 237 | 130 | ||
| 238 | -When creating from a JSON file, the JSON must be complete with data | ||
| 239 | -for all streams, a trailer, and a pdfVersion. When updating from a | ||
| 240 | -JSON: | 131 | +The JSON output will have a "qpdf" key containing |
| 132 | +* jsonVersion | ||
| 133 | +* pdfVersion | ||
| 134 | +* objects | ||
| 241 | 135 | ||
| 242 | -* Any object whose value is null (not "value": null, but just null) is | ||
| 243 | - deleted. | ||
| 244 | -* For any stream that appears without stream data, the stream data is | ||
| 245 | - left alone. | ||
| 246 | -* Otherwise, the object from the JSON completely replaces the input | ||
| 247 | - object. No dictionary merges or anything like that are performed. | ||
| 248 | - It will call replaceObject. | 136 | +The "qpdf" key replaces "objects" and "objectinfo" in v1 JSON. |
| 249 | 137 | ||
| 250 | Within .qpdf.objects, the key is "obj:o g R" or "obj:trailer", and the | 138 | Within .qpdf.objects, the key is "obj:o g R" or "obj:trailer", and the |
| 251 | value is a dictionary with exactly one of "value" or "stream" as its | 139 | value is a dictionary with exactly one of "value" or "stream" as its |
| @@ -254,6 +142,8 @@ single key. | @@ -254,6 +142,8 @@ single key. | ||
| 254 | Rationale of "obj:o g R" is that indirect object references are just | 142 | Rationale of "obj:o g R" is that indirect object references are just |
| 255 | "o g R", and so code that wants to resolve one can do so easily by | 143 | "o g R", and so code that wants to resolve one can do so easily by |
| 256 | just prepending "obj:" and not having to parse or split the string. | 144 | just prepending "obj:" and not having to parse or split the string. |
| 145 | +Having a prefix rather than making the key just "o g R" makes it much | ||
| 146 | +easier to search in the JSON for the definition of an object. | ||
| 257 | 147 | ||
| 258 | For non-streams: | 148 | For non-streams: |
| 259 | 149 | ||
| @@ -268,101 +158,116 @@ For streams: | @@ -268,101 +158,116 @@ For streams: | ||
| 268 | "obj:o g R": { | 158 | "obj:o g R": { |
| 269 | "stream": { | 159 | "stream": { |
| 270 | "dict": { ... stream dictionary ... }, | 160 | "dict": { ... stream dictionary ... }, |
| 271 | - "filterable": bool, | ||
| 272 | - "raw": "base64-encoded raw data", | ||
| 273 | - "filtered": "base64-encoded filtered data" | 161 | + "data": "base64-encoded data", |
| 162 | + "dataFile": "path to base64-encoded data" | ||
| 274 | } | 163 | } |
| 275 | } | 164 | } |
| 276 | } | 165 | } |
| 277 | 166 | ||
| 278 | -Wherever a PDF object appears in the JSON output, including "value" | ||
| 279 | -and "stream"."dict" above as well as other places where they might | ||
| 280 | -appear, objects are represented as follows: | 167 | +At most one of "data" or "dataFile" will be present. When serializing, |
| 168 | +stream decode parameters will be obeyed, and the stream dictionary | ||
| 169 | +will reflect the result. There will be the option to omit stream data. | ||
| 281 | 170 | ||
| 282 | -* Arrays, dictionaries, booleans, nulls, integers, and real numbers | ||
| 283 | - with no more than six decimal places are represented as their native | ||
| 284 | - JSON type. | ||
| 285 | -* Real numbers with more than six decimal places are represented as | ||
| 286 | - "r:{real-value}". | ||
| 287 | -* Names: "/Name" -- internal/canonical representation (e.g. | ||
| 288 | - "/Text/Plain", not #xx quoted) | ||
| 289 | -* Indirect objects: "n n R" | ||
| 290 | -* Strings: one of | ||
| 291 | - "s:json string treated as Unicode" | ||
| 292 | - "b:json string treated as bytes; character > \u00ff is an error" | ||
| 293 | - "e:base64-encoded bytes" | 171 | +In the stream dictionary, "/Length" is always removed. |
| 294 | 172 | ||
| 295 | -Test cases: these are the same: | ||
| 296 | -* "b:\u00c8\u0080", "s:ฯ", "s:\u03c0", and "e:z4A=" | ||
| 297 | -* "b:\u00d8\u003e\u00dd\u0054", "s:๐ฅ", "s:\ud83e\udd54", and "e:8J+llA==" | 173 | +Streams are filtered or not based on the --decode-level parameter. If |
| 174 | +a stream is filtered, "/Filter" and "/DecodeParms" are removed from | ||
| 175 | +the stream dictionary. This makes the stream data and dictionary match | ||
| 176 | +for when the file is read back in. | ||
| 298 | 177 | ||
| 299 | -When creating output from a string: | ||
| 300 | -* If the string is explicitly unicode (UTF-8 or UTF-16), encode as | ||
| 301 | - "s:" without the leading U+FEFF | ||
| 302 | -* Else if the string can be bidirectionally mapped between pdf-doc and | ||
| 303 | - unicode, transcode to unicode and encode as "s:" | ||
| 304 | -* Else if the string would be decoded as binary, encode as "e:" | ||
| 305 | -* Else encode as "b:" | 178 | +CLI: |
| 306 | 179 | ||
| 307 | -When reading a string, any string that doesn't follow the above rules | ||
| 308 | -is an error. This includes "r:" strings not parseable as a real | ||
| 309 | -number, "/Name" strings containing a NUL character, "s:" or "b:" | ||
| 310 | -strings that are not valid JSON strings, "b:" strings containing | ||
| 311 | -character values > 0xff, or "e:" values that are not valid base64. | ||
| 312 | -Once the string is read in, if the "s:" string can be bidirectionally | ||
| 313 | -mapped between pdf-doc and unicode, store as PDFDoc. Otherwise store | ||
| 314 | -as UTF-16BE. "b:" strings are stored as bytes, and "e:" are decoded | ||
| 315 | -and stored as bytes. | 180 | +* Add new flags |
| 316 | 181 | ||
| 317 | -Implementing this will require some refactoring of things between | ||
| 318 | -QUtil and QPDF_String, plus we will need to implement a base64 | ||
| 319 | -encoder/decoder. | 182 | + * --from-json=input.json -- signals reading from a JSON and counts |
| 183 | + as an input file. | ||
| 320 | 184 | ||
| 321 | -This enables a workflow like this: | 185 | + * --json-streams-omit -- stream data is omitted, the default |
| 322 | 186 | ||
| 323 | -* qpdf --json=latest infile.pdf > pdf.json | ||
| 324 | -* modify pdf.json | ||
| 325 | -* qpdf infile.pdf --update-from=pdf.json out.pdf | 187 | + * --json-streams-inline -- stream data is included in the "data" |
| 188 | + key as base64-encoded | ||
| 326 | 189 | ||
| 327 | -or | 190 | + * --json-streams-file-prefix=prefix -- stream is written to $prefix-$obj |
| 191 | + where $obj is the object number. The path to the file is stored | ||
| 192 | + in the "dataFile" key. A relative path is recommended and will be | ||
| 193 | + interpreted as relative to the current directory. If a relative | ||
| 194 | + prefix is given, a relative path will stored in "dataFile". | ||
| 195 | + Example: | ||
| 196 | + mkdir in-streams | ||
| 197 | + qpdf in.pdf --json-streams-file-prefix=in-streams/ > out.json | ||
| 328 | 198 | ||
| 329 | -* qpdf --json=latest --json-stream-data=raw|filtered infile.pdf > pdf.json | ||
| 330 | -* modify pdf.json | ||
| 331 | -* qpdf pdf.json --infile-is-json out.pdf | 199 | + * --to-json -- changes default to --json-streams-inline implies |
| 200 | + --json-key=qpdf | ||
| 332 | 201 | ||
| 333 | -Notes about streams and stream data: | 202 | +Example workflow: |
| 203 | +* qpdf in.pdf --to-json > pdf.json | ||
| 204 | +* edit pdf.json | ||
| 205 | +* qpdf --from-json=pdf.json out.pdf | ||
| 334 | 206 | ||
| 335 | -* Always include "dict". "/Length" is removed from the stream | ||
| 336 | - dictionary. | 207 | +JSON to PDF: |
| 337 | 208 | ||
| 338 | -* Add new flag --json-stream-data={raw,filtered,none}. At most one of | ||
| 339 | - "raw" and "filtered" will appear for each stream. If "filtered" | ||
| 340 | - appears, "/Filter" and "/DecodeParms" are removed from the stream | ||
| 341 | - dictionary. This makes the stream data and dictionary match for when | ||
| 342 | - the file is read back in. | 209 | +For going back from JSON to PDF, we can have |
| 210 | +QPDF::fromJSON(std::shared_ptr<InputSource> which will have logic | ||
| 211 | +similar to copyForeignObject. Note that this InputSource is not going | ||
| 212 | +to be this->file. We have to keep it separately. | ||
| 343 | 213 | ||
| 344 | -* Always include "filterable" regardless of value of | ||
| 345 | - --json-stream-data. The value of filterable is influenced by | ||
| 346 | - --decode-level, which is already in parameters. | 214 | +The backing input source is this memory block: |
| 347 | 215 | ||
| 348 | -* Add to parameters: value of json-stream-data, default is none | 216 | +``` |
| 217 | +%PDF-1.3 | ||
| 218 | +xref | ||
| 219 | +0 1 | ||
| 220 | +0000000000 65535 f | ||
| 221 | +trailer << /Size 1 >> | ||
| 222 | +startxref | ||
| 223 | +9 | ||
| 224 | +%%EOF | ||
| 225 | +``` | ||
| 226 | + | ||
| 227 | +* Ignore all keys except .qpdf. | ||
| 228 | +* Verify that .qpdf.jsonVersion is 2 | ||
| 229 | +* Set this->m->pdf_version based on the .qpdf.pdfVersion key | ||
| 230 | +* For each object in .qpdf.objects: | ||
| 231 | + * Walk through the object detecting any indirect objects. For each | ||
| 232 | + one that is not already known, reserve the object. We can also | ||
| 233 | + validate but we should try to do the best we can with invalid JSON | ||
| 234 | + so people can get good error messages. | ||
| 235 | + * Construct a QPDFObjectHandle from the JSON | ||
| 236 | + * If the object is the trailer, update the trailer | ||
| 237 | + * Else if the object doesn't exist, reserve it | ||
| 238 | + * If the object is reserved, call replaceReserved() | ||
| 239 | + * Else the object already exists; this is an error. | ||
| 240 | + | ||
| 241 | +For streams, have a stream data provider that, for inline streams, | ||
| 242 | +does a base64 from the file offsets and for file-based streams, reads | ||
| 243 | +the file. For the inline case, we have to keep the json InputSource | ||
| 244 | +around. Otherwise, we don't. It is an error if there is no stream data. | ||
| 245 | + | ||
| 246 | +Documentation: | ||
| 247 | + | ||
| 248 | +Update --json option in cli.rst to mention v2 and update json.rst. | ||
| 249 | + | ||
| 250 | +Other documentation fodder: | ||
| 251 | + | ||
| 252 | +You can't create a PDF from v1 json because | ||
| 253 | + | ||
| 254 | +* The PDF version header is not recorded | ||
| 255 | + | ||
| 256 | +* Strings cannot be unambiguously encoded/decoded | ||
| 257 | + | ||
| 258 | + * Can't tell string from name from indirect object | ||
| 349 | 259 | ||
| 350 | -* If --json-stream-data=none, omit stream data entirely | 260 | + * Strings are treated as PDF doc encoding and output as UTF-8, which |
| 261 | + doesn't work since multiple PDF doc code points are undefined | ||
| 351 | 262 | ||
| 352 | -* If --json-stream-data=raw, include raw stream data as base64. Show | ||
| 353 | - the data even for unfiltered streams in "raw". | 263 | +* There is no representation of stream data |
| 354 | 264 | ||
| 355 | -* If --json-stream-data=filtered, include the base64-encoded filtered | ||
| 356 | - stream data if we can and should decode it based on decode-level. | ||
| 357 | - Otherwise, include the base64-encoded raw data. See if we can honor | ||
| 358 | - --normalize-content. If a stream appears unfiltered in the input, | ||
| 359 | - still show it as filtered. Remove /DecodeParms and /Filter if | ||
| 360 | - filtering. | 265 | +* You can't tell a stream from a dictionary except by looking in both |
| 266 | + "object" and "objectinfo". Fix this, and then remove "objectinfo". | ||
| 267 | + | ||
| 268 | +Additionally, using "n n R" as a key in "objects" and "objectinfo" | ||
| 269 | +messes up searching for things. | ||
| 361 | 270 | ||
| 362 | -Note that --json-stream-data=filtered is different from | ||
| 363 | ---filtered-stream-data in that --filtered-stream-data implies | ||
| 364 | ---decode-level=all while --json-stream-data=filtered does not. Make | ||
| 365 | -sure this is mentioned in the help for both options. | ||
| 366 | 271 | ||
| 367 | QPDFJob | 272 | QPDFJob |
| 368 | ======= | 273 | ======= |