Commit 2a92b1b0d6e389c9b033fffe1fc2821a63ca1621
1 parent
0500d434
TODO: solidify remaining json v2 work
Showing
1 changed file
with
167 additions
and
262 deletions
TODO
| ... | ... | @@ -10,6 +10,10 @@ In order: |
| 10 | 10 | |
| 11 | 11 | Other (do in any order): |
| 12 | 12 | |
| 13 | +* See if I can change all output and error messages issued by the | |
| 14 | + library, when context is available, to have a pipeline rather than a | |
| 15 | + FILE* or std::ostream. This makes it possible for people to capture | |
| 16 | + output more flexibly. | |
| 13 | 17 | * Make job JSON accept a single element and treat as an array of one |
| 14 | 18 | when an array is expected. This allows for making things repeatable |
| 15 | 19 | in the future without breaking compatibility and is needed for the |
| ... | ... | @@ -20,10 +24,11 @@ Other (do in any order): |
| 20 | 24 | password). We'll need to make sure we don't try to filter any |
| 21 | 25 | streams in this mode. Ideally we should be able to combine this with |
| 22 | 26 | --json so we can look at the raw encrypted strings and streams if we |
| 23 | - want to. Since providing the password may reveal additional details, | |
| 24 | - --show-encryption could potentially retry with this option if the | |
| 25 | - first time doesn't work. Then, with the file open, we can read the | |
| 26 | - encryption dictionary normally. | |
| 27 | + want to, though be sure to document that the resulting JSON won't be | |
| 28 | + convertible back to a valid PDF. Since providing the password may | |
| 29 | + reveal additional details, --show-encryption could potentially retry | |
| 30 | + with this option if the first time doesn't work. Then, with the file | |
| 31 | + open, we can read the encryption dictionary normally. | |
| 27 | 32 | * Find all places in the code that write to std::cout, std::err, |
| 28 | 33 | stdout, or stderr to make sure they obey default output stream |
| 29 | 34 | settings for QPDF and QPDFJob. This probably includes adding a |
| ... | ... | @@ -43,209 +48,92 @@ Soon: Break ground on "Document-level work" |
| 43 | 48 | Output JSON v2 |
| 44 | 49 | ============== |
| 45 | 50 | |
| 46 | ----- | |
| 47 | -notes from 5/2: | |
| 48 | - | |
| 49 | -See if I can change all output and error messages issued by the | |
| 50 | -library, when context is available, to have a pipeline rather than a | |
| 51 | -FILE* or std::ostream. This makes it possible for people to capture | |
| 52 | -output more flexibly. | |
| 53 | - | |
| 54 | -For json output, do not unparse to string. Use the writers instead. | |
| 55 | -Write incrementally. This changes ordering only, but we should be able | |
| 56 | -manually update the test output for those cases. Objects should be | |
| 57 | -written in numerical order, not lexically sorted. It probably makes | |
| 58 | -sense to put the trailer at the end since that's where it is in a | |
| 59 | -regular PDF. | |
| 60 | - | |
| 61 | -When we get to full serialization, add json serialization performance | |
| 62 | -test. | |
| 63 | - | |
| 64 | -Some if not all of the json output functionality for v2 should move | |
| 65 | -into QPDF proper rather than living in QPDFJob. There can be a | |
| 66 | -top-level QPDF method that takes a pipeline and writes the JSON | |
| 67 | -serialization to it. | |
| 68 | - | |
| 69 | -Decide what the API/CLI will be for serializing to v2. Will it just be | |
| 70 | -part of --json or will it be its own separate thing? Probably we | |
| 71 | -should make it so that a serialized PDF is different but uses the same | |
| 72 | -object format as regular json mode. | |
| 73 | - | |
| 74 | -For going back from JSON to PDF, a separate utility will be needed. | |
| 75 | -It's not practical for QPDFObjectHandle to be able to read JSON | |
| 76 | -because of the special handling that is required for indirect objects, | |
| 77 | -and QPDF can't just accept JSON because the way InputSource is used is | |
| 78 | -complete different. Instead, we will need a separate utility that has | |
| 79 | -logic similar to what copyForeignObject does. It will go something | |
| 80 | -like this: | |
| 81 | - | |
| 82 | -* Create an empty QPDF (not emptyPDF, one with no objects in it at | |
| 83 | - all). This works: | |
| 84 | - | |
| 85 | -``` | |
| 86 | -%PDF-1.3 | |
| 87 | -xref | |
| 88 | -0 1 | |
| 89 | -0000000000 65535 f | |
| 90 | -trailer << /Size 1 >> | |
| 91 | -startxref | |
| 92 | -9 | |
| 93 | -%%EOF | |
| 94 | -``` | |
| 95 | - | |
| 96 | -For each object: | |
| 97 | - | |
| 98 | -* Walk through the object detecting any indirect objects. For each one | |
| 99 | - that is not already known, reserve the object. We can also validate | |
| 100 | - but we should try to do the best we can with invalid JSON so people | |
| 101 | - can get good error messages. | |
| 102 | -* Construct a QPDFObjectHandle from the JSON | |
| 103 | -* If the object is the trailer, update the trailer | |
| 104 | -* Else if the object doesn't exist, reserve it | |
| 105 | -* If the object is reserved, call replaceReserved() | |
| 106 | -* Else the object already exists; this is an error. | |
| 107 | - | |
| 108 | -This can almost be done through public API. I think all we need is the | |
| 109 | -ability to create a reserved object with a specific object ID. | |
| 110 | - | |
| 111 | -The choices for json_key (job.yml) will be different for v1 and v2. | |
| 112 | -That information is already duplicated in multiple places. | |
| 113 | - | |
| 114 | ----- | |
| 115 | - | |
| 116 | -Remember typo: search for "Typo" In QPDFJob::doJSONEncrypt. | |
| 117 | - | |
| 118 | -Remember to test interaction between generators and schemas. | |
| 119 | - | |
| 120 | -Should I have allowed array and object generators? Or maybe just | |
| 121 | -string generators for stream data? | |
| 122 | - | |
| 123 | -When switching to generators for output, it's going to be very | |
| 124 | -important not to break the logic around having things that look at all | |
| 125 | -objects going first. Right now, there are good tests for it -- if you | |
| 126 | -either comment out pushInheritedAttributesToPage or do something that | |
| 127 | -postpones serializing the objects from allObjects (or even getting | |
| 128 | -them), you get test failures either way. However, if we were to | |
| 129 | -blindly overwrite test files, we might accidentally lose this. We will | |
| 130 | -have to try to get most of the logic working before trying to use | |
| 131 | -generators. Or maybe we shouldn't use generators at all for the | |
| 132 | -objects and only use it for the stream data. Or maybe we can use | |
| 133 | -generators but write it out early by exposing the depth() parameter. | |
| 134 | -That might actually the safest way to do it. But that will be hard | |
| 135 | -with schemas. Another thing might be to not combine serializing with | |
| 136 | -other kinds of metadata. | |
| 137 | - | |
| 138 | -Output JSON v2 will contain enough information to completely recreate | |
| 139 | -a PDF file. In other words, qpdf will have full, bidirectional, | |
| 140 | -lossless json serialization/deserialization of PDF. | |
| 141 | - | |
| 142 | -If this is done, update --json option in cli.rst to mention v2. Also | |
| 143 | -update QPDFJob::Config::json and of course other parts of the docs | |
| 144 | -(json.rst). | |
| 145 | - | |
| 146 | -You can't create a PDF from v1 json because | |
| 147 | - | |
| 148 | -* The PDF version header is not recorded | |
| 51 | +Before starting on v2 format: | |
| 52 | + | |
| 53 | +* Some if not all of the json output functionality should move from | |
| 54 | + QPDFJob to QPDF. There can top-level QPDF methods that take a | |
| 55 | + pipeline and write the JSON serialization to it. For things that | |
| 56 | + generate smaller amounts of output (constant-size stuff, lists of | |
| 57 | + attachments), we can also have a version that returns a string. For | |
| 58 | + the benefit of users of other languages, we can have something that | |
| 59 | + takes a FILE* or writes to stdout as well. This would be a good time | |
| 60 | + to make sure all the information from --check and other | |
| 61 | + informational options (--show-linearization, --show-encryption, | |
| 62 | + --show-xref, --list-attachments, --show-npages) is available in the | |
| 63 | + json output. | |
| 64 | + | |
| 65 | +* Writing objects should write in numerical order with the trailer at | |
| 66 | + the end. | |
| 67 | + | |
| 68 | +* Having QPDFJob call these methods will change output ordering. We | |
| 69 | + should fix the json test outputs manually (or programmatically from | |
| 70 | + the input), not by overwriting, in case this has any unwanted side | |
| 71 | + effects. | |
| 72 | + | |
| 73 | +* Figure out how/whether to do schema checks with incremental write. | |
| 74 | + Consider changing the contract to allow fields to be absent even | |
| 75 | + when present in the schema. It's reasonable for people to check for | |
| 76 | + presence of a key. Most languages make this easy to do. | |
| 149 | 77 | |
| 150 | -* Strings cannot be unambiguously encoded/decoded | |
| 78 | +General things to remember: | |
| 151 | 79 | |
| 152 | - * Can't tell string from name from indirect object | |
| 80 | +* deprecate getJSON without a version | |
| 153 | 81 | |
| 154 | - * Strings are treated as PDF doc encoding and output as UTF-8, which | |
| 155 | - doesn't work since multiple PDF doc code points are undefined | |
| 82 | +* The choices for json_key (job.yml) will be different for v1 and v2. | |
| 83 | + That information is already duplicated in multiple places. | |
| 156 | 84 | |
| 157 | -* There is no representation of stream data | |
| 158 | - | |
| 159 | -* You can't tell a stream from a dictionary except by looking in both | |
| 160 | - "object" and "objectinfo". Fix this, and then remove "objectinfo". | |
| 161 | - | |
| 162 | -Additionally, using "n n R" as a key in "objects" and "objectinfo" | |
| 163 | -messes up searching for things. | |
| 164 | - | |
| 165 | -For json v2: | |
| 166 | - | |
| 167 | -* Make sure it is possible to serialize and deserializes a PDF to JSON | |
| 168 | - without loading the whole thing into memory. | |
| 169 | - | |
| 170 | - * As with a regular PDF, we can load everything into memory at once | |
| 171 | - except stream data. | |
| 172 | - | |
| 173 | - * I think we can do this by having the concept of generated values, | |
| 174 | - which we can make just be strings. We would have a JSON subclass | |
| 175 | - whose value is a lambda that gets called to generate output. When | |
| 176 | - we construct the JSON the stream values would be lambda functions | |
| 177 | - that generate the stream data. | |
| 178 | - | |
| 179 | - * When we parse the file, we'll have to have a way for the parser to | |
| 180 | - know that it should create a lambda that reads the data from the | |
| 181 | - file. I think this means we want something that parses JSON from | |
| 182 | - an input source. It would have to keep track of the offset and | |
| 183 | - length of a value from the input source and have a (probably a | |
| 184 | - lambda that it can call with a path) that would indicate whether | |
| 185 | - to store the value or whether to create a lambda that retrieves | |
| 186 | - it. We would have to keep a std::shared_ptr<InputSource> around. | |
| 187 | - | |
| 188 | - * Add json to the large file tests. | |
| 189 | - | |
| 190 | -* Resolve differences between information shown in the json format vs. | |
| 191 | - information shown with options like --check, --list-attachments, | |
| 192 | - etc. The json format should be able to completely replace things | |
| 193 | - that write to stdout. Be sure getAllPages() and other top-level | |
| 194 | - convenience routines are there so people don't need to parse the | |
| 195 | - pages tree themselves. For many workflows, it should be possible for | |
| 196 | - someone to work in the json file based on json metadata rather than | |
| 197 | - calling the QPDF API. (Of course, you still need the QPDF API for | |
| 198 | - higher level helper objects.) | |
| 85 | +* Remember typo: search for "Typo" In QPDFJob::doJSONEncrypt. | |
| 199 | 86 | |
| 200 | 87 | * Consider using camelCase in multi-word key names to be consistent |
| 201 | 88 | with job JSON and with how JSON is often represented in languages |
| 202 | 89 | that use it more natively. |
| 203 | 90 | |
| 204 | -* Consider changing the contract to allow fields to be absent even | |
| 205 | - when present in the schema. It's reasonable for people to check for | |
| 206 | - presence of a key. Most languages make this easy to do. | |
| 91 | +* When we get to full serialization, add json serialization | |
| 92 | + performance test. | |
| 207 | 93 | |
| 208 | -* If we allow --json to be mixed with --ignore-encryption, we must | |
| 209 | - emphasize that the resulting json can't be turned back into a valid | |
| 210 | - PDF. | |
| 94 | +* Add json to the large file tests. | |
| 211 | 95 | |
| 212 | -Most things that are informational can stay the same. We will have to | |
| 213 | -go through every item to decide for sure, especially when camelCase is | |
| 214 | -taken into consideration. | |
| 96 | +* We could consider arguments like --replace-object that would take a | |
| 97 | + JSON representation of the object and could include indirect | |
| 98 | + references, etc. We could also add --delete object. | |
| 215 | 99 | |
| 216 | -New APIs: | |
| 100 | +Object Representation: | |
| 217 | 101 | |
| 218 | -QPDFObjectHandle::parseJSON(QPDF* context, JSON); | |
| 219 | -QPDFObjectHandle::parseJSON(QPDF* context, std::string const&); | |
| 220 | -operator ""_qpdf_json | |
| 221 | -C API to create a QPDFObjectHandle from a json string | |
| 102 | +* Arrays, dictionaries, booleans, nulls, integers, and real numbers | |
| 103 | + are represented as their native JSON type. Real numbers that are out | |
| 104 | + of range will just be dealt with by however whatever JSON parser is | |
| 105 | + in use deals with it. Numbers like that shouldn't appear in PDF and, | |
| 106 | + if they do, they won't work right for anything. QPDF's JSON | |
| 107 | + representation allows for arbitrary precision. | |
| 108 | +* Names: "/Name" -- internal/canonical representation (e.g. | |
| 109 | + "/Text/Plain", not #xx quoted) | |
| 110 | +* Indirect objects: "n n R" | |
| 111 | +* Strings: one of | |
| 112 | + "u:json utf-8-encoded string" | |
| 113 | + "b:hex-encoded bytes" | |
| 114 | + Test cases: these are the same: | |
| 115 | + * "b:cf80", "b:CF80", "u:ฯ", "u:\u03c0" | |
| 116 | + * "b:d83edd54", "u:๐ฅ", "u:\ud83e\udd54" | |
| 222 | 117 | |
| 223 | -JSON::parseFile | |
| 224 | -QPDF::parseJSON(JSON) (like parseFile, etc. -- deserializes json) | |
| 225 | -QPDF::updateFromJSON(JSON) | |
| 118 | +When creating output from a string: | |
| 119 | +* If the string is explicitly unicode (UTF-8 or UTF-16), encode as | |
| 120 | + "u:" without the leading U+FEFF | |
| 121 | +* Else if the string can be bidirectionally mapped between pdf-doc and | |
| 122 | + unicode, transcode to unicode and encode as "u:" | |
| 123 | +* Else encode as "b:" | |
| 226 | 124 | |
| 227 | -CLI: --infile-is-json -- indicate that the input is a qpdf json file | |
| 228 | -rather than a PDF file | |
| 229 | -CLI: --update-from-json=file.json | |
| 125 | +When reading a JSON string, any string that doesn't follow the above rules | |
| 126 | +is an error. Just use newUnicodeString on "u:" strings. For "b:" | |
| 127 | +strings, decode the bytes with hex_decode and use newString. | |
| 230 | 128 | |
| 231 | -Have a "qpdf" key in the output that contains "jsonVersion", | |
| 232 | -"pdfVersion", and "objects". This replaces the "objects" field at the | |
| 233 | -top level. "objects" and "objectinfo" disappear from the top-level. | |
| 234 | -".version" and ".qpdf.jsonVersion" will match. The input to parseJSON | |
| 235 | -and updateFromJSON will have to have the "qpdf" key in it. All other | |
| 236 | -keys are ignored. | |
| 129 | +Serialized PDF: | |
| 237 | 130 | |
| 238 | -When creating from a JSON file, the JSON must be complete with data | |
| 239 | -for all streams, a trailer, and a pdfVersion. When updating from a | |
| 240 | -JSON: | |
| 131 | +The JSON output will have a "qpdf" key containing | |
| 132 | +* jsonVersion | |
| 133 | +* pdfVersion | |
| 134 | +* objects | |
| 241 | 135 | |
| 242 | -* Any object whose value is null (not "value": null, but just null) is | |
| 243 | - deleted. | |
| 244 | -* For any stream that appears without stream data, the stream data is | |
| 245 | - left alone. | |
| 246 | -* Otherwise, the object from the JSON completely replaces the input | |
| 247 | - object. No dictionary merges or anything like that are performed. | |
| 248 | - It will call replaceObject. | |
| 136 | +The "qpdf" key replaces "objects" and "objectinfo" in v1 JSON. | |
| 249 | 137 | |
| 250 | 138 | Within .qpdf.objects, the key is "obj:o g R" or "obj:trailer", and the |
| 251 | 139 | value is a dictionary with exactly one of "value" or "stream" as its |
| ... | ... | @@ -254,6 +142,8 @@ single key. |
| 254 | 142 | Rationale of "obj:o g R" is that indirect object references are just |
| 255 | 143 | "o g R", and so code that wants to resolve one can do so easily by |
| 256 | 144 | just prepending "obj:" and not having to parse or split the string. |
| 145 | +Having a prefix rather than making the key just "o g R" makes it much | |
| 146 | +easier to search in the JSON for the definition of an object. | |
| 257 | 147 | |
| 258 | 148 | For non-streams: |
| 259 | 149 | |
| ... | ... | @@ -268,101 +158,116 @@ For streams: |
| 268 | 158 | "obj:o g R": { |
| 269 | 159 | "stream": { |
| 270 | 160 | "dict": { ... stream dictionary ... }, |
| 271 | - "filterable": bool, | |
| 272 | - "raw": "base64-encoded raw data", | |
| 273 | - "filtered": "base64-encoded filtered data" | |
| 161 | + "data": "base64-encoded data", | |
| 162 | + "dataFile": "path to base64-encoded data" | |
| 274 | 163 | } |
| 275 | 164 | } |
| 276 | 165 | } |
| 277 | 166 | |
| 278 | -Wherever a PDF object appears in the JSON output, including "value" | |
| 279 | -and "stream"."dict" above as well as other places where they might | |
| 280 | -appear, objects are represented as follows: | |
| 167 | +At most one of "data" or "dataFile" will be present. When serializing, | |
| 168 | +stream decode parameters will be obeyed, and the stream dictionary | |
| 169 | +will reflect the result. There will be the option to omit stream data. | |
| 281 | 170 | |
| 282 | -* Arrays, dictionaries, booleans, nulls, integers, and real numbers | |
| 283 | - with no more than six decimal places are represented as their native | |
| 284 | - JSON type. | |
| 285 | -* Real numbers with more than six decimal places are represented as | |
| 286 | - "r:{real-value}". | |
| 287 | -* Names: "/Name" -- internal/canonical representation (e.g. | |
| 288 | - "/Text/Plain", not #xx quoted) | |
| 289 | -* Indirect objects: "n n R" | |
| 290 | -* Strings: one of | |
| 291 | - "s:json string treated as Unicode" | |
| 292 | - "b:json string treated as bytes; character > \u00ff is an error" | |
| 293 | - "e:base64-encoded bytes" | |
| 171 | +In the stream dictionary, "/Length" is always removed. | |
| 294 | 172 | |
| 295 | -Test cases: these are the same: | |
| 296 | -* "b:\u00c8\u0080", "s:ฯ", "s:\u03c0", and "e:z4A=" | |
| 297 | -* "b:\u00d8\u003e\u00dd\u0054", "s:๐ฅ", "s:\ud83e\udd54", and "e:8J+llA==" | |
| 173 | +Streams are filtered or not based on the --decode-level parameter. If | |
| 174 | +a stream is filtered, "/Filter" and "/DecodeParms" are removed from | |
| 175 | +the stream dictionary. This makes the stream data and dictionary match | |
| 176 | +for when the file is read back in. | |
| 298 | 177 | |
| 299 | -When creating output from a string: | |
| 300 | -* If the string is explicitly unicode (UTF-8 or UTF-16), encode as | |
| 301 | - "s:" without the leading U+FEFF | |
| 302 | -* Else if the string can be bidirectionally mapped between pdf-doc and | |
| 303 | - unicode, transcode to unicode and encode as "s:" | |
| 304 | -* Else if the string would be decoded as binary, encode as "e:" | |
| 305 | -* Else encode as "b:" | |
| 178 | +CLI: | |
| 306 | 179 | |
| 307 | -When reading a string, any string that doesn't follow the above rules | |
| 308 | -is an error. This includes "r:" strings not parseable as a real | |
| 309 | -number, "/Name" strings containing a NUL character, "s:" or "b:" | |
| 310 | -strings that are not valid JSON strings, "b:" strings containing | |
| 311 | -character values > 0xff, or "e:" values that are not valid base64. | |
| 312 | -Once the string is read in, if the "s:" string can be bidirectionally | |
| 313 | -mapped between pdf-doc and unicode, store as PDFDoc. Otherwise store | |
| 314 | -as UTF-16BE. "b:" strings are stored as bytes, and "e:" are decoded | |
| 315 | -and stored as bytes. | |
| 180 | +* Add new flags | |
| 316 | 181 | |
| 317 | -Implementing this will require some refactoring of things between | |
| 318 | -QUtil and QPDF_String, plus we will need to implement a base64 | |
| 319 | -encoder/decoder. | |
| 182 | + * --from-json=input.json -- signals reading from a JSON and counts | |
| 183 | + as an input file. | |
| 320 | 184 | |
| 321 | -This enables a workflow like this: | |
| 185 | + * --json-streams-omit -- stream data is omitted, the default | |
| 322 | 186 | |
| 323 | -* qpdf --json=latest infile.pdf > pdf.json | |
| 324 | -* modify pdf.json | |
| 325 | -* qpdf infile.pdf --update-from=pdf.json out.pdf | |
| 187 | + * --json-streams-inline -- stream data is included in the "data" | |
| 188 | + key as base64-encoded | |
| 326 | 189 | |
| 327 | -or | |
| 190 | + * --json-streams-file-prefix=prefix -- stream is written to $prefix-$obj | |
| 191 | + where $obj is the object number. The path to the file is stored | |
| 192 | + in the "dataFile" key. A relative path is recommended and will be | |
| 193 | + interpreted as relative to the current directory. If a relative | |
| 194 | + prefix is given, a relative path will stored in "dataFile". | |
| 195 | + Example: | |
| 196 | + mkdir in-streams | |
| 197 | + qpdf in.pdf --json-streams-file-prefix=in-streams/ > out.json | |
| 328 | 198 | |
| 329 | -* qpdf --json=latest --json-stream-data=raw|filtered infile.pdf > pdf.json | |
| 330 | -* modify pdf.json | |
| 331 | -* qpdf pdf.json --infile-is-json out.pdf | |
| 199 | + * --to-json -- changes default to --json-streams-inline implies | |
| 200 | + --json-key=qpdf | |
| 332 | 201 | |
| 333 | -Notes about streams and stream data: | |
| 202 | +Example workflow: | |
| 203 | +* qpdf in.pdf --to-json > pdf.json | |
| 204 | +* edit pdf.json | |
| 205 | +* qpdf --from-json=pdf.json out.pdf | |
| 334 | 206 | |
| 335 | -* Always include "dict". "/Length" is removed from the stream | |
| 336 | - dictionary. | |
| 207 | +JSON to PDF: | |
| 337 | 208 | |
| 338 | -* Add new flag --json-stream-data={raw,filtered,none}. At most one of | |
| 339 | - "raw" and "filtered" will appear for each stream. If "filtered" | |
| 340 | - appears, "/Filter" and "/DecodeParms" are removed from the stream | |
| 341 | - dictionary. This makes the stream data and dictionary match for when | |
| 342 | - the file is read back in. | |
| 209 | +For going back from JSON to PDF, we can have | |
| 210 | +QPDF::fromJSON(std::shared_ptr<InputSource> which will have logic | |
| 211 | +similar to copyForeignObject. Note that this InputSource is not going | |
| 212 | +to be this->file. We have to keep it separately. | |
| 343 | 213 | |
| 344 | -* Always include "filterable" regardless of value of | |
| 345 | - --json-stream-data. The value of filterable is influenced by | |
| 346 | - --decode-level, which is already in parameters. | |
| 214 | +The backing input source is this memory block: | |
| 347 | 215 | |
| 348 | -* Add to parameters: value of json-stream-data, default is none | |
| 216 | +``` | |
| 217 | +%PDF-1.3 | |
| 218 | +xref | |
| 219 | +0 1 | |
| 220 | +0000000000 65535 f | |
| 221 | +trailer << /Size 1 >> | |
| 222 | +startxref | |
| 223 | +9 | |
| 224 | +%%EOF | |
| 225 | +``` | |
| 226 | + | |
| 227 | +* Ignore all keys except .qpdf. | |
| 228 | +* Verify that .qpdf.jsonVersion is 2 | |
| 229 | +* Set this->m->pdf_version based on the .qpdf.pdfVersion key | |
| 230 | +* For each object in .qpdf.objects: | |
| 231 | + * Walk through the object detecting any indirect objects. For each | |
| 232 | + one that is not already known, reserve the object. We can also | |
| 233 | + validate but we should try to do the best we can with invalid JSON | |
| 234 | + so people can get good error messages. | |
| 235 | + * Construct a QPDFObjectHandle from the JSON | |
| 236 | + * If the object is the trailer, update the trailer | |
| 237 | + * Else if the object doesn't exist, reserve it | |
| 238 | + * If the object is reserved, call replaceReserved() | |
| 239 | + * Else the object already exists; this is an error. | |
| 240 | + | |
| 241 | +For streams, have a stream data provider that, for inline streams, | |
| 242 | +does a base64 from the file offsets and for file-based streams, reads | |
| 243 | +the file. For the inline case, we have to keep the json InputSource | |
| 244 | +around. Otherwise, we don't. It is an error if there is no stream data. | |
| 245 | + | |
| 246 | +Documentation: | |
| 247 | + | |
| 248 | +Update --json option in cli.rst to mention v2 and update json.rst. | |
| 249 | + | |
| 250 | +Other documentation fodder: | |
| 251 | + | |
| 252 | +You can't create a PDF from v1 json because | |
| 253 | + | |
| 254 | +* The PDF version header is not recorded | |
| 255 | + | |
| 256 | +* Strings cannot be unambiguously encoded/decoded | |
| 257 | + | |
| 258 | + * Can't tell string from name from indirect object | |
| 349 | 259 | |
| 350 | -* If --json-stream-data=none, omit stream data entirely | |
| 260 | + * Strings are treated as PDF doc encoding and output as UTF-8, which | |
| 261 | + doesn't work since multiple PDF doc code points are undefined | |
| 351 | 262 | |
| 352 | -* If --json-stream-data=raw, include raw stream data as base64. Show | |
| 353 | - the data even for unfiltered streams in "raw". | |
| 263 | +* There is no representation of stream data | |
| 354 | 264 | |
| 355 | -* If --json-stream-data=filtered, include the base64-encoded filtered | |
| 356 | - stream data if we can and should decode it based on decode-level. | |
| 357 | - Otherwise, include the base64-encoded raw data. See if we can honor | |
| 358 | - --normalize-content. If a stream appears unfiltered in the input, | |
| 359 | - still show it as filtered. Remove /DecodeParms and /Filter if | |
| 360 | - filtering. | |
| 265 | +* You can't tell a stream from a dictionary except by looking in both | |
| 266 | + "object" and "objectinfo". Fix this, and then remove "objectinfo". | |
| 267 | + | |
| 268 | +Additionally, using "n n R" as a key in "objects" and "objectinfo" | |
| 269 | +messes up searching for things. | |
| 361 | 270 | |
| 362 | -Note that --json-stream-data=filtered is different from | |
| 363 | ---filtered-stream-data in that --filtered-stream-data implies | |
| 364 | ---decode-level=all while --json-stream-data=filtered does not. Make | |
| 365 | -sure this is mentioned in the help for both options. | |
| 366 | 271 | |
| 367 | 272 | QPDFJob |
| 368 | 273 | ======= | ... | ... |