Commit f1a9ba0c622deee0ed05004949b34f0126b12b6a
1 parent
27a42c16
TODO: clean up remaining work for json v2
Showing
3 changed files
with
102 additions
and
122 deletions
TODO
| @@ -55,11 +55,7 @@ Soon: Break ground on "Document-level work" | @@ -55,11 +55,7 @@ Soon: Break ground on "Document-level work" | ||
| 55 | Output JSON v2 | 55 | Output JSON v2 |
| 56 | ============== | 56 | ============== |
| 57 | 57 | ||
| 58 | -Some of this documentation has drifted from the actual implementation. | ||
| 59 | - | ||
| 60 | -* Document that /Length is ignored in stream dictionary replacements | ||
| 61 | - | ||
| 62 | -General things to remember: | 58 | +Remaining work: |
| 63 | 59 | ||
| 64 | * Make sure all the information from --check and other informational | 60 | * Make sure all the information from --check and other informational |
| 65 | options (--show-linearization, --show-encryption, --show-xref, | 61 | options (--show-linearization, --show-encryption, --show-xref, |
| @@ -68,106 +64,98 @@ General things to remember: | @@ -68,106 +64,98 @@ General things to remember: | ||
| 68 | right keys when in json mode. I don't think I want check on by | 64 | right keys when in json mode. I don't think I want check on by |
| 69 | default, so that might be different. | 65 | default, so that might be different. |
| 70 | 66 | ||
| 71 | -* Consider changing the contract to allow fields to be absent even | ||
| 72 | - when present in the schema. It's reasonable for people to check for | ||
| 73 | - presence of a key. Most languages make this easy to do. | 67 | +Notes for documentation: |
| 68 | + | ||
| 69 | +* Find all mentions of json in the manual and update. | ||
| 74 | 70 | ||
| 75 | * Document typo fix in encrypt in release notes along with any other | 71 | * Document typo fix in encrypt in release notes along with any other |
| 76 | non-compatible json 2 changes. Scrutinize all the output to decide | 72 | non-compatible json 2 changes. Scrutinize all the output to decide |
| 77 | what should change. | 73 | what should change. |
| 78 | 74 | ||
| 79 | -* Document that keys other than "qpdf-v2" are ignored so people can | ||
| 80 | - stash their own stuff. | ||
| 81 | - | ||
| 82 | -JSON to PDF: | ||
| 83 | - | ||
| 84 | -Have --json-input and --update-from-json. With --json-input, the json | ||
| 85 | -file must be complete, meaning all stream data, the trailer, and the | ||
| 86 | -PDF version must be present. For streams with no stream data, the | ||
| 87 | -dictionary is updated but the data is left untouched. Other things | ||
| 88 | -that are omitted are left alone. Make sure document that, when writing | ||
| 89 | -a PDF file from QPDF, there is no expectation of object numbers being | ||
| 90 | -preserved. As such, --update-from-json can only be used to update the | ||
| 91 | -exact file that the json was created from. You can put multiple | ||
| 92 | -objects in the update file, but you can't use a json from one file to | ||
| 93 | -update the output of a previous update since the object numbers will | ||
| 94 | -have changed. Note that, when creating from a JSON, object numbers are | ||
| 95 | -preserved in the resulting QPDF object but still modified by | ||
| 96 | -QPDFWriter for the output. This would be visible by combining | ||
| 97 | ---json-output and --json-input. Also using --qdf with | ||
| 98 | ---create-from-json would show original object IDs in comments. It will | ||
| 99 | -be important to capture this in the documentation. | ||
| 100 | - | ||
| 101 | -When reading a JSON string, any string that doesn't look like a name | ||
| 102 | -or indirect object or start with "b:" or "u:" should be considered an | ||
| 103 | -error. Just use newUnicodeString on "u:" strings. For "b:" strings, | ||
| 104 | -decode the bytes with hex_decode and use newString. | ||
| 105 | - | ||
| 106 | -Test case: combine --json-input and --json-output to show preservation | ||
| 107 | -of object numbers. QPDFWriter won't show that although --qdf with the | ||
| 108 | -original object ID comments would. | ||
| 109 | - | ||
| 110 | -The backing input source for createFromJSON is this memory block: | ||
| 111 | - | ||
| 112 | -``` | ||
| 113 | -%PDF-1.3 | ||
| 114 | -xref | ||
| 115 | -0 1 | ||
| 116 | -0000000000 65535 f | ||
| 117 | -trailer << /Size 1 >> | ||
| 118 | -startxref | ||
| 119 | -9 | ||
| 120 | -%%EOF | ||
| 121 | -``` | ||
| 122 | - | ||
| 123 | -* Ignore all keys except .qpdf-v2. | ||
| 124 | -* Set this->m->pdf_version based on the .qpdf.pdfVersion key | ||
| 125 | -* For each object in .qpdf.objects: | ||
| 126 | - * Walk through the object detecting any indirect objects. For each | ||
| 127 | - one that is not already known, reserve the object. We can also | ||
| 128 | - validate but we should try to do the best we can with invalid JSON | ||
| 129 | - so people can get good error messages. | ||
| 130 | - * Construct a QPDFObjectHandle from the JSON | ||
| 131 | - * If the object is the trailer, update the trailer | ||
| 132 | - * Else if the object doesn't exist, reserve it | ||
| 133 | - * If the object is reserved, call replaceReserved() | ||
| 134 | - * Else the object already exists; this is an error. | ||
| 135 | - | ||
| 136 | -For streams, have a stream data provider that, for inline streams, | ||
| 137 | -does a base64 from the file offsets and for file-based streams, reads | ||
| 138 | -the file. For the inline case, we have to keep the json InputSource | ||
| 139 | -around. Otherwise, we don't. It is an error if there is no stream | ||
| 140 | -data. For files, we can have a stream data provider that just reads | ||
| 141 | -the file. Remember QUtil::file_provider. | ||
| 142 | - | ||
| 143 | -Documentation: | ||
| 144 | - | ||
| 145 | -Serialized PDF: | ||
| 146 | - | ||
| 147 | -The JSON output will have a "qpdf-v2" key containing | ||
| 148 | -* pdfversion | ||
| 149 | -* maxobjectid | ||
| 150 | -* objects | ||
| 151 | - | ||
| 152 | -In regular json mode, "objectinfo" is gone. | ||
| 153 | - | ||
| 154 | -Within .objects, the key is "obj:o g R" or "trailer", and the | ||
| 155 | -value is a dictionary with exactly one of "value" or "stream" as its | ||
| 156 | -single key. | 75 | +* Keys other than "qpdf-v2" are ignored so people can stash their own |
| 76 | + stuff. Unknown keys are ignored at other places for future | ||
| 77 | + compatibility. Readers of qpdf json should continue to ignore keys | ||
| 78 | + they don't recognize. | ||
| 157 | 79 | ||
| 158 | -Rationale of "obj:o g R" is that indirect object references are just | ||
| 159 | -"o g R", and so code that wants to resolve one can do so easily by | ||
| 160 | -just prepending "obj:" and not having to parse or split the string. | ||
| 161 | -Having a prefix rather than making the key just "o g R" makes it much | ||
| 162 | -easier to search in the JSON for the definition of an object. | 80 | +* Change: names are written in canonical form with a leading slash |
| 81 | + just as they are treated in the code. In v1, they were written in | ||
| 82 | + PDF syntax in the json file. Example: /text#2fplain in pdf will be | ||
| 83 | + written as /text/plain in json v2 and as /text#2fplain in json v1. | ||
| 84 | + | ||
| 85 | +* Document changes to strings, objects, streams, object keys. | ||
| 86 | + | ||
| 87 | +* CLI: --json-input, --json-output[=version], --update-from-json. With | ||
| 88 | + --json-input, the input file is a JSON file instead of a PDF file. | ||
| 89 | + It must be complete, meaning that a PDF version must be given, all | ||
| 90 | + streams must have exactly one of data or datafile, and a trailer | ||
| 91 | + dictionary must be present, even if empty. | ||
| 92 | + | ||
| 93 | + With --update-from-json, the JSON file updates objects in place. If | ||
| 94 | + updating an old stream, if stream data is omitted, the data remains | ||
| 95 | + untouched. The dictionary is always required. Remember that | ||
| 96 | + QPDFWriter does not preserve object numbers, though --json-output | ||
| 97 | + does. Therefore, if you want to update a PDF with a JSON, the input | ||
| 98 | + to --update-from-json must be the same PDF as the one that | ||
| 99 | + --json-output was run on previously. Otherwise, object numbers won't | ||
| 100 | + match. Show this with an example. When updating, | ||
| 101 | + | ||
| 102 | +* Certain fields are ignored when reading the JSON. This includes | ||
| 103 | + maxobjectid, any computed fields in trailer (such as /Size), and all | ||
| 104 | + /Length keys in stream dictionaries. There is no need for the user | ||
| 105 | + to correct, remove, or otherwise worry about any values those keys | ||
| 106 | + might have. The maxobjectid field is present in the original output | ||
| 107 | + to assist with adding new objects to the file. | ||
| 108 | + | ||
| 109 | +* JSON strings within PDF objects: | ||
| 110 | + | ||
| 111 | + * "n n R" is an indirect object | ||
| 112 | + | ||
| 113 | + * "/Name" is a name in canonical form with a leading slash (like | ||
| 114 | + "/text/plain"), not PDF syntax (like "/text#2fplain"). | ||
| 115 | + | ||
| 116 | + * "b:hex-digits" is a binary string ("b:feff03c0"). Hex digits may be | ||
| 117 | + mixed case. There must be an even number of digits. | ||
| 118 | + | ||
| 119 | + * "u:utf-8" is a UTF-8 encoded string ("u:ฯ", "u:\u03c0"). UTF-16 | ||
| 120 | + surrogate pairs are allowed. These are all equivalent: "u:๐ฅ", | ||
| 121 | + "u:\ud83e\udd54", "b:FEFFD83EDD54", "b:efbbbff09fa594". | ||
| 122 | + | ||
| 123 | + * Both "b:" and "u:" are valid representations of the empty string. | ||
| 124 | + | ||
| 125 | + * Anything else is an error | ||
| 126 | + | ||
| 127 | +* Document use of --json-input and --json-output together to show | ||
| 128 | + preservation of object numbers. Draw attention to "original object | ||
| 129 | + ID" comments in qdf as another way to show it. | ||
| 130 | + | ||
| 131 | +* Document top-level keys of "qpdf-v2" ("pdfversion", "objects", | ||
| 132 | + "maxobjectid") noting that "maxobjectid" is ignored when reading. | ||
| 133 | + | ||
| 134 | +* Stream data: "data" is base64-encoded stream data. "datafile" is the | ||
| 135 | + path to a file (relative path recommended but not required) | ||
| 136 | + containing the binary data. As with any PDF representation, the data | ||
| 137 | + must be consistent with the filters. --decode-level is honored by | ||
| 138 | + --json-output. | ||
| 139 | + | ||
| 140 | +* Other changes from v1: | ||
| 141 | + | ||
| 142 | + * in "objects", keys are "obj:o g R" or "trailer" | ||
| 143 | + | ||
| 144 | + * Non-stream objects are dictionaries with a "value" key whose value | ||
| 145 | + is the object. Stream objects are dictionaries with a "stream" key | ||
| 146 | + whose value is {"dict": stream-dictionary}. The "/Length" key is | ||
| 147 | + omitted from the stream dictionary. | ||
| 148 | + | ||
| 149 | + * "objectinfo" is gone as it is now possible to tell a stream from a | ||
| 150 | + non-stream directly. To get stream data, use the --json-output | ||
| 151 | + option. Note about how "pages" may cause the pages tree to be | ||
| 152 | + corrected. | ||
| 163 | 153 | ||
| 164 | For non-streams: | 154 | For non-streams: |
| 165 | 155 | ||
| 166 | -{ | ||
| 167 | "obj:o g R": { | 156 | "obj:o g R": { |
| 168 | "value": ... | 157 | "value": ... |
| 169 | } | 158 | } |
| 170 | -} | ||
| 171 | 159 | ||
| 172 | For streams: | 160 | For streams: |
| 173 | 161 | ||
| @@ -178,41 +166,31 @@ For streams: | @@ -178,41 +166,31 @@ For streams: | ||
| 178 | "datafile": "path to base64-encoded data" | 166 | "datafile": "path to base64-encoded data" |
| 179 | } | 167 | } |
| 180 | } | 168 | } |
| 181 | -} | ||
| 182 | - | ||
| 183 | -At most one of "data" or "datafile" will be present. When serializing, | ||
| 184 | -stream decode parameters will be obeyed, and the stream dictionary | ||
| 185 | -will reflect the result. There will be the option to omit stream data. | ||
| 186 | 169 | ||
| 187 | -When data is included, "/Length" is removed from the stream | ||
| 188 | -dictionary. | ||
| 189 | - | ||
| 190 | -Streams are filtered or not based on the --decode-level parameter. If | ||
| 191 | -a stream is filtered, "/Filter" and "/DecodeParms" are removed from | ||
| 192 | -the stream dictionary. This makes the stream data and dictionary match | ||
| 193 | -for when the file is read back in. | 170 | +Rationale of "obj:o g R" is that indirect object references are just |
| 171 | +"o g R", and so code that wants to resolve one can do so easily by | ||
| 172 | +just prepending "obj:" and not having to parse or split the string. | ||
| 173 | +Having a prefix rather than making the key just "o g R" makes it much | ||
| 174 | +easier to search in the JSON for the definition of an object. | ||
| 194 | 175 | ||
| 195 | CLI: | 176 | CLI: |
| 196 | 177 | ||
| 197 | Example workflow: | 178 | Example workflow: |
| 198 | -* qpdf in.pdf --json-output=2 pdf.json | 179 | +* qpdf in.pdf --json-output pdf.json |
| 199 | * edit pdf.json | 180 | * edit pdf.json |
| 200 | * qpdf --json-input pdf.json out.pdf | 181 | * qpdf --json-input pdf.json out.pdf |
| 201 | 182 | ||
| 202 | -* qpdf in.pdf --json-output=2 pdf.json | 183 | +* qpdf in.pdf --json-output pdf.json |
| 203 | * edit pdf.json keeping only objects that need to be changed | 184 | * edit pdf.json keeping only objects that need to be changed |
| 204 | * qpdf in.pdf --update-from-json=pdf.json out.pdf | 185 | * qpdf in.pdf --update-from-json=pdf.json out.pdf |
| 205 | 186 | ||
| 206 | -Update --json option in cli.rst to mention v2 and update json.rst. | ||
| 207 | - | ||
| 208 | -Other documentation fodder: | 187 | +To modify a single object: |
| 209 | 188 | ||
| 210 | -You can't create a PDF from v1 json because | 189 | +* qpdf in.pdf --json-output pdf.json --json-object=o,g |
| 190 | +* edit pdf.json | ||
| 191 | +* qpdf in.pdf --update-from-json=pdf.json out.pdf | ||
| 211 | 192 | ||
| 212 | -* Change: names are written in canonical form with a leading slash | ||
| 213 | - just as they are treated in the code. In v1, they were written in | ||
| 214 | - PDF syntax in the json file. Example: /text#2fplain in pdf will be | ||
| 215 | - written as /text/plain in json v2 and as /text#2fplain in json v1. | 193 | +Historical note: you can't create a PDF from v1 json because |
| 216 | 194 | ||
| 217 | * The PDF version header is not recorded | 195 | * The PDF version header is not recorded |
| 218 | 196 | ||
| @@ -221,15 +199,16 @@ You can't create a PDF from v1 json because | @@ -221,15 +199,16 @@ You can't create a PDF from v1 json because | ||
| 221 | * Can't tell string from name from indirect object | 199 | * Can't tell string from name from indirect object |
| 222 | 200 | ||
| 223 | * Strings are treated as PDF doc encoding and output as UTF-8, which | 201 | * Strings are treated as PDF doc encoding and output as UTF-8, which |
| 224 | - doesn't work since multiple PDF doc code points are undefined | 202 | + doesn't work since multiple PDF doc code points are undefined and |
| 203 | + is absurd for binary strings | ||
| 225 | 204 | ||
| 226 | * There is no representation of stream data | 205 | * There is no representation of stream data |
| 227 | 206 | ||
| 228 | * You can't tell a stream from a dictionary except by looking in both | 207 | * You can't tell a stream from a dictionary except by looking in both |
| 229 | - "object" and "objectinfo". Fix this, and then remove "objectinfo". | 208 | + "object" and "objectinfo". |
| 230 | 209 | ||
| 231 | -Additionally, using "n n R" as a key in "objects" and "objectinfo" | ||
| 232 | -messes up searching for things. | 210 | +* Using "n n R" as a key in "objects" and "objectinfo" makes it hard |
| 211 | + to search for things when viewing the JSON file in an editor. | ||
| 233 | 212 | ||
| 234 | 213 | ||
| 235 | QPDFPagesTree | 214 | QPDFPagesTree |
| @@ -249,7 +228,7 @@ I'm thinking we will want to keep a pages cache for efficient | @@ -249,7 +228,7 @@ I'm thinking we will want to keep a pages cache for efficient | ||
| 249 | insertion. There's no reason we can't keep a vector of page objects up | 228 | insertion. There's no reason we can't keep a vector of page objects up |
| 250 | to date and just do a traversal the first time we do getAllPages just | 229 | to date and just do a traversal the first time we do getAllPages just |
| 251 | like we do now. The difference is that we would not flatten the pages | 230 | like we do now. The difference is that we would not flatten the pages |
| 252 | -tree. It would be useful to go through QPDF_pages and re-reimplement | 231 | +tree. It would be useful to go through QPDF_pages and reimplement |
| 253 | everything without calling flattenPagesTree. Then we can remove | 232 | everything without calling flattenPagesTree. Then we can remove |
| 254 | flattenPagesTree, which is private. | 233 | flattenPagesTree, which is private. |
| 255 | 234 | ||
| @@ -261,7 +240,7 @@ isPagesObject and isPageObject are reliable and can be made more | @@ -261,7 +240,7 @@ isPagesObject and isPageObject are reliable and can be made more | ||
| 261 | reliable. Maybe add a validate or repair function? It should also make | 240 | reliable. Maybe add a validate or repair function? It should also make |
| 262 | sure /Count and /Parent are correct. | 241 | sure /Count and /Parent are correct. |
| 263 | 242 | ||
| 264 | -refs/attic/QPDFPagesTree-old -- original, abndoned branch -- clean up | 243 | +refs/attic/QPDFPagesTree-old -- original, abandoned branch -- clean up |
| 265 | when done. | 244 | when done. |
| 266 | 245 | ||
| 267 | QPDFJob | 246 | QPDFJob |
cSpell.json
manual/json.rst
| @@ -23,7 +23,7 @@ be extracted from PDF files using other qpdf command-line options. | @@ -23,7 +23,7 @@ be extracted from PDF files using other qpdf command-line options. | ||
| 23 | QPDF JSON Format | 23 | QPDF JSON Format |
| 24 | ---------------- | 24 | ---------------- |
| 25 | 25 | ||
| 26 | -QXXXQ Write this. | 26 | +XXX Write this. |
| 27 | 27 | ||
| 28 | .. _json-guarantees: | 28 | .. _json-guarantees: |
| 29 | 29 |