Commit f1a9ba0c622deee0ed05004949b34f0126b12b6a
1 parent
27a42c16
TODO: clean up remaining work for json v2
Showing
3 changed files
with
102 additions
and
122 deletions
TODO
| ... | ... | @@ -55,11 +55,7 @@ Soon: Break ground on "Document-level work" |
| 55 | 55 | Output JSON v2 |
| 56 | 56 | ============== |
| 57 | 57 | |
| 58 | -Some of this documentation has drifted from the actual implementation. | |
| 59 | - | |
| 60 | -* Document that /Length is ignored in stream dictionary replacements | |
| 61 | - | |
| 62 | -General things to remember: | |
| 58 | +Remaining work: | |
| 63 | 59 | |
| 64 | 60 | * Make sure all the information from --check and other informational |
| 65 | 61 | options (--show-linearization, --show-encryption, --show-xref, |
| ... | ... | @@ -68,106 +64,98 @@ General things to remember: |
| 68 | 64 | right keys when in json mode. I don't think I want check on by |
| 69 | 65 | default, so that might be different. |
| 70 | 66 | |
| 71 | -* Consider changing the contract to allow fields to be absent even | |
| 72 | - when present in the schema. It's reasonable for people to check for | |
| 73 | - presence of a key. Most languages make this easy to do. | |
| 67 | +Notes for documentation: | |
| 68 | + | |
| 69 | +* Find all mentions of json in the manual and update. | |
| 74 | 70 | |
| 75 | 71 | * Document typo fix in encrypt in release notes along with any other |
| 76 | 72 | non-compatible json 2 changes. Scrutinize all the output to decide |
| 77 | 73 | what should change. |
| 78 | 74 | |
| 79 | -* Document that keys other than "qpdf-v2" are ignored so people can | |
| 80 | - stash their own stuff. | |
| 81 | - | |
| 82 | -JSON to PDF: | |
| 83 | - | |
| 84 | -Have --json-input and --update-from-json. With --json-input, the json | |
| 85 | -file must be complete, meaning all stream data, the trailer, and the | |
| 86 | -PDF version must be present. For streams with no stream data, the | |
| 87 | -dictionary is updated but the data is left untouched. Other things | |
| 88 | -that are omitted are left alone. Make sure document that, when writing | |
| 89 | -a PDF file from QPDF, there is no expectation of object numbers being | |
| 90 | -preserved. As such, --update-from-json can only be used to update the | |
| 91 | -exact file that the json was created from. You can put multiple | |
| 92 | -objects in the update file, but you can't use a json from one file to | |
| 93 | -update the output of a previous update since the object numbers will | |
| 94 | -have changed. Note that, when creating from a JSON, object numbers are | |
| 95 | -preserved in the resulting QPDF object but still modified by | |
| 96 | -QPDFWriter for the output. This would be visible by combining | |
| 97 | ---json-output and --json-input. Also using --qdf with | |
| 98 | ---create-from-json would show original object IDs in comments. It will | |
| 99 | -be important to capture this in the documentation. | |
| 100 | - | |
| 101 | -When reading a JSON string, any string that doesn't look like a name | |
| 102 | -or indirect object or start with "b:" or "u:" should be considered an | |
| 103 | -error. Just use newUnicodeString on "u:" strings. For "b:" strings, | |
| 104 | -decode the bytes with hex_decode and use newString. | |
| 105 | - | |
| 106 | -Test case: combine --json-input and --json-output to show preservation | |
| 107 | -of object numbers. QPDFWriter won't show that although --qdf with the | |
| 108 | -original object ID comments would. | |
| 109 | - | |
| 110 | -The backing input source for createFromJSON is this memory block: | |
| 111 | - | |
| 112 | -``` | |
| 113 | -%PDF-1.3 | |
| 114 | -xref | |
| 115 | -0 1 | |
| 116 | -0000000000 65535 f | |
| 117 | -trailer << /Size 1 >> | |
| 118 | -startxref | |
| 119 | -9 | |
| 120 | -%%EOF | |
| 121 | -``` | |
| 122 | - | |
| 123 | -* Ignore all keys except .qpdf-v2. | |
| 124 | -* Set this->m->pdf_version based on the .qpdf.pdfVersion key | |
| 125 | -* For each object in .qpdf.objects: | |
| 126 | - * Walk through the object detecting any indirect objects. For each | |
| 127 | - one that is not already known, reserve the object. We can also | |
| 128 | - validate but we should try to do the best we can with invalid JSON | |
| 129 | - so people can get good error messages. | |
| 130 | - * Construct a QPDFObjectHandle from the JSON | |
| 131 | - * If the object is the trailer, update the trailer | |
| 132 | - * Else if the object doesn't exist, reserve it | |
| 133 | - * If the object is reserved, call replaceReserved() | |
| 134 | - * Else the object already exists; this is an error. | |
| 135 | - | |
| 136 | -For streams, have a stream data provider that, for inline streams, | |
| 137 | -does a base64 from the file offsets and for file-based streams, reads | |
| 138 | -the file. For the inline case, we have to keep the json InputSource | |
| 139 | -around. Otherwise, we don't. It is an error if there is no stream | |
| 140 | -data. For files, we can have a stream data provider that just reads | |
| 141 | -the file. Remember QUtil::file_provider. | |
| 142 | - | |
| 143 | -Documentation: | |
| 144 | - | |
| 145 | -Serialized PDF: | |
| 146 | - | |
| 147 | -The JSON output will have a "qpdf-v2" key containing | |
| 148 | -* pdfversion | |
| 149 | -* maxobjectid | |
| 150 | -* objects | |
| 151 | - | |
| 152 | -In regular json mode, "objectinfo" is gone. | |
| 153 | - | |
| 154 | -Within .objects, the key is "obj:o g R" or "trailer", and the | |
| 155 | -value is a dictionary with exactly one of "value" or "stream" as its | |
| 156 | -single key. | |
| 75 | +* Keys other than "qpdf-v2" are ignored so people can stash their own | |
| 76 | + stuff. Unknown keys are ignored at other places for future | |
| 77 | + compatibility. Readers of qpdf json should continue to ignore keys | |
| 78 | + they don't recognize. | |
| 157 | 79 | |
| 158 | -Rationale of "obj:o g R" is that indirect object references are just | |
| 159 | -"o g R", and so code that wants to resolve one can do so easily by | |
| 160 | -just prepending "obj:" and not having to parse or split the string. | |
| 161 | -Having a prefix rather than making the key just "o g R" makes it much | |
| 162 | -easier to search in the JSON for the definition of an object. | |
| 80 | +* Change: names are written in canonical form with a leading slash | |
| 81 | + just as they are treated in the code. In v1, they were written in | |
| 82 | + PDF syntax in the json file. Example: /text#2fplain in pdf will be | |
| 83 | + written as /text/plain in json v2 and as /text#2fplain in json v1. | |
| 84 | + | |
| 85 | +* Document changes to strings, objects, streams, object keys. | |
| 86 | + | |
| 87 | +* CLI: --json-input, --json-output[=version], --update-from-json. With | |
| 88 | + --json-input, the input file is a JSON file instead of a PDF file. | |
| 89 | + It must be complete, meaning that a PDF version must be given, all | |
| 90 | + streams must have exactly one of data or datafile, and a trailer | |
| 91 | + dictionary must be present, even if empty. | |
| 92 | + | |
| 93 | + With --update-from-json, the JSON file updates objects in place. If | |
| 94 | + updating an old stream, if stream data is omitted, the data remains | |
| 95 | + untouched. The dictionary is always required. Remember that | |
| 96 | + QPDFWriter does not preserve object numbers, though --json-output | |
| 97 | + does. Therefore, if you want to update a PDF with a JSON, the input | |
| 98 | + to --update-from-json must be the same PDF as the one that | |
| 99 | + --json-output was run on previously. Otherwise, object numbers won't | |
| 100 | + match. Show this with an example. When updating, | |
| 101 | + | |
| 102 | +* Certain fields are ignored when reading the JSON. This includes | |
| 103 | + maxobjectid, any computed fields in trailer (such as /Size), and all | |
| 104 | + /Length keys in stream dictionaries. There is no need for the user | |
| 105 | + to correct, remove, or otherwise worry about any values those keys | |
| 106 | + might have. The maxobjectid field is present in the original output | |
| 107 | + to assist with adding new objects to the file. | |
| 108 | + | |
| 109 | +* JSON strings within PDF objects: | |
| 110 | + | |
| 111 | + * "n n R" is an indirect object | |
| 112 | + | |
| 113 | + * "/Name" is a name in canonical form with a leading slash (like | |
| 114 | + "/text/plain"), not PDF syntax (like "/text#2fplain"). | |
| 115 | + | |
| 116 | + * "b:hex-digits" is a binary string ("b:feff03c0"). Hex digits may be | |
| 117 | + mixed case. There must be an even number of digits. | |
| 118 | + | |
| 119 | + * "u:utf-8" is a UTF-8 encoded string ("u:ฯ", "u:\u03c0"). UTF-16 | |
| 120 | + surrogate pairs are allowed. These are all equivalent: "u:๐ฅ", | |
| 121 | + "u:\ud83e\udd54", "b:FEFFD83EDD54", "b:efbbbff09fa594". | |
| 122 | + | |
| 123 | + * Both "b:" and "u:" are valid representations of the empty string. | |
| 124 | + | |
| 125 | + * Anything else is an error | |
| 126 | + | |
| 127 | +* Document use of --json-input and --json-output together to show | |
| 128 | + preservation of object numbers. Draw attention to "original object | |
| 129 | + ID" comments in qdf as another way to show it. | |
| 130 | + | |
| 131 | +* Document top-level keys of "qpdf-v2" ("pdfversion", "objects", | |
| 132 | + "maxobjectid") noting that "maxobjectid" is ignored when reading. | |
| 133 | + | |
| 134 | +* Stream data: "data" is base64-encoded stream data. "datafile" is the | |
| 135 | + path to a file (relative path recommended but not required) | |
| 136 | + containing the binary data. As with any PDF representation, the data | |
| 137 | + must be consistent with the filters. --decode-level is honored by | |
| 138 | + --json-output. | |
| 139 | + | |
| 140 | +* Other changes from v1: | |
| 141 | + | |
| 142 | + * in "objects", keys are "obj:o g R" or "trailer" | |
| 143 | + | |
| 144 | + * Non-stream objects are dictionaries with a "value" key whose value | |
| 145 | + is the object. Stream objects are dictionaries with a "stream" key | |
| 146 | + whose value is {"dict": stream-dictionary}. The "/Length" key is | |
| 147 | + omitted from the stream dictionary. | |
| 148 | + | |
| 149 | + * "objectinfo" is gone as it is now possible to tell a stream from a | |
| 150 | + non-stream directly. To get stream data, use the --json-output | |
| 151 | + option. Note about how "pages" may cause the pages tree to be | |
| 152 | + corrected. | |
| 163 | 153 | |
| 164 | 154 | For non-streams: |
| 165 | 155 | |
| 166 | -{ | |
| 167 | 156 | "obj:o g R": { |
| 168 | 157 | "value": ... |
| 169 | 158 | } |
| 170 | -} | |
| 171 | 159 | |
| 172 | 160 | For streams: |
| 173 | 161 | |
| ... | ... | @@ -178,41 +166,31 @@ For streams: |
| 178 | 166 | "datafile": "path to base64-encoded data" |
| 179 | 167 | } |
| 180 | 168 | } |
| 181 | -} | |
| 182 | - | |
| 183 | -At most one of "data" or "datafile" will be present. When serializing, | |
| 184 | -stream decode parameters will be obeyed, and the stream dictionary | |
| 185 | -will reflect the result. There will be the option to omit stream data. | |
| 186 | 169 | |
| 187 | -When data is included, "/Length" is removed from the stream | |
| 188 | -dictionary. | |
| 189 | - | |
| 190 | -Streams are filtered or not based on the --decode-level parameter. If | |
| 191 | -a stream is filtered, "/Filter" and "/DecodeParms" are removed from | |
| 192 | -the stream dictionary. This makes the stream data and dictionary match | |
| 193 | -for when the file is read back in. | |
| 170 | +Rationale of "obj:o g R" is that indirect object references are just | |
| 171 | +"o g R", and so code that wants to resolve one can do so easily by | |
| 172 | +just prepending "obj:" and not having to parse or split the string. | |
| 173 | +Having a prefix rather than making the key just "o g R" makes it much | |
| 174 | +easier to search in the JSON for the definition of an object. | |
| 194 | 175 | |
| 195 | 176 | CLI: |
| 196 | 177 | |
| 197 | 178 | Example workflow: |
| 198 | -* qpdf in.pdf --json-output=2 pdf.json | |
| 179 | +* qpdf in.pdf --json-output pdf.json | |
| 199 | 180 | * edit pdf.json |
| 200 | 181 | * qpdf --json-input pdf.json out.pdf |
| 201 | 182 | |
| 202 | -* qpdf in.pdf --json-output=2 pdf.json | |
| 183 | +* qpdf in.pdf --json-output pdf.json | |
| 203 | 184 | * edit pdf.json keeping only objects that need to be changed |
| 204 | 185 | * qpdf in.pdf --update-from-json=pdf.json out.pdf |
| 205 | 186 | |
| 206 | -Update --json option in cli.rst to mention v2 and update json.rst. | |
| 207 | - | |
| 208 | -Other documentation fodder: | |
| 187 | +To modify a single object: | |
| 209 | 188 | |
| 210 | -You can't create a PDF from v1 json because | |
| 189 | +* qpdf in.pdf --json-output pdf.json --json-object=o,g | |
| 190 | +* edit pdf.json | |
| 191 | +* qpdf in.pdf --update-from-json=pdf.json out.pdf | |
| 211 | 192 | |
| 212 | -* Change: names are written in canonical form with a leading slash | |
| 213 | - just as they are treated in the code. In v1, they were written in | |
| 214 | - PDF syntax in the json file. Example: /text#2fplain in pdf will be | |
| 215 | - written as /text/plain in json v2 and as /text#2fplain in json v1. | |
| 193 | +Historical note: you can't create a PDF from v1 json because | |
| 216 | 194 | |
| 217 | 195 | * The PDF version header is not recorded |
| 218 | 196 | |
| ... | ... | @@ -221,15 +199,16 @@ You can't create a PDF from v1 json because |
| 221 | 199 | * Can't tell string from name from indirect object |
| 222 | 200 | |
| 223 | 201 | * Strings are treated as PDF doc encoding and output as UTF-8, which |
| 224 | - doesn't work since multiple PDF doc code points are undefined | |
| 202 | + doesn't work since multiple PDF doc code points are undefined and | |
| 203 | + is absurd for binary strings | |
| 225 | 204 | |
| 226 | 205 | * There is no representation of stream data |
| 227 | 206 | |
| 228 | 207 | * You can't tell a stream from a dictionary except by looking in both |
| 229 | - "object" and "objectinfo". Fix this, and then remove "objectinfo". | |
| 208 | + "object" and "objectinfo". | |
| 230 | 209 | |
| 231 | -Additionally, using "n n R" as a key in "objects" and "objectinfo" | |
| 232 | -messes up searching for things. | |
| 210 | +* Using "n n R" as a key in "objects" and "objectinfo" makes it hard | |
| 211 | + to search for things when viewing the JSON file in an editor. | |
| 233 | 212 | |
| 234 | 213 | |
| 235 | 214 | QPDFPagesTree |
| ... | ... | @@ -249,7 +228,7 @@ I'm thinking we will want to keep a pages cache for efficient |
| 249 | 228 | insertion. There's no reason we can't keep a vector of page objects up |
| 250 | 229 | to date and just do a traversal the first time we do getAllPages just |
| 251 | 230 | like we do now. The difference is that we would not flatten the pages |
| 252 | -tree. It would be useful to go through QPDF_pages and re-reimplement | |
| 231 | +tree. It would be useful to go through QPDF_pages and reimplement | |
| 253 | 232 | everything without calling flattenPagesTree. Then we can remove |
| 254 | 233 | flattenPagesTree, which is private. |
| 255 | 234 | |
| ... | ... | @@ -261,7 +240,7 @@ isPagesObject and isPageObject are reliable and can be made more |
| 261 | 240 | reliable. Maybe add a validate or repair function? It should also make |
| 262 | 241 | sure /Count and /Parent are correct. |
| 263 | 242 | |
| 264 | -refs/attic/QPDFPagesTree-old -- original, abndoned branch -- clean up | |
| 243 | +refs/attic/QPDFPagesTree-old -- original, abandoned branch -- clean up | |
| 265 | 244 | when done. |
| 266 | 245 | |
| 267 | 246 | QPDFJob | ... | ... |
cSpell.json