Commit 905e99a3141edc7d6523e8da47e624b1c1e664a3

Authored by Jay Berkenbilt
1 parent 36794a60

TODO: flesh out JSON v2 details

Showing 1 changed file with 152 additions and 36 deletions
  1 +
1 Next 2 Next
2 ==== 3 ====
3 4
@@ -9,6 +10,7 @@ Priorities for 11: @@ -9,6 +10,7 @@ Priorities for 11:
9 * cmake 10 * cmake
10 * PointerHolder -> shared_ptr 11 * PointerHolder -> shared_ptr
11 * ABI 12 * ABI
  13 +* --json default is latest
12 14
13 Misc 15 Misc
14 * Get rid of "ugly switch statements" in QUtil.cc -- replace with 16 * Get rid of "ugly switch statements" in QUtil.cc -- replace with
@@ -17,6 +19,16 @@ Misc @@ -17,6 +19,16 @@ Misc
17 * Consider exposing get_next_utf8_codepoint in QUtil 19 * Consider exposing get_next_utf8_codepoint in QUtil
18 * Add QUtil::is_explicit_utf8 that does what QPDF_String::getUTF8Val 20 * Add QUtil::is_explicit_utf8 that does what QPDF_String::getUTF8Val
19 does to detect UTF-8 encoded strings per PDF 2.0 spec. 21 does to detect UTF-8 encoded strings per PDF 2.0 spec.
  22 +* Add an option --ignore-encryption to ignore encryption information
  23 + and treat encrypted files as if they weren't encrypted. This should
  24 + make it possible to solve #598 (--show-encryption without a
  25 + password). We'll need to make sure we don't try to filter any
  26 + streams in this mode. Ideally we should be able to combine this with
  27 + --json so we can look at the raw encrypted strings and streams if we
  28 + want to. Since providing the password may reveal additional details,
  29 + --show-encryption could potentially retry with this option if the
  30 + first time doesn't work. Then, with the file open, we can read the
  31 + encryption dictionary normally.
20 32
21 Soon: Break ground on "Document-level work" 33 Soon: Break ground on "Document-level work"
22 34
@@ -82,21 +94,17 @@ A .clang-format file can be created at the top of the repository. @@ -82,21 +94,17 @@ A .clang-format file can be created at the top of the repository.
82 Output JSON v2 94 Output JSON v2
83 ============== 95 ==============
84 96
85 -Output JSON v2 contain enough information to completely recreate a PDF  
86 -file.  
87 -  
88 -This is not an ABI change as long as the default --json version is 1. 97 +Output JSON v2 will contain enough information to completely recreate
  98 +a PDF file. In other words, qpdf will have full, bidirectional,
  99 +lossless json serialization/deserialization of PDF.
89 100
90 If this is done, update --json option in cli.rst to mention v2. Also 101 If this is done, update --json option in cli.rst to mention v2. Also
91 update QPDFJob::Config::json and of course other parts of the docs 102 update QPDFJob::Config::json and of course other parts of the docs
92 (json.rst). 103 (json.rst).
93 104
94 -Fix the following problems: 105 +You can't create a PDF from v1 json because
95 106
96 -* Include the PDF version header somewhere.  
97 -  
98 -* Using "n n R" as a key in "objects" and "objectinfo" messes up  
99 - searching for things 107 +* The PDF version header is not recorded
100 108
101 * Strings cannot be unambiguously encoded/decoded 109 * Strings cannot be unambiguously encoded/decoded
102 110
@@ -110,36 +118,83 @@ Fix the following problems: @@ -110,36 +118,83 @@ Fix the following problems:
110 * You can't tell a stream from a dictionary except by looking in both 118 * You can't tell a stream from a dictionary except by looking in both
111 "object" and "objectinfo". Fix this, and then remove "objectinfo". 119 "object" and "objectinfo". Fix this, and then remove "objectinfo".
112 120
113 -* There are differences between information shown in the json format  
114 - vs. information shown with options like --check, --list-attachments, 121 +Additionally, using "n n R" as a key in "objects" and "objectinfo"
  122 +messes up searching for things.
  123 +
  124 +For json v2:
  125 +
  126 +* Make sure it is possible to serialize and deserializes a PDF to JSON
  127 + without loading the whole thing into memory. This is substantial. It
  128 + means we need sax-style parsing and handling so we can
  129 + handle/generate objects as we go. We'll have to be able to keep
  130 + track of keys for dictionary error checking. May want to add json to
  131 + large file tests.
  132 +
  133 +* Resolve differences between information shown in the json format vs.
  134 + information shown with options like --check, --list-attachments,
115 etc. The json format should be able to completely replace things 135 etc. The json format should be able to completely replace things
116 - that write to stdout. 136 + that write to stdout. Be sure getAllPages() and other top-level
  137 + convenience routines are there so people don't need to parse the
  138 + pages tree themselves. For many workflows, it should be possible for
  139 + someone to work in the json file based on json metadata rather than
  140 + calling the QPDF API. (Of course, you still need the QPDF API for
  141 + higher level helper objects.)
117 142
118 * Consider using camelCase in multi-word key names to be consistent 143 * Consider using camelCase in multi-word key names to be consistent
119 with job JSON and with how JSON is often represented in languages 144 with job JSON and with how JSON is often represented in languages
120 - that use it more natively 145 + that use it more natively.
121 146
122 * Consider changing the contract to allow fields to be absent even 147 * Consider changing the contract to allow fields to be absent even
123 when present in the schema. It's reasonable for people to check for 148 when present in the schema. It's reasonable for people to check for
124 presence of a key. Most languages make this easy to do. 149 presence of a key. Most languages make this easy to do.
125 150
  151 +* If we allow --json to be mixed with --ignore-encryption, we must
  152 + emphasize that the resulting json can't be turned back into a valid
  153 + PDF.
  154 +
126 Most things that are informational can stay the same. We will have to 155 Most things that are informational can stay the same. We will have to
127 -go through every item to decide for sure. 156 +go through every item to decide for sure, especially when camelCase is
  157 +taken into consideration.
  158 +
  159 +New APIs:
128 160
129 -To address ambiguity, consider the following: 161 +QPDFObjectHandle::parseJSON(QPDF* context, JSON);
  162 +QPDFObjectHandle::parseJSON(QPDF* context, std::string const&);
  163 +operator ""_qpdf_json
  164 +C API to create a QPDFObjectHandle from a json string
130 165
131 -Whenever a direct PDF object appears, disambiguate things represented  
132 -in JSON as strings as follows: 166 +JSON::parseFile
  167 +QPDF::parseJSON(JSON) (like parseFile, etc. -- deserializes json)
  168 +QPDF::updateFromJSON(JSON)
133 169
134 -* "/Name" -- if it starts with /, it's a name  
135 -* "n n R" -- if it is "n n R", it's an indirect object  
136 -* "u:utf8-encoded" -- a utf8-encoded string  
137 -* "b:<12ab34>" -- a binary string 170 +CLI: --infile-is-json -- indicate that the input is a qpdf json file
  171 +rather than a PDF file
  172 +CLI: --update-from-json=file.json
138 173
139 -In "objects", the key is "obj:o,g", and the value is a dictionary with  
140 -exactly one of "value" or "stream" as its single key. 174 +Have a "qpdf" key in the output that contains "jsonVersion",
  175 +"pdfVersion", and "objects". This replaces the "objects" field at the
  176 +top level. "objects" and "objectinfo" disappear from the top-level.
  177 +".version" and ".qpdf.jsonVersion" will match. The input to parseJSON
  178 +and updateFromJSON will have to have the "qpdf" key in it. All other
  179 +keys are ignored.
141 180
142 -For non-streams, the value of "value" is as described above. 181 +When creating from a JSON file, the JSON must be complete with data
  182 +for all streams, a trailer, and a pdfVersion. When updating from a
  183 +JSON:
  184 +
  185 +* Any object whose value is null (not "value": null, but just null) is
  186 + deleted.
  187 +* For any stream that appears without stream data, the stream data is
  188 + left alone.
  189 +* Otherwise, the object from the JSON completely replaces the input
  190 + object. No dictionary merges or anything like that are performed.
  191 + It will call replaceObject.
  192 +
  193 +Within .qpdf.objects, the key is "obj:o,g" or "obj:trailer", and the
  194 +value is a dictionary with exactly one of "value" or "stream" as its
  195 +single key.
  196 +
  197 +For non-streams:
143 198
144 { 199 {
145 "obj:o,g": { 200 "obj:o,g": {
@@ -149,7 +204,6 @@ For non-streams, the value of &quot;value&quot; is as described above. @@ -149,7 +204,6 @@ For non-streams, the value of &quot;value&quot; is as described above.
149 204
150 For streams: 205 For streams:
151 206
152 -{  
153 "obj:o,g": { 207 "obj:o,g": {
154 "stream": { 208 "stream": {
155 "dict": { ... stream dictionary ... }, 209 "dict": { ... stream dictionary ... },
@@ -160,27 +214,89 @@ For streams: @@ -160,27 +214,89 @@ For streams:
160 } 214 }
161 } 215 }
162 216
163 -Notes about stream data: 217 +Wherever a PDF object appears in the JSON output, including "value"
  218 +and "stream"."dict" above as well as other places where they might
  219 +appear, objects are represented as follows:
  220 +
  221 +* Arrays, dictionaries, booleans, nulls, integers, and real numbers
  222 + with no more than six decimal places are represented as their native
  223 + JSON type.
  224 +* Real numbers with more than six decimal places are represented as
  225 + "r:{real-value}".
  226 +* Names: "/Name" -- internal/canonical representation (e.g.
  227 + "/Text/Plain", not #xx quoted)
  228 +* Indirect objects: "n n R"
  229 +* Strings: one of
  230 + "s:json string treated as Unicode"
  231 + "b:json string treated as bytes; character > \u00ff is an error"
  232 + "e:base64-encoded bytes"
  233 +
  234 +Test cases: these are the same:
  235 +* "b:\u00c8\u0080", "s:π", "s:\u03c0", and "e:z4A="
  236 +* "b:\u00d8\u003e\u00dd\u0054", "s:🥔", "s:\ud83e\udd54", and "e:8J+llA=="
  237 +
  238 +When creating output from a string:
  239 +* If the string is explicitly unicode (UTF-8 or UTF-16), encode as
  240 + "s:" without the leading U+FEFF
  241 +* Else if the string can be bidirectionally mapped between pdf-doc and
  242 + unicode, transcode to unicode and encode as "s:"
  243 +* Else if the string would be decoded as binary, encode as "e:"
  244 +* Else encode as "b:"
  245 +
  246 +When reading a string, any string that doesn't follow the above rules
  247 +is an error. This includes "r:" strings not paresable as a real
  248 +number, "/Name" strings containing a NUL character, "s:" or "b:"
  249 +strings that are not valid JSON strings, "b:" strings containing
  250 +character values > 0xff, or "e:" values that are not valid base64.
  251 +Once the string is read in, if the "s:" string can be bidirectionally
  252 +mapped between pdf-doc and unicode, store as PDFDoc. Otherwise store
  253 +as UTF-16BE. "b:" strings are stored as bytes, and "e:" are decoded
  254 +and stored as bytes.
  255 +
  256 +Implementing this will require some refactoring of things between
  257 +QUtil and QPDF_String, plus we will need to implement a base64
  258 +encoder/decoder.
  259 +
  260 +This enables a workflow like this:
  261 +
  262 +* qpdf --json=latest infile.pdf > pdf.json
  263 +* modify pdf.json
  264 +* qpdf infile.pdf --update-from=pdf.json out.pdf
  265 +
  266 +or
  267 +
  268 +* qpdf --json=latest --json-stream-data=raw|filtered infile.pdf > pdf.json
  269 +* modify pdf.json
  270 +* qpdf pdf.json --infile-is-json out.pdf
  271 +
  272 +Notes about streams and stream data:
  273 +
  274 +* Always include "dict". "/Length" is removed from the stream
  275 + dictionary.
164 276
165 -* Always include "dict". 277 +* Add new flag --json-stream-data={raw,filtered,none}. At most one of
  278 + "raw" and "filtered" will appear for each stream. If "filtered"
  279 + appears, "/Filter" and "/DecodeParms" are removed from the stream
  280 + dictionary. This makes the stream data and dictionary match for when
  281 + the file is read back in.
166 282
167 * Always include "filterable" regardless of value of 283 * Always include "filterable" regardless of value of
168 --json-stream-data. The value of filterable is influenced by 284 --json-stream-data. The value of filterable is influenced by
169 --decode-level, which is already in parameters. 285 --decode-level, which is already in parameters.
170 286
171 -* Add new flag --json-stream-data={raw,filtered,none}. At most one of  
172 - "raw" and "filtered" will appear for each stream.  
173 -  
174 * Add to parameters: value of json-stream-data, default is none 287 * Add to parameters: value of json-stream-data, default is none
175 288
176 -* If none, omit stream data entirely 289 +* If --json-stream-data=none, omit stream data entirely
177 290
178 -* If raw, include raw stream data as base64 291 +* If --json-stream-data=raw, include raw stream data as base64. Show
  292 + the data even for unfiltered streams in "raw".
179 293
180 -* If filtered, including the base64-encoded filtered stream data if we  
181 - can and should decode it based on decode-level. Otherwise, include  
182 - the base64-encoded raw data. See if we can honor  
183 - --normalize-content. 294 +* If --json-stream-data=filtered, include the base64-encoded filtered
  295 + stream data if we can and should decode it based on decode-level.
  296 + Otherwise, include the base64-encoded raw data. See if we can honor
  297 + --normalize-content. If a stream appears unfiltered in the input,
  298 + still show it as filtered. Remove /DecodeParms and /Filter if
  299 + filtering.
184 300
185 Note that --json-stream-data=filtered is different from 301 Note that --json-stream-data=filtered is different from
186 --filtered-stream-data in that --filtered-stream-data implies 302 --filtered-stream-data in that --filtered-stream-data implies