Commit 905e99a3141edc7d6523e8da47e624b1c1e664a3

Authored by Jay Berkenbilt
1 parent 36794a60

TODO: flesh out JSON v2 details

Showing 1 changed file with 152 additions and 36 deletions
  1 +
1 2 Next
2 3 ====
3 4  
... ... @@ -9,6 +10,7 @@ Priorities for 11:
9 10 * cmake
10 11 * PointerHolder -> shared_ptr
11 12 * ABI
  13 +* --json default is latest
12 14  
13 15 Misc
14 16 * Get rid of "ugly switch statements" in QUtil.cc -- replace with
... ... @@ -17,6 +19,16 @@ Misc
17 19 * Consider exposing get_next_utf8_codepoint in QUtil
18 20 * Add QUtil::is_explicit_utf8 that does what QPDF_String::getUTF8Val
19 21 does to detect UTF-8 encoded strings per PDF 2.0 spec.
  22 +* Add an option --ignore-encryption to ignore encryption information
  23 + and treat encrypted files as if they weren't encrypted. This should
  24 + make it possible to solve #598 (--show-encryption without a
  25 + password). We'll need to make sure we don't try to filter any
  26 + streams in this mode. Ideally we should be able to combine this with
  27 + --json so we can look at the raw encrypted strings and streams if we
  28 + want to. Since providing the password may reveal additional details,
  29 + --show-encryption could potentially retry with this option if the
  30 + first time doesn't work. Then, with the file open, we can read the
  31 + encryption dictionary normally.
20 32  
21 33 Soon: Break ground on "Document-level work"
22 34  
... ... @@ -82,21 +94,17 @@ A .clang-format file can be created at the top of the repository.
82 94 Output JSON v2
83 95 ==============
84 96  
85   -Output JSON v2 contain enough information to completely recreate a PDF
86   -file.
87   -
88   -This is not an ABI change as long as the default --json version is 1.
  97 +Output JSON v2 will contain enough information to completely recreate
  98 +a PDF file. In other words, qpdf will have full, bidirectional,
  99 +lossless json serialization/deserialization of PDF.
89 100  
90 101 If this is done, update --json option in cli.rst to mention v2. Also
91 102 update QPDFJob::Config::json and of course other parts of the docs
92 103 (json.rst).
93 104  
94   -Fix the following problems:
  105 +You can't create a PDF from v1 json because
95 106  
96   -* Include the PDF version header somewhere.
97   -
98   -* Using "n n R" as a key in "objects" and "objectinfo" messes up
99   - searching for things
  107 +* The PDF version header is not recorded
100 108  
101 109 * Strings cannot be unambiguously encoded/decoded
102 110  
... ... @@ -110,36 +118,83 @@ Fix the following problems:
110 118 * You can't tell a stream from a dictionary except by looking in both
111 119 "object" and "objectinfo". Fix this, and then remove "objectinfo".
112 120  
113   -* There are differences between information shown in the json format
114   - vs. information shown with options like --check, --list-attachments,
  121 +Additionally, using "n n R" as a key in "objects" and "objectinfo"
  122 +messes up searching for things.
  123 +
  124 +For json v2:
  125 +
  126 +* Make sure it is possible to serialize and deserializes a PDF to JSON
  127 + without loading the whole thing into memory. This is substantial. It
  128 + means we need sax-style parsing and handling so we can
  129 + handle/generate objects as we go. We'll have to be able to keep
  130 + track of keys for dictionary error checking. May want to add json to
  131 + large file tests.
  132 +
  133 +* Resolve differences between information shown in the json format vs.
  134 + information shown with options like --check, --list-attachments,
115 135 etc. The json format should be able to completely replace things
116   - that write to stdout.
  136 + that write to stdout. Be sure getAllPages() and other top-level
  137 + convenience routines are there so people don't need to parse the
  138 + pages tree themselves. For many workflows, it should be possible for
  139 + someone to work in the json file based on json metadata rather than
  140 + calling the QPDF API. (Of course, you still need the QPDF API for
  141 + higher level helper objects.)
117 142  
118 143 * Consider using camelCase in multi-word key names to be consistent
119 144 with job JSON and with how JSON is often represented in languages
120   - that use it more natively
  145 + that use it more natively.
121 146  
122 147 * Consider changing the contract to allow fields to be absent even
123 148 when present in the schema. It's reasonable for people to check for
124 149 presence of a key. Most languages make this easy to do.
125 150  
  151 +* If we allow --json to be mixed with --ignore-encryption, we must
  152 + emphasize that the resulting json can't be turned back into a valid
  153 + PDF.
  154 +
126 155 Most things that are informational can stay the same. We will have to
127   -go through every item to decide for sure.
  156 +go through every item to decide for sure, especially when camelCase is
  157 +taken into consideration.
  158 +
  159 +New APIs:
128 160  
129   -To address ambiguity, consider the following:
  161 +QPDFObjectHandle::parseJSON(QPDF* context, JSON);
  162 +QPDFObjectHandle::parseJSON(QPDF* context, std::string const&);
  163 +operator ""_qpdf_json
  164 +C API to create a QPDFObjectHandle from a json string
130 165  
131   -Whenever a direct PDF object appears, disambiguate things represented
132   -in JSON as strings as follows:
  166 +JSON::parseFile
  167 +QPDF::parseJSON(JSON) (like parseFile, etc. -- deserializes json)
  168 +QPDF::updateFromJSON(JSON)
133 169  
134   -* "/Name" -- if it starts with /, it's a name
135   -* "n n R" -- if it is "n n R", it's an indirect object
136   -* "u:utf8-encoded" -- a utf8-encoded string
137   -* "b:<12ab34>" -- a binary string
  170 +CLI: --infile-is-json -- indicate that the input is a qpdf json file
  171 +rather than a PDF file
  172 +CLI: --update-from-json=file.json
138 173  
139   -In "objects", the key is "obj:o,g", and the value is a dictionary with
140   -exactly one of "value" or "stream" as its single key.
  174 +Have a "qpdf" key in the output that contains "jsonVersion",
  175 +"pdfVersion", and "objects". This replaces the "objects" field at the
  176 +top level. "objects" and "objectinfo" disappear from the top-level.
  177 +".version" and ".qpdf.jsonVersion" will match. The input to parseJSON
  178 +and updateFromJSON will have to have the "qpdf" key in it. All other
  179 +keys are ignored.
141 180  
142   -For non-streams, the value of "value" is as described above.
  181 +When creating from a JSON file, the JSON must be complete with data
  182 +for all streams, a trailer, and a pdfVersion. When updating from a
  183 +JSON:
  184 +
  185 +* Any object whose value is null (not "value": null, but just null) is
  186 + deleted.
  187 +* For any stream that appears without stream data, the stream data is
  188 + left alone.
  189 +* Otherwise, the object from the JSON completely replaces the input
  190 + object. No dictionary merges or anything like that are performed.
  191 + It will call replaceObject.
  192 +
  193 +Within .qpdf.objects, the key is "obj:o,g" or "obj:trailer", and the
  194 +value is a dictionary with exactly one of "value" or "stream" as its
  195 +single key.
  196 +
  197 +For non-streams:
143 198  
144 199 {
145 200 "obj:o,g": {
... ... @@ -149,7 +204,6 @@ For non-streams, the value of &quot;value&quot; is as described above.
149 204  
150 205 For streams:
151 206  
152   -{
153 207 "obj:o,g": {
154 208 "stream": {
155 209 "dict": { ... stream dictionary ... },
... ... @@ -160,27 +214,89 @@ For streams:
160 214 }
161 215 }
162 216  
163   -Notes about stream data:
  217 +Wherever a PDF object appears in the JSON output, including "value"
  218 +and "stream"."dict" above as well as other places where they might
  219 +appear, objects are represented as follows:
  220 +
  221 +* Arrays, dictionaries, booleans, nulls, integers, and real numbers
  222 + with no more than six decimal places are represented as their native
  223 + JSON type.
  224 +* Real numbers with more than six decimal places are represented as
  225 + "r:{real-value}".
  226 +* Names: "/Name" -- internal/canonical representation (e.g.
  227 + "/Text/Plain", not #xx quoted)
  228 +* Indirect objects: "n n R"
  229 +* Strings: one of
  230 + "s:json string treated as Unicode"
  231 + "b:json string treated as bytes; character > \u00ff is an error"
  232 + "e:base64-encoded bytes"
  233 +
  234 +Test cases: these are the same:
  235 +* "b:\u00c8\u0080", "s:π", "s:\u03c0", and "e:z4A="
  236 +* "b:\u00d8\u003e\u00dd\u0054", "s:🥔", "s:\ud83e\udd54", and "e:8J+llA=="
  237 +
  238 +When creating output from a string:
  239 +* If the string is explicitly unicode (UTF-8 or UTF-16), encode as
  240 + "s:" without the leading U+FEFF
  241 +* Else if the string can be bidirectionally mapped between pdf-doc and
  242 + unicode, transcode to unicode and encode as "s:"
  243 +* Else if the string would be decoded as binary, encode as "e:"
  244 +* Else encode as "b:"
  245 +
  246 +When reading a string, any string that doesn't follow the above rules
  247 +is an error. This includes "r:" strings not paresable as a real
  248 +number, "/Name" strings containing a NUL character, "s:" or "b:"
  249 +strings that are not valid JSON strings, "b:" strings containing
  250 +character values > 0xff, or "e:" values that are not valid base64.
  251 +Once the string is read in, if the "s:" string can be bidirectionally
  252 +mapped between pdf-doc and unicode, store as PDFDoc. Otherwise store
  253 +as UTF-16BE. "b:" strings are stored as bytes, and "e:" are decoded
  254 +and stored as bytes.
  255 +
  256 +Implementing this will require some refactoring of things between
  257 +QUtil and QPDF_String, plus we will need to implement a base64
  258 +encoder/decoder.
  259 +
  260 +This enables a workflow like this:
  261 +
  262 +* qpdf --json=latest infile.pdf > pdf.json
  263 +* modify pdf.json
  264 +* qpdf infile.pdf --update-from=pdf.json out.pdf
  265 +
  266 +or
  267 +
  268 +* qpdf --json=latest --json-stream-data=raw|filtered infile.pdf > pdf.json
  269 +* modify pdf.json
  270 +* qpdf pdf.json --infile-is-json out.pdf
  271 +
  272 +Notes about streams and stream data:
  273 +
  274 +* Always include "dict". "/Length" is removed from the stream
  275 + dictionary.
164 276  
165   -* Always include "dict".
  277 +* Add new flag --json-stream-data={raw,filtered,none}. At most one of
  278 + "raw" and "filtered" will appear for each stream. If "filtered"
  279 + appears, "/Filter" and "/DecodeParms" are removed from the stream
  280 + dictionary. This makes the stream data and dictionary match for when
  281 + the file is read back in.
166 282  
167 283 * Always include "filterable" regardless of value of
168 284 --json-stream-data. The value of filterable is influenced by
169 285 --decode-level, which is already in parameters.
170 286  
171   -* Add new flag --json-stream-data={raw,filtered,none}. At most one of
172   - "raw" and "filtered" will appear for each stream.
173   -
174 287 * Add to parameters: value of json-stream-data, default is none
175 288  
176   -* If none, omit stream data entirely
  289 +* If --json-stream-data=none, omit stream data entirely
177 290  
178   -* If raw, include raw stream data as base64
  291 +* If --json-stream-data=raw, include raw stream data as base64. Show
  292 + the data even for unfiltered streams in "raw".
179 293  
180   -* If filtered, including the base64-encoded filtered stream data if we
181   - can and should decode it based on decode-level. Otherwise, include
182   - the base64-encoded raw data. See if we can honor
183   - --normalize-content.
  294 +* If --json-stream-data=filtered, include the base64-encoded filtered
  295 + stream data if we can and should decode it based on decode-level.
  296 + Otherwise, include the base64-encoded raw data. See if we can honor
  297 + --normalize-content. If a stream appears unfiltered in the input,
  298 + still show it as filtered. Remove /DecodeParms and /Filter if
  299 + filtering.
184 300  
185 301 Note that --json-stream-data=filtered is different from
186 302 --filtered-stream-data in that --filtered-stream-data implies
... ...