Commit 2a92b1b0d6e389c9b033fffe1fc2821a63ca1621

Authored by Jay Berkenbilt
1 parent 0500d434

TODO: solidify remaining json v2 work

Showing 1 changed file with 167 additions and 262 deletions
... ... @@ -10,6 +10,10 @@ In order:
10 10  
11 11 Other (do in any order):
12 12  
  13 +* See if I can change all output and error messages issued by the
  14 + library, when context is available, to have a pipeline rather than a
  15 + FILE* or std::ostream. This makes it possible for people to capture
  16 + output more flexibly.
13 17 * Make job JSON accept a single element and treat as an array of one
14 18 when an array is expected. This allows for making things repeatable
15 19 in the future without breaking compatibility and is needed for the
... ... @@ -20,10 +24,11 @@ Other (do in any order):
20 24 password). We'll need to make sure we don't try to filter any
21 25 streams in this mode. Ideally we should be able to combine this with
22 26 --json so we can look at the raw encrypted strings and streams if we
23   - want to. Since providing the password may reveal additional details,
24   - --show-encryption could potentially retry with this option if the
25   - first time doesn't work. Then, with the file open, we can read the
26   - encryption dictionary normally.
  27 + want to, though be sure to document that the resulting JSON won't be
  28 + convertible back to a valid PDF. Since providing the password may
  29 + reveal additional details, --show-encryption could potentially retry
  30 + with this option if the first time doesn't work. Then, with the file
  31 + open, we can read the encryption dictionary normally.
27 32 * Find all places in the code that write to std::cout, std::err,
28 33 stdout, or stderr to make sure they obey default output stream
29 34 settings for QPDF and QPDFJob. This probably includes adding a
... ... @@ -43,209 +48,92 @@ Soon: Break ground on "Document-level work"
43 48 Output JSON v2
44 49 ==============
45 50  
46   -----
47   -notes from 5/2:
48   -
49   -See if I can change all output and error messages issued by the
50   -library, when context is available, to have a pipeline rather than a
51   -FILE* or std::ostream. This makes it possible for people to capture
52   -output more flexibly.
53   -
54   -For json output, do not unparse to string. Use the writers instead.
55   -Write incrementally. This changes ordering only, but we should be able
56   -manually update the test output for those cases. Objects should be
57   -written in numerical order, not lexically sorted. It probably makes
58   -sense to put the trailer at the end since that's where it is in a
59   -regular PDF.
60   -
61   -When we get to full serialization, add json serialization performance
62   -test.
63   -
64   -Some if not all of the json output functionality for v2 should move
65   -into QPDF proper rather than living in QPDFJob. There can be a
66   -top-level QPDF method that takes a pipeline and writes the JSON
67   -serialization to it.
68   -
69   -Decide what the API/CLI will be for serializing to v2. Will it just be
70   -part of --json or will it be its own separate thing? Probably we
71   -should make it so that a serialized PDF is different but uses the same
72   -object format as regular json mode.
73   -
74   -For going back from JSON to PDF, a separate utility will be needed.
75   -It's not practical for QPDFObjectHandle to be able to read JSON
76   -because of the special handling that is required for indirect objects,
77   -and QPDF can't just accept JSON because the way InputSource is used is
78   -complete different. Instead, we will need a separate utility that has
79   -logic similar to what copyForeignObject does. It will go something
80   -like this:
81   -
82   -* Create an empty QPDF (not emptyPDF, one with no objects in it at
83   - all). This works:
84   -
85   -```
86   -%PDF-1.3
87   -xref
88   -0 1
89   -0000000000 65535 f
90   -trailer << /Size 1 >>
91   -startxref
92   -9
93   -%%EOF
94   -```
95   -
96   -For each object:
97   -
98   -* Walk through the object detecting any indirect objects. For each one
99   - that is not already known, reserve the object. We can also validate
100   - but we should try to do the best we can with invalid JSON so people
101   - can get good error messages.
102   -* Construct a QPDFObjectHandle from the JSON
103   -* If the object is the trailer, update the trailer
104   -* Else if the object doesn't exist, reserve it
105   -* If the object is reserved, call replaceReserved()
106   -* Else the object already exists; this is an error.
107   -
108   -This can almost be done through public API. I think all we need is the
109   -ability to create a reserved object with a specific object ID.
110   -
111   -The choices for json_key (job.yml) will be different for v1 and v2.
112   -That information is already duplicated in multiple places.
113   -
114   -----
115   -
116   -Remember typo: search for "Typo" In QPDFJob::doJSONEncrypt.
117   -
118   -Remember to test interaction between generators and schemas.
119   -
120   -Should I have allowed array and object generators? Or maybe just
121   -string generators for stream data?
122   -
123   -When switching to generators for output, it's going to be very
124   -important not to break the logic around having things that look at all
125   -objects going first. Right now, there are good tests for it -- if you
126   -either comment out pushInheritedAttributesToPage or do something that
127   -postpones serializing the objects from allObjects (or even getting
128   -them), you get test failures either way. However, if we were to
129   -blindly overwrite test files, we might accidentally lose this. We will
130   -have to try to get most of the logic working before trying to use
131   -generators. Or maybe we shouldn't use generators at all for the
132   -objects and only use it for the stream data. Or maybe we can use
133   -generators but write it out early by exposing the depth() parameter.
134   -That might actually the safest way to do it. But that will be hard
135   -with schemas. Another thing might be to not combine serializing with
136   -other kinds of metadata.
137   -
138   -Output JSON v2 will contain enough information to completely recreate
139   -a PDF file. In other words, qpdf will have full, bidirectional,
140   -lossless json serialization/deserialization of PDF.
141   -
142   -If this is done, update --json option in cli.rst to mention v2. Also
143   -update QPDFJob::Config::json and of course other parts of the docs
144   -(json.rst).
145   -
146   -You can't create a PDF from v1 json because
147   -
148   -* The PDF version header is not recorded
  51 +Before starting on v2 format:
  52 +
  53 +* Some if not all of the json output functionality should move from
  54 + QPDFJob to QPDF. There can top-level QPDF methods that take a
  55 + pipeline and write the JSON serialization to it. For things that
  56 + generate smaller amounts of output (constant-size stuff, lists of
  57 + attachments), we can also have a version that returns a string. For
  58 + the benefit of users of other languages, we can have something that
  59 + takes a FILE* or writes to stdout as well. This would be a good time
  60 + to make sure all the information from --check and other
  61 + informational options (--show-linearization, --show-encryption,
  62 + --show-xref, --list-attachments, --show-npages) is available in the
  63 + json output.
  64 +
  65 +* Writing objects should write in numerical order with the trailer at
  66 + the end.
  67 +
  68 +* Having QPDFJob call these methods will change output ordering. We
  69 + should fix the json test outputs manually (or programmatically from
  70 + the input), not by overwriting, in case this has any unwanted side
  71 + effects.
  72 +
  73 +* Figure out how/whether to do schema checks with incremental write.
  74 + Consider changing the contract to allow fields to be absent even
  75 + when present in the schema. It's reasonable for people to check for
  76 + presence of a key. Most languages make this easy to do.
149 77  
150   -* Strings cannot be unambiguously encoded/decoded
  78 +General things to remember:
151 79  
152   - * Can't tell string from name from indirect object
  80 +* deprecate getJSON without a version
153 81  
154   - * Strings are treated as PDF doc encoding and output as UTF-8, which
155   - doesn't work since multiple PDF doc code points are undefined
  82 +* The choices for json_key (job.yml) will be different for v1 and v2.
  83 + That information is already duplicated in multiple places.
156 84  
157   -* There is no representation of stream data
158   -
159   -* You can't tell a stream from a dictionary except by looking in both
160   - "object" and "objectinfo". Fix this, and then remove "objectinfo".
161   -
162   -Additionally, using "n n R" as a key in "objects" and "objectinfo"
163   -messes up searching for things.
164   -
165   -For json v2:
166   -
167   -* Make sure it is possible to serialize and deserializes a PDF to JSON
168   - without loading the whole thing into memory.
169   -
170   - * As with a regular PDF, we can load everything into memory at once
171   - except stream data.
172   -
173   - * I think we can do this by having the concept of generated values,
174   - which we can make just be strings. We would have a JSON subclass
175   - whose value is a lambda that gets called to generate output. When
176   - we construct the JSON the stream values would be lambda functions
177   - that generate the stream data.
178   -
179   - * When we parse the file, we'll have to have a way for the parser to
180   - know that it should create a lambda that reads the data from the
181   - file. I think this means we want something that parses JSON from
182   - an input source. It would have to keep track of the offset and
183   - length of a value from the input source and have a (probably a
184   - lambda that it can call with a path) that would indicate whether
185   - to store the value or whether to create a lambda that retrieves
186   - it. We would have to keep a std::shared_ptr<InputSource> around.
187   -
188   - * Add json to the large file tests.
189   -
190   -* Resolve differences between information shown in the json format vs.
191   - information shown with options like --check, --list-attachments,
192   - etc. The json format should be able to completely replace things
193   - that write to stdout. Be sure getAllPages() and other top-level
194   - convenience routines are there so people don't need to parse the
195   - pages tree themselves. For many workflows, it should be possible for
196   - someone to work in the json file based on json metadata rather than
197   - calling the QPDF API. (Of course, you still need the QPDF API for
198   - higher level helper objects.)
  85 +* Remember typo: search for "Typo" In QPDFJob::doJSONEncrypt.
199 86  
200 87 * Consider using camelCase in multi-word key names to be consistent
201 88 with job JSON and with how JSON is often represented in languages
202 89 that use it more natively.
203 90  
204   -* Consider changing the contract to allow fields to be absent even
205   - when present in the schema. It's reasonable for people to check for
206   - presence of a key. Most languages make this easy to do.
  91 +* When we get to full serialization, add json serialization
  92 + performance test.
207 93  
208   -* If we allow --json to be mixed with --ignore-encryption, we must
209   - emphasize that the resulting json can't be turned back into a valid
210   - PDF.
  94 +* Add json to the large file tests.
211 95  
212   -Most things that are informational can stay the same. We will have to
213   -go through every item to decide for sure, especially when camelCase is
214   -taken into consideration.
  96 +* We could consider arguments like --replace-object that would take a
  97 + JSON representation of the object and could include indirect
  98 + references, etc. We could also add --delete object.
215 99  
216   -New APIs:
  100 +Object Representation:
217 101  
218   -QPDFObjectHandle::parseJSON(QPDF* context, JSON);
219   -QPDFObjectHandle::parseJSON(QPDF* context, std::string const&);
220   -operator ""_qpdf_json
221   -C API to create a QPDFObjectHandle from a json string
  102 +* Arrays, dictionaries, booleans, nulls, integers, and real numbers
  103 + are represented as their native JSON type. Real numbers that are out
  104 + of range will just be dealt with by however whatever JSON parser is
  105 + in use deals with it. Numbers like that shouldn't appear in PDF and,
  106 + if they do, they won't work right for anything. QPDF's JSON
  107 + representation allows for arbitrary precision.
  108 +* Names: "/Name" -- internal/canonical representation (e.g.
  109 + "/Text/Plain", not #xx quoted)
  110 +* Indirect objects: "n n R"
  111 +* Strings: one of
  112 + "u:json utf-8-encoded string"
  113 + "b:hex-encoded bytes"
  114 + Test cases: these are the same:
  115 + * "b:cf80", "b:CF80", "u:ฯ€", "u:\u03c0"
  116 + * "b:d83edd54", "u:๐Ÿฅ”", "u:\ud83e\udd54"
222 117  
223   -JSON::parseFile
224   -QPDF::parseJSON(JSON) (like parseFile, etc. -- deserializes json)
225   -QPDF::updateFromJSON(JSON)
  118 +When creating output from a string:
  119 +* If the string is explicitly unicode (UTF-8 or UTF-16), encode as
  120 + "u:" without the leading U+FEFF
  121 +* Else if the string can be bidirectionally mapped between pdf-doc and
  122 + unicode, transcode to unicode and encode as "u:"
  123 +* Else encode as "b:"
226 124  
227   -CLI: --infile-is-json -- indicate that the input is a qpdf json file
228   -rather than a PDF file
229   -CLI: --update-from-json=file.json
  125 +When reading a JSON string, any string that doesn't follow the above rules
  126 +is an error. Just use newUnicodeString on "u:" strings. For "b:"
  127 +strings, decode the bytes with hex_decode and use newString.
230 128  
231   -Have a "qpdf" key in the output that contains "jsonVersion",
232   -"pdfVersion", and "objects". This replaces the "objects" field at the
233   -top level. "objects" and "objectinfo" disappear from the top-level.
234   -".version" and ".qpdf.jsonVersion" will match. The input to parseJSON
235   -and updateFromJSON will have to have the "qpdf" key in it. All other
236   -keys are ignored.
  129 +Serialized PDF:
237 130  
238   -When creating from a JSON file, the JSON must be complete with data
239   -for all streams, a trailer, and a pdfVersion. When updating from a
240   -JSON:
  131 +The JSON output will have a "qpdf" key containing
  132 +* jsonVersion
  133 +* pdfVersion
  134 +* objects
241 135  
242   -* Any object whose value is null (not "value": null, but just null) is
243   - deleted.
244   -* For any stream that appears without stream data, the stream data is
245   - left alone.
246   -* Otherwise, the object from the JSON completely replaces the input
247   - object. No dictionary merges or anything like that are performed.
248   - It will call replaceObject.
  136 +The "qpdf" key replaces "objects" and "objectinfo" in v1 JSON.
249 137  
250 138 Within .qpdf.objects, the key is "obj:o g R" or "obj:trailer", and the
251 139 value is a dictionary with exactly one of "value" or "stream" as its
... ... @@ -254,6 +142,8 @@ single key.
254 142 Rationale of "obj:o g R" is that indirect object references are just
255 143 "o g R", and so code that wants to resolve one can do so easily by
256 144 just prepending "obj:" and not having to parse or split the string.
  145 +Having a prefix rather than making the key just "o g R" makes it much
  146 +easier to search in the JSON for the definition of an object.
257 147  
258 148 For non-streams:
259 149  
... ... @@ -268,101 +158,116 @@ For streams:
268 158 "obj:o g R": {
269 159 "stream": {
270 160 "dict": { ... stream dictionary ... },
271   - "filterable": bool,
272   - "raw": "base64-encoded raw data",
273   - "filtered": "base64-encoded filtered data"
  161 + "data": "base64-encoded data",
  162 + "dataFile": "path to base64-encoded data"
274 163 }
275 164 }
276 165 }
277 166  
278   -Wherever a PDF object appears in the JSON output, including "value"
279   -and "stream"."dict" above as well as other places where they might
280   -appear, objects are represented as follows:
  167 +At most one of "data" or "dataFile" will be present. When serializing,
  168 +stream decode parameters will be obeyed, and the stream dictionary
  169 +will reflect the result. There will be the option to omit stream data.
281 170  
282   -* Arrays, dictionaries, booleans, nulls, integers, and real numbers
283   - with no more than six decimal places are represented as their native
284   - JSON type.
285   -* Real numbers with more than six decimal places are represented as
286   - "r:{real-value}".
287   -* Names: "/Name" -- internal/canonical representation (e.g.
288   - "/Text/Plain", not #xx quoted)
289   -* Indirect objects: "n n R"
290   -* Strings: one of
291   - "s:json string treated as Unicode"
292   - "b:json string treated as bytes; character > \u00ff is an error"
293   - "e:base64-encoded bytes"
  171 +In the stream dictionary, "/Length" is always removed.
294 172  
295   -Test cases: these are the same:
296   -* "b:\u00c8\u0080", "s:ฯ€", "s:\u03c0", and "e:z4A="
297   -* "b:\u00d8\u003e\u00dd\u0054", "s:๐Ÿฅ”", "s:\ud83e\udd54", and "e:8J+llA=="
  173 +Streams are filtered or not based on the --decode-level parameter. If
  174 +a stream is filtered, "/Filter" and "/DecodeParms" are removed from
  175 +the stream dictionary. This makes the stream data and dictionary match
  176 +for when the file is read back in.
298 177  
299   -When creating output from a string:
300   -* If the string is explicitly unicode (UTF-8 or UTF-16), encode as
301   - "s:" without the leading U+FEFF
302   -* Else if the string can be bidirectionally mapped between pdf-doc and
303   - unicode, transcode to unicode and encode as "s:"
304   -* Else if the string would be decoded as binary, encode as "e:"
305   -* Else encode as "b:"
  178 +CLI:
306 179  
307   -When reading a string, any string that doesn't follow the above rules
308   -is an error. This includes "r:" strings not parseable as a real
309   -number, "/Name" strings containing a NUL character, "s:" or "b:"
310   -strings that are not valid JSON strings, "b:" strings containing
311   -character values > 0xff, or "e:" values that are not valid base64.
312   -Once the string is read in, if the "s:" string can be bidirectionally
313   -mapped between pdf-doc and unicode, store as PDFDoc. Otherwise store
314   -as UTF-16BE. "b:" strings are stored as bytes, and "e:" are decoded
315   -and stored as bytes.
  180 +* Add new flags
316 181  
317   -Implementing this will require some refactoring of things between
318   -QUtil and QPDF_String, plus we will need to implement a base64
319   -encoder/decoder.
  182 + * --from-json=input.json -- signals reading from a JSON and counts
  183 + as an input file.
320 184  
321   -This enables a workflow like this:
  185 + * --json-streams-omit -- stream data is omitted, the default
322 186  
323   -* qpdf --json=latest infile.pdf > pdf.json
324   -* modify pdf.json
325   -* qpdf infile.pdf --update-from=pdf.json out.pdf
  187 + * --json-streams-inline -- stream data is included in the "data"
  188 + key as base64-encoded
326 189  
327   -or
  190 + * --json-streams-file-prefix=prefix -- stream is written to $prefix-$obj
  191 + where $obj is the object number. The path to the file is stored
  192 + in the "dataFile" key. A relative path is recommended and will be
  193 + interpreted as relative to the current directory. If a relative
  194 + prefix is given, a relative path will stored in "dataFile".
  195 + Example:
  196 + mkdir in-streams
  197 + qpdf in.pdf --json-streams-file-prefix=in-streams/ > out.json
328 198  
329   -* qpdf --json=latest --json-stream-data=raw|filtered infile.pdf > pdf.json
330   -* modify pdf.json
331   -* qpdf pdf.json --infile-is-json out.pdf
  199 + * --to-json -- changes default to --json-streams-inline implies
  200 + --json-key=qpdf
332 201  
333   -Notes about streams and stream data:
  202 +Example workflow:
  203 +* qpdf in.pdf --to-json > pdf.json
  204 +* edit pdf.json
  205 +* qpdf --from-json=pdf.json out.pdf
334 206  
335   -* Always include "dict". "/Length" is removed from the stream
336   - dictionary.
  207 +JSON to PDF:
337 208  
338   -* Add new flag --json-stream-data={raw,filtered,none}. At most one of
339   - "raw" and "filtered" will appear for each stream. If "filtered"
340   - appears, "/Filter" and "/DecodeParms" are removed from the stream
341   - dictionary. This makes the stream data and dictionary match for when
342   - the file is read back in.
  209 +For going back from JSON to PDF, we can have
  210 +QPDF::fromJSON(std::shared_ptr<InputSource> which will have logic
  211 +similar to copyForeignObject. Note that this InputSource is not going
  212 +to be this->file. We have to keep it separately.
343 213  
344   -* Always include "filterable" regardless of value of
345   - --json-stream-data. The value of filterable is influenced by
346   - --decode-level, which is already in parameters.
  214 +The backing input source is this memory block:
347 215  
348   -* Add to parameters: value of json-stream-data, default is none
  216 +```
  217 +%PDF-1.3
  218 +xref
  219 +0 1
  220 +0000000000 65535 f
  221 +trailer << /Size 1 >>
  222 +startxref
  223 +9
  224 +%%EOF
  225 +```
  226 +
  227 +* Ignore all keys except .qpdf.
  228 +* Verify that .qpdf.jsonVersion is 2
  229 +* Set this->m->pdf_version based on the .qpdf.pdfVersion key
  230 +* For each object in .qpdf.objects:
  231 + * Walk through the object detecting any indirect objects. For each
  232 + one that is not already known, reserve the object. We can also
  233 + validate but we should try to do the best we can with invalid JSON
  234 + so people can get good error messages.
  235 + * Construct a QPDFObjectHandle from the JSON
  236 + * If the object is the trailer, update the trailer
  237 + * Else if the object doesn't exist, reserve it
  238 + * If the object is reserved, call replaceReserved()
  239 + * Else the object already exists; this is an error.
  240 +
  241 +For streams, have a stream data provider that, for inline streams,
  242 +does a base64 from the file offsets and for file-based streams, reads
  243 +the file. For the inline case, we have to keep the json InputSource
  244 +around. Otherwise, we don't. It is an error if there is no stream data.
  245 +
  246 +Documentation:
  247 +
  248 +Update --json option in cli.rst to mention v2 and update json.rst.
  249 +
  250 +Other documentation fodder:
  251 +
  252 +You can't create a PDF from v1 json because
  253 +
  254 +* The PDF version header is not recorded
  255 +
  256 +* Strings cannot be unambiguously encoded/decoded
  257 +
  258 + * Can't tell string from name from indirect object
349 259  
350   -* If --json-stream-data=none, omit stream data entirely
  260 + * Strings are treated as PDF doc encoding and output as UTF-8, which
  261 + doesn't work since multiple PDF doc code points are undefined
351 262  
352   -* If --json-stream-data=raw, include raw stream data as base64. Show
353   - the data even for unfiltered streams in "raw".
  263 +* There is no representation of stream data
354 264  
355   -* If --json-stream-data=filtered, include the base64-encoded filtered
356   - stream data if we can and should decode it based on decode-level.
357   - Otherwise, include the base64-encoded raw data. See if we can honor
358   - --normalize-content. If a stream appears unfiltered in the input,
359   - still show it as filtered. Remove /DecodeParms and /Filter if
360   - filtering.
  265 +* You can't tell a stream from a dictionary except by looking in both
  266 + "object" and "objectinfo". Fix this, and then remove "objectinfo".
  267 +
  268 +Additionally, using "n n R" as a key in "objects" and "objectinfo"
  269 +messes up searching for things.
361 270  
362   -Note that --json-stream-data=filtered is different from
363   ---filtered-stream-data in that --filtered-stream-data implies
364   ---decode-level=all while --json-stream-data=filtered does not. Make
365   -sure this is mentioned in the help for both options.
366 271  
367 272 QPDFJob
368 273 =======
... ...