Commit f1a9ba0c622deee0ed05004949b34f0126b12b6a

Authored by Jay Berkenbilt
1 parent 27a42c16

TODO: clean up remaining work for json v2

... ... @@ -55,11 +55,7 @@ Soon: Break ground on "Document-level work"
55 55 Output JSON v2
56 56 ==============
57 57  
58   -Some of this documentation has drifted from the actual implementation.
59   -
60   -* Document that /Length is ignored in stream dictionary replacements
61   -
62   -General things to remember:
  58 +Remaining work:
63 59  
64 60 * Make sure all the information from --check and other informational
65 61 options (--show-linearization, --show-encryption, --show-xref,
... ... @@ -68,106 +64,98 @@ General things to remember:
68 64 right keys when in json mode. I don't think I want check on by
69 65 default, so that might be different.
70 66  
71   -* Consider changing the contract to allow fields to be absent even
72   - when present in the schema. It's reasonable for people to check for
73   - presence of a key. Most languages make this easy to do.
  67 +Notes for documentation:
  68 +
  69 +* Find all mentions of json in the manual and update.
74 70  
75 71 * Document typo fix in encrypt in release notes along with any other
76 72 non-compatible json 2 changes. Scrutinize all the output to decide
77 73 what should change.
78 74  
79   -* Document that keys other than "qpdf-v2" are ignored so people can
80   - stash their own stuff.
81   -
82   -JSON to PDF:
83   -
84   -Have --json-input and --update-from-json. With --json-input, the json
85   -file must be complete, meaning all stream data, the trailer, and the
86   -PDF version must be present. For streams with no stream data, the
87   -dictionary is updated but the data is left untouched. Other things
88   -that are omitted are left alone. Make sure document that, when writing
89   -a PDF file from QPDF, there is no expectation of object numbers being
90   -preserved. As such, --update-from-json can only be used to update the
91   -exact file that the json was created from. You can put multiple
92   -objects in the update file, but you can't use a json from one file to
93   -update the output of a previous update since the object numbers will
94   -have changed. Note that, when creating from a JSON, object numbers are
95   -preserved in the resulting QPDF object but still modified by
96   -QPDFWriter for the output. This would be visible by combining
97   ---json-output and --json-input. Also using --qdf with
98   ---create-from-json would show original object IDs in comments. It will
99   -be important to capture this in the documentation.
100   -
101   -When reading a JSON string, any string that doesn't look like a name
102   -or indirect object or start with "b:" or "u:" should be considered an
103   -error. Just use newUnicodeString on "u:" strings. For "b:" strings,
104   -decode the bytes with hex_decode and use newString.
105   -
106   -Test case: combine --json-input and --json-output to show preservation
107   -of object numbers. QPDFWriter won't show that although --qdf with the
108   -original object ID comments would.
109   -
110   -The backing input source for createFromJSON is this memory block:
111   -
112   -```
113   -%PDF-1.3
114   -xref
115   -0 1
116   -0000000000 65535 f
117   -trailer << /Size 1 >>
118   -startxref
119   -9
120   -%%EOF
121   -```
122   -
123   -* Ignore all keys except .qpdf-v2.
124   -* Set this->m->pdf_version based on the .qpdf.pdfVersion key
125   -* For each object in .qpdf.objects:
126   - * Walk through the object detecting any indirect objects. For each
127   - one that is not already known, reserve the object. We can also
128   - validate but we should try to do the best we can with invalid JSON
129   - so people can get good error messages.
130   - * Construct a QPDFObjectHandle from the JSON
131   - * If the object is the trailer, update the trailer
132   - * Else if the object doesn't exist, reserve it
133   - * If the object is reserved, call replaceReserved()
134   - * Else the object already exists; this is an error.
135   -
136   -For streams, have a stream data provider that, for inline streams,
137   -does a base64 from the file offsets and for file-based streams, reads
138   -the file. For the inline case, we have to keep the json InputSource
139   -around. Otherwise, we don't. It is an error if there is no stream
140   -data. For files, we can have a stream data provider that just reads
141   -the file. Remember QUtil::file_provider.
142   -
143   -Documentation:
144   -
145   -Serialized PDF:
146   -
147   -The JSON output will have a "qpdf-v2" key containing
148   -* pdfversion
149   -* maxobjectid
150   -* objects
151   -
152   -In regular json mode, "objectinfo" is gone.
153   -
154   -Within .objects, the key is "obj:o g R" or "trailer", and the
155   -value is a dictionary with exactly one of "value" or "stream" as its
156   -single key.
  75 +* Keys other than "qpdf-v2" are ignored so people can stash their own
  76 + stuff. Unknown keys are ignored at other places for future
  77 + compatibility. Readers of qpdf json should continue to ignore keys
  78 + they don't recognize.
157 79  
158   -Rationale of "obj:o g R" is that indirect object references are just
159   -"o g R", and so code that wants to resolve one can do so easily by
160   -just prepending "obj:" and not having to parse or split the string.
161   -Having a prefix rather than making the key just "o g R" makes it much
162   -easier to search in the JSON for the definition of an object.
  80 +* Change: names are written in canonical form with a leading slash
  81 + just as they are treated in the code. In v1, they were written in
  82 + PDF syntax in the json file. Example: /text#2fplain in pdf will be
  83 + written as /text/plain in json v2 and as /text#2fplain in json v1.
  84 +
  85 +* Document changes to strings, objects, streams, object keys.
  86 +
  87 +* CLI: --json-input, --json-output[=version], --update-from-json. With
  88 + --json-input, the input file is a JSON file instead of a PDF file.
  89 + It must be complete, meaning that a PDF version must be given, all
  90 + streams must have exactly one of data or datafile, and a trailer
  91 + dictionary must be present, even if empty.
  92 +
  93 + With --update-from-json, the JSON file updates objects in place. If
  94 + updating an old stream, if stream data is omitted, the data remains
  95 + untouched. The dictionary is always required. Remember that
  96 + QPDFWriter does not preserve object numbers, though --json-output
  97 + does. Therefore, if you want to update a PDF with a JSON, the input
  98 + to --update-from-json must be the same PDF as the one that
  99 + --json-output was run on previously. Otherwise, object numbers won't
  100 + match. Show this with an example. When updating,
  101 +
  102 +* Certain fields are ignored when reading the JSON. This includes
  103 + maxobjectid, any computed fields in trailer (such as /Size), and all
  104 + /Length keys in stream dictionaries. There is no need for the user
  105 + to correct, remove, or otherwise worry about any values those keys
  106 + might have. The maxobjectid field is present in the original output
  107 + to assist with adding new objects to the file.
  108 +
  109 +* JSON strings within PDF objects:
  110 +
  111 + * "n n R" is an indirect object
  112 +
  113 + * "/Name" is a name in canonical form with a leading slash (like
  114 + "/text/plain"), not PDF syntax (like "/text#2fplain").
  115 +
  116 + * "b:hex-digits" is a binary string ("b:feff03c0"). Hex digits may be
  117 + mixed case. There must be an even number of digits.
  118 +
  119 + * "u:utf-8" is a UTF-8 encoded string ("u:ฯ€", "u:\u03c0"). UTF-16
  120 + surrogate pairs are allowed. These are all equivalent: "u:๐Ÿฅ”",
  121 + "u:\ud83e\udd54", "b:FEFFD83EDD54", "b:efbbbff09fa594".
  122 +
  123 + * Both "b:" and "u:" are valid representations of the empty string.
  124 +
  125 + * Anything else is an error
  126 +
  127 +* Document use of --json-input and --json-output together to show
  128 + preservation of object numbers. Draw attention to "original object
  129 + ID" comments in qdf as another way to show it.
  130 +
  131 +* Document top-level keys of "qpdf-v2" ("pdfversion", "objects",
  132 + "maxobjectid") noting that "maxobjectid" is ignored when reading.
  133 +
  134 +* Stream data: "data" is base64-encoded stream data. "datafile" is the
  135 + path to a file (relative path recommended but not required)
  136 + containing the binary data. As with any PDF representation, the data
  137 + must be consistent with the filters. --decode-level is honored by
  138 + --json-output.
  139 +
  140 +* Other changes from v1:
  141 +
  142 + * in "objects", keys are "obj:o g R" or "trailer"
  143 +
  144 + * Non-stream objects are dictionaries with a "value" key whose value
  145 + is the object. Stream objects are dictionaries with a "stream" key
  146 + whose value is {"dict": stream-dictionary}. The "/Length" key is
  147 + omitted from the stream dictionary.
  148 +
  149 + * "objectinfo" is gone as it is now possible to tell a stream from a
  150 + non-stream directly. To get stream data, use the --json-output
  151 + option. Note about how "pages" may cause the pages tree to be
  152 + corrected.
163 153  
164 154 For non-streams:
165 155  
166   -{
167 156 "obj:o g R": {
168 157 "value": ...
169 158 }
170   -}
171 159  
172 160 For streams:
173 161  
... ... @@ -178,41 +166,31 @@ For streams:
178 166 "datafile": "path to base64-encoded data"
179 167 }
180 168 }
181   -}
182   -
183   -At most one of "data" or "datafile" will be present. When serializing,
184   -stream decode parameters will be obeyed, and the stream dictionary
185   -will reflect the result. There will be the option to omit stream data.
186 169  
187   -When data is included, "/Length" is removed from the stream
188   -dictionary.
189   -
190   -Streams are filtered or not based on the --decode-level parameter. If
191   -a stream is filtered, "/Filter" and "/DecodeParms" are removed from
192   -the stream dictionary. This makes the stream data and dictionary match
193   -for when the file is read back in.
  170 +Rationale of "obj:o g R" is that indirect object references are just
  171 +"o g R", and so code that wants to resolve one can do so easily by
  172 +just prepending "obj:" and not having to parse or split the string.
  173 +Having a prefix rather than making the key just "o g R" makes it much
  174 +easier to search in the JSON for the definition of an object.
194 175  
195 176 CLI:
196 177  
197 178 Example workflow:
198   -* qpdf in.pdf --json-output=2 pdf.json
  179 +* qpdf in.pdf --json-output pdf.json
199 180 * edit pdf.json
200 181 * qpdf --json-input pdf.json out.pdf
201 182  
202   -* qpdf in.pdf --json-output=2 pdf.json
  183 +* qpdf in.pdf --json-output pdf.json
203 184 * edit pdf.json keeping only objects that need to be changed
204 185 * qpdf in.pdf --update-from-json=pdf.json out.pdf
205 186  
206   -Update --json option in cli.rst to mention v2 and update json.rst.
207   -
208   -Other documentation fodder:
  187 +To modify a single object:
209 188  
210   -You can't create a PDF from v1 json because
  189 +* qpdf in.pdf --json-output pdf.json --json-object=o,g
  190 +* edit pdf.json
  191 +* qpdf in.pdf --update-from-json=pdf.json out.pdf
211 192  
212   -* Change: names are written in canonical form with a leading slash
213   - just as they are treated in the code. In v1, they were written in
214   - PDF syntax in the json file. Example: /text#2fplain in pdf will be
215   - written as /text/plain in json v2 and as /text#2fplain in json v1.
  193 +Historical note: you can't create a PDF from v1 json because
216 194  
217 195 * The PDF version header is not recorded
218 196  
... ... @@ -221,15 +199,16 @@ You can&#39;t create a PDF from v1 json because
221 199 * Can't tell string from name from indirect object
222 200  
223 201 * Strings are treated as PDF doc encoding and output as UTF-8, which
224   - doesn't work since multiple PDF doc code points are undefined
  202 + doesn't work since multiple PDF doc code points are undefined and
  203 + is absurd for binary strings
225 204  
226 205 * There is no representation of stream data
227 206  
228 207 * You can't tell a stream from a dictionary except by looking in both
229   - "object" and "objectinfo". Fix this, and then remove "objectinfo".
  208 + "object" and "objectinfo".
230 209  
231   -Additionally, using "n n R" as a key in "objects" and "objectinfo"
232   -messes up searching for things.
  210 +* Using "n n R" as a key in "objects" and "objectinfo" makes it hard
  211 + to search for things when viewing the JSON file in an editor.
233 212  
234 213  
235 214 QPDFPagesTree
... ... @@ -249,7 +228,7 @@ I&#39;m thinking we will want to keep a pages cache for efficient
249 228 insertion. There's no reason we can't keep a vector of page objects up
250 229 to date and just do a traversal the first time we do getAllPages just
251 230 like we do now. The difference is that we would not flatten the pages
252   -tree. It would be useful to go through QPDF_pages and re-reimplement
  231 +tree. It would be useful to go through QPDF_pages and reimplement
253 232 everything without calling flattenPagesTree. Then we can remove
254 233 flattenPagesTree, which is private.
255 234  
... ... @@ -261,7 +240,7 @@ isPagesObject and isPageObject are reliable and can be made more
261 240 reliable. Maybe add a validate or repair function? It should also make
262 241 sure /Count and /Parent are correct.
263 242  
264   -refs/attic/QPDFPagesTree-old -- original, abndoned branch -- clean up
  243 +refs/attic/QPDFPagesTree-old -- original, abandoned branch -- clean up
265 244 when done.
266 245  
267 246 QPDFJob
... ...
cSpell.json
... ... @@ -429,6 +429,7 @@
429 429 "rdpp",
430 430 "rdquo",
431 431 "refcount",
  432 + "reimplement",
432 433 "resave",
433 434 "retargeted",
434 435 "rfont",
... ...
manual/json.rst
... ... @@ -23,7 +23,7 @@ be extracted from PDF files using other qpdf command-line options.
23 23 QPDF JSON Format
24 24 ----------------
25 25  
26   -QXXXQ Write this.
  26 +XXX Write this.
27 27  
28 28 .. _json-guarantees:
29 29  
... ...