Commit f1a9ba0c622deee0ed05004949b34f0126b12b6a

Authored by Jay Berkenbilt
1 parent 27a42c16

TODO: clean up remaining work for json v2

@@ -55,11 +55,7 @@ Soon: Break ground on "Document-level work" @@ -55,11 +55,7 @@ Soon: Break ground on "Document-level work"
55 Output JSON v2 55 Output JSON v2
56 ============== 56 ==============
57 57
58 -Some of this documentation has drifted from the actual implementation.  
59 -  
60 -* Document that /Length is ignored in stream dictionary replacements  
61 -  
62 -General things to remember: 58 +Remaining work:
63 59
64 * Make sure all the information from --check and other informational 60 * Make sure all the information from --check and other informational
65 options (--show-linearization, --show-encryption, --show-xref, 61 options (--show-linearization, --show-encryption, --show-xref,
@@ -68,106 +64,98 @@ General things to remember: @@ -68,106 +64,98 @@ General things to remember:
68 right keys when in json mode. I don't think I want check on by 64 right keys when in json mode. I don't think I want check on by
69 default, so that might be different. 65 default, so that might be different.
70 66
71 -* Consider changing the contract to allow fields to be absent even  
72 - when present in the schema. It's reasonable for people to check for  
73 - presence of a key. Most languages make this easy to do. 67 +Notes for documentation:
  68 +
  69 +* Find all mentions of json in the manual and update.
74 70
75 * Document typo fix in encrypt in release notes along with any other 71 * Document typo fix in encrypt in release notes along with any other
76 non-compatible json 2 changes. Scrutinize all the output to decide 72 non-compatible json 2 changes. Scrutinize all the output to decide
77 what should change. 73 what should change.
78 74
79 -* Document that keys other than "qpdf-v2" are ignored so people can  
80 - stash their own stuff.  
81 -  
82 -JSON to PDF:  
83 -  
84 -Have --json-input and --update-from-json. With --json-input, the json  
85 -file must be complete, meaning all stream data, the trailer, and the  
86 -PDF version must be present. For streams with no stream data, the  
87 -dictionary is updated but the data is left untouched. Other things  
88 -that are omitted are left alone. Make sure document that, when writing  
89 -a PDF file from QPDF, there is no expectation of object numbers being  
90 -preserved. As such, --update-from-json can only be used to update the  
91 -exact file that the json was created from. You can put multiple  
92 -objects in the update file, but you can't use a json from one file to  
93 -update the output of a previous update since the object numbers will  
94 -have changed. Note that, when creating from a JSON, object numbers are  
95 -preserved in the resulting QPDF object but still modified by  
96 -QPDFWriter for the output. This would be visible by combining  
97 ---json-output and --json-input. Also using --qdf with  
98 ---create-from-json would show original object IDs in comments. It will  
99 -be important to capture this in the documentation.  
100 -  
101 -When reading a JSON string, any string that doesn't look like a name  
102 -or indirect object or start with "b:" or "u:" should be considered an  
103 -error. Just use newUnicodeString on "u:" strings. For "b:" strings,  
104 -decode the bytes with hex_decode and use newString.  
105 -  
106 -Test case: combine --json-input and --json-output to show preservation  
107 -of object numbers. QPDFWriter won't show that although --qdf with the  
108 -original object ID comments would.  
109 -  
110 -The backing input source for createFromJSON is this memory block:  
111 -  
112 -```  
113 -%PDF-1.3  
114 -xref  
115 -0 1  
116 -0000000000 65535 f  
117 -trailer << /Size 1 >>  
118 -startxref  
119 -9  
120 -%%EOF  
121 -```  
122 -  
123 -* Ignore all keys except .qpdf-v2.  
124 -* Set this->m->pdf_version based on the .qpdf.pdfVersion key  
125 -* For each object in .qpdf.objects:  
126 - * Walk through the object detecting any indirect objects. For each  
127 - one that is not already known, reserve the object. We can also  
128 - validate but we should try to do the best we can with invalid JSON  
129 - so people can get good error messages.  
130 - * Construct a QPDFObjectHandle from the JSON  
131 - * If the object is the trailer, update the trailer  
132 - * Else if the object doesn't exist, reserve it  
133 - * If the object is reserved, call replaceReserved()  
134 - * Else the object already exists; this is an error.  
135 -  
136 -For streams, have a stream data provider that, for inline streams,  
137 -does a base64 from the file offsets and for file-based streams, reads  
138 -the file. For the inline case, we have to keep the json InputSource  
139 -around. Otherwise, we don't. It is an error if there is no stream  
140 -data. For files, we can have a stream data provider that just reads  
141 -the file. Remember QUtil::file_provider.  
142 -  
143 -Documentation:  
144 -  
145 -Serialized PDF:  
146 -  
147 -The JSON output will have a "qpdf-v2" key containing  
148 -* pdfversion  
149 -* maxobjectid  
150 -* objects  
151 -  
152 -In regular json mode, "objectinfo" is gone.  
153 -  
154 -Within .objects, the key is "obj:o g R" or "trailer", and the  
155 -value is a dictionary with exactly one of "value" or "stream" as its  
156 -single key. 75 +* Keys other than "qpdf-v2" are ignored so people can stash their own
  76 + stuff. Unknown keys are ignored at other places for future
  77 + compatibility. Readers of qpdf json should continue to ignore keys
  78 + they don't recognize.
157 79
158 -Rationale of "obj:o g R" is that indirect object references are just  
159 -"o g R", and so code that wants to resolve one can do so easily by  
160 -just prepending "obj:" and not having to parse or split the string.  
161 -Having a prefix rather than making the key just "o g R" makes it much  
162 -easier to search in the JSON for the definition of an object. 80 +* Change: names are written in canonical form with a leading slash
  81 + just as they are treated in the code. In v1, they were written in
  82 + PDF syntax in the json file. Example: /text#2fplain in pdf will be
  83 + written as /text/plain in json v2 and as /text#2fplain in json v1.
  84 +
  85 +* Document changes to strings, objects, streams, object keys.
  86 +
  87 +* CLI: --json-input, --json-output[=version], --update-from-json. With
  88 + --json-input, the input file is a JSON file instead of a PDF file.
  89 + It must be complete, meaning that a PDF version must be given, all
  90 + streams must have exactly one of data or datafile, and a trailer
  91 + dictionary must be present, even if empty.
  92 +
  93 + With --update-from-json, the JSON file updates objects in place. If
  94 + updating an old stream, if stream data is omitted, the data remains
  95 + untouched. The dictionary is always required. Remember that
  96 + QPDFWriter does not preserve object numbers, though --json-output
  97 + does. Therefore, if you want to update a PDF with a JSON, the input
  98 + to --update-from-json must be the same PDF as the one that
  99 + --json-output was run on previously. Otherwise, object numbers won't
  100 + match. Show this with an example. When updating,
  101 +
  102 +* Certain fields are ignored when reading the JSON. This includes
  103 + maxobjectid, any computed fields in trailer (such as /Size), and all
  104 + /Length keys in stream dictionaries. There is no need for the user
  105 + to correct, remove, or otherwise worry about any values those keys
  106 + might have. The maxobjectid field is present in the original output
  107 + to assist with adding new objects to the file.
  108 +
  109 +* JSON strings within PDF objects:
  110 +
  111 + * "n n R" is an indirect object
  112 +
  113 + * "/Name" is a name in canonical form with a leading slash (like
  114 + "/text/plain"), not PDF syntax (like "/text#2fplain").
  115 +
  116 + * "b:hex-digits" is a binary string ("b:feff03c0"). Hex digits may be
  117 + mixed case. There must be an even number of digits.
  118 +
  119 + * "u:utf-8" is a UTF-8 encoded string ("u:ฯ€", "u:\u03c0"). UTF-16
  120 + surrogate pairs are allowed. These are all equivalent: "u:๐Ÿฅ”",
  121 + "u:\ud83e\udd54", "b:FEFFD83EDD54", "b:efbbbff09fa594".
  122 +
  123 + * Both "b:" and "u:" are valid representations of the empty string.
  124 +
  125 + * Anything else is an error
  126 +
  127 +* Document use of --json-input and --json-output together to show
  128 + preservation of object numbers. Draw attention to "original object
  129 + ID" comments in qdf as another way to show it.
  130 +
  131 +* Document top-level keys of "qpdf-v2" ("pdfversion", "objects",
  132 + "maxobjectid") noting that "maxobjectid" is ignored when reading.
  133 +
  134 +* Stream data: "data" is base64-encoded stream data. "datafile" is the
  135 + path to a file (relative path recommended but not required)
  136 + containing the binary data. As with any PDF representation, the data
  137 + must be consistent with the filters. --decode-level is honored by
  138 + --json-output.
  139 +
  140 +* Other changes from v1:
  141 +
  142 + * in "objects", keys are "obj:o g R" or "trailer"
  143 +
  144 + * Non-stream objects are dictionaries with a "value" key whose value
  145 + is the object. Stream objects are dictionaries with a "stream" key
  146 + whose value is {"dict": stream-dictionary}. The "/Length" key is
  147 + omitted from the stream dictionary.
  148 +
  149 + * "objectinfo" is gone as it is now possible to tell a stream from a
  150 + non-stream directly. To get stream data, use the --json-output
  151 + option. Note about how "pages" may cause the pages tree to be
  152 + corrected.
163 153
164 For non-streams: 154 For non-streams:
165 155
166 -{  
167 "obj:o g R": { 156 "obj:o g R": {
168 "value": ... 157 "value": ...
169 } 158 }
170 -}  
171 159
172 For streams: 160 For streams:
173 161
@@ -178,41 +166,31 @@ For streams: @@ -178,41 +166,31 @@ For streams:
178 "datafile": "path to base64-encoded data" 166 "datafile": "path to base64-encoded data"
179 } 167 }
180 } 168 }
181 -}  
182 -  
183 -At most one of "data" or "datafile" will be present. When serializing,  
184 -stream decode parameters will be obeyed, and the stream dictionary  
185 -will reflect the result. There will be the option to omit stream data.  
186 169
187 -When data is included, "/Length" is removed from the stream  
188 -dictionary.  
189 -  
190 -Streams are filtered or not based on the --decode-level parameter. If  
191 -a stream is filtered, "/Filter" and "/DecodeParms" are removed from  
192 -the stream dictionary. This makes the stream data and dictionary match  
193 -for when the file is read back in. 170 +Rationale of "obj:o g R" is that indirect object references are just
  171 +"o g R", and so code that wants to resolve one can do so easily by
  172 +just prepending "obj:" and not having to parse or split the string.
  173 +Having a prefix rather than making the key just "o g R" makes it much
  174 +easier to search in the JSON for the definition of an object.
194 175
195 CLI: 176 CLI:
196 177
197 Example workflow: 178 Example workflow:
198 -* qpdf in.pdf --json-output=2 pdf.json 179 +* qpdf in.pdf --json-output pdf.json
199 * edit pdf.json 180 * edit pdf.json
200 * qpdf --json-input pdf.json out.pdf 181 * qpdf --json-input pdf.json out.pdf
201 182
202 -* qpdf in.pdf --json-output=2 pdf.json 183 +* qpdf in.pdf --json-output pdf.json
203 * edit pdf.json keeping only objects that need to be changed 184 * edit pdf.json keeping only objects that need to be changed
204 * qpdf in.pdf --update-from-json=pdf.json out.pdf 185 * qpdf in.pdf --update-from-json=pdf.json out.pdf
205 186
206 -Update --json option in cli.rst to mention v2 and update json.rst.  
207 -  
208 -Other documentation fodder: 187 +To modify a single object:
209 188
210 -You can't create a PDF from v1 json because 189 +* qpdf in.pdf --json-output pdf.json --json-object=o,g
  190 +* edit pdf.json
  191 +* qpdf in.pdf --update-from-json=pdf.json out.pdf
211 192
212 -* Change: names are written in canonical form with a leading slash  
213 - just as they are treated in the code. In v1, they were written in  
214 - PDF syntax in the json file. Example: /text#2fplain in pdf will be  
215 - written as /text/plain in json v2 and as /text#2fplain in json v1. 193 +Historical note: you can't create a PDF from v1 json because
216 194
217 * The PDF version header is not recorded 195 * The PDF version header is not recorded
218 196
@@ -221,15 +199,16 @@ You can&#39;t create a PDF from v1 json because @@ -221,15 +199,16 @@ You can&#39;t create a PDF from v1 json because
221 * Can't tell string from name from indirect object 199 * Can't tell string from name from indirect object
222 200
223 * Strings are treated as PDF doc encoding and output as UTF-8, which 201 * Strings are treated as PDF doc encoding and output as UTF-8, which
224 - doesn't work since multiple PDF doc code points are undefined 202 + doesn't work since multiple PDF doc code points are undefined and
  203 + is absurd for binary strings
225 204
226 * There is no representation of stream data 205 * There is no representation of stream data
227 206
228 * You can't tell a stream from a dictionary except by looking in both 207 * You can't tell a stream from a dictionary except by looking in both
229 - "object" and "objectinfo". Fix this, and then remove "objectinfo". 208 + "object" and "objectinfo".
230 209
231 -Additionally, using "n n R" as a key in "objects" and "objectinfo"  
232 -messes up searching for things. 210 +* Using "n n R" as a key in "objects" and "objectinfo" makes it hard
  211 + to search for things when viewing the JSON file in an editor.
233 212
234 213
235 QPDFPagesTree 214 QPDFPagesTree
@@ -249,7 +228,7 @@ I&#39;m thinking we will want to keep a pages cache for efficient @@ -249,7 +228,7 @@ I&#39;m thinking we will want to keep a pages cache for efficient
249 insertion. There's no reason we can't keep a vector of page objects up 228 insertion. There's no reason we can't keep a vector of page objects up
250 to date and just do a traversal the first time we do getAllPages just 229 to date and just do a traversal the first time we do getAllPages just
251 like we do now. The difference is that we would not flatten the pages 230 like we do now. The difference is that we would not flatten the pages
252 -tree. It would be useful to go through QPDF_pages and re-reimplement 231 +tree. It would be useful to go through QPDF_pages and reimplement
253 everything without calling flattenPagesTree. Then we can remove 232 everything without calling flattenPagesTree. Then we can remove
254 flattenPagesTree, which is private. 233 flattenPagesTree, which is private.
255 234
@@ -261,7 +240,7 @@ isPagesObject and isPageObject are reliable and can be made more @@ -261,7 +240,7 @@ isPagesObject and isPageObject are reliable and can be made more
261 reliable. Maybe add a validate or repair function? It should also make 240 reliable. Maybe add a validate or repair function? It should also make
262 sure /Count and /Parent are correct. 241 sure /Count and /Parent are correct.
263 242
264 -refs/attic/QPDFPagesTree-old -- original, abndoned branch -- clean up 243 +refs/attic/QPDFPagesTree-old -- original, abandoned branch -- clean up
265 when done. 244 when done.
266 245
267 QPDFJob 246 QPDFJob
cSpell.json
@@ -429,6 +429,7 @@ @@ -429,6 +429,7 @@
429 "rdpp", 429 "rdpp",
430 "rdquo", 430 "rdquo",
431 "refcount", 431 "refcount",
  432 + "reimplement",
432 "resave", 433 "resave",
433 "retargeted", 434 "retargeted",
434 "rfont", 435 "rfont",
manual/json.rst
@@ -23,7 +23,7 @@ be extracted from PDF files using other qpdf command-line options. @@ -23,7 +23,7 @@ be extracted from PDF files using other qpdf command-line options.
23 QPDF JSON Format 23 QPDF JSON Format
24 ---------------- 24 ----------------
25 25
26 -QXXXQ Write this. 26 +XXX Write this.
27 27
28 .. _json-guarantees: 28 .. _json-guarantees:
29 29