Commit 7882b85b0691d6a669cb0b2656f1e4c7438c552b
1 parent
3c4d2bfb
TODO: more JSON notes
Showing
1 changed file
with
109 additions
and
3 deletions
TODO
| ... | ... | @@ -39,6 +39,108 @@ Soon: Break ground on "Document-level work" |
| 39 | 39 | Output JSON v2 |
| 40 | 40 | ============== |
| 41 | 41 | |
| 42 | +---- | |
| 43 | +notes from 5/2: | |
| 44 | + | |
| 45 | +Need new pipelines: | |
| 46 | +* Pl_OStream(std::ostream) with semantics like Pl_StdioFile | |
| 47 | +* Pl_String to std::string with semantics like Pl_Buffer | |
| 48 | +* Pl_Base64 | |
| 49 | + | |
| 50 | +New Pipeline methods: | |
| 51 | +* writeString(std::string const&) | |
| 52 | +* writeCString(char*) | |
| 53 | +* writeChars(char*, size_t) | |
| 54 | + | |
| 55 | +* Consider templated operator<< which could specialize for char* and | |
| 56 | + std::string and could use std::ostringstream otherwise | |
| 57 | + | |
| 58 | +See if I can change all output and error messages issued by the | |
| 59 | +library, when context is available, to have a pipeline rather than a | |
| 60 | +FILE* or std::ostream. This makes it possible for people to capture | |
| 61 | +output more flexibly. | |
| 62 | + | |
| 63 | +JSON: rather than unparse() -> string, there should be write method | |
| 64 | +that takes a pipeline and a depth. Then rewrite all the unparse | |
| 65 | +methods to use it. This makes incremental write possible as well as | |
| 66 | +writing arbitrarily large amounts of output. | |
| 67 | + | |
| 68 | +JSON::parse should work from an InputSource. BufferInputSource can | |
| 69 | +already start with a std::string. | |
| 70 | + | |
| 71 | +Have a json blob defined by a function that takes a pipeline and | |
| 72 | +writes data to the pipeline. It's writer should create a Pl_Base64 -> | |
| 73 | +Pl_Concatenate in front of the pipeline passed to write and call the | |
| 74 | +function with that. | |
| 75 | + | |
| 76 | +Add methods needed to do incremental writes. Basically we need to | |
| 77 | +expose functionality the array and dictionary unparse methods. Maybe | |
| 78 | +we can have a DictionaryWriter and an ArrayWriter that deal with the | |
| 79 | +first/depth logic and have writeElement or writeEntry(key, value) | |
| 80 | +methods. | |
| 81 | + | |
| 82 | +For json output, do not unparse to string. Use the writers instead. | |
| 83 | +Write incrementally. This changes ordering only, but we should be able | |
| 84 | +manually update the test output for those cases. Objects should be | |
| 85 | +written in numerical order, not lexically sorted. It probably makes | |
| 86 | +sense to put the trailer at the end since that's where it is in a | |
| 87 | +regular PDF. | |
| 88 | + | |
| 89 | +When we get to full serialization, add json serialization performance | |
| 90 | +test. | |
| 91 | + | |
| 92 | +Some if not all of the json output functionality for v2 should move | |
| 93 | +into QPDF proper rather than living in QPDFJob. There can be a | |
| 94 | +top-level QPDF method that takes a pipeline and writes the JSON | |
| 95 | +serialization to it. | |
| 96 | + | |
| 97 | +Decide what the API/CLI will be for serializing to v2. Will it just be | |
| 98 | +part of --json or will it be its own separate thing? Probably we | |
| 99 | +should make it so that a serialized PDF is different but uses the same | |
| 100 | +object format as regular json mode. | |
| 101 | + | |
| 102 | +For going back from JSON to PDF, a separate utility will be needed. | |
| 103 | +It's not practical for QPDFObjectHandle to be able to read JSON | |
| 104 | +because of the special handling that is required for indirect objects, | |
| 105 | +and QPDF can't just accept JSON because the way InputSource is used is | |
| 106 | +complete different. Instead, we will need a separate utility that has | |
| 107 | +logic similar to what copyForeignObject does. It will go something | |
| 108 | +like this: | |
| 109 | + | |
| 110 | +* Create an empty QPDF (not emptyPDF, one with no objects in it at | |
| 111 | + all). This works: | |
| 112 | + | |
| 113 | +``` | |
| 114 | +%PDF-1.3 | |
| 115 | +xref | |
| 116 | +0 1 | |
| 117 | +0000000000 65535 f | |
| 118 | +trailer << /Size 1 >> | |
| 119 | +startxref | |
| 120 | +9 | |
| 121 | +%%EOF | |
| 122 | +``` | |
| 123 | + | |
| 124 | +For each object: | |
| 125 | + | |
| 126 | +* Walk through the object detecting any indirect objects. For each one | |
| 127 | + that is not already known, reserve the object. We can also validate | |
| 128 | + but we should try to do the best we can with invalid JSON so people | |
| 129 | + can get good error messages. | |
| 130 | +* Construct a QPDFObjectHandle from the JSON | |
| 131 | +* If the object is the trailer, update the trailer | |
| 132 | +* Else if the object doesn't exist, reserve it | |
| 133 | +* If the object is reserved, call replaceReserved() | |
| 134 | +* Else the object already exists; this is an error. | |
| 135 | + | |
| 136 | +This can almost be done through public API. I think all we need is the | |
| 137 | +ability to create a reserved object with a specific object ID. | |
| 138 | + | |
| 139 | +The choices for json_key (job.yml) will be different for v1 and v2. | |
| 140 | +That information is already duplicated in multiple places. | |
| 141 | + | |
| 142 | +---- | |
| 143 | + | |
| 42 | 144 | Remember typo: search for "Typo" In QPDFJob::doJSONEncrypt. |
| 43 | 145 | |
| 44 | 146 | Remember to test interaction between generators and schemas. |
| ... | ... | @@ -173,21 +275,25 @@ JSON: |
| 173 | 275 | object. No dictionary merges or anything like that are performed. |
| 174 | 276 | It will call replaceObject. |
| 175 | 277 | |
| 176 | -Within .qpdf.objects, the key is "obj:o,g" or "obj:trailer", and the | |
| 278 | +Within .qpdf.objects, the key is "obj:o g R" or "obj:trailer", and the | |
| 177 | 279 | value is a dictionary with exactly one of "value" or "stream" as its |
| 178 | 280 | single key. |
| 179 | 281 | |
| 282 | +Rationale of "obj:o g R" is that indirect object references are just | |
| 283 | +"o g R", and so code that wants to resolve one can do so easily by | |
| 284 | +just prepending "obj:" and not having to parse or split the string. | |
| 285 | + | |
| 180 | 286 | For non-streams: |
| 181 | 287 | |
| 182 | 288 | { |
| 183 | - "obj:o,g": { | |
| 289 | + "obj:o g R": { | |
| 184 | 290 | "value": ... |
| 185 | 291 | } |
| 186 | 292 | } |
| 187 | 293 | |
| 188 | 294 | For streams: |
| 189 | 295 | |
| 190 | - "obj:o,g": { | |
| 296 | + "obj:o g R": { | |
| 191 | 297 | "stream": { |
| 192 | 298 | "dict": { ... stream dictionary ... }, |
| 193 | 299 | "filterable": bool, | ... | ... |