Commit 7882b85b0691d6a669cb0b2656f1e4c7438c552b
1 parent
3c4d2bfb
TODO: more JSON notes
Showing
1 changed file
with
109 additions
and
3 deletions
TODO
| @@ -39,6 +39,108 @@ Soon: Break ground on "Document-level work" | @@ -39,6 +39,108 @@ Soon: Break ground on "Document-level work" | ||
| 39 | Output JSON v2 | 39 | Output JSON v2 |
| 40 | ============== | 40 | ============== |
| 41 | 41 | ||
| 42 | +---- | ||
| 43 | +notes from 5/2: | ||
| 44 | + | ||
| 45 | +Need new pipelines: | ||
| 46 | +* Pl_OStream(std::ostream) with semantics like Pl_StdioFile | ||
| 47 | +* Pl_String to std::string with semantics like Pl_Buffer | ||
| 48 | +* Pl_Base64 | ||
| 49 | + | ||
| 50 | +New Pipeline methods: | ||
| 51 | +* writeString(std::string const&) | ||
| 52 | +* writeCString(char*) | ||
| 53 | +* writeChars(char*, size_t) | ||
| 54 | + | ||
| 55 | +* Consider templated operator<< which could specialize for char* and | ||
| 56 | + std::string and could use std::ostringstream otherwise | ||
| 57 | + | ||
| 58 | +See if I can change all output and error messages issued by the | ||
| 59 | +library, when context is available, to have a pipeline rather than a | ||
| 60 | +FILE* or std::ostream. This makes it possible for people to capture | ||
| 61 | +output more flexibly. | ||
| 62 | + | ||
| 63 | +JSON: rather than unparse() -> string, there should be write method | ||
| 64 | +that takes a pipeline and a depth. Then rewrite all the unparse | ||
| 65 | +methods to use it. This makes incremental write possible as well as | ||
| 66 | +writing arbitrarily large amounts of output. | ||
| 67 | + | ||
| 68 | +JSON::parse should work from an InputSource. BufferInputSource can | ||
| 69 | +already start with a std::string. | ||
| 70 | + | ||
| 71 | +Have a json blob defined by a function that takes a pipeline and | ||
| 72 | +writes data to the pipeline. It's writer should create a Pl_Base64 -> | ||
| 73 | +Pl_Concatenate in front of the pipeline passed to write and call the | ||
| 74 | +function with that. | ||
| 75 | + | ||
| 76 | +Add methods needed to do incremental writes. Basically we need to | ||
| 77 | +expose functionality the array and dictionary unparse methods. Maybe | ||
| 78 | +we can have a DictionaryWriter and an ArrayWriter that deal with the | ||
| 79 | +first/depth logic and have writeElement or writeEntry(key, value) | ||
| 80 | +methods. | ||
| 81 | + | ||
| 82 | +For json output, do not unparse to string. Use the writers instead. | ||
| 83 | +Write incrementally. This changes ordering only, but we should be able | ||
| 84 | +manually update the test output for those cases. Objects should be | ||
| 85 | +written in numerical order, not lexically sorted. It probably makes | ||
| 86 | +sense to put the trailer at the end since that's where it is in a | ||
| 87 | +regular PDF. | ||
| 88 | + | ||
| 89 | +When we get to full serialization, add json serialization performance | ||
| 90 | +test. | ||
| 91 | + | ||
| 92 | +Some if not all of the json output functionality for v2 should move | ||
| 93 | +into QPDF proper rather than living in QPDFJob. There can be a | ||
| 94 | +top-level QPDF method that takes a pipeline and writes the JSON | ||
| 95 | +serialization to it. | ||
| 96 | + | ||
| 97 | +Decide what the API/CLI will be for serializing to v2. Will it just be | ||
| 98 | +part of --json or will it be its own separate thing? Probably we | ||
| 99 | +should make it so that a serialized PDF is different but uses the same | ||
| 100 | +object format as regular json mode. | ||
| 101 | + | ||
| 102 | +For going back from JSON to PDF, a separate utility will be needed. | ||
| 103 | +It's not practical for QPDFObjectHandle to be able to read JSON | ||
| 104 | +because of the special handling that is required for indirect objects, | ||
| 105 | +and QPDF can't just accept JSON because the way InputSource is used is | ||
| 106 | +complete different. Instead, we will need a separate utility that has | ||
| 107 | +logic similar to what copyForeignObject does. It will go something | ||
| 108 | +like this: | ||
| 109 | + | ||
| 110 | +* Create an empty QPDF (not emptyPDF, one with no objects in it at | ||
| 111 | + all). This works: | ||
| 112 | + | ||
| 113 | +``` | ||
| 114 | +%PDF-1.3 | ||
| 115 | +xref | ||
| 116 | +0 1 | ||
| 117 | +0000000000 65535 f | ||
| 118 | +trailer << /Size 1 >> | ||
| 119 | +startxref | ||
| 120 | +9 | ||
| 121 | +%%EOF | ||
| 122 | +``` | ||
| 123 | + | ||
| 124 | +For each object: | ||
| 125 | + | ||
| 126 | +* Walk through the object detecting any indirect objects. For each one | ||
| 127 | + that is not already known, reserve the object. We can also validate | ||
| 128 | + but we should try to do the best we can with invalid JSON so people | ||
| 129 | + can get good error messages. | ||
| 130 | +* Construct a QPDFObjectHandle from the JSON | ||
| 131 | +* If the object is the trailer, update the trailer | ||
| 132 | +* Else if the object doesn't exist, reserve it | ||
| 133 | +* If the object is reserved, call replaceReserved() | ||
| 134 | +* Else the object already exists; this is an error. | ||
| 135 | + | ||
| 136 | +This can almost be done through public API. I think all we need is the | ||
| 137 | +ability to create a reserved object with a specific object ID. | ||
| 138 | + | ||
| 139 | +The choices for json_key (job.yml) will be different for v1 and v2. | ||
| 140 | +That information is already duplicated in multiple places. | ||
| 141 | + | ||
| 142 | +---- | ||
| 143 | + | ||
| 42 | Remember typo: search for "Typo" In QPDFJob::doJSONEncrypt. | 144 | Remember typo: search for "Typo" In QPDFJob::doJSONEncrypt. |
| 43 | 145 | ||
| 44 | Remember to test interaction between generators and schemas. | 146 | Remember to test interaction between generators and schemas. |
| @@ -173,21 +275,25 @@ JSON: | @@ -173,21 +275,25 @@ JSON: | ||
| 173 | object. No dictionary merges or anything like that are performed. | 275 | object. No dictionary merges or anything like that are performed. |
| 174 | It will call replaceObject. | 276 | It will call replaceObject. |
| 175 | 277 | ||
| 176 | -Within .qpdf.objects, the key is "obj:o,g" or "obj:trailer", and the | 278 | +Within .qpdf.objects, the key is "obj:o g R" or "obj:trailer", and the |
| 177 | value is a dictionary with exactly one of "value" or "stream" as its | 279 | value is a dictionary with exactly one of "value" or "stream" as its |
| 178 | single key. | 280 | single key. |
| 179 | 281 | ||
| 282 | +Rationale of "obj:o g R" is that indirect object references are just | ||
| 283 | +"o g R", and so code that wants to resolve one can do so easily by | ||
| 284 | +just prepending "obj:" and not having to parse or split the string. | ||
| 285 | + | ||
| 180 | For non-streams: | 286 | For non-streams: |
| 181 | 287 | ||
| 182 | { | 288 | { |
| 183 | - "obj:o,g": { | 289 | + "obj:o g R": { |
| 184 | "value": ... | 290 | "value": ... |
| 185 | } | 291 | } |
| 186 | } | 292 | } |
| 187 | 293 | ||
| 188 | For streams: | 294 | For streams: |
| 189 | 295 | ||
| 190 | - "obj:o,g": { | 296 | + "obj:o g R": { |
| 191 | "stream": { | 297 | "stream": { |
| 192 | "dict": { ... stream dictionary ... }, | 298 | "dict": { ... stream dictionary ... }, |
| 193 | "filterable": bool, | 299 | "filterable": bool, |