Commit 7882b85b0691d6a669cb0b2656f1e4c7438c552b

Authored by Jay Berkenbilt
1 parent 3c4d2bfb

TODO: more JSON notes

Showing 1 changed file with 109 additions and 3 deletions
@@ -39,6 +39,108 @@ Soon: Break ground on "Document-level work" @@ -39,6 +39,108 @@ Soon: Break ground on "Document-level work"
39 Output JSON v2 39 Output JSON v2
40 ============== 40 ==============
41 41
  42 +----
  43 +notes from 5/2:
  44 +
  45 +Need new pipelines:
  46 +* Pl_OStream(std::ostream) with semantics like Pl_StdioFile
  47 +* Pl_String to std::string with semantics like Pl_Buffer
  48 +* Pl_Base64
  49 +
  50 +New Pipeline methods:
  51 +* writeString(std::string const&)
  52 +* writeCString(char*)
  53 +* writeChars(char*, size_t)
  54 +
  55 +* Consider templated operator<< which could specialize for char* and
  56 + std::string and could use std::ostringstream otherwise
  57 +
  58 +See if I can change all output and error messages issued by the
  59 +library, when context is available, to have a pipeline rather than a
  60 +FILE* or std::ostream. This makes it possible for people to capture
  61 +output more flexibly.
  62 +
  63 +JSON: rather than unparse() -> string, there should be write method
  64 +that takes a pipeline and a depth. Then rewrite all the unparse
  65 +methods to use it. This makes incremental write possible as well as
  66 +writing arbitrarily large amounts of output.
  67 +
  68 +JSON::parse should work from an InputSource. BufferInputSource can
  69 +already start with a std::string.
  70 +
  71 +Have a json blob defined by a function that takes a pipeline and
  72 +writes data to the pipeline. It's writer should create a Pl_Base64 ->
  73 +Pl_Concatenate in front of the pipeline passed to write and call the
  74 +function with that.
  75 +
  76 +Add methods needed to do incremental writes. Basically we need to
  77 +expose functionality the array and dictionary unparse methods. Maybe
  78 +we can have a DictionaryWriter and an ArrayWriter that deal with the
  79 +first/depth logic and have writeElement or writeEntry(key, value)
  80 +methods.
  81 +
  82 +For json output, do not unparse to string. Use the writers instead.
  83 +Write incrementally. This changes ordering only, but we should be able
  84 +manually update the test output for those cases. Objects should be
  85 +written in numerical order, not lexically sorted. It probably makes
  86 +sense to put the trailer at the end since that's where it is in a
  87 +regular PDF.
  88 +
  89 +When we get to full serialization, add json serialization performance
  90 +test.
  91 +
  92 +Some if not all of the json output functionality for v2 should move
  93 +into QPDF proper rather than living in QPDFJob. There can be a
  94 +top-level QPDF method that takes a pipeline and writes the JSON
  95 +serialization to it.
  96 +
  97 +Decide what the API/CLI will be for serializing to v2. Will it just be
  98 +part of --json or will it be its own separate thing? Probably we
  99 +should make it so that a serialized PDF is different but uses the same
  100 +object format as regular json mode.
  101 +
  102 +For going back from JSON to PDF, a separate utility will be needed.
  103 +It's not practical for QPDFObjectHandle to be able to read JSON
  104 +because of the special handling that is required for indirect objects,
  105 +and QPDF can't just accept JSON because the way InputSource is used is
  106 +complete different. Instead, we will need a separate utility that has
  107 +logic similar to what copyForeignObject does. It will go something
  108 +like this:
  109 +
  110 +* Create an empty QPDF (not emptyPDF, one with no objects in it at
  111 + all). This works:
  112 +
  113 +```
  114 +%PDF-1.3
  115 +xref
  116 +0 1
  117 +0000000000 65535 f
  118 +trailer << /Size 1 >>
  119 +startxref
  120 +9
  121 +%%EOF
  122 +```
  123 +
  124 +For each object:
  125 +
  126 +* Walk through the object detecting any indirect objects. For each one
  127 + that is not already known, reserve the object. We can also validate
  128 + but we should try to do the best we can with invalid JSON so people
  129 + can get good error messages.
  130 +* Construct a QPDFObjectHandle from the JSON
  131 +* If the object is the trailer, update the trailer
  132 +* Else if the object doesn't exist, reserve it
  133 +* If the object is reserved, call replaceReserved()
  134 +* Else the object already exists; this is an error.
  135 +
  136 +This can almost be done through public API. I think all we need is the
  137 +ability to create a reserved object with a specific object ID.
  138 +
  139 +The choices for json_key (job.yml) will be different for v1 and v2.
  140 +That information is already duplicated in multiple places.
  141 +
  142 +----
  143 +
42 Remember typo: search for "Typo" In QPDFJob::doJSONEncrypt. 144 Remember typo: search for "Typo" In QPDFJob::doJSONEncrypt.
43 145
44 Remember to test interaction between generators and schemas. 146 Remember to test interaction between generators and schemas.
@@ -173,21 +275,25 @@ JSON: @@ -173,21 +275,25 @@ JSON:
173 object. No dictionary merges or anything like that are performed. 275 object. No dictionary merges or anything like that are performed.
174 It will call replaceObject. 276 It will call replaceObject.
175 277
176 -Within .qpdf.objects, the key is "obj:o,g" or "obj:trailer", and the 278 +Within .qpdf.objects, the key is "obj:o g R" or "obj:trailer", and the
177 value is a dictionary with exactly one of "value" or "stream" as its 279 value is a dictionary with exactly one of "value" or "stream" as its
178 single key. 280 single key.
179 281
  282 +Rationale of "obj:o g R" is that indirect object references are just
  283 +"o g R", and so code that wants to resolve one can do so easily by
  284 +just prepending "obj:" and not having to parse or split the string.
  285 +
180 For non-streams: 286 For non-streams:
181 287
182 { 288 {
183 - "obj:o,g": { 289 + "obj:o g R": {
184 "value": ... 290 "value": ...
185 } 291 }
186 } 292 }
187 293
188 For streams: 294 For streams:
189 295
190 - "obj:o,g": { 296 + "obj:o g R": {
191 "stream": { 297 "stream": {
192 "dict": { ... stream dictionary ... }, 298 "dict": { ... stream dictionary ... },
193 "filterable": bool, 299 "filterable": bool,