Commit 8b67ac494e57c74d8e762de9fc476133d7cc49db

Authored by Jay Berkenbilt
1 parent 95e7d36b

TODO: add notes on json v2 and other post-QPDFJob activities/ideas

Showing 2 changed files with 177 additions and 22 deletions
1 -Next 1 +10.6
2 ==== 2 ====
3 3
4 -* Add user-defined initializer `QPDFObjectHandle operator ""_qpdf` to  
5 - be like QPDFObjectHandle::parse: `auto oh = "<< /a (b) >>"_qpdf;` 4 +* Close issue #556.
  5 +
6 * Add QPDF_MAJOR_VERSION, QPDF_MINOR_VERSION to some header, possibly 6 * Add QPDF_MAJOR_VERSION, QPDF_MINOR_VERSION to some header, possibly
7 dll.h since this is everywhere that there's API 7 dll.h since this is everywhere that there's API
8 8
9 -* Take a fresh look at PointerHolder with a good plan for being able  
10 - to have developers phase it in using macros or something. Decide  
11 - about shared_ptr vs unique_ptr for each time make_shared_cstr is  
12 - called. For non-copiable classes, we can use unique_ptr instead of  
13 - shared_ptr as a replacement for PointerHolder. For performance  
14 - critical cases, we could potentially have a real pointer and a  
15 - shared pointer where the shared pointer's job is to clean up but we  
16 - use the real pointer for regular access.  
17 -  
18 -Consider in the context of #593, possibly with a different  
19 -implementation  
20 -  
21 -* replace mode: --replace-object, --replace-stream-raw,  
22 - --replace-stream-filtered  
23 - * update first paragraph of QPDF JSON in the manual to mention this  
24 - * object numbers are not preserved by write, so object ID lookup  
25 - has to be done separately for each invocation  
26 - * you don't have to specify length for streams  
27 - * you only have to specify filtering for streams if providing raw data 9 +* Add user-defined initializer `QPDFObjectHandle operator ""_qpdf` to
  10 + be like QPDFObjectHandle::parse: `auto oh = "<< /a (b) >>"_qpdf;`
28 11
29 * See if this has been done or is trivial with C++11 local static 12 * See if this has been done or is trivial with C++11 local static
30 initializers: Secure random number generation could be made more 13 initializers: Secure random number generation could be made more
@@ -43,6 +26,168 @@ implementation @@ -43,6 +26,168 @@ implementation
43 * Completion: would be nice if --job-json-file=<TAB> would complete 26 * Completion: would be nice if --job-json-file=<TAB> would complete
44 files 27 files
45 28
  29 +* Remember for release notes: starting in qpdf 11, the default value
  30 + for the --json keyword will be "latest". If you are depending on
  31 + version 1, change your code to specify --json=1, which works
  32 + starting with 10.6.0.
  33 +
  34 +* Try to put something in to ease future PointerHolder migration, such
  35 + as typedefs for containers of PointerHolders. Test to see whether
  36 + using auto or decltype in certain places may make containers of
  37 + pointerholders switch over cleanly. Clearly document the deprecation
  38 + stuff.
  39 +
  40 +
  41 +Output JSON v2
  42 +==============
  43 +
  44 +Output JSON v2 contain enough information to completely recreate a PDF
  45 +file.
  46 +
  47 +This is not an ABI change as long as the default --json version is 1.
  48 +
  49 +If this is done, update --json option in cli.rst to mention v2. Also
  50 +update QPDFJob::Config::json and of course other parts of the docs
  51 +(json.rst).
  52 +
  53 +Fix the following problems:
  54 +
  55 +* Include the PDF version header somewhere.
  56 +
  57 +* Using "n n R" as a key in "objects" and "objectinfo" messes up
  58 + searching for things
  59 +
  60 +* Strings cannot be unambiguously encoded/decoded
  61 +
  62 + * Can't tell string from name from indirect object
  63 +
  64 + * Strings are treated as PDF doc encoding and output as UTF-8, which
  65 + doesn't work since multiple PDF doc code points are undefined
  66 +
  67 +* There is no representation of stream data
  68 +
  69 +* You can't tell a stream from a dictionary except by looking in both
  70 + "object" and "objectinfo". Fix this, and then remove "objectinfo".
  71 +
  72 +* There are differences between information shown in the json format
  73 + vs. information shown with options like --check, --list-attachments,
  74 + etc. The json format should be able to completely replace things
  75 + that write to stdout.
  76 +
  77 +* Consider using camelCase in multi-word key names to be consistent
  78 + with job JSON and with how JSON is often represented in languages
  79 + that use it more natively
  80 +
  81 +* Consider changing the contract to allow fields to be absent even
  82 + when present in the schema. It's reasonable for people to check for
  83 + presence of a key. Most languages make this easy to do.
  84 +
  85 +Most things that are informational can stay the same. We will have to
  86 +go through every item to decide for sure.
  87 +
  88 +To address ambiguity, consider the following:
  89 +
  90 +Whenever a direct PDF object appears, disambiguate things represented
  91 +in JSON as strings as follows:
  92 +
  93 +* "/Name" -- if it starts with /, it's a name
  94 +* "n n R" -- if it is "n n R", it's an indirect object
  95 +* "u:utf8-encoded" -- a utf8-encoded string
  96 +* "b:<12ab34>" -- a binary string
  97 +
  98 +In "objects", the key is "obj:o,g", and the value is a dictionary with
  99 +exactly one of "value" or "stream" as its single key.
  100 +
  101 +For non-streams, the value of "value" is as described above.
  102 +
  103 +{
  104 + "obj:o,g": {
  105 + "value": ...
  106 + }
  107 +}
  108 +
  109 +For streams:
  110 +
  111 +{
  112 + "obj:o,g": {
  113 + "stream": {
  114 + "dict": { ... stream dictionary ... },
  115 + "filterable": bool,
  116 + "raw": "base64-encoded raw data",
  117 + "filtered": "base64-encoded filtered data"
  118 + }
  119 + }
  120 +}
  121 +
  122 +Notes about stream data:
  123 +
  124 +* Always include "dict".
  125 +
  126 +* Always include "filterable" regardless of value of
  127 + --json-stream-data. The value of filterable is influenced by
  128 + --decode-level, which is already in parameters.
  129 +
  130 +* Add new flag --json-stream-data={raw,filtered,none}. At most one of
  131 + "raw" and "filtered" will appear for each stream.
  132 +
  133 +* Add to parameters: value of json-stream-data, default is none
  134 +
  135 +* If none, omit stream data entirely
  136 +
  137 +* If raw, include raw stream data as base64
  138 +
  139 +* If filtered, including the base64-encoded filtered stream data if we
  140 + can and should decode it based on decode-level. Otherwise, include
  141 + the base64-encoded raw data. See if we can honor
  142 + --normalize-content.
  143 +
  144 +Note that --json-stream-data=filtered is different from
  145 +--filtered-stream-data in that --filtered-stream-data implies
  146 +--decode-level=all while --json-stream-data=filtered does not. Make
  147 +sure this is mentioned in the help for both options.
  148 +
  149 +QPDFJob
  150 +=======
  151 +
  152 +Here are some ideas for QPDFJob that didn't make it into 10.6. Not all
  153 +of these are necessarily good -- just things to consider.
  154 +
  155 +* replace mode: --replace-object, --replace-stream-raw,
  156 + --replace-stream-filtered
  157 + * update first paragraph of QPDF JSON in the manual to mention this
  158 + * object numbers are not preserved by write, so object ID lookup
  159 + has to be done separately for each invocation
  160 + * you don't have to specify length for streams
  161 + * you only have to specify filtering for streams if providing raw data
  162 +
  163 +* Allow users to supply a custom progress reporter for QPDFJob
  164 +
  165 +* Better interoperability with json output:
  166 +
  167 + * Make sure all the things that print stuff to stdout have json
  168 + equivalents (check, showLinearizationData, etc.)
  169 + * There should be a way to get json output other than having it
  170 + print to stdout. It should be multi-language friendly and allow
  171 + for large amounts of data, such as providing a callback that qpdf
  172 + can write to (like a pipeline)
  173 + * See also JSON v2
  174 +
  175 +* How do we chain jobs? The idea would be that the input and/or output
  176 + of a QPDFJob could be a QPDF object rather than a file. For input,
  177 + it's pretty easy. For output, none of the output-specific options
  178 + (encrypt, compress-streams, objects-streams, etc.) would have any
  179 + affect, so we would have to treat this like inspect for error
  180 + checking. The QPDF object in the state where it's ready to be sent
  181 + off to QPDFWriter would be used as the input to the next QPDFJob.
  182 + For the job json, I think we can have the output be an identifier
  183 + that can be used as the input for another QPDFJob. For a json file,
  184 + we could the top level detect if it's an array with the convention
  185 + that exactly one has an output, or we could have a subkey with other
  186 + job definitions or something. Ideally, any input
  187 + (copy-attachments-from, pages, etc.) could use a QPDF object. It
  188 + wouldn't surprise me if this exposes bugs in qpdf around foreign
  189 + streams as this has been a relatively fragile area before.
  190 +
46 Documentation 191 Documentation
47 ============= 192 =============
48 193
@@ -210,6 +355,15 @@ This is a list of changes to make next time there is an ABI change. @@ -210,6 +355,15 @@ This is a list of changes to make next time there is an ABI change.
210 Comments appear in the code prefixed by "ABI" 355 Comments appear in the code prefixed by "ABI"
211 356
212 * Search for ABI to find items not listed here. 357 * Search for ABI to find items not listed here.
  358 +* Switch default --json to latest
  359 +* Take a fresh look at PointerHolder with a good plan for being able
  360 + to have developers phase it in using macros or something. Decide
  361 + about shared_ptr vs unique_ptr for each time make_shared_cstr is
  362 + called. For non-copiable classes, we can use unique_ptr instead of
  363 + shared_ptr as a replacement for PointerHolder. For performance
  364 + critical cases, we could potentially have a real pointer and a
  365 + shared pointer where the shared pointer's job is to clean up but we
  366 + use the real pointer for regular access.
213 * See where anonymous namespaces can be used to keep things private to 367 * See where anonymous namespaces can be used to keep things private to
214 a source file. Search for `(class|struct)` in **/*.cc. 368 a source file. Search for `(class|struct)` in **/*.cc.
215 * See if we can use constructor delegation instead of init() in 369 * See if we can use constructor delegation instead of init() in
cSpell.json
@@ -411,6 +411,7 @@ @@ -411,6 +411,7 @@
411 "struct", 411 "struct",
412 "stylesheet", 412 "stylesheet",
413 "subclassing", 413 "subclassing",
  414 + "subkey",
414 "subkeys", 415 "subkeys",
415 "subramanyam", 416 "subramanyam",
416 "swversion", 417 "swversion",