Commit 8b67ac494e57c74d8e762de9fc476133d7cc49db

Authored by Jay Berkenbilt
1 parent 95e7d36b

TODO: add notes on json v2 and other post-QPDFJob activities/ideas

Showing 2 changed files with 177 additions and 22 deletions
1   -Next
  1 +10.6
2 2 ====
3 3  
4   -* Add user-defined initializer `QPDFObjectHandle operator ""_qpdf` to
5   - be like QPDFObjectHandle::parse: `auto oh = "<< /a (b) >>"_qpdf;`
  4 +* Close issue #556.
  5 +
6 6 * Add QPDF_MAJOR_VERSION, QPDF_MINOR_VERSION to some header, possibly
7 7 dll.h since this is everywhere that there's API
8 8  
9   -* Take a fresh look at PointerHolder with a good plan for being able
10   - to have developers phase it in using macros or something. Decide
11   - about shared_ptr vs unique_ptr for each time make_shared_cstr is
12   - called. For non-copiable classes, we can use unique_ptr instead of
13   - shared_ptr as a replacement for PointerHolder. For performance
14   - critical cases, we could potentially have a real pointer and a
15   - shared pointer where the shared pointer's job is to clean up but we
16   - use the real pointer for regular access.
17   -
18   -Consider in the context of #593, possibly with a different
19   -implementation
20   -
21   -* replace mode: --replace-object, --replace-stream-raw,
22   - --replace-stream-filtered
23   - * update first paragraph of QPDF JSON in the manual to mention this
24   - * object numbers are not preserved by write, so object ID lookup
25   - has to be done separately for each invocation
26   - * you don't have to specify length for streams
27   - * you only have to specify filtering for streams if providing raw data
  9 +* Add user-defined initializer `QPDFObjectHandle operator ""_qpdf` to
  10 + be like QPDFObjectHandle::parse: `auto oh = "<< /a (b) >>"_qpdf;`
28 11  
29 12 * See if this has been done or is trivial with C++11 local static
30 13 initializers: Secure random number generation could be made more
... ... @@ -43,6 +26,168 @@ implementation
43 26 * Completion: would be nice if --job-json-file=<TAB> would complete
44 27 files
45 28  
  29 +* Remember for release notes: starting in qpdf 11, the default value
  30 + for the --json keyword will be "latest". If you are depending on
  31 + version 1, change your code to specify --json=1, which works
  32 + starting with 10.6.0.
  33 +
  34 +* Try to put something in to ease future PointerHolder migration, such
  35 + as typedefs for containers of PointerHolders. Test to see whether
  36 + using auto or decltype in certain places may make containers of
  37 + pointerholders switch over cleanly. Clearly document the deprecation
  38 + stuff.
  39 +
  40 +
  41 +Output JSON v2
  42 +==============
  43 +
  44 +Output JSON v2 contain enough information to completely recreate a PDF
  45 +file.
  46 +
  47 +This is not an ABI change as long as the default --json version is 1.
  48 +
  49 +If this is done, update --json option in cli.rst to mention v2. Also
  50 +update QPDFJob::Config::json and of course other parts of the docs
  51 +(json.rst).
  52 +
  53 +Fix the following problems:
  54 +
  55 +* Include the PDF version header somewhere.
  56 +
  57 +* Using "n n R" as a key in "objects" and "objectinfo" messes up
  58 + searching for things
  59 +
  60 +* Strings cannot be unambiguously encoded/decoded
  61 +
  62 + * Can't tell string from name from indirect object
  63 +
  64 + * Strings are treated as PDF doc encoding and output as UTF-8, which
  65 + doesn't work since multiple PDF doc code points are undefined
  66 +
  67 +* There is no representation of stream data
  68 +
  69 +* You can't tell a stream from a dictionary except by looking in both
  70 + "object" and "objectinfo". Fix this, and then remove "objectinfo".
  71 +
  72 +* There are differences between information shown in the json format
  73 + vs. information shown with options like --check, --list-attachments,
  74 + etc. The json format should be able to completely replace things
  75 + that write to stdout.
  76 +
  77 +* Consider using camelCase in multi-word key names to be consistent
  78 + with job JSON and with how JSON is often represented in languages
  79 + that use it more natively
  80 +
  81 +* Consider changing the contract to allow fields to be absent even
  82 + when present in the schema. It's reasonable for people to check for
  83 + presence of a key. Most languages make this easy to do.
  84 +
  85 +Most things that are informational can stay the same. We will have to
  86 +go through every item to decide for sure.
  87 +
  88 +To address ambiguity, consider the following:
  89 +
  90 +Whenever a direct PDF object appears, disambiguate things represented
  91 +in JSON as strings as follows:
  92 +
  93 +* "/Name" -- if it starts with /, it's a name
  94 +* "n n R" -- if it is "n n R", it's an indirect object
  95 +* "u:utf8-encoded" -- a utf8-encoded string
  96 +* "b:<12ab34>" -- a binary string
  97 +
  98 +In "objects", the key is "obj:o,g", and the value is a dictionary with
  99 +exactly one of "value" or "stream" as its single key.
  100 +
  101 +For non-streams, the value of "value" is as described above.
  102 +
  103 +{
  104 + "obj:o,g": {
  105 + "value": ...
  106 + }
  107 +}
  108 +
  109 +For streams:
  110 +
  111 +{
  112 + "obj:o,g": {
  113 + "stream": {
  114 + "dict": { ... stream dictionary ... },
  115 + "filterable": bool,
  116 + "raw": "base64-encoded raw data",
  117 + "filtered": "base64-encoded filtered data"
  118 + }
  119 + }
  120 +}
  121 +
  122 +Notes about stream data:
  123 +
  124 +* Always include "dict".
  125 +
  126 +* Always include "filterable" regardless of value of
  127 + --json-stream-data. The value of filterable is influenced by
  128 + --decode-level, which is already in parameters.
  129 +
  130 +* Add new flag --json-stream-data={raw,filtered,none}. At most one of
  131 + "raw" and "filtered" will appear for each stream.
  132 +
  133 +* Add to parameters: value of json-stream-data, default is none
  134 +
  135 +* If none, omit stream data entirely
  136 +
  137 +* If raw, include raw stream data as base64
  138 +
  139 +* If filtered, including the base64-encoded filtered stream data if we
  140 + can and should decode it based on decode-level. Otherwise, include
  141 + the base64-encoded raw data. See if we can honor
  142 + --normalize-content.
  143 +
  144 +Note that --json-stream-data=filtered is different from
  145 +--filtered-stream-data in that --filtered-stream-data implies
  146 +--decode-level=all while --json-stream-data=filtered does not. Make
  147 +sure this is mentioned in the help for both options.
  148 +
  149 +QPDFJob
  150 +=======
  151 +
  152 +Here are some ideas for QPDFJob that didn't make it into 10.6. Not all
  153 +of these are necessarily good -- just things to consider.
  154 +
  155 +* replace mode: --replace-object, --replace-stream-raw,
  156 + --replace-stream-filtered
  157 + * update first paragraph of QPDF JSON in the manual to mention this
  158 + * object numbers are not preserved by write, so object ID lookup
  159 + has to be done separately for each invocation
  160 + * you don't have to specify length for streams
  161 + * you only have to specify filtering for streams if providing raw data
  162 +
  163 +* Allow users to supply a custom progress reporter for QPDFJob
  164 +
  165 +* Better interoperability with json output:
  166 +
  167 + * Make sure all the things that print stuff to stdout have json
  168 + equivalents (check, showLinearizationData, etc.)
  169 + * There should be a way to get json output other than having it
  170 + print to stdout. It should be multi-language friendly and allow
  171 + for large amounts of data, such as providing a callback that qpdf
  172 + can write to (like a pipeline)
  173 + * See also JSON v2
  174 +
  175 +* How do we chain jobs? The idea would be that the input and/or output
  176 + of a QPDFJob could be a QPDF object rather than a file. For input,
  177 + it's pretty easy. For output, none of the output-specific options
  178 + (encrypt, compress-streams, objects-streams, etc.) would have any
  179 + affect, so we would have to treat this like inspect for error
  180 + checking. The QPDF object in the state where it's ready to be sent
  181 + off to QPDFWriter would be used as the input to the next QPDFJob.
  182 + For the job json, I think we can have the output be an identifier
  183 + that can be used as the input for another QPDFJob. For a json file,
  184 + we could the top level detect if it's an array with the convention
  185 + that exactly one has an output, or we could have a subkey with other
  186 + job definitions or something. Ideally, any input
  187 + (copy-attachments-from, pages, etc.) could use a QPDF object. It
  188 + wouldn't surprise me if this exposes bugs in qpdf around foreign
  189 + streams as this has been a relatively fragile area before.
  190 +
46 191 Documentation
47 192 =============
48 193  
... ... @@ -210,6 +355,15 @@ This is a list of changes to make next time there is an ABI change.
210 355 Comments appear in the code prefixed by "ABI"
211 356  
212 357 * Search for ABI to find items not listed here.
  358 +* Switch default --json to latest
  359 +* Take a fresh look at PointerHolder with a good plan for being able
  360 + to have developers phase it in using macros or something. Decide
  361 + about shared_ptr vs unique_ptr for each time make_shared_cstr is
  362 + called. For non-copiable classes, we can use unique_ptr instead of
  363 + shared_ptr as a replacement for PointerHolder. For performance
  364 + critical cases, we could potentially have a real pointer and a
  365 + shared pointer where the shared pointer's job is to clean up but we
  366 + use the real pointer for regular access.
213 367 * See where anonymous namespaces can be used to keep things private to
214 368 a source file. Search for `(class|struct)` in **/*.cc.
215 369 * See if we can use constructor delegation instead of init() in
... ...
cSpell.json
... ... @@ -411,6 +411,7 @@
411 411 "struct",
412 412 "stylesheet",
413 413 "subclassing",
  414 + "subkey",
414 415 "subkeys",
415 416 "subramanyam",
416 417 "swversion",
... ...