Commit 8b67ac494e57c74d8e762de9fc476133d7cc49db
1 parent
95e7d36b
TODO: add notes on json v2 and other post-QPDFJob activities/ideas
Showing
2 changed files
with
177 additions
and
22 deletions
TODO
| 1 | -Next | |
| 1 | +10.6 | |
| 2 | 2 | ==== |
| 3 | 3 | |
| 4 | -* Add user-defined initializer `QPDFObjectHandle operator ""_qpdf` to | |
| 5 | - be like QPDFObjectHandle::parse: `auto oh = "<< /a (b) >>"_qpdf;` | |
| 4 | +* Close issue #556. | |
| 5 | + | |
| 6 | 6 | * Add QPDF_MAJOR_VERSION, QPDF_MINOR_VERSION to some header, possibly |
| 7 | 7 | dll.h since this is everywhere that there's API |
| 8 | 8 | |
| 9 | -* Take a fresh look at PointerHolder with a good plan for being able | |
| 10 | - to have developers phase it in using macros or something. Decide | |
| 11 | - about shared_ptr vs unique_ptr for each time make_shared_cstr is | |
| 12 | - called. For non-copiable classes, we can use unique_ptr instead of | |
| 13 | - shared_ptr as a replacement for PointerHolder. For performance | |
| 14 | - critical cases, we could potentially have a real pointer and a | |
| 15 | - shared pointer where the shared pointer's job is to clean up but we | |
| 16 | - use the real pointer for regular access. | |
| 17 | - | |
| 18 | -Consider in the context of #593, possibly with a different | |
| 19 | -implementation | |
| 20 | - | |
| 21 | -* replace mode: --replace-object, --replace-stream-raw, | |
| 22 | - --replace-stream-filtered | |
| 23 | - * update first paragraph of QPDF JSON in the manual to mention this | |
| 24 | - * object numbers are not preserved by write, so object ID lookup | |
| 25 | - has to be done separately for each invocation | |
| 26 | - * you don't have to specify length for streams | |
| 27 | - * you only have to specify filtering for streams if providing raw data | |
| 9 | +* Add user-defined initializer `QPDFObjectHandle operator ""_qpdf` to | |
| 10 | + be like QPDFObjectHandle::parse: `auto oh = "<< /a (b) >>"_qpdf;` | |
| 28 | 11 | |
| 29 | 12 | * See if this has been done or is trivial with C++11 local static |
| 30 | 13 | initializers: Secure random number generation could be made more |
| ... | ... | @@ -43,6 +26,168 @@ implementation |
| 43 | 26 | * Completion: would be nice if --job-json-file=<TAB> would complete |
| 44 | 27 | files |
| 45 | 28 | |
| 29 | +* Remember for release notes: starting in qpdf 11, the default value | |
| 30 | + for the --json keyword will be "latest". If you are depending on | |
| 31 | + version 1, change your code to specify --json=1, which works | |
| 32 | + starting with 10.6.0. | |
| 33 | + | |
| 34 | +* Try to put something in to ease future PointerHolder migration, such | |
| 35 | + as typedefs for containers of PointerHolders. Test to see whether | |
| 36 | + using auto or decltype in certain places may make containers of | |
| 37 | + pointerholders switch over cleanly. Clearly document the deprecation | |
| 38 | + stuff. | |
| 39 | + | |
| 40 | + | |
| 41 | +Output JSON v2 | |
| 42 | +============== | |
| 43 | + | |
| 44 | +Output JSON v2 contain enough information to completely recreate a PDF | |
| 45 | +file. | |
| 46 | + | |
| 47 | +This is not an ABI change as long as the default --json version is 1. | |
| 48 | + | |
| 49 | +If this is done, update --json option in cli.rst to mention v2. Also | |
| 50 | +update QPDFJob::Config::json and of course other parts of the docs | |
| 51 | +(json.rst). | |
| 52 | + | |
| 53 | +Fix the following problems: | |
| 54 | + | |
| 55 | +* Include the PDF version header somewhere. | |
| 56 | + | |
| 57 | +* Using "n n R" as a key in "objects" and "objectinfo" messes up | |
| 58 | + searching for things | |
| 59 | + | |
| 60 | +* Strings cannot be unambiguously encoded/decoded | |
| 61 | + | |
| 62 | + * Can't tell string from name from indirect object | |
| 63 | + | |
| 64 | + * Strings are treated as PDF doc encoding and output as UTF-8, which | |
| 65 | + doesn't work since multiple PDF doc code points are undefined | |
| 66 | + | |
| 67 | +* There is no representation of stream data | |
| 68 | + | |
| 69 | +* You can't tell a stream from a dictionary except by looking in both | |
| 70 | + "object" and "objectinfo". Fix this, and then remove "objectinfo". | |
| 71 | + | |
| 72 | +* There are differences between information shown in the json format | |
| 73 | + vs. information shown with options like --check, --list-attachments, | |
| 74 | + etc. The json format should be able to completely replace things | |
| 75 | + that write to stdout. | |
| 76 | + | |
| 77 | +* Consider using camelCase in multi-word key names to be consistent | |
| 78 | + with job JSON and with how JSON is often represented in languages | |
| 79 | + that use it more natively | |
| 80 | + | |
| 81 | +* Consider changing the contract to allow fields to be absent even | |
| 82 | + when present in the schema. It's reasonable for people to check for | |
| 83 | + presence of a key. Most languages make this easy to do. | |
| 84 | + | |
| 85 | +Most things that are informational can stay the same. We will have to | |
| 86 | +go through every item to decide for sure. | |
| 87 | + | |
| 88 | +To address ambiguity, consider the following: | |
| 89 | + | |
| 90 | +Whenever a direct PDF object appears, disambiguate things represented | |
| 91 | +in JSON as strings as follows: | |
| 92 | + | |
| 93 | +* "/Name" -- if it starts with /, it's a name | |
| 94 | +* "n n R" -- if it is "n n R", it's an indirect object | |
| 95 | +* "u:utf8-encoded" -- a utf8-encoded string | |
| 96 | +* "b:<12ab34>" -- a binary string | |
| 97 | + | |
| 98 | +In "objects", the key is "obj:o,g", and the value is a dictionary with | |
| 99 | +exactly one of "value" or "stream" as its single key. | |
| 100 | + | |
| 101 | +For non-streams, the value of "value" is as described above. | |
| 102 | + | |
| 103 | +{ | |
| 104 | + "obj:o,g": { | |
| 105 | + "value": ... | |
| 106 | + } | |
| 107 | +} | |
| 108 | + | |
| 109 | +For streams: | |
| 110 | + | |
| 111 | +{ | |
| 112 | + "obj:o,g": { | |
| 113 | + "stream": { | |
| 114 | + "dict": { ... stream dictionary ... }, | |
| 115 | + "filterable": bool, | |
| 116 | + "raw": "base64-encoded raw data", | |
| 117 | + "filtered": "base64-encoded filtered data" | |
| 118 | + } | |
| 119 | + } | |
| 120 | +} | |
| 121 | + | |
| 122 | +Notes about stream data: | |
| 123 | + | |
| 124 | +* Always include "dict". | |
| 125 | + | |
| 126 | +* Always include "filterable" regardless of value of | |
| 127 | + --json-stream-data. The value of filterable is influenced by | |
| 128 | + --decode-level, which is already in parameters. | |
| 129 | + | |
| 130 | +* Add new flag --json-stream-data={raw,filtered,none}. At most one of | |
| 131 | + "raw" and "filtered" will appear for each stream. | |
| 132 | + | |
| 133 | +* Add to parameters: value of json-stream-data, default is none | |
| 134 | + | |
| 135 | +* If none, omit stream data entirely | |
| 136 | + | |
| 137 | +* If raw, include raw stream data as base64 | |
| 138 | + | |
| 139 | +* If filtered, including the base64-encoded filtered stream data if we | |
| 140 | + can and should decode it based on decode-level. Otherwise, include | |
| 141 | + the base64-encoded raw data. See if we can honor | |
| 142 | + --normalize-content. | |
| 143 | + | |
| 144 | +Note that --json-stream-data=filtered is different from | |
| 145 | +--filtered-stream-data in that --filtered-stream-data implies | |
| 146 | +--decode-level=all while --json-stream-data=filtered does not. Make | |
| 147 | +sure this is mentioned in the help for both options. | |
| 148 | + | |
| 149 | +QPDFJob | |
| 150 | +======= | |
| 151 | + | |
| 152 | +Here are some ideas for QPDFJob that didn't make it into 10.6. Not all | |
| 153 | +of these are necessarily good -- just things to consider. | |
| 154 | + | |
| 155 | +* replace mode: --replace-object, --replace-stream-raw, | |
| 156 | + --replace-stream-filtered | |
| 157 | + * update first paragraph of QPDF JSON in the manual to mention this | |
| 158 | + * object numbers are not preserved by write, so object ID lookup | |
| 159 | + has to be done separately for each invocation | |
| 160 | + * you don't have to specify length for streams | |
| 161 | + * you only have to specify filtering for streams if providing raw data | |
| 162 | + | |
| 163 | +* Allow users to supply a custom progress reporter for QPDFJob | |
| 164 | + | |
| 165 | +* Better interoperability with json output: | |
| 166 | + | |
| 167 | + * Make sure all the things that print stuff to stdout have json | |
| 168 | + equivalents (check, showLinearizationData, etc.) | |
| 169 | + * There should be a way to get json output other than having it | |
| 170 | + print to stdout. It should be multi-language friendly and allow | |
| 171 | + for large amounts of data, such as providing a callback that qpdf | |
| 172 | + can write to (like a pipeline) | |
| 173 | + * See also JSON v2 | |
| 174 | + | |
| 175 | +* How do we chain jobs? The idea would be that the input and/or output | |
| 176 | + of a QPDFJob could be a QPDF object rather than a file. For input, | |
| 177 | + it's pretty easy. For output, none of the output-specific options | |
| 178 | + (encrypt, compress-streams, objects-streams, etc.) would have any | |
| 179 | + affect, so we would have to treat this like inspect for error | |
| 180 | + checking. The QPDF object in the state where it's ready to be sent | |
| 181 | + off to QPDFWriter would be used as the input to the next QPDFJob. | |
| 182 | + For the job json, I think we can have the output be an identifier | |
| 183 | + that can be used as the input for another QPDFJob. For a json file, | |
| 184 | + we could the top level detect if it's an array with the convention | |
| 185 | + that exactly one has an output, or we could have a subkey with other | |
| 186 | + job definitions or something. Ideally, any input | |
| 187 | + (copy-attachments-from, pages, etc.) could use a QPDF object. It | |
| 188 | + wouldn't surprise me if this exposes bugs in qpdf around foreign | |
| 189 | + streams as this has been a relatively fragile area before. | |
| 190 | + | |
| 46 | 191 | Documentation |
| 47 | 192 | ============= |
| 48 | 193 | |
| ... | ... | @@ -210,6 +355,15 @@ This is a list of changes to make next time there is an ABI change. |
| 210 | 355 | Comments appear in the code prefixed by "ABI" |
| 211 | 356 | |
| 212 | 357 | * Search for ABI to find items not listed here. |
| 358 | +* Switch default --json to latest | |
| 359 | +* Take a fresh look at PointerHolder with a good plan for being able | |
| 360 | + to have developers phase it in using macros or something. Decide | |
| 361 | + about shared_ptr vs unique_ptr for each time make_shared_cstr is | |
| 362 | + called. For non-copiable classes, we can use unique_ptr instead of | |
| 363 | + shared_ptr as a replacement for PointerHolder. For performance | |
| 364 | + critical cases, we could potentially have a real pointer and a | |
| 365 | + shared pointer where the shared pointer's job is to clean up but we | |
| 366 | + use the real pointer for regular access. | |
| 213 | 367 | * See where anonymous namespaces can be used to keep things private to |
| 214 | 368 | a source file. Search for `(class|struct)` in **/*.cc. |
| 215 | 369 | * See if we can use constructor delegation instead of init() in | ... | ... |