Commit 8b67ac494e57c74d8e762de9fc476133d7cc49db
1 parent
95e7d36b
TODO: add notes on json v2 and other post-QPDFJob activities/ideas
Showing
2 changed files
with
177 additions
and
22 deletions
TODO
| 1 | -Next | 1 | +10.6 |
| 2 | ==== | 2 | ==== |
| 3 | 3 | ||
| 4 | -* Add user-defined initializer `QPDFObjectHandle operator ""_qpdf` to | ||
| 5 | - be like QPDFObjectHandle::parse: `auto oh = "<< /a (b) >>"_qpdf;` | 4 | +* Close issue #556. |
| 5 | + | ||
| 6 | * Add QPDF_MAJOR_VERSION, QPDF_MINOR_VERSION to some header, possibly | 6 | * Add QPDF_MAJOR_VERSION, QPDF_MINOR_VERSION to some header, possibly |
| 7 | dll.h since this is everywhere that there's API | 7 | dll.h since this is everywhere that there's API |
| 8 | 8 | ||
| 9 | -* Take a fresh look at PointerHolder with a good plan for being able | ||
| 10 | - to have developers phase it in using macros or something. Decide | ||
| 11 | - about shared_ptr vs unique_ptr for each time make_shared_cstr is | ||
| 12 | - called. For non-copiable classes, we can use unique_ptr instead of | ||
| 13 | - shared_ptr as a replacement for PointerHolder. For performance | ||
| 14 | - critical cases, we could potentially have a real pointer and a | ||
| 15 | - shared pointer where the shared pointer's job is to clean up but we | ||
| 16 | - use the real pointer for regular access. | ||
| 17 | - | ||
| 18 | -Consider in the context of #593, possibly with a different | ||
| 19 | -implementation | ||
| 20 | - | ||
| 21 | -* replace mode: --replace-object, --replace-stream-raw, | ||
| 22 | - --replace-stream-filtered | ||
| 23 | - * update first paragraph of QPDF JSON in the manual to mention this | ||
| 24 | - * object numbers are not preserved by write, so object ID lookup | ||
| 25 | - has to be done separately for each invocation | ||
| 26 | - * you don't have to specify length for streams | ||
| 27 | - * you only have to specify filtering for streams if providing raw data | 9 | +* Add user-defined initializer `QPDFObjectHandle operator ""_qpdf` to |
| 10 | + be like QPDFObjectHandle::parse: `auto oh = "<< /a (b) >>"_qpdf;` | ||
| 28 | 11 | ||
| 29 | * See if this has been done or is trivial with C++11 local static | 12 | * See if this has been done or is trivial with C++11 local static |
| 30 | initializers: Secure random number generation could be made more | 13 | initializers: Secure random number generation could be made more |
| @@ -43,6 +26,168 @@ implementation | @@ -43,6 +26,168 @@ implementation | ||
| 43 | * Completion: would be nice if --job-json-file=<TAB> would complete | 26 | * Completion: would be nice if --job-json-file=<TAB> would complete |
| 44 | files | 27 | files |
| 45 | 28 | ||
| 29 | +* Remember for release notes: starting in qpdf 11, the default value | ||
| 30 | + for the --json keyword will be "latest". If you are depending on | ||
| 31 | + version 1, change your code to specify --json=1, which works | ||
| 32 | + starting with 10.6.0. | ||
| 33 | + | ||
| 34 | +* Try to put something in to ease future PointerHolder migration, such | ||
| 35 | + as typedefs for containers of PointerHolders. Test to see whether | ||
| 36 | + using auto or decltype in certain places may make containers of | ||
| 37 | + pointerholders switch over cleanly. Clearly document the deprecation | ||
| 38 | + stuff. | ||
| 39 | + | ||
| 40 | + | ||
| 41 | +Output JSON v2 | ||
| 42 | +============== | ||
| 43 | + | ||
| 44 | +Output JSON v2 contain enough information to completely recreate a PDF | ||
| 45 | +file. | ||
| 46 | + | ||
| 47 | +This is not an ABI change as long as the default --json version is 1. | ||
| 48 | + | ||
| 49 | +If this is done, update --json option in cli.rst to mention v2. Also | ||
| 50 | +update QPDFJob::Config::json and of course other parts of the docs | ||
| 51 | +(json.rst). | ||
| 52 | + | ||
| 53 | +Fix the following problems: | ||
| 54 | + | ||
| 55 | +* Include the PDF version header somewhere. | ||
| 56 | + | ||
| 57 | +* Using "n n R" as a key in "objects" and "objectinfo" messes up | ||
| 58 | + searching for things | ||
| 59 | + | ||
| 60 | +* Strings cannot be unambiguously encoded/decoded | ||
| 61 | + | ||
| 62 | + * Can't tell string from name from indirect object | ||
| 63 | + | ||
| 64 | + * Strings are treated as PDF doc encoding and output as UTF-8, which | ||
| 65 | + doesn't work since multiple PDF doc code points are undefined | ||
| 66 | + | ||
| 67 | +* There is no representation of stream data | ||
| 68 | + | ||
| 69 | +* You can't tell a stream from a dictionary except by looking in both | ||
| 70 | + "object" and "objectinfo". Fix this, and then remove "objectinfo". | ||
| 71 | + | ||
| 72 | +* There are differences between information shown in the json format | ||
| 73 | + vs. information shown with options like --check, --list-attachments, | ||
| 74 | + etc. The json format should be able to completely replace things | ||
| 75 | + that write to stdout. | ||
| 76 | + | ||
| 77 | +* Consider using camelCase in multi-word key names to be consistent | ||
| 78 | + with job JSON and with how JSON is often represented in languages | ||
| 79 | + that use it more natively | ||
| 80 | + | ||
| 81 | +* Consider changing the contract to allow fields to be absent even | ||
| 82 | + when present in the schema. It's reasonable for people to check for | ||
| 83 | + presence of a key. Most languages make this easy to do. | ||
| 84 | + | ||
| 85 | +Most things that are informational can stay the same. We will have to | ||
| 86 | +go through every item to decide for sure. | ||
| 87 | + | ||
| 88 | +To address ambiguity, consider the following: | ||
| 89 | + | ||
| 90 | +Whenever a direct PDF object appears, disambiguate things represented | ||
| 91 | +in JSON as strings as follows: | ||
| 92 | + | ||
| 93 | +* "/Name" -- if it starts with /, it's a name | ||
| 94 | +* "n n R" -- if it is "n n R", it's an indirect object | ||
| 95 | +* "u:utf8-encoded" -- a utf8-encoded string | ||
| 96 | +* "b:<12ab34>" -- a binary string | ||
| 97 | + | ||
| 98 | +In "objects", the key is "obj:o,g", and the value is a dictionary with | ||
| 99 | +exactly one of "value" or "stream" as its single key. | ||
| 100 | + | ||
| 101 | +For non-streams, the value of "value" is as described above. | ||
| 102 | + | ||
| 103 | +{ | ||
| 104 | + "obj:o,g": { | ||
| 105 | + "value": ... | ||
| 106 | + } | ||
| 107 | +} | ||
| 108 | + | ||
| 109 | +For streams: | ||
| 110 | + | ||
| 111 | +{ | ||
| 112 | + "obj:o,g": { | ||
| 113 | + "stream": { | ||
| 114 | + "dict": { ... stream dictionary ... }, | ||
| 115 | + "filterable": bool, | ||
| 116 | + "raw": "base64-encoded raw data", | ||
| 117 | + "filtered": "base64-encoded filtered data" | ||
| 118 | + } | ||
| 119 | + } | ||
| 120 | +} | ||
| 121 | + | ||
| 122 | +Notes about stream data: | ||
| 123 | + | ||
| 124 | +* Always include "dict". | ||
| 125 | + | ||
| 126 | +* Always include "filterable" regardless of value of | ||
| 127 | + --json-stream-data. The value of filterable is influenced by | ||
| 128 | + --decode-level, which is already in parameters. | ||
| 129 | + | ||
| 130 | +* Add new flag --json-stream-data={raw,filtered,none}. At most one of | ||
| 131 | + "raw" and "filtered" will appear for each stream. | ||
| 132 | + | ||
| 133 | +* Add to parameters: value of json-stream-data, default is none | ||
| 134 | + | ||
| 135 | +* If none, omit stream data entirely | ||
| 136 | + | ||
| 137 | +* If raw, include raw stream data as base64 | ||
| 138 | + | ||
| 139 | +* If filtered, including the base64-encoded filtered stream data if we | ||
| 140 | + can and should decode it based on decode-level. Otherwise, include | ||
| 141 | + the base64-encoded raw data. See if we can honor | ||
| 142 | + --normalize-content. | ||
| 143 | + | ||
| 144 | +Note that --json-stream-data=filtered is different from | ||
| 145 | +--filtered-stream-data in that --filtered-stream-data implies | ||
| 146 | +--decode-level=all while --json-stream-data=filtered does not. Make | ||
| 147 | +sure this is mentioned in the help for both options. | ||
| 148 | + | ||
| 149 | +QPDFJob | ||
| 150 | +======= | ||
| 151 | + | ||
| 152 | +Here are some ideas for QPDFJob that didn't make it into 10.6. Not all | ||
| 153 | +of these are necessarily good -- just things to consider. | ||
| 154 | + | ||
| 155 | +* replace mode: --replace-object, --replace-stream-raw, | ||
| 156 | + --replace-stream-filtered | ||
| 157 | + * update first paragraph of QPDF JSON in the manual to mention this | ||
| 158 | + * object numbers are not preserved by write, so object ID lookup | ||
| 159 | + has to be done separately for each invocation | ||
| 160 | + * you don't have to specify length for streams | ||
| 161 | + * you only have to specify filtering for streams if providing raw data | ||
| 162 | + | ||
| 163 | +* Allow users to supply a custom progress reporter for QPDFJob | ||
| 164 | + | ||
| 165 | +* Better interoperability with json output: | ||
| 166 | + | ||
| 167 | + * Make sure all the things that print stuff to stdout have json | ||
| 168 | + equivalents (check, showLinearizationData, etc.) | ||
| 169 | + * There should be a way to get json output other than having it | ||
| 170 | + print to stdout. It should be multi-language friendly and allow | ||
| 171 | + for large amounts of data, such as providing a callback that qpdf | ||
| 172 | + can write to (like a pipeline) | ||
| 173 | + * See also JSON v2 | ||
| 174 | + | ||
| 175 | +* How do we chain jobs? The idea would be that the input and/or output | ||
| 176 | + of a QPDFJob could be a QPDF object rather than a file. For input, | ||
| 177 | + it's pretty easy. For output, none of the output-specific options | ||
| 178 | + (encrypt, compress-streams, objects-streams, etc.) would have any | ||
| 179 | + affect, so we would have to treat this like inspect for error | ||
| 180 | + checking. The QPDF object in the state where it's ready to be sent | ||
| 181 | + off to QPDFWriter would be used as the input to the next QPDFJob. | ||
| 182 | + For the job json, I think we can have the output be an identifier | ||
| 183 | + that can be used as the input for another QPDFJob. For a json file, | ||
| 184 | + we could the top level detect if it's an array with the convention | ||
| 185 | + that exactly one has an output, or we could have a subkey with other | ||
| 186 | + job definitions or something. Ideally, any input | ||
| 187 | + (copy-attachments-from, pages, etc.) could use a QPDF object. It | ||
| 188 | + wouldn't surprise me if this exposes bugs in qpdf around foreign | ||
| 189 | + streams as this has been a relatively fragile area before. | ||
| 190 | + | ||
| 46 | Documentation | 191 | Documentation |
| 47 | ============= | 192 | ============= |
| 48 | 193 | ||
| @@ -210,6 +355,15 @@ This is a list of changes to make next time there is an ABI change. | @@ -210,6 +355,15 @@ This is a list of changes to make next time there is an ABI change. | ||
| 210 | Comments appear in the code prefixed by "ABI" | 355 | Comments appear in the code prefixed by "ABI" |
| 211 | 356 | ||
| 212 | * Search for ABI to find items not listed here. | 357 | * Search for ABI to find items not listed here. |
| 358 | +* Switch default --json to latest | ||
| 359 | +* Take a fresh look at PointerHolder with a good plan for being able | ||
| 360 | + to have developers phase it in using macros or something. Decide | ||
| 361 | + about shared_ptr vs unique_ptr for each time make_shared_cstr is | ||
| 362 | + called. For non-copiable classes, we can use unique_ptr instead of | ||
| 363 | + shared_ptr as a replacement for PointerHolder. For performance | ||
| 364 | + critical cases, we could potentially have a real pointer and a | ||
| 365 | + shared pointer where the shared pointer's job is to clean up but we | ||
| 366 | + use the real pointer for regular access. | ||
| 213 | * See where anonymous namespaces can be used to keep things private to | 367 | * See where anonymous namespaces can be used to keep things private to |
| 214 | a source file. Search for `(class|struct)` in **/*.cc. | 368 | a source file. Search for `(class|struct)` in **/*.cc. |
| 215 | * See if we can use constructor delegation instead of init() in | 369 | * See if we can use constructor delegation instead of init() in |