Commit 5b6cf45f8365562d805547b6ca774f2ba2a0c90f
1 parent
eff911a9
Cleanup TODO
Showing
2 changed files
with
216 additions
and
224 deletions
README-maintainer
| ... | ... | @@ -110,7 +110,16 @@ CODING RULES |
| 110 | 110 | the shared library boundary. |
| 111 | 111 | |
| 112 | 112 | * Put private member variables in PointerHolder<Members> for all |
| 113 | - public classes. Remember to use QPDF_DLL on ~Members(). | |
| 113 | + public classes. Remember to use QPDF_DLL on ~Members(). Exception: | |
| 114 | + indirection through PointerHolder<Members> is expensive, so don't do | |
| 115 | + it for classes that are copied a lot, like QPDFObjectHandle and | |
| 116 | + QPDFObject. | |
| 117 | + | |
| 118 | +* Traversal of objects is expensive. It's worth adding some complexity | |
| 119 | + to avoid needless traversals of objects. | |
| 120 | + | |
| 121 | +* Avoid attaching too much metadata to objects and object handles | |
| 122 | + since those have to get copied around a lot. | |
| 114 | 123 | |
| 115 | 124 | |
| 116 | 125 | RELEASE PREPARATION | ... | ... |
TODO
| ... | ... | @@ -3,6 +3,23 @@ Next |
| 3 | 3 | |
| 4 | 4 | * High-level API/doc overhaul: #593 |
| 5 | 5 | |
| 6 | +Consider in the context of #593, possibly with a different | |
| 7 | +implementation | |
| 8 | + | |
| 9 | +* replace mode: --replace-object, --replace-stream-raw, | |
| 10 | + --replace-stream-filtered | |
| 11 | + * update first paragraph of QPDF JSON in the manual to mention this | |
| 12 | + * object numbers are not preserved by write, so object ID lookup | |
| 13 | + has to be done separately for each invocation | |
| 14 | + * you don't have to specify length for streams | |
| 15 | + * you only have to specify filtering for streams if providing raw data | |
| 16 | + | |
| 17 | +* See if this has been done or is trivial with C++11 local static | |
| 18 | + initializers: Secure random number generation could be made more | |
| 19 | + efficient by using a local static to ensure a single random device | |
| 20 | + or crypt provider as long as this can be done in a thread-safe | |
| 21 | + fashion. | |
| 22 | + | |
| 6 | 23 | Documentation |
| 7 | 24 | ============= |
| 8 | 25 | |
| ... | ... | @@ -21,6 +38,8 @@ Documentation |
| 21 | 38 | Document-level work |
| 22 | 39 | =================== |
| 23 | 40 | |
| 41 | +* Ideas here may by superseded by #593. | |
| 42 | + | |
| 24 | 43 | * QPDFPageCopier -- object for moving pages around within files or |
| 25 | 44 | between files and performing various transformations |
| 26 | 45 | |
| ... | ... | @@ -97,19 +116,6 @@ Fuzz Errors |
| 97 | 116 | * Ignoring these: |
| 98 | 117 | * Out of memory in dct: 35001, 32516 |
| 99 | 118 | |
| 100 | -GitHub Actions | |
| 101 | -============== | |
| 102 | - | |
| 103 | -* Actions are triggered on push to main and master. When we eventually | |
| 104 | - rename master to main, make sure the reference to master is removed | |
| 105 | - from .github/workflows/*.yml. | |
| 106 | - | |
| 107 | -* At the time of migrating from Azure Pipelines to GitHub Actions | |
| 108 | - (2020-10), there was no standard test result publisher (to replace | |
| 109 | - the PublishTestResults@2 task). There are some third-party actions, | |
| 110 | - but I'd rather not depend on them. Keep an eye open for this coming | |
| 111 | - to GitHub Actions. | |
| 112 | - | |
| 113 | 119 | External Libraries |
| 114 | 120 | ================== |
| 115 | 121 | |
| ... | ... | @@ -288,66 +294,6 @@ Page splitting/merging |
| 288 | 294 | |
| 289 | 295 | * Form fields: should be similar to outlines. |
| 290 | 296 | |
| 291 | -Performance | |
| 292 | -=========== | |
| 293 | - | |
| 294 | -As described in https://github.com/qpdf/qpdf/issues/401, there was | |
| 295 | -great performance degradation between qpdf 7.1.1 and 9.1.1. Doing a | |
| 296 | -bisect between dac65a21fb4fa5f871e31c314280b75adde89a6c and | |
| 297 | -release-qpdf-7.1.1, I found several commits that damaged performance. | |
| 298 | -I fixed some of them to improve performance by about 70% (as measured | |
| 299 | -by saying that old times were 170% of new times). The remaining | |
| 300 | -commits that broke performance either can't be correct because they | |
| 301 | -would re-introduce an old bug or aren't worth correcting because of | |
| 302 | -the high value they offer relative to a relatively low penalty. For | |
| 303 | -historical reference, here are the commits. The numbers are the time | |
| 304 | -in seconds on the machine I happened to be using of splitting the | |
| 305 | -first 100 pages of PDF32000_2008.pdf 20 times and taking an average | |
| 306 | -duration. | |
| 307 | - | |
| 308 | -Commits that broke performance: | |
| 309 | - | |
| 310 | -* d0e99f195a987c483bbb6c5449cf39bee34e08a1 -- object description and | |
| 311 | - context: 0.39 -> 0.45 | |
| 312 | -* a01359189b32c60c2d55b039f7aefd6c3ce0ebde (minus 313ba08) -- fix | |
| 313 | - dangling references: 0.55 -> 0.6 | |
| 314 | -* e5f504b6c5dc34337cc0b316b4a7b1fca7e614b1 -- sparse array: 0.6 -> 0.62 | |
| 315 | - | |
| 316 | -Other intermediate steps that were previously fixed: | |
| 317 | - | |
| 318 | -* 313ba081265f69ac9a0324f9fe87087c72918191 -- copy outlines into | |
| 319 | - split: 0.55 -> 4.0 | |
| 320 | -* a01359189b32c60c2d55b039f7aefd6c3ce0ebde -- fix dangling references: | |
| 321 | - 4.0 -> 9.0 | |
| 322 | - | |
| 323 | -This commit fixed the awful problem introduced in 313ba081: | |
| 324 | - | |
| 325 | -* a5a016cdd26a8e5c99e5f019bc30d1bdf6c050a2 -- revert outline | |
| 326 | - preservation: 9.0 -> 0.6 | |
| 327 | - | |
| 328 | -Note that the fix dangling references commit had a much worse impact | |
| 329 | -prior to removing the outline preservation, so I also measured its | |
| 330 | -impact in isolation. | |
| 331 | - | |
| 332 | -A few important lessons: | |
| 333 | - | |
| 334 | -* Indirection through PointerHolder<Members> is expensive, and should | |
| 335 | - not be used for things that are created and destroyed frequently | |
| 336 | - such as QPDFObjectHandle and QPDFObject. | |
| 337 | -* Traversal of objects is expensive and should be avoided where | |
| 338 | - possible. | |
| 339 | - | |
| 340 | -Future ideas: | |
| 341 | - | |
| 342 | -* Look at places in the code where object traversal is being done and, | |
| 343 | - where possible, try to avoid it entirely or at least avoid ever | |
| 344 | - traversing the same objects multiple times. | |
| 345 | -* Avoid attaching too much metadata to objects and object handles | |
| 346 | - since those have to get copied around a lot. | |
| 347 | - | |
| 348 | -Also, it turns out that PointerHolder is more performant than | |
| 349 | -std::shared_ptr. | |
| 350 | - | |
| 351 | 297 | Analytics |
| 352 | 298 | ========= |
| 353 | 299 | |
| ... | ... | @@ -406,9 +352,6 @@ I find it useful to make reference to them in this list. |
| 406 | 352 | dictionary may need to be changed -- create test cases with lots of |
| 407 | 353 | duplicated/overlapping keys. |
| 408 | 354 | |
| 409 | - * Investigate whether there is a way to automate the memory checker | |
| 410 | - tests for Windows. | |
| 411 | - | |
| 412 | 355 | * Part of closed_file_input_source.cc is disabled on Windows because |
| 413 | 356 | of odd failures. It might be worth investigating so we can fully |
| 414 | 357 | exercise this in the test suite. That said, ClosedFileInputSource |
| ... | ... | @@ -435,14 +378,6 @@ I find it useful to make reference to them in this list. |
| 435 | 378 | * set value from CLI? Specify title, and provide way to |
| 436 | 379 | disambiguate, probably by giving objgen of field |
| 437 | 380 | |
| 438 | - * replace mode: --replace-object, --replace-stream-raw, | |
| 439 | - --replace-stream-filtered | |
| 440 | - * update first paragraph of QPDF JSON in the manual to mention this | |
| 441 | - * object numbers are not preserved by write, so object ID lookup | |
| 442 | - has to be done separately for each invocation | |
| 443 | - * you don't have to specify length for streams | |
| 444 | - * you only have to specify filtering for streams if providing raw data | |
| 445 | - | |
| 446 | 381 | * Pl_TIFFPredictor is pretty slow. |
| 447 | 382 | |
| 448 | 383 | * Support for handling file names with Unicode characters in Windows |
| ... | ... | @@ -486,46 +421,6 @@ I find it useful to make reference to them in this list. |
| 486 | 421 | |
| 487 | 422 | * Look at ~/Q/pdf-collection/forms-from-appian/ |
| 488 | 423 | |
| 489 | - * Consider adding "uninstall" target to makefile. It should only | |
| 490 | - uninstall what it installed, which means that you must run | |
| 491 | - uninstall from the version you ran install with. It would only be | |
| 492 | - supported for the toolchains that support the install target | |
| 493 | - (libtool). | |
| 494 | - | |
| 495 | - * Provide support in QPDFWriter for writing incremental updates. | |
| 496 | - Provide support in qpdf for preserving incremental updates. The | |
| 497 | - goal should be that QDF mode should be fully functional for files | |
| 498 | - with incremental updates including fix_qdf. | |
| 499 | - | |
| 500 | - Note that there's nothing that says an indirect object in one | |
| 501 | - update can't refer to an object that doesn't appear until a later | |
| 502 | - update. This means that QPDF has to treat indirect null objects | |
| 503 | - differently from how it does now. QPDF drops indirect null objects | |
| 504 | - that appear as members of arrays or dictionaries. For arrays, it's | |
| 505 | - handled in QPDFWriter where we make indirect nulls direct. This is | |
| 506 | - in a single if block, and nothing else in the code cares about it. | |
| 507 | - We could just remove that if block and not break anything except a | |
| 508 | - few test cases that exercise the current behavior. For | |
| 509 | - dictionaries, it's more complicated. In this case, | |
| 510 | - QPDF_Dictionary::getKeys() ignores all keys with null values, and | |
| 511 | - hasKey() returns false for keys that have null values. We would | |
| 512 | - probably want to make QPDF_Dictionary able to handle the special | |
| 513 | - case of keys that are indirect nulls and basically never have it | |
| 514 | - drop any keys that are indirect objects. | |
| 515 | - | |
| 516 | - If we make a change to have qpdf preserve indirect references to | |
| 517 | - null objects, we have to note this in ChangeLog and in the release | |
| 518 | - notes since this will change output files. We did this before when | |
| 519 | - we stopped flattening scalar references, so this is probably not a | |
| 520 | - big deal. We also have to make sure that the testing for this | |
| 521 | - handles non-trivial cases of the targets of indirect nulls being | |
| 522 | - replaced by real objects in an update. I'm not sure how this plays | |
| 523 | - with linearization, if at all. For cases where incremental updates | |
| 524 | - are not being preserved as incremental updates and where the data | |
| 525 | - is being folded in (as is always the case with qpdf now), none of | |
| 526 | - this should make any difference in the actual semantics of the | |
| 527 | - files. | |
| 528 | - | |
| 529 | 424 | * When decrypting files with /R=6, hash_V5 is called more than once |
| 530 | 425 | with the same inputs. Caching the results or refactoring to reduce |
| 531 | 426 | the number of identical calls could improve performance for |
| ... | ... | @@ -536,12 +431,6 @@ I find it useful to make reference to them in this list. |
| 536 | 431 | and replace the /Pages key of the root dictionary with the new |
| 537 | 432 | tree. |
| 538 | 433 | |
| 539 | - * Secure random number generation could be made more efficient by | |
| 540 | - using a local static to ensure a single random device or crypt | |
| 541 | - provider as long as this can be done in a thread-safe fashion. In | |
| 542 | - the initial implementation, this is being skipped to avoid having | |
| 543 | - to add any dependencies on threading libraries. | |
| 544 | - | |
| 545 | 434 | * Study what's required to support savable forms that can be saved by |
| 546 | 435 | Adobe Reader. Does this require actually signing the document with |
| 547 | 436 | an Adobe private key? Search for "Digital signatures" in the PDF |
| ... | ... | @@ -551,19 +440,6 @@ I find it useful to make reference to them in this list. |
| 551 | 440 | implemented, update the docs on crypto providers, which mention |
| 552 | 441 | that this may happen in the future. |
| 553 | 442 | |
| 554 | - * Provide APIs for embedded files. See *attachments*.pdf in test | |
| 555 | - suite. The private method findAttachmentStreams finds at least | |
| 556 | - cases for modern versions of Adobe Reader (>= 1.7, maybe earlier). | |
| 557 | - PDF Reference 1.7 section 3.10, "File Specifications", discusses | |
| 558 | - this. | |
| 559 | - | |
| 560 | - A sourceforge user asks if qpdf can handle extracting and embedded | |
| 561 | - resources and references these tools, which may be useful as a | |
| 562 | - reference. | |
| 563 | - | |
| 564 | - http://multivalent.sourceforge.net/Tools/pdf/Extract.html | |
| 565 | - http://multivalent.sourceforge.net/Tools/pdf/Embed.html | |
| 566 | - | |
| 567 | 443 | * Qpdf does not honor /EFF when adding new file attachments. When it |
| 568 | 444 | encrypts, it never generates streams with explicit crypt filters. |
| 569 | 445 | Prior to 10.2, there was an incorrect attempt to treat /EFF as a |
| ... | ... | @@ -572,15 +448,6 @@ I find it useful to make reference to them in this list. |
| 572 | 448 | writers to obey this when adding new attachments. Qpdf is not a |
| 573 | 449 | conforming writer in that respect. |
| 574 | 450 | |
| 575 | - * The second xref stream for linearized files has to be padded only | |
| 576 | - because we need file_size as computed in pass 1 to be accurate. If | |
| 577 | - we were not allowing writing to a pipe, we could seek back to the | |
| 578 | - beginning and fill in the value of /L in the linearization | |
| 579 | - dictionary as an optimization to alleviate the need for this | |
| 580 | - padding. Doing so would require us to pad the /L value | |
| 581 | - individually and also to save the file descriptor and determine | |
| 582 | - whether it's seekable. This is probably not worth bothering with. | |
| 583 | - | |
| 584 | 451 | * The whole xref handling code in the QPDF object allows the same |
| 585 | 452 | object with more than one generation to coexist, but a lot of logic |
| 586 | 453 | assumes this isn't the case. Anything that creates mappings only |
| ... | ... | @@ -593,10 +460,6 @@ I find it useful to make reference to them in this list. |
| 593 | 460 | viewing software silently ignores objects of this type, so this is |
| 594 | 461 | probably not a big deal. |
| 595 | 462 | |
| 596 | - * Based on an idea suggested by user "Atom Smasher", consider | |
| 597 | - providing some mechanism to recover earlier versions of a file | |
| 598 | - embedded prior to appended sections. | |
| 599 | - | |
| 600 | 463 | * From a suggestion in bug 3152169, consider having an option to |
| 601 | 464 | re-encode inline images with an ASCII encoding. |
| 602 | 465 | |
| ... | ... | @@ -616,70 +479,190 @@ I find it useful to make reference to them in this list. |
| 616 | 479 | fonts or font metrics, see email from Tobias with Message-ID |
| 617 | 480 | <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14. |
| 618 | 481 | |
| 619 | - * Consider creating a sanitizer to make it easier for people to send | |
| 620 | - broken files. Now that we have json mode, this is probably no | |
| 621 | - longer worth doing. Here is the previous idea, possibly implemented | |
| 622 | - by making it possible to run the lexer (tokenizer) over a whole | |
| 623 | - file. Make it possible to replace all strings in a file lexically | |
| 624 | - even on badly broken files. Ideally this should work files that are | |
| 625 | - lacking xref, have broken links, duplicated dictionary keys, syntax | |
| 626 | - errors, etc., and ideally it should work with encrypted files if | |
| 627 | - possible. This should go through the streams and strings and | |
| 628 | - replace them with fixed or random characters, preferably, but not | |
| 629 | - necessarily, in a manner that works with fonts. One possibility | |
| 630 | - would be to detect whether a string contains characters with normal | |
| 631 | - encoding, and if so, use 0x41. If the string uses character maps, | |
| 632 | - use 0x01. The output should otherwise be unrelated to the input. | |
| 633 | - This could be built after the filtering and tokenizer rewrite and | |
| 634 | - should be done in a manner that takes advantage of the other | |
| 635 | - lexical features. This sanitizer should also clear metadata and | |
| 636 | - replace images. If I ever do this, the file from issue #494 would | |
| 637 | - be a great one to look at. | |
| 638 | - | |
| 639 | - * Here are some notes about having stream data providers modify | |
| 640 | - stream dictionaries. I had wanted to add this functionality to make | |
| 641 | - it more efficient to create stream data providers that may | |
| 642 | - dynamically decide what kind of filters to use and that may end up | |
| 643 | - modifying the dictionary conditionally depending on the original | |
| 644 | - stream data. Ultimately I decided not to implement this feature. | |
| 645 | - This paragraph describes why. | |
| 646 | - | |
| 647 | - * When writing, the way objects are placed into the queue for | |
| 648 | - writing strongly precludes creation of any new indirect objects, | |
| 649 | - or even changing which indirect objects are referenced from which | |
| 650 | - other objects, because we sometimes write as we are traversing | |
| 651 | - and enqueuing objects. For non-linearized files, there is a risk | |
| 652 | - that an indirect object that used to be referenced would no | |
| 653 | - longer be referenced, and whether it was already written to the | |
| 654 | - output file would be based on an accident of where it was | |
| 655 | - encountered when traversing the object structure. For linearized | |
| 656 | - files, the situation is considerably worse. We decide which | |
| 657 | - section of the file to write an object to based on a mapping of | |
| 658 | - which objects are used by which other objects. Changing this | |
| 659 | - mapping could cause an object to appear in the wrong section, to | |
| 660 | - be written even though it is unreferenced, or to be entirely | |
| 661 | - omitted since, during linearization, we don't enqueue new objects | |
| 662 | - as we traverse for writing. | |
| 663 | - | |
| 664 | - * There are several places in QPDFWriter that query a stream's | |
| 665 | - dictionary in order to prepare for writing or to make decisions | |
| 666 | - about certain aspects of the writing process. If the stream data | |
| 667 | - provider has the chance to modify the dictionary, every piece of | |
| 668 | - code that gets stream data would have to be aware of this. This | |
| 669 | - would potentially include end user code. For example, any code | |
| 670 | - that called getDict() on a stream before installing a stream data | |
| 671 | - provider and expected that dictionary to be valid would | |
| 672 | - potentially be broken. As implemented right now, you must perform | |
| 673 | - any modifications on the dictionary in advance and provided | |
| 674 | - /Filter and /DecodeParms at the time you installed the stream | |
| 675 | - data provider. This means that some computations would have to be | |
| 676 | - done more than once, but for linearized files, stream data | |
| 677 | - providers are already called more than once. If the work done by | |
| 678 | - a stream data provider is especially expensive, it can implement | |
| 679 | - its own cache. | |
| 680 | - | |
| 681 | - The example examples/pdf-custom-filter.cc demonstrates the use of | |
| 682 | - custom stream filters. This includes a custom pipeline, a custom | |
| 683 | - stream filter, as well as modification of a stream's dictionary to | |
| 684 | - include creation of a new stream that is referenced from | |
| 685 | - /DecodeParms. | |
| 482 | + * Look at places in the code where object traversal is being done and, | |
| 483 | + where possible, try to avoid it entirely or at least avoid ever | |
| 484 | + traversing the same objects multiple times. | |
| 485 | + | |
| 486 | +---------------------------------------------------------------------- | |
| 487 | + | |
| 488 | +HISTORICAL NOTES | |
| 489 | + | |
| 490 | +Performance | |
| 491 | +=========== | |
| 492 | + | |
| 493 | +As described in https://github.com/qpdf/qpdf/issues/401, there was | |
| 494 | +great performance degradation between qpdf 7.1.1 and 9.1.1. Doing a | |
| 495 | +bisect between dac65a21fb4fa5f871e31c314280b75adde89a6c and | |
| 496 | +release-qpdf-7.1.1, I found several commits that damaged performance. | |
| 497 | +I fixed some of them to improve performance by about 70% (as measured | |
| 498 | +by saying that old times were 170% of new times). The remaining | |
| 499 | +commits that broke performance either can't be correct because they | |
| 500 | +would re-introduce an old bug or aren't worth correcting because of | |
| 501 | +the high value they offer relative to a relatively low penalty. For | |
| 502 | +historical reference, here are the commits. The numbers are the time | |
| 503 | +in seconds on the machine I happened to be using of splitting the | |
| 504 | +first 100 pages of PDF32000_2008.pdf 20 times and taking an average | |
| 505 | +duration. | |
| 506 | + | |
| 507 | +Commits that broke performance: | |
| 508 | + | |
| 509 | +* d0e99f195a987c483bbb6c5449cf39bee34e08a1 -- object description and | |
| 510 | + context: 0.39 -> 0.45 | |
| 511 | +* a01359189b32c60c2d55b039f7aefd6c3ce0ebde (minus 313ba08) -- fix | |
| 512 | + dangling references: 0.55 -> 0.6 | |
| 513 | +* e5f504b6c5dc34337cc0b316b4a7b1fca7e614b1 -- sparse array: 0.6 -> 0.62 | |
| 514 | + | |
| 515 | +Other intermediate steps that were previously fixed: | |
| 516 | + | |
| 517 | +* 313ba081265f69ac9a0324f9fe87087c72918191 -- copy outlines into | |
| 518 | + split: 0.55 -> 4.0 | |
| 519 | +* a01359189b32c60c2d55b039f7aefd6c3ce0ebde -- fix dangling references: | |
| 520 | + 4.0 -> 9.0 | |
| 521 | + | |
| 522 | +This commit fixed the awful problem introduced in 313ba081: | |
| 523 | + | |
| 524 | +* a5a016cdd26a8e5c99e5f019bc30d1bdf6c050a2 -- revert outline | |
| 525 | + preservation: 9.0 -> 0.6 | |
| 526 | + | |
| 527 | +Note that the fix dangling references commit had a much worse impact | |
| 528 | +prior to removing the outline preservation, so I also measured its | |
| 529 | +impact in isolation. | |
| 530 | + | |
| 531 | +A few important lessons (in README-maintainer) | |
| 532 | + | |
| 533 | +* Indirection through PointerHolder<Members> is expensive, and should | |
| 534 | + not be used for things that are created and destroyed frequently | |
| 535 | + such as QPDFObjectHandle and QPDFObject. | |
| 536 | +* Traversal of objects is expensive and should be avoided where | |
| 537 | + possible. | |
| 538 | + | |
| 539 | +Also, it turns out that PointerHolder is more performant than | |
| 540 | +std::shared_ptr. | |
| 541 | + | |
| 542 | + | |
| 543 | +Rejected Ideas | |
| 544 | +============== | |
| 545 | + | |
| 546 | +* Investigate whether there is a way to automate the memory checker | |
| 547 | + tests for Windows. | |
| 548 | + | |
| 549 | +* Consider adding "uninstall" target to makefile. It should only | |
| 550 | + uninstall what it installed, which means that you must run | |
| 551 | + uninstall from the version you ran install with. It would only be | |
| 552 | + supported for the toolchains that support the install target | |
| 553 | + (libtool). | |
| 554 | + | |
| 555 | +* Provide support in QPDFWriter for writing incremental updates. | |
| 556 | + Provide support in qpdf for preserving incremental updates. The | |
| 557 | + goal should be that QDF mode should be fully functional for files | |
| 558 | + with incremental updates including fix_qdf. | |
| 559 | + | |
| 560 | + Note that there's nothing that says an indirect object in one | |
| 561 | + update can't refer to an object that doesn't appear until a later | |
| 562 | + update. This means that QPDF has to treat indirect null objects | |
| 563 | + differently from how it does now. QPDF drops indirect null objects | |
| 564 | + that appear as members of arrays or dictionaries. For arrays, it's | |
| 565 | + handled in QPDFWriter where we make indirect nulls direct. This is | |
| 566 | + in a single if block, and nothing else in the code cares about it. | |
| 567 | + We could just remove that if block and not break anything except a | |
| 568 | + few test cases that exercise the current behavior. For | |
| 569 | + dictionaries, it's more complicated. In this case, | |
| 570 | + QPDF_Dictionary::getKeys() ignores all keys with null values, and | |
| 571 | + hasKey() returns false for keys that have null values. We would | |
| 572 | + probably want to make QPDF_Dictionary able to handle the special | |
| 573 | + case of keys that are indirect nulls and basically never have it | |
| 574 | + drop any keys that are indirect objects. | |
| 575 | + | |
| 576 | + If we make a change to have qpdf preserve indirect references to | |
| 577 | + null objects, we have to note this in ChangeLog and in the release | |
| 578 | + notes since this will change output files. We did this before when | |
| 579 | + we stopped flattening scalar references, so this is probably not a | |
| 580 | + big deal. We also have to make sure that the testing for this | |
| 581 | + handles non-trivial cases of the targets of indirect nulls being | |
| 582 | + replaced by real objects in an update. I'm not sure how this plays | |
| 583 | + with linearization, if at all. For cases where incremental updates | |
| 584 | + are not being preserved as incremental updates and where the data | |
| 585 | + is being folded in (as is always the case with qpdf now), none of | |
| 586 | + this should make any difference in the actual semantics of the | |
| 587 | + files. | |
| 588 | + | |
| 589 | +* The second xref stream for linearized files has to be padded only | |
| 590 | + because we need file_size as computed in pass 1 to be accurate. If | |
| 591 | + we were not allowing writing to a pipe, we could seek back to the | |
| 592 | + beginning and fill in the value of /L in the linearization | |
| 593 | + dictionary as an optimization to alleviate the need for this | |
| 594 | + padding. Doing so would require us to pad the /L value | |
| 595 | + individually and also to save the file descriptor and determine | |
| 596 | + whether it's seekable. This is probably not worth bothering with. | |
| 597 | + | |
| 598 | +* Based on an idea suggested by user "Atom Smasher", consider | |
| 599 | + providing some mechanism to recover earlier versions of a file | |
| 600 | + embedded prior to appended sections. | |
| 601 | + | |
| 602 | +* Consider creating a sanitizer to make it easier for people to send | |
| 603 | + broken files. Now that we have json mode, this is probably no | |
| 604 | + longer worth doing. Here is the previous idea, possibly implemented | |
| 605 | + by making it possible to run the lexer (tokenizer) over a whole | |
| 606 | + file. Make it possible to replace all strings in a file lexically | |
| 607 | + even on badly broken files. Ideally this should work files that are | |
| 608 | + lacking xref, have broken links, duplicated dictionary keys, syntax | |
| 609 | + errors, etc., and ideally it should work with encrypted files if | |
| 610 | + possible. This should go through the streams and strings and | |
| 611 | + replace them with fixed or random characters, preferably, but not | |
| 612 | + necessarily, in a manner that works with fonts. One possibility | |
| 613 | + would be to detect whether a string contains characters with normal | |
| 614 | + encoding, and if so, use 0x41. If the string uses character maps, | |
| 615 | + use 0x01. The output should otherwise be unrelated to the input. | |
| 616 | + This could be built after the filtering and tokenizer rewrite and | |
| 617 | + should be done in a manner that takes advantage of the other | |
| 618 | + lexical features. This sanitizer should also clear metadata and | |
| 619 | + replace images. If I ever do this, the file from issue #494 would | |
| 620 | + be a great one to look at. | |
| 621 | + | |
| 622 | +* Here are some notes about having stream data providers modify | |
| 623 | + stream dictionaries. I had wanted to add this functionality to make | |
| 624 | + it more efficient to create stream data providers that may | |
| 625 | + dynamically decide what kind of filters to use and that may end up | |
| 626 | + modifying the dictionary conditionally depending on the original | |
| 627 | + stream data. Ultimately I decided not to implement this feature. | |
| 628 | + This paragraph describes why. | |
| 629 | + | |
| 630 | + * When writing, the way objects are placed into the queue for | |
| 631 | + writing strongly precludes creation of any new indirect objects, | |
| 632 | + or even changing which indirect objects are referenced from which | |
| 633 | + other objects, because we sometimes write as we are traversing | |
| 634 | + and enqueuing objects. For non-linearized files, there is a risk | |
| 635 | + that an indirect object that used to be referenced would no | |
| 636 | + longer be referenced, and whether it was already written to the | |
| 637 | + output file would be based on an accident of where it was | |
| 638 | + encountered when traversing the object structure. For linearized | |
| 639 | + files, the situation is considerably worse. We decide which | |
| 640 | + section of the file to write an object to based on a mapping of | |
| 641 | + which objects are used by which other objects. Changing this | |
| 642 | + mapping could cause an object to appear in the wrong section, to | |
| 643 | + be written even though it is unreferenced, or to be entirely | |
| 644 | + omitted since, during linearization, we don't enqueue new objects | |
| 645 | + as we traverse for writing. | |
| 646 | + | |
| 647 | + * There are several places in QPDFWriter that query a stream's | |
| 648 | + dictionary in order to prepare for writing or to make decisions | |
| 649 | + about certain aspects of the writing process. If the stream data | |
| 650 | + provider has the chance to modify the dictionary, every piece of | |
| 651 | + code that gets stream data would have to be aware of this. This | |
| 652 | + would potentially include end user code. For example, any code | |
| 653 | + that called getDict() on a stream before installing a stream data | |
| 654 | + provider and expected that dictionary to be valid would | |
| 655 | + potentially be broken. As implemented right now, you must perform | |
| 656 | + any modifications on the dictionary in advance and provided | |
| 657 | + /Filter and /DecodeParms at the time you installed the stream | |
| 658 | + data provider. This means that some computations would have to be | |
| 659 | + done more than once, but for linearized files, stream data | |
| 660 | + providers are already called more than once. If the work done by | |
| 661 | + a stream data provider is especially expensive, it can implement | |
| 662 | + its own cache. | |
| 663 | + | |
| 664 | + The example examples/pdf-custom-filter.cc demonstrates the use of | |
| 665 | + custom stream filters. This includes a custom pipeline, a custom | |
| 666 | + stream filter, as well as modification of a stream's dictionary to | |
| 667 | + include creation of a new stream that is referenced from | |
| 668 | + /DecodeParms. | ... | ... |