Commit 5b6cf45f8365562d805547b6ca774f2ba2a0c90f

Authored by Jay Berkenbilt
1 parent eff911a9

Cleanup TODO

Showing 2 changed files with 216 additions and 224 deletions
README-maintainer
@@ -110,7 +110,16 @@ CODING RULES @@ -110,7 +110,16 @@ CODING RULES
110 the shared library boundary. 110 the shared library boundary.
111 111
112 * Put private member variables in PointerHolder<Members> for all 112 * Put private member variables in PointerHolder<Members> for all
113 - public classes. Remember to use QPDF_DLL on ~Members(). 113 + public classes. Remember to use QPDF_DLL on ~Members(). Exception:
  114 + indirection through PointerHolder<Members> is expensive, so don't do
  115 + it for classes that are copied a lot, like QPDFObjectHandle and
  116 + QPDFObject.
  117 +
  118 +* Traversal of objects is expensive. It's worth adding some complexity
  119 + to avoid needless traversals of objects.
  120 +
  121 +* Avoid attaching too much metadata to objects and object handles
  122 + since those have to get copied around a lot.
114 123
115 124
116 RELEASE PREPARATION 125 RELEASE PREPARATION
@@ -3,6 +3,23 @@ Next @@ -3,6 +3,23 @@ Next
3 3
4 * High-level API/doc overhaul: #593 4 * High-level API/doc overhaul: #593
5 5
  6 +Consider in the context of #593, possibly with a different
  7 +implementation
  8 +
  9 +* replace mode: --replace-object, --replace-stream-raw,
  10 + --replace-stream-filtered
  11 + * update first paragraph of QPDF JSON in the manual to mention this
  12 + * object numbers are not preserved by write, so object ID lookup
  13 + has to be done separately for each invocation
  14 + * you don't have to specify length for streams
  15 + * you only have to specify filtering for streams if providing raw data
  16 +
  17 +* See if this has been done or is trivial with C++11 local static
  18 + initializers: Secure random number generation could be made more
  19 + efficient by using a local static to ensure a single random device
  20 + or crypt provider as long as this can be done in a thread-safe
  21 + fashion.
  22 +
6 Documentation 23 Documentation
7 ============= 24 =============
8 25
@@ -21,6 +38,8 @@ Documentation @@ -21,6 +38,8 @@ Documentation
21 Document-level work 38 Document-level work
22 =================== 39 ===================
23 40
  41 +* Ideas here may by superseded by #593.
  42 +
24 * QPDFPageCopier -- object for moving pages around within files or 43 * QPDFPageCopier -- object for moving pages around within files or
25 between files and performing various transformations 44 between files and performing various transformations
26 45
@@ -97,19 +116,6 @@ Fuzz Errors @@ -97,19 +116,6 @@ Fuzz Errors
97 * Ignoring these: 116 * Ignoring these:
98 * Out of memory in dct: 35001, 32516 117 * Out of memory in dct: 35001, 32516
99 118
100 -GitHub Actions  
101 -==============  
102 -  
103 -* Actions are triggered on push to main and master. When we eventually  
104 - rename master to main, make sure the reference to master is removed  
105 - from .github/workflows/*.yml.  
106 -  
107 -* At the time of migrating from Azure Pipelines to GitHub Actions  
108 - (2020-10), there was no standard test result publisher (to replace  
109 - the PublishTestResults@2 task). There are some third-party actions,  
110 - but I'd rather not depend on them. Keep an eye open for this coming  
111 - to GitHub Actions.  
112 -  
113 External Libraries 119 External Libraries
114 ================== 120 ==================
115 121
@@ -288,66 +294,6 @@ Page splitting/merging @@ -288,66 +294,6 @@ Page splitting/merging
288 294
289 * Form fields: should be similar to outlines. 295 * Form fields: should be similar to outlines.
290 296
291 -Performance  
292 -===========  
293 -  
294 -As described in https://github.com/qpdf/qpdf/issues/401, there was  
295 -great performance degradation between qpdf 7.1.1 and 9.1.1. Doing a  
296 -bisect between dac65a21fb4fa5f871e31c314280b75adde89a6c and  
297 -release-qpdf-7.1.1, I found several commits that damaged performance.  
298 -I fixed some of them to improve performance by about 70% (as measured  
299 -by saying that old times were 170% of new times). The remaining  
300 -commits that broke performance either can't be correct because they  
301 -would re-introduce an old bug or aren't worth correcting because of  
302 -the high value they offer relative to a relatively low penalty. For  
303 -historical reference, here are the commits. The numbers are the time  
304 -in seconds on the machine I happened to be using of splitting the  
305 -first 100 pages of PDF32000_2008.pdf 20 times and taking an average  
306 -duration.  
307 -  
308 -Commits that broke performance:  
309 -  
310 -* d0e99f195a987c483bbb6c5449cf39bee34e08a1 -- object description and  
311 - context: 0.39 -> 0.45  
312 -* a01359189b32c60c2d55b039f7aefd6c3ce0ebde (minus 313ba08) -- fix  
313 - dangling references: 0.55 -> 0.6  
314 -* e5f504b6c5dc34337cc0b316b4a7b1fca7e614b1 -- sparse array: 0.6 -> 0.62  
315 -  
316 -Other intermediate steps that were previously fixed:  
317 -  
318 -* 313ba081265f69ac9a0324f9fe87087c72918191 -- copy outlines into  
319 - split: 0.55 -> 4.0  
320 -* a01359189b32c60c2d55b039f7aefd6c3ce0ebde -- fix dangling references:  
321 - 4.0 -> 9.0  
322 -  
323 -This commit fixed the awful problem introduced in 313ba081:  
324 -  
325 -* a5a016cdd26a8e5c99e5f019bc30d1bdf6c050a2 -- revert outline  
326 - preservation: 9.0 -> 0.6  
327 -  
328 -Note that the fix dangling references commit had a much worse impact  
329 -prior to removing the outline preservation, so I also measured its  
330 -impact in isolation.  
331 -  
332 -A few important lessons:  
333 -  
334 -* Indirection through PointerHolder<Members> is expensive, and should  
335 - not be used for things that are created and destroyed frequently  
336 - such as QPDFObjectHandle and QPDFObject.  
337 -* Traversal of objects is expensive and should be avoided where  
338 - possible.  
339 -  
340 -Future ideas:  
341 -  
342 -* Look at places in the code where object traversal is being done and,  
343 - where possible, try to avoid it entirely or at least avoid ever  
344 - traversing the same objects multiple times.  
345 -* Avoid attaching too much metadata to objects and object handles  
346 - since those have to get copied around a lot.  
347 -  
348 -Also, it turns out that PointerHolder is more performant than  
349 -std::shared_ptr.  
350 -  
351 Analytics 297 Analytics
352 ========= 298 =========
353 299
@@ -406,9 +352,6 @@ I find it useful to make reference to them in this list. @@ -406,9 +352,6 @@ I find it useful to make reference to them in this list.
406 dictionary may need to be changed -- create test cases with lots of 352 dictionary may need to be changed -- create test cases with lots of
407 duplicated/overlapping keys. 353 duplicated/overlapping keys.
408 354
409 - * Investigate whether there is a way to automate the memory checker  
410 - tests for Windows.  
411 -  
412 * Part of closed_file_input_source.cc is disabled on Windows because 355 * Part of closed_file_input_source.cc is disabled on Windows because
413 of odd failures. It might be worth investigating so we can fully 356 of odd failures. It might be worth investigating so we can fully
414 exercise this in the test suite. That said, ClosedFileInputSource 357 exercise this in the test suite. That said, ClosedFileInputSource
@@ -435,14 +378,6 @@ I find it useful to make reference to them in this list. @@ -435,14 +378,6 @@ I find it useful to make reference to them in this list.
435 * set value from CLI? Specify title, and provide way to 378 * set value from CLI? Specify title, and provide way to
436 disambiguate, probably by giving objgen of field 379 disambiguate, probably by giving objgen of field
437 380
438 - * replace mode: --replace-object, --replace-stream-raw,  
439 - --replace-stream-filtered  
440 - * update first paragraph of QPDF JSON in the manual to mention this  
441 - * object numbers are not preserved by write, so object ID lookup  
442 - has to be done separately for each invocation  
443 - * you don't have to specify length for streams  
444 - * you only have to specify filtering for streams if providing raw data  
445 -  
446 * Pl_TIFFPredictor is pretty slow. 381 * Pl_TIFFPredictor is pretty slow.
447 382
448 * Support for handling file names with Unicode characters in Windows 383 * Support for handling file names with Unicode characters in Windows
@@ -486,46 +421,6 @@ I find it useful to make reference to them in this list. @@ -486,46 +421,6 @@ I find it useful to make reference to them in this list.
486 421
487 * Look at ~/Q/pdf-collection/forms-from-appian/ 422 * Look at ~/Q/pdf-collection/forms-from-appian/
488 423
489 - * Consider adding "uninstall" target to makefile. It should only  
490 - uninstall what it installed, which means that you must run  
491 - uninstall from the version you ran install with. It would only be  
492 - supported for the toolchains that support the install target  
493 - (libtool).  
494 -  
495 - * Provide support in QPDFWriter for writing incremental updates.  
496 - Provide support in qpdf for preserving incremental updates. The  
497 - goal should be that QDF mode should be fully functional for files  
498 - with incremental updates including fix_qdf.  
499 -  
500 - Note that there's nothing that says an indirect object in one  
501 - update can't refer to an object that doesn't appear until a later  
502 - update. This means that QPDF has to treat indirect null objects  
503 - differently from how it does now. QPDF drops indirect null objects  
504 - that appear as members of arrays or dictionaries. For arrays, it's  
505 - handled in QPDFWriter where we make indirect nulls direct. This is  
506 - in a single if block, and nothing else in the code cares about it.  
507 - We could just remove that if block and not break anything except a  
508 - few test cases that exercise the current behavior. For  
509 - dictionaries, it's more complicated. In this case,  
510 - QPDF_Dictionary::getKeys() ignores all keys with null values, and  
511 - hasKey() returns false for keys that have null values. We would  
512 - probably want to make QPDF_Dictionary able to handle the special  
513 - case of keys that are indirect nulls and basically never have it  
514 - drop any keys that are indirect objects.  
515 -  
516 - If we make a change to have qpdf preserve indirect references to  
517 - null objects, we have to note this in ChangeLog and in the release  
518 - notes since this will change output files. We did this before when  
519 - we stopped flattening scalar references, so this is probably not a  
520 - big deal. We also have to make sure that the testing for this  
521 - handles non-trivial cases of the targets of indirect nulls being  
522 - replaced by real objects in an update. I'm not sure how this plays  
523 - with linearization, if at all. For cases where incremental updates  
524 - are not being preserved as incremental updates and where the data  
525 - is being folded in (as is always the case with qpdf now), none of  
526 - this should make any difference in the actual semantics of the  
527 - files.  
528 -  
529 * When decrypting files with /R=6, hash_V5 is called more than once 424 * When decrypting files with /R=6, hash_V5 is called more than once
530 with the same inputs. Caching the results or refactoring to reduce 425 with the same inputs. Caching the results or refactoring to reduce
531 the number of identical calls could improve performance for 426 the number of identical calls could improve performance for
@@ -536,12 +431,6 @@ I find it useful to make reference to them in this list. @@ -536,12 +431,6 @@ I find it useful to make reference to them in this list.
536 and replace the /Pages key of the root dictionary with the new 431 and replace the /Pages key of the root dictionary with the new
537 tree. 432 tree.
538 433
539 - * Secure random number generation could be made more efficient by  
540 - using a local static to ensure a single random device or crypt  
541 - provider as long as this can be done in a thread-safe fashion. In  
542 - the initial implementation, this is being skipped to avoid having  
543 - to add any dependencies on threading libraries.  
544 -  
545 * Study what's required to support savable forms that can be saved by 434 * Study what's required to support savable forms that can be saved by
546 Adobe Reader. Does this require actually signing the document with 435 Adobe Reader. Does this require actually signing the document with
547 an Adobe private key? Search for "Digital signatures" in the PDF 436 an Adobe private key? Search for "Digital signatures" in the PDF
@@ -551,19 +440,6 @@ I find it useful to make reference to them in this list. @@ -551,19 +440,6 @@ I find it useful to make reference to them in this list.
551 implemented, update the docs on crypto providers, which mention 440 implemented, update the docs on crypto providers, which mention
552 that this may happen in the future. 441 that this may happen in the future.
553 442
554 - * Provide APIs for embedded files. See *attachments*.pdf in test  
555 - suite. The private method findAttachmentStreams finds at least  
556 - cases for modern versions of Adobe Reader (>= 1.7, maybe earlier).  
557 - PDF Reference 1.7 section 3.10, "File Specifications", discusses  
558 - this.  
559 -  
560 - A sourceforge user asks if qpdf can handle extracting and embedded  
561 - resources and references these tools, which may be useful as a  
562 - reference.  
563 -  
564 - http://multivalent.sourceforge.net/Tools/pdf/Extract.html  
565 - http://multivalent.sourceforge.net/Tools/pdf/Embed.html  
566 -  
567 * Qpdf does not honor /EFF when adding new file attachments. When it 443 * Qpdf does not honor /EFF when adding new file attachments. When it
568 encrypts, it never generates streams with explicit crypt filters. 444 encrypts, it never generates streams with explicit crypt filters.
569 Prior to 10.2, there was an incorrect attempt to treat /EFF as a 445 Prior to 10.2, there was an incorrect attempt to treat /EFF as a
@@ -572,15 +448,6 @@ I find it useful to make reference to them in this list. @@ -572,15 +448,6 @@ I find it useful to make reference to them in this list.
572 writers to obey this when adding new attachments. Qpdf is not a 448 writers to obey this when adding new attachments. Qpdf is not a
573 conforming writer in that respect. 449 conforming writer in that respect.
574 450
575 - * The second xref stream for linearized files has to be padded only  
576 - because we need file_size as computed in pass 1 to be accurate. If  
577 - we were not allowing writing to a pipe, we could seek back to the  
578 - beginning and fill in the value of /L in the linearization  
579 - dictionary as an optimization to alleviate the need for this  
580 - padding. Doing so would require us to pad the /L value  
581 - individually and also to save the file descriptor and determine  
582 - whether it's seekable. This is probably not worth bothering with.  
583 -  
584 * The whole xref handling code in the QPDF object allows the same 451 * The whole xref handling code in the QPDF object allows the same
585 object with more than one generation to coexist, but a lot of logic 452 object with more than one generation to coexist, but a lot of logic
586 assumes this isn't the case. Anything that creates mappings only 453 assumes this isn't the case. Anything that creates mappings only
@@ -593,10 +460,6 @@ I find it useful to make reference to them in this list. @@ -593,10 +460,6 @@ I find it useful to make reference to them in this list.
593 viewing software silently ignores objects of this type, so this is 460 viewing software silently ignores objects of this type, so this is
594 probably not a big deal. 461 probably not a big deal.
595 462
596 - * Based on an idea suggested by user "Atom Smasher", consider  
597 - providing some mechanism to recover earlier versions of a file  
598 - embedded prior to appended sections.  
599 -  
600 * From a suggestion in bug 3152169, consider having an option to 463 * From a suggestion in bug 3152169, consider having an option to
601 re-encode inline images with an ASCII encoding. 464 re-encode inline images with an ASCII encoding.
602 465
@@ -616,70 +479,190 @@ I find it useful to make reference to them in this list. @@ -616,70 +479,190 @@ I find it useful to make reference to them in this list.
616 fonts or font metrics, see email from Tobias with Message-ID 479 fonts or font metrics, see email from Tobias with Message-ID
617 <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14. 480 <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14.
618 481
619 - * Consider creating a sanitizer to make it easier for people to send  
620 - broken files. Now that we have json mode, this is probably no  
621 - longer worth doing. Here is the previous idea, possibly implemented  
622 - by making it possible to run the lexer (tokenizer) over a whole  
623 - file. Make it possible to replace all strings in a file lexically  
624 - even on badly broken files. Ideally this should work files that are  
625 - lacking xref, have broken links, duplicated dictionary keys, syntax  
626 - errors, etc., and ideally it should work with encrypted files if  
627 - possible. This should go through the streams and strings and  
628 - replace them with fixed or random characters, preferably, but not  
629 - necessarily, in a manner that works with fonts. One possibility  
630 - would be to detect whether a string contains characters with normal  
631 - encoding, and if so, use 0x41. If the string uses character maps,  
632 - use 0x01. The output should otherwise be unrelated to the input.  
633 - This could be built after the filtering and tokenizer rewrite and  
634 - should be done in a manner that takes advantage of the other  
635 - lexical features. This sanitizer should also clear metadata and  
636 - replace images. If I ever do this, the file from issue #494 would  
637 - be a great one to look at.  
638 -  
639 - * Here are some notes about having stream data providers modify  
640 - stream dictionaries. I had wanted to add this functionality to make  
641 - it more efficient to create stream data providers that may  
642 - dynamically decide what kind of filters to use and that may end up  
643 - modifying the dictionary conditionally depending on the original  
644 - stream data. Ultimately I decided not to implement this feature.  
645 - This paragraph describes why.  
646 -  
647 - * When writing, the way objects are placed into the queue for  
648 - writing strongly precludes creation of any new indirect objects,  
649 - or even changing which indirect objects are referenced from which  
650 - other objects, because we sometimes write as we are traversing  
651 - and enqueuing objects. For non-linearized files, there is a risk  
652 - that an indirect object that used to be referenced would no  
653 - longer be referenced, and whether it was already written to the  
654 - output file would be based on an accident of where it was  
655 - encountered when traversing the object structure. For linearized  
656 - files, the situation is considerably worse. We decide which  
657 - section of the file to write an object to based on a mapping of  
658 - which objects are used by which other objects. Changing this  
659 - mapping could cause an object to appear in the wrong section, to  
660 - be written even though it is unreferenced, or to be entirely  
661 - omitted since, during linearization, we don't enqueue new objects  
662 - as we traverse for writing.  
663 -  
664 - * There are several places in QPDFWriter that query a stream's  
665 - dictionary in order to prepare for writing or to make decisions  
666 - about certain aspects of the writing process. If the stream data  
667 - provider has the chance to modify the dictionary, every piece of  
668 - code that gets stream data would have to be aware of this. This  
669 - would potentially include end user code. For example, any code  
670 - that called getDict() on a stream before installing a stream data  
671 - provider and expected that dictionary to be valid would  
672 - potentially be broken. As implemented right now, you must perform  
673 - any modifications on the dictionary in advance and provided  
674 - /Filter and /DecodeParms at the time you installed the stream  
675 - data provider. This means that some computations would have to be  
676 - done more than once, but for linearized files, stream data  
677 - providers are already called more than once. If the work done by  
678 - a stream data provider is especially expensive, it can implement  
679 - its own cache.  
680 -  
681 - The example examples/pdf-custom-filter.cc demonstrates the use of  
682 - custom stream filters. This includes a custom pipeline, a custom  
683 - stream filter, as well as modification of a stream's dictionary to  
684 - include creation of a new stream that is referenced from  
685 - /DecodeParms. 482 + * Look at places in the code where object traversal is being done and,
  483 + where possible, try to avoid it entirely or at least avoid ever
  484 + traversing the same objects multiple times.
  485 +
  486 +----------------------------------------------------------------------
  487 +
  488 +HISTORICAL NOTES
  489 +
  490 +Performance
  491 +===========
  492 +
  493 +As described in https://github.com/qpdf/qpdf/issues/401, there was
  494 +great performance degradation between qpdf 7.1.1 and 9.1.1. Doing a
  495 +bisect between dac65a21fb4fa5f871e31c314280b75adde89a6c and
  496 +release-qpdf-7.1.1, I found several commits that damaged performance.
  497 +I fixed some of them to improve performance by about 70% (as measured
  498 +by saying that old times were 170% of new times). The remaining
  499 +commits that broke performance either can't be correct because they
  500 +would re-introduce an old bug or aren't worth correcting because of
  501 +the high value they offer relative to a relatively low penalty. For
  502 +historical reference, here are the commits. The numbers are the time
  503 +in seconds on the machine I happened to be using of splitting the
  504 +first 100 pages of PDF32000_2008.pdf 20 times and taking an average
  505 +duration.
  506 +
  507 +Commits that broke performance:
  508 +
  509 +* d0e99f195a987c483bbb6c5449cf39bee34e08a1 -- object description and
  510 + context: 0.39 -> 0.45
  511 +* a01359189b32c60c2d55b039f7aefd6c3ce0ebde (minus 313ba08) -- fix
  512 + dangling references: 0.55 -> 0.6
  513 +* e5f504b6c5dc34337cc0b316b4a7b1fca7e614b1 -- sparse array: 0.6 -> 0.62
  514 +
  515 +Other intermediate steps that were previously fixed:
  516 +
  517 +* 313ba081265f69ac9a0324f9fe87087c72918191 -- copy outlines into
  518 + split: 0.55 -> 4.0
  519 +* a01359189b32c60c2d55b039f7aefd6c3ce0ebde -- fix dangling references:
  520 + 4.0 -> 9.0
  521 +
  522 +This commit fixed the awful problem introduced in 313ba081:
  523 +
  524 +* a5a016cdd26a8e5c99e5f019bc30d1bdf6c050a2 -- revert outline
  525 + preservation: 9.0 -> 0.6
  526 +
  527 +Note that the fix dangling references commit had a much worse impact
  528 +prior to removing the outline preservation, so I also measured its
  529 +impact in isolation.
  530 +
  531 +A few important lessons (in README-maintainer)
  532 +
  533 +* Indirection through PointerHolder<Members> is expensive, and should
  534 + not be used for things that are created and destroyed frequently
  535 + such as QPDFObjectHandle and QPDFObject.
  536 +* Traversal of objects is expensive and should be avoided where
  537 + possible.
  538 +
  539 +Also, it turns out that PointerHolder is more performant than
  540 +std::shared_ptr.
  541 +
  542 +
  543 +Rejected Ideas
  544 +==============
  545 +
  546 +* Investigate whether there is a way to automate the memory checker
  547 + tests for Windows.
  548 +
  549 +* Consider adding "uninstall" target to makefile. It should only
  550 + uninstall what it installed, which means that you must run
  551 + uninstall from the version you ran install with. It would only be
  552 + supported for the toolchains that support the install target
  553 + (libtool).
  554 +
  555 +* Provide support in QPDFWriter for writing incremental updates.
  556 + Provide support in qpdf for preserving incremental updates. The
  557 + goal should be that QDF mode should be fully functional for files
  558 + with incremental updates including fix_qdf.
  559 +
  560 + Note that there's nothing that says an indirect object in one
  561 + update can't refer to an object that doesn't appear until a later
  562 + update. This means that QPDF has to treat indirect null objects
  563 + differently from how it does now. QPDF drops indirect null objects
  564 + that appear as members of arrays or dictionaries. For arrays, it's
  565 + handled in QPDFWriter where we make indirect nulls direct. This is
  566 + in a single if block, and nothing else in the code cares about it.
  567 + We could just remove that if block and not break anything except a
  568 + few test cases that exercise the current behavior. For
  569 + dictionaries, it's more complicated. In this case,
  570 + QPDF_Dictionary::getKeys() ignores all keys with null values, and
  571 + hasKey() returns false for keys that have null values. We would
  572 + probably want to make QPDF_Dictionary able to handle the special
  573 + case of keys that are indirect nulls and basically never have it
  574 + drop any keys that are indirect objects.
  575 +
  576 + If we make a change to have qpdf preserve indirect references to
  577 + null objects, we have to note this in ChangeLog and in the release
  578 + notes since this will change output files. We did this before when
  579 + we stopped flattening scalar references, so this is probably not a
  580 + big deal. We also have to make sure that the testing for this
  581 + handles non-trivial cases of the targets of indirect nulls being
  582 + replaced by real objects in an update. I'm not sure how this plays
  583 + with linearization, if at all. For cases where incremental updates
  584 + are not being preserved as incremental updates and where the data
  585 + is being folded in (as is always the case with qpdf now), none of
  586 + this should make any difference in the actual semantics of the
  587 + files.
  588 +
  589 +* The second xref stream for linearized files has to be padded only
  590 + because we need file_size as computed in pass 1 to be accurate. If
  591 + we were not allowing writing to a pipe, we could seek back to the
  592 + beginning and fill in the value of /L in the linearization
  593 + dictionary as an optimization to alleviate the need for this
  594 + padding. Doing so would require us to pad the /L value
  595 + individually and also to save the file descriptor and determine
  596 + whether it's seekable. This is probably not worth bothering with.
  597 +
  598 +* Based on an idea suggested by user "Atom Smasher", consider
  599 + providing some mechanism to recover earlier versions of a file
  600 + embedded prior to appended sections.
  601 +
  602 +* Consider creating a sanitizer to make it easier for people to send
  603 + broken files. Now that we have json mode, this is probably no
  604 + longer worth doing. Here is the previous idea, possibly implemented
  605 + by making it possible to run the lexer (tokenizer) over a whole
  606 + file. Make it possible to replace all strings in a file lexically
  607 + even on badly broken files. Ideally this should work files that are
  608 + lacking xref, have broken links, duplicated dictionary keys, syntax
  609 + errors, etc., and ideally it should work with encrypted files if
  610 + possible. This should go through the streams and strings and
  611 + replace them with fixed or random characters, preferably, but not
  612 + necessarily, in a manner that works with fonts. One possibility
  613 + would be to detect whether a string contains characters with normal
  614 + encoding, and if so, use 0x41. If the string uses character maps,
  615 + use 0x01. The output should otherwise be unrelated to the input.
  616 + This could be built after the filtering and tokenizer rewrite and
  617 + should be done in a manner that takes advantage of the other
  618 + lexical features. This sanitizer should also clear metadata and
  619 + replace images. If I ever do this, the file from issue #494 would
  620 + be a great one to look at.
  621 +
  622 +* Here are some notes about having stream data providers modify
  623 + stream dictionaries. I had wanted to add this functionality to make
  624 + it more efficient to create stream data providers that may
  625 + dynamically decide what kind of filters to use and that may end up
  626 + modifying the dictionary conditionally depending on the original
  627 + stream data. Ultimately I decided not to implement this feature.
  628 + This paragraph describes why.
  629 +
  630 + * When writing, the way objects are placed into the queue for
  631 + writing strongly precludes creation of any new indirect objects,
  632 + or even changing which indirect objects are referenced from which
  633 + other objects, because we sometimes write as we are traversing
  634 + and enqueuing objects. For non-linearized files, there is a risk
  635 + that an indirect object that used to be referenced would no
  636 + longer be referenced, and whether it was already written to the
  637 + output file would be based on an accident of where it was
  638 + encountered when traversing the object structure. For linearized
  639 + files, the situation is considerably worse. We decide which
  640 + section of the file to write an object to based on a mapping of
  641 + which objects are used by which other objects. Changing this
  642 + mapping could cause an object to appear in the wrong section, to
  643 + be written even though it is unreferenced, or to be entirely
  644 + omitted since, during linearization, we don't enqueue new objects
  645 + as we traverse for writing.
  646 +
  647 + * There are several places in QPDFWriter that query a stream's
  648 + dictionary in order to prepare for writing or to make decisions
  649 + about certain aspects of the writing process. If the stream data
  650 + provider has the chance to modify the dictionary, every piece of
  651 + code that gets stream data would have to be aware of this. This
  652 + would potentially include end user code. For example, any code
  653 + that called getDict() on a stream before installing a stream data
  654 + provider and expected that dictionary to be valid would
  655 + potentially be broken. As implemented right now, you must perform
  656 + any modifications on the dictionary in advance and provided
  657 + /Filter and /DecodeParms at the time you installed the stream
  658 + data provider. This means that some computations would have to be
  659 + done more than once, but for linearized files, stream data
  660 + providers are already called more than once. If the work done by
  661 + a stream data provider is especially expensive, it can implement
  662 + its own cache.
  663 +
  664 + The example examples/pdf-custom-filter.cc demonstrates the use of
  665 + custom stream filters. This includes a custom pipeline, a custom
  666 + stream filter, as well as modification of a stream's dictionary to
  667 + include creation of a new stream that is referenced from
  668 + /DecodeParms.