Commit 48dfae6443943512739bee4d5a488592c89f3c1d

Authored by Jay Berkenbilt
1 parent 433be371

TODO: rescope some items

Showing 1 changed file with 198 additions and 189 deletions
@@ -21,31 +21,15 @@ Pending changes: @@ -21,31 +21,15 @@ Pending changes:
21 appimage build specifically is setting the runpath, which is 21 appimage build specifically is setting the runpath, which is
22 actually desirable in this case. Make sure to understand and 22 actually desirable in this case. Make sure to understand and
23 document this. Maybe add a check for it in the build. 23 document this. Maybe add a check for it in the build.
24 -* Decide what to do about #664 (get*Box)  
25 -* Add an option --ignore-encryption to ignore encryption information  
26 - and treat encrypted files as if they weren't encrypted. This should  
27 - make it possible to solve #598 (--show-encryption without a  
28 - password). We'll need to make sure we don't try to filter any  
29 - streams in this mode. Ideally we should be able to combine this with  
30 - --json so we can look at the raw encrypted strings and streams if we  
31 - want to, though be sure to document that the resulting JSON won't be  
32 - convertible back to a valid PDF. Since providing the password may  
33 - reveal additional details, --show-encryption could potentially retry  
34 - with this option if the first time doesn't work. Then, with the file  
35 - open, we can read the encryption dictionary normally.  
36 -* In libtests, separate executables that need the object library  
37 - from those that strictly use public API. Move as many of the test  
38 - drivers from the qpdf directory into the latter category as long  
39 - as doing so isn't too troublesome from a coverage standpoint.  
40 -* Consider adding fuzzer code for JSON  
41 -* Consider generating a non-flat pages tree before creating output to  
42 - better handle files with lots of pages. If there are more than 256  
43 - pages, add a second layer with the second layer nodes having no more  
44 - than 256 nodes and being as evenly sizes as possible. Don't worry  
45 - about the case of more than 65,536 pages. If the top node has more  
46 - than 256 children, we'll live with it.  
47 24
48 -Parent pointer idea: 25 +Soon: Break ground on "Document-level work"
  26 +
  27 +Fix Multiple Direct Object Owner Issue
  28 +======================================
  29 +
  30 +These are some ideas I've had, but I'm parking them until I fully
  31 +understand m-holger's proposal to split QPDFObject into QPDFObject and
  32 +QPDFValue.
49 33
50 * Add std::weak_ptr<QPDFObject> parent to QPDFObject. When adding a 34 * Add std::weak_ptr<QPDFObject> parent to QPDFObject. When adding a
51 direct object to an array or dictionary, set its parent. When 35 direct object to an array or dictionary, set its parent. When
@@ -65,8 +49,6 @@ Note that arrays and dictionaries still need to contain @@ -65,8 +49,6 @@ Note that arrays and dictionaries still need to contain
65 QPDFObjectHandle because of indirect objects. This only pertains to 49 QPDFObjectHandle because of indirect objects. This only pertains to
66 direct objects, which are always "resolved" in QPDFObjectHandle. 50 direct objects, which are always "resolved" in QPDFObjectHandle.
67 51
68 -Soon: Break ground on "Document-level work"  
69 -  
70 Possible future JSON enhancements 52 Possible future JSON enhancements
71 ================================= 53 =================================
72 54
@@ -376,169 +358,196 @@ directory or that are otherwise not publicly accessible. This includes @@ -376,169 +358,196 @@ directory or that are otherwise not publicly accessible. This includes
376 things sent to me by email that are specifically not public. Even so, 358 things sent to me by email that are specifically not public. Even so,
377 I find it useful to make reference to them in this list. 359 I find it useful to make reference to them in this list.
378 360
379 - * Look at https://bestpractices.coreinfrastructure.org/en  
380 -  
381 - * Rework tests so that nothing is written into the source directory.  
382 - Ideally then the entire build could be done with a read-only  
383 - source tree.  
384 -  
385 - * Large file tests fail with linux32 before and after cmake. This was  
386 - first noticed after 10.6.3. I don't think it's worth fixing.  
387 -  
388 - * Consider updating the fuzzer with code that exercises  
389 - copyAnnotations, file attachments, and name and number trees. Check  
390 - fuzzer coverage.  
391 -  
392 - * Add code for creation of a file attachment annotation. It should  
393 - also be possible to create a widget annotation and a form field.  
394 - Update the pdf-attach-file.cc example with new APIs when ready.  
395 -  
396 - * Flattening of form XObjects seems like something that would be  
397 - useful in the library. We are seeing more cases of completely valid  
398 - PDF files with form XObjects that cause problems in other software.  
399 - Flattening of form XObjects could be a useful way to work around  
400 - those issues or to prepare files for additional processing, making  
401 - it possible for users of the qpdf library to not be concerned about  
402 - form XObjects. This could be done recursively; i.e., we could have a  
403 - method to embed a form XObject into whatever contains it, whether  
404 - that is a form XObject or a page. This would require more  
405 - significant interpretation of the content stream. We would need a  
406 - test file in which the placement of the form XObject has to be in  
407 - the right place, e.g., the form XObject partially obscures earlier  
408 - code and is partially obscured by later code. Keys in the resource  
409 - dictionary may need to be changed -- create test cases with lots of  
410 - duplicated/overlapping keys.  
411 -  
412 - * Part of closed_file_input_source.cc is disabled on Windows because  
413 - of odd failures. It might be worth investigating so we can fully  
414 - exercise this in the test suite. That said, ClosedFileInputSource  
415 - is exercised elsewhere in qpdf's test suite, so this is not that  
416 - pressing.  
417 -  
418 - * If possible, consider adding CCITT3, CCITT4, or any other easy  
419 - filters. For some reference code that we probably can't use but may  
420 - be handy anyway, see  
421 - http://partners.adobe.com/public/developer/ps/sdk/index_archive.html  
422 -  
423 - * If possible, support the following types of broken files:  
424 -  
425 - - Files that have no whitespace token after "endobj" such that  
426 - endobj collides with the start of the next object  
427 -  
428 - - See ../misc/broken-files  
429 -  
430 - - See ../misc/bad-files-issue-476. This directory contains a  
431 - snapshot of the google doc and linked PDF files from issue #476.  
432 - Please see the issue for details.  
433 -  
434 - * Additional form features  
435 - * set value from CLI? Specify title, and provide way to  
436 - disambiguate, probably by giving objgen of field  
437 -  
438 - * Pl_TIFFPredictor is pretty slow.  
439 -  
440 - * Support for handling file names with Unicode characters in Windows  
441 - is incomplete. qpdf seems to support them okay from a functionality  
442 - standpoint, and the right thing happens if you pass in UTF-8  
443 - encoded filenames to QPDF library routines in Windows (they are  
444 - converted internally to wchar_t*), but file names are encoded in  
445 - UTF-8 on output, which doesn't produce nice error messages or  
446 - output on Windows in some cases.  
447 -  
448 - * If we ever wanted to do anything more with character encoding, see  
449 - ../misc/character-encoding/, which includes machine-readable dump  
450 - of table D.2 in the ISO-32000 PDF spec. This shows the mapping  
451 - between Unicode, StandardEncoding, WinAnsiEncoding,  
452 - MacRomanEncoding, and PDFDocEncoding.  
453 -  
454 - * Some test cases on bad files fail because qpdf is unable to find  
455 - the root dictionary when it fails to read the trailer. Recovery  
456 - could find the root dictionary and even the info dictionary in  
457 - other ways. In particular, issue-202.pdf can be opened by evince,  
458 - and there's no real reason that qpdf couldn't be made to be able to  
459 - recover that file as well.  
460 -  
461 - * Audit every place where qpdf allocates memory to see whether there  
462 - are cases where malicious inputs could cause qpdf to attempt to  
463 - grab very large amounts of memory. Certainly there are cases like  
464 - this, such as if a very highly compressed, very large image stream  
465 - is requested in a buffer. Hopefully normal input to output  
466 - filtering doesn't ever try to do this. QPDFWriter should be checked  
467 - carefully too. See also bugs/private/from-email-663916/  
468 -  
469 - * Interactive form modification:  
470 - https://github.com/qpdf/qpdf/issues/213 contains a good discussion  
471 - of some ideas for adding methods to modify annotations and form  
472 - fields if we want to make it easier to support modifications to  
473 - interactive forms. Some of the ideas have been implemented, and  
474 - some of the probably never will be implemented, but it's worth a  
475 - read if there is an intention to work on this. In the issue, search  
476 - for "Regarding write functionality", and read that comment and the  
477 - responses to it.  
478 -  
479 - * Look at ~/Q/pdf-collection/forms-from-appian/  
480 -  
481 - * When decrypting files with /R=6, hash_V5 is called more than once  
482 - with the same inputs. Caching the results or refactoring to reduce  
483 - the number of identical calls could improve performance for  
484 - workloads that involve processing large numbers of small files.  
485 -  
486 - * Consider adding a method to balance the pages tree. It would call  
487 - pushInheritedAttributesToPage, construct a pages tree from scratch,  
488 - and replace the /Pages key of the root dictionary with the new  
489 - tree.  
490 -  
491 - * Study what's required to support savable forms that can be saved by  
492 - Adobe Reader. Does this require actually signing the document with  
493 - an Adobe private key? Search for "Digital signatures" in the PDF  
494 - spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which  
495 - came from Adobe's example site. See also  
496 - ../misc/digital-sign-from-trueroad/. If digital signatures are  
497 - implemented, update the docs on crypto providers, which mention  
498 - that this may happen in the future.  
499 -  
500 - * Qpdf does not honor /EFF when adding new file attachments. When it  
501 - encrypts, it never generates streams with explicit crypt filters.  
502 - Prior to 10.2, there was an incorrect attempt to treat /EFF as a  
503 - default value for decrypting file attachment streams, but it is not  
504 - supposed to mean that. Instead, it is intended for conforming  
505 - writers to obey this when adding new attachments. Qpdf is not a  
506 - conforming writer in that respect.  
507 -  
508 - * The whole xref handling code in the QPDF object allows the same  
509 - object with more than one generation to coexist, but a lot of logic  
510 - assumes this isn't the case. Anything that creates mappings only  
511 - with the object number and not the generation is this way,  
512 - including most of the interaction between QPDFWriter and QPDF. If  
513 - we wanted to allow the same object with more than one generation to  
514 - coexist, which I'm not sure is allowed, we could fix this by  
515 - changing xref_table. Alternatively, we could detect and disallow  
516 - that case. In fact, it appears that Adobe reader and other PDF  
517 - viewing software silently ignores objects of this type, so this is  
518 - probably not a big deal.  
519 -  
520 - * From a suggestion in bug 3152169, consider having an option to  
521 - re-encode inline images with an ASCII encoding.  
522 -  
523 - * From github issue 2, provide more in-depth output for examining  
524 - hint stream contents. Consider adding on option to provide a  
525 - human-readable dump of linearization hint tables. This should  
526 - include improving the 'overflow reading bit stream' message as  
527 - reported in issue #2. There are multiple calls to stopOnError in  
528 - the linearization checking code. Ideally, these should not  
529 - terminate checking. It would require re-acquiring an understanding  
530 - of all that code to make the checks more robust. In particular,  
531 - it's hard to look at the code and quickly determine what is a true  
532 - logic error and what could happen because of malformed user input.  
533 - See also ../misc/linearization-errors.  
534 -  
535 - * If I ever decide to make appearance stream-generation aware of  
536 - fonts or font metrics, see email from Tobias with Message-ID  
537 - <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14.  
538 -  
539 - * Look at places in the code where object traversal is being done and,  
540 - where possible, try to avoid it entirely or at least avoid ever  
541 - traversing the same objects multiple times. 361 +* Add an option --ignore-encryption to ignore encryption information
  362 + and treat encrypted files as if they weren't encrypted. This should
  363 + make it possible to solve #598 (--show-encryption without a
  364 + password). We'll need to make sure we don't try to filter any
  365 + streams in this mode. Ideally we should be able to combine this with
  366 + --json so we can look at the raw encrypted strings and streams if we
  367 + want to, though be sure to document that the resulting JSON won't be
  368 + convertible back to a valid PDF. Since providing the password may
  369 + reveal additional details, --show-encryption could potentially retry
  370 + with this option if the first time doesn't work. Then, with the file
  371 + open, we can read the encryption dictionary normally.
  372 +
  373 +* In libtests, separate executables that need the object library
  374 + from those that strictly use public API. Move as many of the test
  375 + drivers from the qpdf directory into the latter category as long
  376 + as doing so isn't too troublesome from a coverage standpoint.
  377 +
  378 +* Consider generating a non-flat pages tree before creating output to
  379 + better handle files with lots of pages. If there are more than 256
  380 + pages, add a second layer with the second layer nodes having no more
  381 + than 256 nodes and being as evenly sizes as possible. Don't worry
  382 + about the case of more than 65,536 pages. If the top node has more
  383 + than 256 children, we'll live with it. This is only safe if all
  384 + intermediate page nodes have only /Kids, /Parent, /Type, and /Count.
  385 +
  386 +* Look at https://bestpractices.coreinfrastructure.org/en
  387 +
  388 +* Consider adding fuzzer code for JSON
  389 +
  390 +* Rework tests so that nothing is written into the source directory.
  391 + Ideally then the entire build could be done with a read-only
  392 + source tree.
  393 +
  394 +* Large file tests fail with linux32 before and after cmake. This was
  395 + first noticed after 10.6.3. I don't think it's worth fixing.
  396 +
  397 +* Consider updating the fuzzer with code that exercises
  398 + copyAnnotations, file attachments, and name and number trees. Check
  399 + fuzzer coverage.
  400 +
  401 +* Add code for creation of a file attachment annotation. It should
  402 + also be possible to create a widget annotation and a form field.
  403 + Update the pdf-attach-file.cc example with new APIs when ready.
  404 +
  405 +* Flattening of form XObjects seems like something that would be
  406 + useful in the library. We are seeing more cases of completely valid
  407 + PDF files with form XObjects that cause problems in other software.
  408 + Flattening of form XObjects could be a useful way to work around
  409 + those issues or to prepare files for additional processing, making
  410 + it possible for users of the qpdf library to not be concerned about
  411 + form XObjects. This could be done recursively; i.e., we could have a
  412 + method to embed a form XObject into whatever contains it, whether
  413 + that is a form XObject or a page. This would require more
  414 + significant interpretation of the content stream. We would need a
  415 + test file in which the placement of the form XObject has to be in
  416 + the right place, e.g., the form XObject partially obscures earlier
  417 + code and is partially obscured by later code. Keys in the resource
  418 + dictionary may need to be changed -- create test cases with lots of
  419 + duplicated/overlapping keys.
  420 +
  421 +* Part of closed_file_input_source.cc is disabled on Windows because
  422 + of odd failures. It might be worth investigating so we can fully
  423 + exercise this in the test suite. That said, ClosedFileInputSource
  424 + is exercised elsewhere in qpdf's test suite, so this is not that
  425 + pressing.
  426 +
  427 +* If possible, consider adding CCITT3, CCITT4, or any other easy
  428 + filters. For some reference code that we probably can't use but may
  429 + be handy anyway, see
  430 + http://partners.adobe.com/public/developer/ps/sdk/index_archive.html
  431 +
  432 +* If possible, support the following types of broken files:
  433 +
  434 + - Files that have no whitespace token after "endobj" such that
  435 + endobj collides with the start of the next object
  436 +
  437 + - See ../misc/broken-files
  438 +
  439 + - See ../misc/bad-files-issue-476. This directory contains a
  440 + snapshot of the google doc and linked PDF files from issue #476.
  441 + Please see the issue for details.
  442 +
  443 +* Additional form features
  444 + * set value from CLI? Specify title, and provide way to
  445 + disambiguate, probably by giving objgen of field
  446 +
  447 +* Pl_TIFFPredictor is pretty slow.
  448 +
  449 +* Support for handling file names with Unicode characters in Windows
  450 + is incomplete. qpdf seems to support them okay from a functionality
  451 + standpoint, and the right thing happens if you pass in UTF-8
  452 + encoded filenames to QPDF library routines in Windows (they are
  453 + converted internally to wchar_t*), but file names are encoded in
  454 + UTF-8 on output, which doesn't produce nice error messages or
  455 + output on Windows in some cases.
  456 +
  457 +* If we ever wanted to do anything more with character encoding, see
  458 + ../misc/character-encoding/, which includes machine-readable dump
  459 + of table D.2 in the ISO-32000 PDF spec. This shows the mapping
  460 + between Unicode, StandardEncoding, WinAnsiEncoding,
  461 + MacRomanEncoding, and PDFDocEncoding.
  462 +
  463 +* Some test cases on bad files fail because qpdf is unable to find
  464 + the root dictionary when it fails to read the trailer. Recovery
  465 + could find the root dictionary and even the info dictionary in
  466 + other ways. In particular, issue-202.pdf can be opened by evince,
  467 + and there's no real reason that qpdf couldn't be made to be able to
  468 + recover that file as well.
  469 +
  470 +* Audit every place where qpdf allocates memory to see whether there
  471 + are cases where malicious inputs could cause qpdf to attempt to
  472 + grab very large amounts of memory. Certainly there are cases like
  473 + this, such as if a very highly compressed, very large image stream
  474 + is requested in a buffer. Hopefully normal input to output
  475 + filtering doesn't ever try to do this. QPDFWriter should be checked
  476 + carefully too. See also bugs/private/from-email-663916/
  477 +
  478 +* Interactive form modification:
  479 + https://github.com/qpdf/qpdf/issues/213 contains a good discussion
  480 + of some ideas for adding methods to modify annotations and form
  481 + fields if we want to make it easier to support modifications to
  482 + interactive forms. Some of the ideas have been implemented, and
  483 + some of the probably never will be implemented, but it's worth a
  484 + read if there is an intention to work on this. In the issue, search
  485 + for "Regarding write functionality", and read that comment and the
  486 + responses to it.
  487 +
  488 +* Look at ~/Q/pdf-collection/forms-from-appian/
  489 +
  490 +* When decrypting files with /R=6, hash_V5 is called more than once
  491 + with the same inputs. Caching the results or refactoring to reduce
  492 + the number of identical calls could improve performance for
  493 + workloads that involve processing large numbers of small files.
  494 +
  495 +* Consider adding a method to balance the pages tree. It would call
  496 + pushInheritedAttributesToPage, construct a pages tree from scratch,
  497 + and replace the /Pages key of the root dictionary with the new
  498 + tree.
  499 +
  500 +* Study what's required to support savable forms that can be saved by
  501 + Adobe Reader. Does this require actually signing the document with
  502 + an Adobe private key? Search for "Digital signatures" in the PDF
  503 + spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which
  504 + came from Adobe's example site. See also
  505 + ../misc/digital-sign-from-trueroad/. If digital signatures are
  506 + implemented, update the docs on crypto providers, which mention
  507 + that this may happen in the future.
  508 +
  509 +* Qpdf does not honor /EFF when adding new file attachments. When it
  510 + encrypts, it never generates streams with explicit crypt filters.
  511 + Prior to 10.2, there was an incorrect attempt to treat /EFF as a
  512 + default value for decrypting file attachment streams, but it is not
  513 + supposed to mean that. Instead, it is intended for conforming
  514 + writers to obey this when adding new attachments. Qpdf is not a
  515 + conforming writer in that respect.
  516 +
  517 +* The whole xref handling code in the QPDF object allows the same
  518 + object with more than one generation to coexist, but a lot of logic
  519 + assumes this isn't the case. Anything that creates mappings only
  520 + with the object number and not the generation is this way,
  521 + including most of the interaction between QPDFWriter and QPDF. If
  522 + we wanted to allow the same object with more than one generation to
  523 + coexist, which I'm not sure is allowed, we could fix this by
  524 + changing xref_table. Alternatively, we could detect and disallow
  525 + that case. In fact, it appears that Adobe reader and other PDF
  526 + viewing software silently ignores objects of this type, so this is
  527 + probably not a big deal.
  528 +
  529 +* From a suggestion in bug 3152169, consider having an option to
  530 + re-encode inline images with an ASCII encoding.
  531 +
  532 +* From github issue 2, provide more in-depth output for examining
  533 + hint stream contents. Consider adding on option to provide a
  534 + human-readable dump of linearization hint tables. This should
  535 + include improving the 'overflow reading bit stream' message as
  536 + reported in issue #2. There are multiple calls to stopOnError in
  537 + the linearization checking code. Ideally, these should not
  538 + terminate checking. It would require re-acquiring an understanding
  539 + of all that code to make the checks more robust. In particular,
  540 + it's hard to look at the code and quickly determine what is a true
  541 + logic error and what could happen because of malformed user input.
  542 + See also ../misc/linearization-errors.
  543 +
  544 +* If I ever decide to make appearance stream-generation aware of
  545 + fonts or font metrics, see email from Tobias with Message-ID
  546 + <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14.
  547 +
  548 +* Look at places in the code where object traversal is being done and,
  549 + where possible, try to avoid it entirely or at least avoid ever
  550 + traversing the same objects multiple times.
542 551
543 ---------------------------------------------------------------------- 552 ----------------------------------------------------------------------
544 553