Commit 48dfae6443943512739bee4d5a488592c89f3c1d

Authored by Jay Berkenbilt
1 parent 433be371

TODO: rescope some items

Showing 1 changed file with 198 additions and 189 deletions
... ... @@ -21,31 +21,15 @@ Pending changes:
21 21 appimage build specifically is setting the runpath, which is
22 22 actually desirable in this case. Make sure to understand and
23 23 document this. Maybe add a check for it in the build.
24   -* Decide what to do about #664 (get*Box)
25   -* Add an option --ignore-encryption to ignore encryption information
26   - and treat encrypted files as if they weren't encrypted. This should
27   - make it possible to solve #598 (--show-encryption without a
28   - password). We'll need to make sure we don't try to filter any
29   - streams in this mode. Ideally we should be able to combine this with
30   - --json so we can look at the raw encrypted strings and streams if we
31   - want to, though be sure to document that the resulting JSON won't be
32   - convertible back to a valid PDF. Since providing the password may
33   - reveal additional details, --show-encryption could potentially retry
34   - with this option if the first time doesn't work. Then, with the file
35   - open, we can read the encryption dictionary normally.
36   -* In libtests, separate executables that need the object library
37   - from those that strictly use public API. Move as many of the test
38   - drivers from the qpdf directory into the latter category as long
39   - as doing so isn't too troublesome from a coverage standpoint.
40   -* Consider adding fuzzer code for JSON
41   -* Consider generating a non-flat pages tree before creating output to
42   - better handle files with lots of pages. If there are more than 256
43   - pages, add a second layer with the second layer nodes having no more
44   - than 256 nodes and being as evenly sizes as possible. Don't worry
45   - about the case of more than 65,536 pages. If the top node has more
46   - than 256 children, we'll live with it.
47 24  
48   -Parent pointer idea:
  25 +Soon: Break ground on "Document-level work"
  26 +
  27 +Fix Multiple Direct Object Owner Issue
  28 +======================================
  29 +
  30 +These are some ideas I've had, but I'm parking them until I fully
  31 +understand m-holger's proposal to split QPDFObject into QPDFObject and
  32 +QPDFValue.
49 33  
50 34 * Add std::weak_ptr<QPDFObject> parent to QPDFObject. When adding a
51 35 direct object to an array or dictionary, set its parent. When
... ... @@ -65,8 +49,6 @@ Note that arrays and dictionaries still need to contain
65 49 QPDFObjectHandle because of indirect objects. This only pertains to
66 50 direct objects, which are always "resolved" in QPDFObjectHandle.
67 51  
68   -Soon: Break ground on "Document-level work"
69   -
70 52 Possible future JSON enhancements
71 53 =================================
72 54  
... ... @@ -376,169 +358,196 @@ directory or that are otherwise not publicly accessible. This includes
376 358 things sent to me by email that are specifically not public. Even so,
377 359 I find it useful to make reference to them in this list.
378 360  
379   - * Look at https://bestpractices.coreinfrastructure.org/en
380   -
381   - * Rework tests so that nothing is written into the source directory.
382   - Ideally then the entire build could be done with a read-only
383   - source tree.
384   -
385   - * Large file tests fail with linux32 before and after cmake. This was
386   - first noticed after 10.6.3. I don't think it's worth fixing.
387   -
388   - * Consider updating the fuzzer with code that exercises
389   - copyAnnotations, file attachments, and name and number trees. Check
390   - fuzzer coverage.
391   -
392   - * Add code for creation of a file attachment annotation. It should
393   - also be possible to create a widget annotation and a form field.
394   - Update the pdf-attach-file.cc example with new APIs when ready.
395   -
396   - * Flattening of form XObjects seems like something that would be
397   - useful in the library. We are seeing more cases of completely valid
398   - PDF files with form XObjects that cause problems in other software.
399   - Flattening of form XObjects could be a useful way to work around
400   - those issues or to prepare files for additional processing, making
401   - it possible for users of the qpdf library to not be concerned about
402   - form XObjects. This could be done recursively; i.e., we could have a
403   - method to embed a form XObject into whatever contains it, whether
404   - that is a form XObject or a page. This would require more
405   - significant interpretation of the content stream. We would need a
406   - test file in which the placement of the form XObject has to be in
407   - the right place, e.g., the form XObject partially obscures earlier
408   - code and is partially obscured by later code. Keys in the resource
409   - dictionary may need to be changed -- create test cases with lots of
410   - duplicated/overlapping keys.
411   -
412   - * Part of closed_file_input_source.cc is disabled on Windows because
413   - of odd failures. It might be worth investigating so we can fully
414   - exercise this in the test suite. That said, ClosedFileInputSource
415   - is exercised elsewhere in qpdf's test suite, so this is not that
416   - pressing.
417   -
418   - * If possible, consider adding CCITT3, CCITT4, or any other easy
419   - filters. For some reference code that we probably can't use but may
420   - be handy anyway, see
421   - http://partners.adobe.com/public/developer/ps/sdk/index_archive.html
422   -
423   - * If possible, support the following types of broken files:
424   -
425   - - Files that have no whitespace token after "endobj" such that
426   - endobj collides with the start of the next object
427   -
428   - - See ../misc/broken-files
429   -
430   - - See ../misc/bad-files-issue-476. This directory contains a
431   - snapshot of the google doc and linked PDF files from issue #476.
432   - Please see the issue for details.
433   -
434   - * Additional form features
435   - * set value from CLI? Specify title, and provide way to
436   - disambiguate, probably by giving objgen of field
437   -
438   - * Pl_TIFFPredictor is pretty slow.
439   -
440   - * Support for handling file names with Unicode characters in Windows
441   - is incomplete. qpdf seems to support them okay from a functionality
442   - standpoint, and the right thing happens if you pass in UTF-8
443   - encoded filenames to QPDF library routines in Windows (they are
444   - converted internally to wchar_t*), but file names are encoded in
445   - UTF-8 on output, which doesn't produce nice error messages or
446   - output on Windows in some cases.
447   -
448   - * If we ever wanted to do anything more with character encoding, see
449   - ../misc/character-encoding/, which includes machine-readable dump
450   - of table D.2 in the ISO-32000 PDF spec. This shows the mapping
451   - between Unicode, StandardEncoding, WinAnsiEncoding,
452   - MacRomanEncoding, and PDFDocEncoding.
453   -
454   - * Some test cases on bad files fail because qpdf is unable to find
455   - the root dictionary when it fails to read the trailer. Recovery
456   - could find the root dictionary and even the info dictionary in
457   - other ways. In particular, issue-202.pdf can be opened by evince,
458   - and there's no real reason that qpdf couldn't be made to be able to
459   - recover that file as well.
460   -
461   - * Audit every place where qpdf allocates memory to see whether there
462   - are cases where malicious inputs could cause qpdf to attempt to
463   - grab very large amounts of memory. Certainly there are cases like
464   - this, such as if a very highly compressed, very large image stream
465   - is requested in a buffer. Hopefully normal input to output
466   - filtering doesn't ever try to do this. QPDFWriter should be checked
467   - carefully too. See also bugs/private/from-email-663916/
468   -
469   - * Interactive form modification:
470   - https://github.com/qpdf/qpdf/issues/213 contains a good discussion
471   - of some ideas for adding methods to modify annotations and form
472   - fields if we want to make it easier to support modifications to
473   - interactive forms. Some of the ideas have been implemented, and
474   - some of the probably never will be implemented, but it's worth a
475   - read if there is an intention to work on this. In the issue, search
476   - for "Regarding write functionality", and read that comment and the
477   - responses to it.
478   -
479   - * Look at ~/Q/pdf-collection/forms-from-appian/
480   -
481   - * When decrypting files with /R=6, hash_V5 is called more than once
482   - with the same inputs. Caching the results or refactoring to reduce
483   - the number of identical calls could improve performance for
484   - workloads that involve processing large numbers of small files.
485   -
486   - * Consider adding a method to balance the pages tree. It would call
487   - pushInheritedAttributesToPage, construct a pages tree from scratch,
488   - and replace the /Pages key of the root dictionary with the new
489   - tree.
490   -
491   - * Study what's required to support savable forms that can be saved by
492   - Adobe Reader. Does this require actually signing the document with
493   - an Adobe private key? Search for "Digital signatures" in the PDF
494   - spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which
495   - came from Adobe's example site. See also
496   - ../misc/digital-sign-from-trueroad/. If digital signatures are
497   - implemented, update the docs on crypto providers, which mention
498   - that this may happen in the future.
499   -
500   - * Qpdf does not honor /EFF when adding new file attachments. When it
501   - encrypts, it never generates streams with explicit crypt filters.
502   - Prior to 10.2, there was an incorrect attempt to treat /EFF as a
503   - default value for decrypting file attachment streams, but it is not
504   - supposed to mean that. Instead, it is intended for conforming
505   - writers to obey this when adding new attachments. Qpdf is not a
506   - conforming writer in that respect.
507   -
508   - * The whole xref handling code in the QPDF object allows the same
509   - object with more than one generation to coexist, but a lot of logic
510   - assumes this isn't the case. Anything that creates mappings only
511   - with the object number and not the generation is this way,
512   - including most of the interaction between QPDFWriter and QPDF. If
513   - we wanted to allow the same object with more than one generation to
514   - coexist, which I'm not sure is allowed, we could fix this by
515   - changing xref_table. Alternatively, we could detect and disallow
516   - that case. In fact, it appears that Adobe reader and other PDF
517   - viewing software silently ignores objects of this type, so this is
518   - probably not a big deal.
519   -
520   - * From a suggestion in bug 3152169, consider having an option to
521   - re-encode inline images with an ASCII encoding.
522   -
523   - * From github issue 2, provide more in-depth output for examining
524   - hint stream contents. Consider adding on option to provide a
525   - human-readable dump of linearization hint tables. This should
526   - include improving the 'overflow reading bit stream' message as
527   - reported in issue #2. There are multiple calls to stopOnError in
528   - the linearization checking code. Ideally, these should not
529   - terminate checking. It would require re-acquiring an understanding
530   - of all that code to make the checks more robust. In particular,
531   - it's hard to look at the code and quickly determine what is a true
532   - logic error and what could happen because of malformed user input.
533   - See also ../misc/linearization-errors.
534   -
535   - * If I ever decide to make appearance stream-generation aware of
536   - fonts or font metrics, see email from Tobias with Message-ID
537   - <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14.
538   -
539   - * Look at places in the code where object traversal is being done and,
540   - where possible, try to avoid it entirely or at least avoid ever
541   - traversing the same objects multiple times.
  361 +* Add an option --ignore-encryption to ignore encryption information
  362 + and treat encrypted files as if they weren't encrypted. This should
  363 + make it possible to solve #598 (--show-encryption without a
  364 + password). We'll need to make sure we don't try to filter any
  365 + streams in this mode. Ideally we should be able to combine this with
  366 + --json so we can look at the raw encrypted strings and streams if we
  367 + want to, though be sure to document that the resulting JSON won't be
  368 + convertible back to a valid PDF. Since providing the password may
  369 + reveal additional details, --show-encryption could potentially retry
  370 + with this option if the first time doesn't work. Then, with the file
  371 + open, we can read the encryption dictionary normally.
  372 +
  373 +* In libtests, separate executables that need the object library
  374 + from those that strictly use public API. Move as many of the test
  375 + drivers from the qpdf directory into the latter category as long
  376 + as doing so isn't too troublesome from a coverage standpoint.
  377 +
  378 +* Consider generating a non-flat pages tree before creating output to
  379 + better handle files with lots of pages. If there are more than 256
  380 + pages, add a second layer with the second layer nodes having no more
  381 + than 256 nodes and being as evenly sizes as possible. Don't worry
  382 + about the case of more than 65,536 pages. If the top node has more
  383 + than 256 children, we'll live with it. This is only safe if all
  384 + intermediate page nodes have only /Kids, /Parent, /Type, and /Count.
  385 +
  386 +* Look at https://bestpractices.coreinfrastructure.org/en
  387 +
  388 +* Consider adding fuzzer code for JSON
  389 +
  390 +* Rework tests so that nothing is written into the source directory.
  391 + Ideally then the entire build could be done with a read-only
  392 + source tree.
  393 +
  394 +* Large file tests fail with linux32 before and after cmake. This was
  395 + first noticed after 10.6.3. I don't think it's worth fixing.
  396 +
  397 +* Consider updating the fuzzer with code that exercises
  398 + copyAnnotations, file attachments, and name and number trees. Check
  399 + fuzzer coverage.
  400 +
  401 +* Add code for creation of a file attachment annotation. It should
  402 + also be possible to create a widget annotation and a form field.
  403 + Update the pdf-attach-file.cc example with new APIs when ready.
  404 +
  405 +* Flattening of form XObjects seems like something that would be
  406 + useful in the library. We are seeing more cases of completely valid
  407 + PDF files with form XObjects that cause problems in other software.
  408 + Flattening of form XObjects could be a useful way to work around
  409 + those issues or to prepare files for additional processing, making
  410 + it possible for users of the qpdf library to not be concerned about
  411 + form XObjects. This could be done recursively; i.e., we could have a
  412 + method to embed a form XObject into whatever contains it, whether
  413 + that is a form XObject or a page. This would require more
  414 + significant interpretation of the content stream. We would need a
  415 + test file in which the placement of the form XObject has to be in
  416 + the right place, e.g., the form XObject partially obscures earlier
  417 + code and is partially obscured by later code. Keys in the resource
  418 + dictionary may need to be changed -- create test cases with lots of
  419 + duplicated/overlapping keys.
  420 +
  421 +* Part of closed_file_input_source.cc is disabled on Windows because
  422 + of odd failures. It might be worth investigating so we can fully
  423 + exercise this in the test suite. That said, ClosedFileInputSource
  424 + is exercised elsewhere in qpdf's test suite, so this is not that
  425 + pressing.
  426 +
  427 +* If possible, consider adding CCITT3, CCITT4, or any other easy
  428 + filters. For some reference code that we probably can't use but may
  429 + be handy anyway, see
  430 + http://partners.adobe.com/public/developer/ps/sdk/index_archive.html
  431 +
  432 +* If possible, support the following types of broken files:
  433 +
  434 + - Files that have no whitespace token after "endobj" such that
  435 + endobj collides with the start of the next object
  436 +
  437 + - See ../misc/broken-files
  438 +
  439 + - See ../misc/bad-files-issue-476. This directory contains a
  440 + snapshot of the google doc and linked PDF files from issue #476.
  441 + Please see the issue for details.
  442 +
  443 +* Additional form features
  444 + * set value from CLI? Specify title, and provide way to
  445 + disambiguate, probably by giving objgen of field
  446 +
  447 +* Pl_TIFFPredictor is pretty slow.
  448 +
  449 +* Support for handling file names with Unicode characters in Windows
  450 + is incomplete. qpdf seems to support them okay from a functionality
  451 + standpoint, and the right thing happens if you pass in UTF-8
  452 + encoded filenames to QPDF library routines in Windows (they are
  453 + converted internally to wchar_t*), but file names are encoded in
  454 + UTF-8 on output, which doesn't produce nice error messages or
  455 + output on Windows in some cases.
  456 +
  457 +* If we ever wanted to do anything more with character encoding, see
  458 + ../misc/character-encoding/, which includes machine-readable dump
  459 + of table D.2 in the ISO-32000 PDF spec. This shows the mapping
  460 + between Unicode, StandardEncoding, WinAnsiEncoding,
  461 + MacRomanEncoding, and PDFDocEncoding.
  462 +
  463 +* Some test cases on bad files fail because qpdf is unable to find
  464 + the root dictionary when it fails to read the trailer. Recovery
  465 + could find the root dictionary and even the info dictionary in
  466 + other ways. In particular, issue-202.pdf can be opened by evince,
  467 + and there's no real reason that qpdf couldn't be made to be able to
  468 + recover that file as well.
  469 +
  470 +* Audit every place where qpdf allocates memory to see whether there
  471 + are cases where malicious inputs could cause qpdf to attempt to
  472 + grab very large amounts of memory. Certainly there are cases like
  473 + this, such as if a very highly compressed, very large image stream
  474 + is requested in a buffer. Hopefully normal input to output
  475 + filtering doesn't ever try to do this. QPDFWriter should be checked
  476 + carefully too. See also bugs/private/from-email-663916/
  477 +
  478 +* Interactive form modification:
  479 + https://github.com/qpdf/qpdf/issues/213 contains a good discussion
  480 + of some ideas for adding methods to modify annotations and form
  481 + fields if we want to make it easier to support modifications to
  482 + interactive forms. Some of the ideas have been implemented, and
  483 + some of the probably never will be implemented, but it's worth a
  484 + read if there is an intention to work on this. In the issue, search
  485 + for "Regarding write functionality", and read that comment and the
  486 + responses to it.
  487 +
  488 +* Look at ~/Q/pdf-collection/forms-from-appian/
  489 +
  490 +* When decrypting files with /R=6, hash_V5 is called more than once
  491 + with the same inputs. Caching the results or refactoring to reduce
  492 + the number of identical calls could improve performance for
  493 + workloads that involve processing large numbers of small files.
  494 +
  495 +* Consider adding a method to balance the pages tree. It would call
  496 + pushInheritedAttributesToPage, construct a pages tree from scratch,
  497 + and replace the /Pages key of the root dictionary with the new
  498 + tree.
  499 +
  500 +* Study what's required to support savable forms that can be saved by
  501 + Adobe Reader. Does this require actually signing the document with
  502 + an Adobe private key? Search for "Digital signatures" in the PDF
  503 + spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which
  504 + came from Adobe's example site. See also
  505 + ../misc/digital-sign-from-trueroad/. If digital signatures are
  506 + implemented, update the docs on crypto providers, which mention
  507 + that this may happen in the future.
  508 +
  509 +* Qpdf does not honor /EFF when adding new file attachments. When it
  510 + encrypts, it never generates streams with explicit crypt filters.
  511 + Prior to 10.2, there was an incorrect attempt to treat /EFF as a
  512 + default value for decrypting file attachment streams, but it is not
  513 + supposed to mean that. Instead, it is intended for conforming
  514 + writers to obey this when adding new attachments. Qpdf is not a
  515 + conforming writer in that respect.
  516 +
  517 +* The whole xref handling code in the QPDF object allows the same
  518 + object with more than one generation to coexist, but a lot of logic
  519 + assumes this isn't the case. Anything that creates mappings only
  520 + with the object number and not the generation is this way,
  521 + including most of the interaction between QPDFWriter and QPDF. If
  522 + we wanted to allow the same object with more than one generation to
  523 + coexist, which I'm not sure is allowed, we could fix this by
  524 + changing xref_table. Alternatively, we could detect and disallow
  525 + that case. In fact, it appears that Adobe reader and other PDF
  526 + viewing software silently ignores objects of this type, so this is
  527 + probably not a big deal.
  528 +
  529 +* From a suggestion in bug 3152169, consider having an option to
  530 + re-encode inline images with an ASCII encoding.
  531 +
  532 +* From github issue 2, provide more in-depth output for examining
  533 + hint stream contents. Consider adding on option to provide a
  534 + human-readable dump of linearization hint tables. This should
  535 + include improving the 'overflow reading bit stream' message as
  536 + reported in issue #2. There are multiple calls to stopOnError in
  537 + the linearization checking code. Ideally, these should not
  538 + terminate checking. It would require re-acquiring an understanding
  539 + of all that code to make the checks more robust. In particular,
  540 + it's hard to look at the code and quickly determine what is a true
  541 + logic error and what could happen because of malformed user input.
  542 + See also ../misc/linearization-errors.
  543 +
  544 +* If I ever decide to make appearance stream-generation aware of
  545 + fonts or font metrics, see email from Tobias with Message-ID
  546 + <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14.
  547 +
  548 +* Look at places in the code where object traversal is being done and,
  549 + where possible, try to avoid it entirely or at least avoid ever
  550 + traversing the same objects multiple times.
542 551  
543 552 ----------------------------------------------------------------------
544 553  
... ...