Commit 48dfae6443943512739bee4d5a488592c89f3c1d
1 parent
433be371
TODO: rescope some items
Showing
1 changed file
with
198 additions
and
189 deletions
TODO
| ... | ... | @@ -21,31 +21,15 @@ Pending changes: |
| 21 | 21 | appimage build specifically is setting the runpath, which is |
| 22 | 22 | actually desirable in this case. Make sure to understand and |
| 23 | 23 | document this. Maybe add a check for it in the build. |
| 24 | -* Decide what to do about #664 (get*Box) | |
| 25 | -* Add an option --ignore-encryption to ignore encryption information | |
| 26 | - and treat encrypted files as if they weren't encrypted. This should | |
| 27 | - make it possible to solve #598 (--show-encryption without a | |
| 28 | - password). We'll need to make sure we don't try to filter any | |
| 29 | - streams in this mode. Ideally we should be able to combine this with | |
| 30 | - --json so we can look at the raw encrypted strings and streams if we | |
| 31 | - want to, though be sure to document that the resulting JSON won't be | |
| 32 | - convertible back to a valid PDF. Since providing the password may | |
| 33 | - reveal additional details, --show-encryption could potentially retry | |
| 34 | - with this option if the first time doesn't work. Then, with the file | |
| 35 | - open, we can read the encryption dictionary normally. | |
| 36 | -* In libtests, separate executables that need the object library | |
| 37 | - from those that strictly use public API. Move as many of the test | |
| 38 | - drivers from the qpdf directory into the latter category as long | |
| 39 | - as doing so isn't too troublesome from a coverage standpoint. | |
| 40 | -* Consider adding fuzzer code for JSON | |
| 41 | -* Consider generating a non-flat pages tree before creating output to | |
| 42 | - better handle files with lots of pages. If there are more than 256 | |
| 43 | - pages, add a second layer with the second layer nodes having no more | |
| 44 | - than 256 nodes and being as evenly sizes as possible. Don't worry | |
| 45 | - about the case of more than 65,536 pages. If the top node has more | |
| 46 | - than 256 children, we'll live with it. | |
| 47 | 24 | |
| 48 | -Parent pointer idea: | |
| 25 | +Soon: Break ground on "Document-level work" | |
| 26 | + | |
| 27 | +Fix Multiple Direct Object Owner Issue | |
| 28 | +====================================== | |
| 29 | + | |
| 30 | +These are some ideas I've had, but I'm parking them until I fully | |
| 31 | +understand m-holger's proposal to split QPDFObject into QPDFObject and | |
| 32 | +QPDFValue. | |
| 49 | 33 | |
| 50 | 34 | * Add std::weak_ptr<QPDFObject> parent to QPDFObject. When adding a |
| 51 | 35 | direct object to an array or dictionary, set its parent. When |
| ... | ... | @@ -65,8 +49,6 @@ Note that arrays and dictionaries still need to contain |
| 65 | 49 | QPDFObjectHandle because of indirect objects. This only pertains to |
| 66 | 50 | direct objects, which are always "resolved" in QPDFObjectHandle. |
| 67 | 51 | |
| 68 | -Soon: Break ground on "Document-level work" | |
| 69 | - | |
| 70 | 52 | Possible future JSON enhancements |
| 71 | 53 | ================================= |
| 72 | 54 | |
| ... | ... | @@ -376,169 +358,196 @@ directory or that are otherwise not publicly accessible. This includes |
| 376 | 358 | things sent to me by email that are specifically not public. Even so, |
| 377 | 359 | I find it useful to make reference to them in this list. |
| 378 | 360 | |
| 379 | - * Look at https://bestpractices.coreinfrastructure.org/en | |
| 380 | - | |
| 381 | - * Rework tests so that nothing is written into the source directory. | |
| 382 | - Ideally then the entire build could be done with a read-only | |
| 383 | - source tree. | |
| 384 | - | |
| 385 | - * Large file tests fail with linux32 before and after cmake. This was | |
| 386 | - first noticed after 10.6.3. I don't think it's worth fixing. | |
| 387 | - | |
| 388 | - * Consider updating the fuzzer with code that exercises | |
| 389 | - copyAnnotations, file attachments, and name and number trees. Check | |
| 390 | - fuzzer coverage. | |
| 391 | - | |
| 392 | - * Add code for creation of a file attachment annotation. It should | |
| 393 | - also be possible to create a widget annotation and a form field. | |
| 394 | - Update the pdf-attach-file.cc example with new APIs when ready. | |
| 395 | - | |
| 396 | - * Flattening of form XObjects seems like something that would be | |
| 397 | - useful in the library. We are seeing more cases of completely valid | |
| 398 | - PDF files with form XObjects that cause problems in other software. | |
| 399 | - Flattening of form XObjects could be a useful way to work around | |
| 400 | - those issues or to prepare files for additional processing, making | |
| 401 | - it possible for users of the qpdf library to not be concerned about | |
| 402 | - form XObjects. This could be done recursively; i.e., we could have a | |
| 403 | - method to embed a form XObject into whatever contains it, whether | |
| 404 | - that is a form XObject or a page. This would require more | |
| 405 | - significant interpretation of the content stream. We would need a | |
| 406 | - test file in which the placement of the form XObject has to be in | |
| 407 | - the right place, e.g., the form XObject partially obscures earlier | |
| 408 | - code and is partially obscured by later code. Keys in the resource | |
| 409 | - dictionary may need to be changed -- create test cases with lots of | |
| 410 | - duplicated/overlapping keys. | |
| 411 | - | |
| 412 | - * Part of closed_file_input_source.cc is disabled on Windows because | |
| 413 | - of odd failures. It might be worth investigating so we can fully | |
| 414 | - exercise this in the test suite. That said, ClosedFileInputSource | |
| 415 | - is exercised elsewhere in qpdf's test suite, so this is not that | |
| 416 | - pressing. | |
| 417 | - | |
| 418 | - * If possible, consider adding CCITT3, CCITT4, or any other easy | |
| 419 | - filters. For some reference code that we probably can't use but may | |
| 420 | - be handy anyway, see | |
| 421 | - http://partners.adobe.com/public/developer/ps/sdk/index_archive.html | |
| 422 | - | |
| 423 | - * If possible, support the following types of broken files: | |
| 424 | - | |
| 425 | - - Files that have no whitespace token after "endobj" such that | |
| 426 | - endobj collides with the start of the next object | |
| 427 | - | |
| 428 | - - See ../misc/broken-files | |
| 429 | - | |
| 430 | - - See ../misc/bad-files-issue-476. This directory contains a | |
| 431 | - snapshot of the google doc and linked PDF files from issue #476. | |
| 432 | - Please see the issue for details. | |
| 433 | - | |
| 434 | - * Additional form features | |
| 435 | - * set value from CLI? Specify title, and provide way to | |
| 436 | - disambiguate, probably by giving objgen of field | |
| 437 | - | |
| 438 | - * Pl_TIFFPredictor is pretty slow. | |
| 439 | - | |
| 440 | - * Support for handling file names with Unicode characters in Windows | |
| 441 | - is incomplete. qpdf seems to support them okay from a functionality | |
| 442 | - standpoint, and the right thing happens if you pass in UTF-8 | |
| 443 | - encoded filenames to QPDF library routines in Windows (they are | |
| 444 | - converted internally to wchar_t*), but file names are encoded in | |
| 445 | - UTF-8 on output, which doesn't produce nice error messages or | |
| 446 | - output on Windows in some cases. | |
| 447 | - | |
| 448 | - * If we ever wanted to do anything more with character encoding, see | |
| 449 | - ../misc/character-encoding/, which includes machine-readable dump | |
| 450 | - of table D.2 in the ISO-32000 PDF spec. This shows the mapping | |
| 451 | - between Unicode, StandardEncoding, WinAnsiEncoding, | |
| 452 | - MacRomanEncoding, and PDFDocEncoding. | |
| 453 | - | |
| 454 | - * Some test cases on bad files fail because qpdf is unable to find | |
| 455 | - the root dictionary when it fails to read the trailer. Recovery | |
| 456 | - could find the root dictionary and even the info dictionary in | |
| 457 | - other ways. In particular, issue-202.pdf can be opened by evince, | |
| 458 | - and there's no real reason that qpdf couldn't be made to be able to | |
| 459 | - recover that file as well. | |
| 460 | - | |
| 461 | - * Audit every place where qpdf allocates memory to see whether there | |
| 462 | - are cases where malicious inputs could cause qpdf to attempt to | |
| 463 | - grab very large amounts of memory. Certainly there are cases like | |
| 464 | - this, such as if a very highly compressed, very large image stream | |
| 465 | - is requested in a buffer. Hopefully normal input to output | |
| 466 | - filtering doesn't ever try to do this. QPDFWriter should be checked | |
| 467 | - carefully too. See also bugs/private/from-email-663916/ | |
| 468 | - | |
| 469 | - * Interactive form modification: | |
| 470 | - https://github.com/qpdf/qpdf/issues/213 contains a good discussion | |
| 471 | - of some ideas for adding methods to modify annotations and form | |
| 472 | - fields if we want to make it easier to support modifications to | |
| 473 | - interactive forms. Some of the ideas have been implemented, and | |
| 474 | - some of the probably never will be implemented, but it's worth a | |
| 475 | - read if there is an intention to work on this. In the issue, search | |
| 476 | - for "Regarding write functionality", and read that comment and the | |
| 477 | - responses to it. | |
| 478 | - | |
| 479 | - * Look at ~/Q/pdf-collection/forms-from-appian/ | |
| 480 | - | |
| 481 | - * When decrypting files with /R=6, hash_V5 is called more than once | |
| 482 | - with the same inputs. Caching the results or refactoring to reduce | |
| 483 | - the number of identical calls could improve performance for | |
| 484 | - workloads that involve processing large numbers of small files. | |
| 485 | - | |
| 486 | - * Consider adding a method to balance the pages tree. It would call | |
| 487 | - pushInheritedAttributesToPage, construct a pages tree from scratch, | |
| 488 | - and replace the /Pages key of the root dictionary with the new | |
| 489 | - tree. | |
| 490 | - | |
| 491 | - * Study what's required to support savable forms that can be saved by | |
| 492 | - Adobe Reader. Does this require actually signing the document with | |
| 493 | - an Adobe private key? Search for "Digital signatures" in the PDF | |
| 494 | - spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which | |
| 495 | - came from Adobe's example site. See also | |
| 496 | - ../misc/digital-sign-from-trueroad/. If digital signatures are | |
| 497 | - implemented, update the docs on crypto providers, which mention | |
| 498 | - that this may happen in the future. | |
| 499 | - | |
| 500 | - * Qpdf does not honor /EFF when adding new file attachments. When it | |
| 501 | - encrypts, it never generates streams with explicit crypt filters. | |
| 502 | - Prior to 10.2, there was an incorrect attempt to treat /EFF as a | |
| 503 | - default value for decrypting file attachment streams, but it is not | |
| 504 | - supposed to mean that. Instead, it is intended for conforming | |
| 505 | - writers to obey this when adding new attachments. Qpdf is not a | |
| 506 | - conforming writer in that respect. | |
| 507 | - | |
| 508 | - * The whole xref handling code in the QPDF object allows the same | |
| 509 | - object with more than one generation to coexist, but a lot of logic | |
| 510 | - assumes this isn't the case. Anything that creates mappings only | |
| 511 | - with the object number and not the generation is this way, | |
| 512 | - including most of the interaction between QPDFWriter and QPDF. If | |
| 513 | - we wanted to allow the same object with more than one generation to | |
| 514 | - coexist, which I'm not sure is allowed, we could fix this by | |
| 515 | - changing xref_table. Alternatively, we could detect and disallow | |
| 516 | - that case. In fact, it appears that Adobe reader and other PDF | |
| 517 | - viewing software silently ignores objects of this type, so this is | |
| 518 | - probably not a big deal. | |
| 519 | - | |
| 520 | - * From a suggestion in bug 3152169, consider having an option to | |
| 521 | - re-encode inline images with an ASCII encoding. | |
| 522 | - | |
| 523 | - * From github issue 2, provide more in-depth output for examining | |
| 524 | - hint stream contents. Consider adding on option to provide a | |
| 525 | - human-readable dump of linearization hint tables. This should | |
| 526 | - include improving the 'overflow reading bit stream' message as | |
| 527 | - reported in issue #2. There are multiple calls to stopOnError in | |
| 528 | - the linearization checking code. Ideally, these should not | |
| 529 | - terminate checking. It would require re-acquiring an understanding | |
| 530 | - of all that code to make the checks more robust. In particular, | |
| 531 | - it's hard to look at the code and quickly determine what is a true | |
| 532 | - logic error and what could happen because of malformed user input. | |
| 533 | - See also ../misc/linearization-errors. | |
| 534 | - | |
| 535 | - * If I ever decide to make appearance stream-generation aware of | |
| 536 | - fonts or font metrics, see email from Tobias with Message-ID | |
| 537 | - <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14. | |
| 538 | - | |
| 539 | - * Look at places in the code where object traversal is being done and, | |
| 540 | - where possible, try to avoid it entirely or at least avoid ever | |
| 541 | - traversing the same objects multiple times. | |
| 361 | +* Add an option --ignore-encryption to ignore encryption information | |
| 362 | + and treat encrypted files as if they weren't encrypted. This should | |
| 363 | + make it possible to solve #598 (--show-encryption without a | |
| 364 | + password). We'll need to make sure we don't try to filter any | |
| 365 | + streams in this mode. Ideally we should be able to combine this with | |
| 366 | + --json so we can look at the raw encrypted strings and streams if we | |
| 367 | + want to, though be sure to document that the resulting JSON won't be | |
| 368 | + convertible back to a valid PDF. Since providing the password may | |
| 369 | + reveal additional details, --show-encryption could potentially retry | |
| 370 | + with this option if the first time doesn't work. Then, with the file | |
| 371 | + open, we can read the encryption dictionary normally. | |
| 372 | + | |
| 373 | +* In libtests, separate executables that need the object library | |
| 374 | + from those that strictly use public API. Move as many of the test | |
| 375 | + drivers from the qpdf directory into the latter category as long | |
| 376 | + as doing so isn't too troublesome from a coverage standpoint. | |
| 377 | + | |
| 378 | +* Consider generating a non-flat pages tree before creating output to | |
| 379 | + better handle files with lots of pages. If there are more than 256 | |
| 380 | + pages, add a second layer with the second layer nodes having no more | |
| 381 | + than 256 nodes and being as evenly sizes as possible. Don't worry | |
| 382 | + about the case of more than 65,536 pages. If the top node has more | |
| 383 | + than 256 children, we'll live with it. This is only safe if all | |
| 384 | + intermediate page nodes have only /Kids, /Parent, /Type, and /Count. | |
| 385 | + | |
| 386 | +* Look at https://bestpractices.coreinfrastructure.org/en | |
| 387 | + | |
| 388 | +* Consider adding fuzzer code for JSON | |
| 389 | + | |
| 390 | +* Rework tests so that nothing is written into the source directory. | |
| 391 | + Ideally then the entire build could be done with a read-only | |
| 392 | + source tree. | |
| 393 | + | |
| 394 | +* Large file tests fail with linux32 before and after cmake. This was | |
| 395 | + first noticed after 10.6.3. I don't think it's worth fixing. | |
| 396 | + | |
| 397 | +* Consider updating the fuzzer with code that exercises | |
| 398 | + copyAnnotations, file attachments, and name and number trees. Check | |
| 399 | + fuzzer coverage. | |
| 400 | + | |
| 401 | +* Add code for creation of a file attachment annotation. It should | |
| 402 | + also be possible to create a widget annotation and a form field. | |
| 403 | + Update the pdf-attach-file.cc example with new APIs when ready. | |
| 404 | + | |
| 405 | +* Flattening of form XObjects seems like something that would be | |
| 406 | + useful in the library. We are seeing more cases of completely valid | |
| 407 | + PDF files with form XObjects that cause problems in other software. | |
| 408 | + Flattening of form XObjects could be a useful way to work around | |
| 409 | + those issues or to prepare files for additional processing, making | |
| 410 | + it possible for users of the qpdf library to not be concerned about | |
| 411 | + form XObjects. This could be done recursively; i.e., we could have a | |
| 412 | + method to embed a form XObject into whatever contains it, whether | |
| 413 | + that is a form XObject or a page. This would require more | |
| 414 | + significant interpretation of the content stream. We would need a | |
| 415 | + test file in which the placement of the form XObject has to be in | |
| 416 | + the right place, e.g., the form XObject partially obscures earlier | |
| 417 | + code and is partially obscured by later code. Keys in the resource | |
| 418 | + dictionary may need to be changed -- create test cases with lots of | |
| 419 | + duplicated/overlapping keys. | |
| 420 | + | |
| 421 | +* Part of closed_file_input_source.cc is disabled on Windows because | |
| 422 | + of odd failures. It might be worth investigating so we can fully | |
| 423 | + exercise this in the test suite. That said, ClosedFileInputSource | |
| 424 | + is exercised elsewhere in qpdf's test suite, so this is not that | |
| 425 | + pressing. | |
| 426 | + | |
| 427 | +* If possible, consider adding CCITT3, CCITT4, or any other easy | |
| 428 | + filters. For some reference code that we probably can't use but may | |
| 429 | + be handy anyway, see | |
| 430 | + http://partners.adobe.com/public/developer/ps/sdk/index_archive.html | |
| 431 | + | |
| 432 | +* If possible, support the following types of broken files: | |
| 433 | + | |
| 434 | + - Files that have no whitespace token after "endobj" such that | |
| 435 | + endobj collides with the start of the next object | |
| 436 | + | |
| 437 | + - See ../misc/broken-files | |
| 438 | + | |
| 439 | + - See ../misc/bad-files-issue-476. This directory contains a | |
| 440 | + snapshot of the google doc and linked PDF files from issue #476. | |
| 441 | + Please see the issue for details. | |
| 442 | + | |
| 443 | +* Additional form features | |
| 444 | + * set value from CLI? Specify title, and provide way to | |
| 445 | + disambiguate, probably by giving objgen of field | |
| 446 | + | |
| 447 | +* Pl_TIFFPredictor is pretty slow. | |
| 448 | + | |
| 449 | +* Support for handling file names with Unicode characters in Windows | |
| 450 | + is incomplete. qpdf seems to support them okay from a functionality | |
| 451 | + standpoint, and the right thing happens if you pass in UTF-8 | |
| 452 | + encoded filenames to QPDF library routines in Windows (they are | |
| 453 | + converted internally to wchar_t*), but file names are encoded in | |
| 454 | + UTF-8 on output, which doesn't produce nice error messages or | |
| 455 | + output on Windows in some cases. | |
| 456 | + | |
| 457 | +* If we ever wanted to do anything more with character encoding, see | |
| 458 | + ../misc/character-encoding/, which includes machine-readable dump | |
| 459 | + of table D.2 in the ISO-32000 PDF spec. This shows the mapping | |
| 460 | + between Unicode, StandardEncoding, WinAnsiEncoding, | |
| 461 | + MacRomanEncoding, and PDFDocEncoding. | |
| 462 | + | |
| 463 | +* Some test cases on bad files fail because qpdf is unable to find | |
| 464 | + the root dictionary when it fails to read the trailer. Recovery | |
| 465 | + could find the root dictionary and even the info dictionary in | |
| 466 | + other ways. In particular, issue-202.pdf can be opened by evince, | |
| 467 | + and there's no real reason that qpdf couldn't be made to be able to | |
| 468 | + recover that file as well. | |
| 469 | + | |
| 470 | +* Audit every place where qpdf allocates memory to see whether there | |
| 471 | + are cases where malicious inputs could cause qpdf to attempt to | |
| 472 | + grab very large amounts of memory. Certainly there are cases like | |
| 473 | + this, such as if a very highly compressed, very large image stream | |
| 474 | + is requested in a buffer. Hopefully normal input to output | |
| 475 | + filtering doesn't ever try to do this. QPDFWriter should be checked | |
| 476 | + carefully too. See also bugs/private/from-email-663916/ | |
| 477 | + | |
| 478 | +* Interactive form modification: | |
| 479 | + https://github.com/qpdf/qpdf/issues/213 contains a good discussion | |
| 480 | + of some ideas for adding methods to modify annotations and form | |
| 481 | + fields if we want to make it easier to support modifications to | |
| 482 | + interactive forms. Some of the ideas have been implemented, and | |
| 483 | + some of the probably never will be implemented, but it's worth a | |
| 484 | + read if there is an intention to work on this. In the issue, search | |
| 485 | + for "Regarding write functionality", and read that comment and the | |
| 486 | + responses to it. | |
| 487 | + | |
| 488 | +* Look at ~/Q/pdf-collection/forms-from-appian/ | |
| 489 | + | |
| 490 | +* When decrypting files with /R=6, hash_V5 is called more than once | |
| 491 | + with the same inputs. Caching the results or refactoring to reduce | |
| 492 | + the number of identical calls could improve performance for | |
| 493 | + workloads that involve processing large numbers of small files. | |
| 494 | + | |
| 495 | +* Consider adding a method to balance the pages tree. It would call | |
| 496 | + pushInheritedAttributesToPage, construct a pages tree from scratch, | |
| 497 | + and replace the /Pages key of the root dictionary with the new | |
| 498 | + tree. | |
| 499 | + | |
| 500 | +* Study what's required to support savable forms that can be saved by | |
| 501 | + Adobe Reader. Does this require actually signing the document with | |
| 502 | + an Adobe private key? Search for "Digital signatures" in the PDF | |
| 503 | + spec, and look at ~/Q/pdf-collection/form-with-full-save.pdf, which | |
| 504 | + came from Adobe's example site. See also | |
| 505 | + ../misc/digital-sign-from-trueroad/. If digital signatures are | |
| 506 | + implemented, update the docs on crypto providers, which mention | |
| 507 | + that this may happen in the future. | |
| 508 | + | |
| 509 | +* Qpdf does not honor /EFF when adding new file attachments. When it | |
| 510 | + encrypts, it never generates streams with explicit crypt filters. | |
| 511 | + Prior to 10.2, there was an incorrect attempt to treat /EFF as a | |
| 512 | + default value for decrypting file attachment streams, but it is not | |
| 513 | + supposed to mean that. Instead, it is intended for conforming | |
| 514 | + writers to obey this when adding new attachments. Qpdf is not a | |
| 515 | + conforming writer in that respect. | |
| 516 | + | |
| 517 | +* The whole xref handling code in the QPDF object allows the same | |
| 518 | + object with more than one generation to coexist, but a lot of logic | |
| 519 | + assumes this isn't the case. Anything that creates mappings only | |
| 520 | + with the object number and not the generation is this way, | |
| 521 | + including most of the interaction between QPDFWriter and QPDF. If | |
| 522 | + we wanted to allow the same object with more than one generation to | |
| 523 | + coexist, which I'm not sure is allowed, we could fix this by | |
| 524 | + changing xref_table. Alternatively, we could detect and disallow | |
| 525 | + that case. In fact, it appears that Adobe reader and other PDF | |
| 526 | + viewing software silently ignores objects of this type, so this is | |
| 527 | + probably not a big deal. | |
| 528 | + | |
| 529 | +* From a suggestion in bug 3152169, consider having an option to | |
| 530 | + re-encode inline images with an ASCII encoding. | |
| 531 | + | |
| 532 | +* From github issue 2, provide more in-depth output for examining | |
| 533 | + hint stream contents. Consider adding on option to provide a | |
| 534 | + human-readable dump of linearization hint tables. This should | |
| 535 | + include improving the 'overflow reading bit stream' message as | |
| 536 | + reported in issue #2. There are multiple calls to stopOnError in | |
| 537 | + the linearization checking code. Ideally, these should not | |
| 538 | + terminate checking. It would require re-acquiring an understanding | |
| 539 | + of all that code to make the checks more robust. In particular, | |
| 540 | + it's hard to look at the code and quickly determine what is a true | |
| 541 | + logic error and what could happen because of malformed user input. | |
| 542 | + See also ../misc/linearization-errors. | |
| 543 | + | |
| 544 | +* If I ever decide to make appearance stream-generation aware of | |
| 545 | + fonts or font metrics, see email from Tobias with Message-ID | |
| 546 | + <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14. | |
| 547 | + | |
| 548 | +* Look at places in the code where object traversal is being done and, | |
| 549 | + where possible, try to avoid it entirely or at least avoid ever | |
| 550 | + traversing the same objects multiple times. | |
| 542 | 551 | |
| 543 | 552 | ---------------------------------------------------------------------- |
| 544 | 553 | ... | ... |