Commit 910a373a79f885cba1023fa69aa0c679e4ae0601
1 parent
a6c4b293
Clean up the Design and Library Notes chapter of the manual
Showing
1 changed file
with
195 additions
and
207 deletions
manual/design.rst
| @@ -8,50 +8,53 @@ Design and Library Notes | @@ -8,50 +8,53 @@ Design and Library Notes | ||
| 8 | Introduction | 8 | Introduction |
| 9 | ------------ | 9 | ------------ |
| 10 | 10 | ||
| 11 | -This section was written prior to the implementation of the qpdf package | ||
| 12 | -and was subsequently modified to reflect the implementation. In some | ||
| 13 | -cases, for purposes of explanation, it may differ slightly from the | ||
| 14 | -actual implementation. As always, the source code and test suite are | ||
| 15 | -authoritative. Even if there are some errors, this document should serve | ||
| 16 | -as a road map to understanding how this code works. | 11 | +This section was written prior to the implementation of the qpdf |
| 12 | +library and was subsequently modified to reflect the implementation. | ||
| 13 | +In some cases, for purposes of explanation, it may differ slightly | ||
| 14 | +from the actual implementation. As always, the source code and test | ||
| 15 | +suite are authoritative. Even if there are some errors, this document | ||
| 16 | +should serve as a road map to understanding how this code works. | ||
| 17 | 17 | ||
| 18 | In general, one should adhere strictly to a specification when writing | 18 | In general, one should adhere strictly to a specification when writing |
| 19 | -but be liberal in reading. This way, the product of our software will be | ||
| 20 | -accepted by the widest range of other programs, and we will accept the | ||
| 21 | -widest range of input files. This library attempts to conform to that | ||
| 22 | -philosophy whenever possible but also aims to provide strict checking | ||
| 23 | -for people who want to validate PDF files. If you don't want to see | ||
| 24 | -warnings and are trying to write something that is tolerant, you can | ||
| 25 | -call ``setSuppressWarnings(true)``. If you want to fail on the first | ||
| 26 | -error, you can call ``setAttemptRecovery(false)``. The default behavior | ||
| 27 | -is to generating warnings for recoverable problems. Note that recovery | ||
| 28 | -will not always produce the desired results even if it is able to get | ||
| 29 | -through the file. Unlike most other PDF files that produce generic | ||
| 30 | -warnings such as "This file is damaged,", qpdf generally issues a | ||
| 31 | -detailed error message that would be most useful to a PDF developer. | 19 | +but be liberal in reading. This way, the product of our software will |
| 20 | +be accepted by the widest range of other programs, and we will accept | ||
| 21 | +the widest range of input files. This library attempts to conform to | ||
| 22 | +that philosophy whenever possible but also aims to provide strict | ||
| 23 | +checking for people who want to validate PDF files. If you don't want | ||
| 24 | +to see warnings and are trying to write something that is tolerant, | ||
| 25 | +you can call ``setSuppressWarnings(true)``. If you want to fail on the | ||
| 26 | +first error, you can call ``setAttemptRecovery(false)``. The default | ||
| 27 | +behavior is to generating warnings for recoverable problems. Note that | ||
| 28 | +recovery will not always produce the desired results even if it is | ||
| 29 | +able to get through the file. Unlike most other PDF files that produce | ||
| 30 | +generic warnings such as "This file is damaged," qpdf generally issues | ||
| 31 | +a detailed error message that would be most useful to a PDF developer. | ||
| 32 | This is by design as there seems to be a shortage of PDF validation | 32 | This is by design as there seems to be a shortage of PDF validation |
| 33 | -tools out there. This was, in fact, one of the major motivations behind | ||
| 34 | -the initial creation of qpdf. | 33 | +tools out there. This was, in fact, one of the major motivations |
| 34 | +behind the initial creation of qpdf. That said, qpdf is not a strict | ||
| 35 | +PDF checker. There are many ways in which a PDF file can be out of | ||
| 36 | +conformance to the spec that qpdf doesn't notice or report. | ||
| 35 | 37 | ||
| 36 | .. _design-goals: | 38 | .. _design-goals: |
| 37 | 39 | ||
| 38 | Design Goals | 40 | Design Goals |
| 39 | ------------ | 41 | ------------ |
| 40 | 42 | ||
| 41 | -The QPDF package includes support for reading and rewriting PDF files. | 43 | +The qpdf library includes support for reading and rewriting PDF files. |
| 42 | It aims to hide from the user details involving object locations, | 44 | It aims to hide from the user details involving object locations, |
| 43 | -modified (appended) PDF files, the directness/indirectness of objects, | ||
| 44 | -and stream filters including encryption. It does not aim to hide | ||
| 45 | -knowledge of the object hierarchy or content stream contents. Put | ||
| 46 | -another way, a user of the qpdf library is expected to have knowledge | ||
| 47 | -about how PDF files work, but is not expected to have to keep track of | ||
| 48 | -bookkeeping details such as file positions. | ||
| 49 | - | ||
| 50 | -A user of the library never has to care whether an object is direct or | ||
| 51 | -indirect, though it is possible to determine whether an object is direct | ||
| 52 | -or not if this information is needed. All access to objects deals with | ||
| 53 | -this transparently. All memory management details are also handled by | ||
| 54 | -the library. | 45 | +modified (appended) PDF files, use of object streams, and stream |
| 46 | +filters including encryption. It does not aim to hide knowledge of the | ||
| 47 | +object hierarchy or content stream contents. Put another way, a user | ||
| 48 | +of the qpdf library is expected to have knowledge about how PDF files | ||
| 49 | +work, but is not expected to have to keep track of bookkeeping details | ||
| 50 | +such as file positions. | ||
| 51 | + | ||
| 52 | +When accessing objects, a user of the library never has to care | ||
| 53 | +whether an object is direct or indirect as all access to objects deals | ||
| 54 | +with this transparently. All memory management details are also | ||
| 55 | +handled by the library. When modifying objects, it is possible to | ||
| 56 | +determine whether an object is indirect and to make copies of the | ||
| 57 | +object if needed. | ||
| 55 | 58 | ||
| 56 | Memory is managed mostly with ``std::shared_ptr`` object to minimize | 59 | Memory is managed mostly with ``std::shared_ptr`` object to minimize |
| 57 | explicit memory handling. This library also makes use of a technique | 60 | explicit memory handling. This library also makes use of a technique |
| @@ -85,29 +88,32 @@ objects to indirect objects and vice versa. | @@ -85,29 +88,32 @@ objects to indirect objects and vice versa. | ||
| 85 | Instances of ``QPDFObjectHandle`` can be directly created and modified | 88 | Instances of ``QPDFObjectHandle`` can be directly created and modified |
| 86 | using static factory methods in the ``QPDFObjectHandle`` class. There | 89 | using static factory methods in the ``QPDFObjectHandle`` class. There |
| 87 | are factory methods for each type of object as well as a convenience | 90 | are factory methods for each type of object as well as a convenience |
| 88 | -method ``QPDFObjectHandle::parse`` that creates an object from a string | ||
| 89 | -representation of the object. Existing instances of ``QPDFObjectHandle`` | ||
| 90 | -can also be modified in several ways. See comments in | ||
| 91 | -:file:`QPDFObjectHandle.hh` for details. | 91 | +method ``QPDFObjectHandle::parse`` that creates an object from a |
| 92 | +string representation of the object. The ``_qpdf`` user-defined string | ||
| 93 | +literal is also available, making it possible to create instances of | ||
| 94 | +``QPDFObjectHandle`` with ``"(pdf-syntax)"_qpdf``. Existing instances | ||
| 95 | +of ``QPDFObjectHandle`` can also be modified in several ways. See | ||
| 96 | +comments in :file:`QPDFObjectHandle.hh` for details. | ||
| 92 | 97 | ||
| 93 | An instance of ``QPDF`` is constructed by using the class's default | 98 | An instance of ``QPDF`` is constructed by using the class's default |
| 94 | -constructor. If desired, the ``QPDF`` object may be configured with | ||
| 95 | -various methods that change its default behavior. Then the | ||
| 96 | -``QPDF::processFile()`` method is passed the name of a PDF file, which | ||
| 97 | -permanently associates the file with that QPDF object. A password may | ||
| 98 | -also be given for access to password-protected files. QPDF does not | ||
| 99 | -enforce encryption parameters and will treat user and owner passwords | ||
| 100 | -equivalently. Either password may be used to access an encrypted file. | ||
| 101 | -``QPDF`` will allow recovery of a user password given an owner password. | ||
| 102 | -The input PDF file must be seekable. (Output files written by | ||
| 103 | -``QPDFWriter`` need not be seekable, even when creating linearized | ||
| 104 | -files.) During construction, ``QPDF`` validates the PDF file's header, | ||
| 105 | -and then reads the cross reference tables and trailer dictionaries. The | ||
| 106 | -``QPDF`` class keeps only the first trailer dictionary though it does | ||
| 107 | -read all of them so it can check the ``/Prev`` key. ``QPDF`` class users | ||
| 108 | -may request the root object and the trailer dictionary specifically. The | ||
| 109 | -cross reference table is kept private. Objects may then be requested by | ||
| 110 | -number or by walking the object tree. | 99 | +constructor or with ``QPDF::create()``. If desired, the ``QPDF`` |
| 100 | +object may be configured with various methods that change its default | ||
| 101 | +behavior. Then the ``QPDF::processFile`` method is passed the name of | ||
| 102 | +a PDF file, which permanently associates the file with that ``QPDF`` | ||
| 103 | +object. A password may also be given for access to password-protected | ||
| 104 | +files. ``QPDF`` does not enforce encryption parameters and will treat | ||
| 105 | +user and owner passwords equivalently. Either password may be used to | ||
| 106 | +access an encrypted file. ``QPDF`` will allow recovery of a user | ||
| 107 | +password given an owner password. The input PDF file must be seekable. | ||
| 108 | +Output files written by ``QPDFWriter`` need not be seekable, even when | ||
| 109 | +creating linearized files. During construction, ``QPDF`` validates the | ||
| 110 | +PDF file's header, and then reads the cross reference tables and | ||
| 111 | +trailer dictionaries. The ``QPDF`` class keeps only the first trailer | ||
| 112 | +dictionary though it does read all of them so it can check the | ||
| 113 | +``/Prev`` key. ``QPDF`` class users may request the root object and | ||
| 114 | +the trailer dictionary specifically. The cross reference table is kept | ||
| 115 | +private. Objects may then be requested by number or by walking the | ||
| 116 | +object tree. | ||
| 111 | 117 | ||
| 112 | When a PDF file has a cross-reference stream instead of a | 118 | When a PDF file has a cross-reference stream instead of a |
| 113 | cross-reference table and trailer, requesting the document's trailer | 119 | cross-reference table and trailer, requesting the document's trailer |
| @@ -240,13 +246,14 @@ the ``QPDFObjectHandle`` type to hold onto objects and to abstract | @@ -240,13 +246,14 @@ the ``QPDFObjectHandle`` type to hold onto objects and to abstract | ||
| 240 | away in most cases whether the object is direct or indirect. | 246 | away in most cases whether the object is direct or indirect. |
| 241 | 247 | ||
| 242 | Internally, ``QPDFObjectHandle`` holds onto a shared pointer to the | 248 | Internally, ``QPDFObjectHandle`` holds onto a shared pointer to the |
| 243 | -underlying object value. When a direct object is created, the | ||
| 244 | -``QPDFObjectHandle`` that holds it is not associated with a ``QPDF`` | ||
| 245 | -object. When an indirect object reference is created, it starts off in | ||
| 246 | -an *unresolved* state and must be associated with a ``QPDF`` object, | ||
| 247 | -which is considered its *owner*. To access the actual value of the | ||
| 248 | -object, the object must be *resolved*. This happens automatically when | ||
| 249 | -the the object is accessed in any way. | 249 | +underlying object value. When a direct object is created |
| 250 | +programmatically by client code (rather than being read from the | ||
| 251 | +file), the ``QPDFObjectHandle`` that holds it is not associated with a | ||
| 252 | +``QPDF`` object. When an indirect object reference is created, it | ||
| 253 | +starts off in an *unresolved* state and must be associated with a | ||
| 254 | +``QPDF`` object, which is considered its *owner*. To access the actual | ||
| 255 | +value of the object, the object must be *resolved*. This happens | ||
| 256 | +automatically when the the object is accessed in any way. | ||
| 250 | 257 | ||
| 251 | To resolve an object, qpdf checks its object cache. If not found in | 258 | To resolve an object, qpdf checks its object cache. If not found in |
| 252 | the cache, it attempts to read the object from the input source | 259 | the cache, it attempts to read the object from the input source |
| @@ -286,18 +293,20 @@ file. | @@ -286,18 +293,20 @@ file. | ||
| 286 | it is looking before the last ``%%EOF``. After getting to ``trailer`` | 293 | it is looking before the last ``%%EOF``. After getting to ``trailer`` |
| 287 | keyword, it invokes the parser. | 294 | keyword, it invokes the parser. |
| 288 | 295 | ||
| 289 | -- The parser sees ``<<``, so it calls itself recursively in | ||
| 290 | - dictionary creation mode. | 296 | +- The parser sees ``<<``, so it changes state and starts accumulating |
| 297 | + the keys and values of the dictionary. | ||
| 291 | 298 | ||
| 292 | - In dictionary creation mode, the parser keeps accumulating objects | 299 | - In dictionary creation mode, the parser keeps accumulating objects |
| 293 | until it encounters ``>>``. Each object that is read is pushed onto | 300 | until it encounters ``>>``. Each object that is read is pushed onto |
| 294 | a stack. If ``R`` is read, the last two objects on the stack are | 301 | a stack. If ``R`` is read, the last two objects on the stack are |
| 295 | inspected. If they are integers, they are popped off the stack and | 302 | inspected. If they are integers, they are popped off the stack and |
| 296 | - their values are used to construct an indirect object handle which | ||
| 297 | - is then pushed onto the stack. When ``>>`` is finally read, the | ||
| 298 | - stack is converted into a ``QPDF_Dictionary`` (not directly | ||
| 299 | - accessible through the API) which is placed in a | ||
| 300 | - ``QPDFObjectHandle`` and returned. | 303 | + their values are used to obtain an indirect object handle from the |
| 304 | + ``QPDF`` class. The ``QPDF`` class consults its cache, and if | ||
| 305 | + necessary, inserts a new unresolved object, and returns an object | ||
| 306 | + handle pointing to the cache entry, which is then pushed onto the | ||
| 307 | + stack. When ``>>`` is finally read, the stack is converted into a | ||
| 308 | + ``QPDF_Dictionary`` (not directly accessible through the API) which | ||
| 309 | + is placed in a ``QPDFObjectHandle`` and returned. | ||
| 301 | 310 | ||
| 302 | - The resulting dictionary is saved as the trailer dictionary. | 311 | - The resulting dictionary is saved as the trailer dictionary. |
| 303 | 312 | ||
| @@ -309,23 +318,21 @@ file. | @@ -309,23 +318,21 @@ file. | ||
| 309 | - If there is an encryption dictionary, the document's encryption | 318 | - If there is an encryption dictionary, the document's encryption |
| 310 | parameters are initialized. | 319 | parameters are initialized. |
| 311 | 320 | ||
| 312 | -- The client requests root object. The ``QPDF`` class gets the value of | ||
| 313 | - root key from trailer dictionary and returns it. It is an unresolved | ||
| 314 | - indirect ``QPDFObjectHandle``. | 321 | +- The client requests the root object by getting the value of the |
| 322 | + ``/Root`` key from trailer dictionary and returns it. It is an | ||
| 323 | + unresolved indirect ``QPDFObjectHandle``. | ||
| 315 | 324 | ||
| 316 | - The client requests the ``/Pages`` key from root | 325 | - The client requests the ``/Pages`` key from root |
| 317 | - ``QPDFObjectHandle``. The ``QPDFObjectHandle`` notices that it is | ||
| 318 | - indirect so it asks ``QPDF`` to resolve it. ``QPDF`` looks in the | ||
| 319 | - object cache for an object with the root dictionary's object ID and | ||
| 320 | - generation number. Upon not seeing it, it checks the cross reference | ||
| 321 | - table, gets the offset, and reads the object present at that offset. | ||
| 322 | - It stores the result in the object cache. The cache entry's value is | ||
| 323 | - replaced by the actual value, which causes any previously unresolved | ||
| 324 | - ``QPDFObjectHandle`` objects that that pointed there to now have a | ||
| 325 | - shared copy of the actual object. Modifications through any such | ||
| 326 | - ``QPDFObjectHandle`` will be reflected in all of them. As the client | ||
| 327 | - continues to request objects, the same process is followed for each | ||
| 328 | - new requested object. | 326 | + ``QPDFObjectHandle``. The ``QPDFObjectHandle`` notices that it is an |
| 327 | + unresolved indirect object, so it asks ``QPDF`` to resolve it. | ||
| 328 | + ``QPDF`` checks the cross reference table, gets the offset, and | ||
| 329 | + reads the object present at that offset. The object cache entry's | ||
| 330 | + ``unresolved`` value is replaced by the actual value, which causes | ||
| 331 | + any previously unresolved ``QPDFObjectHandle`` objects that pointed | ||
| 332 | + there to now have a shared copy of the actual object. Modifications | ||
| 333 | + through any such ``QPDFObjectHandle`` will be reflected in all of | ||
| 334 | + them. As the client continues to request objects, the same process | ||
| 335 | + is followed for each new requested object. | ||
| 329 | 336 | ||
| 330 | .. _object_internals: | 337 | .. _object_internals: |
| 331 | 338 | ||
| @@ -339,11 +346,12 @@ Object Internals | @@ -339,11 +346,12 @@ Object Internals | ||
| 339 | ~~~~~~~~~~~~~~~~ | 346 | ~~~~~~~~~~~~~~~~ |
| 340 | 347 | ||
| 341 | The ``QPDF`` object has an object cache which contains a shared | 348 | The ``QPDF`` object has an object cache which contains a shared |
| 342 | -pointer to each object that was read from the file. Changes can be | ||
| 343 | -made to any of those objects through ``QPDFObjectHandle`` methods. Any | ||
| 344 | -such changes are visible to all ``QPDFObjectHandle`` instances that | ||
| 345 | -point to the same object. When a ``QPDF`` object is written by | ||
| 346 | -``QPDFWriter`` or serialized to JSON, any changes are reflected. | 349 | +pointer to each object that was read from the file or added as an |
| 350 | +indirect object. Changes can be made to any of those objects through | ||
| 351 | +``QPDFObjectHandle`` methods. Any such changes are visible to all | ||
| 352 | +``QPDFObjectHandle`` instances that point to the same object. When a | ||
| 353 | +``QPDF`` object is written by ``QPDFWriter`` or serialized to JSON, | ||
| 354 | +any changes are reflected. | ||
| 347 | 355 | ||
| 348 | Objects in qpdf 11 and Newer | 356 | Objects in qpdf 11 and Newer |
| 349 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 357 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| @@ -356,30 +364,32 @@ reference to that object has a copy of that shared pointer. Each | @@ -356,30 +364,32 @@ reference to that object has a copy of that shared pointer. Each | ||
| 356 | is an implementation for each of the basic object types (array, | 364 | is an implementation for each of the basic object types (array, |
| 357 | dictionary, null, boolean, string, number, etc.) as well as a few | 365 | dictionary, null, boolean, string, number, etc.) as well as a few |
| 358 | special ones including ``uninitialized``, ``unresolved``, | 366 | special ones including ``uninitialized``, ``unresolved``, |
| 359 | -``reserved``, and ``destroyed``. When an object is first referenced, | 367 | +``reserved``, and ``destroyed``. When an object is first created, |
| 360 | its underlying ``QPDFValue`` has type ``unresolved``. When the object | 368 | its underlying ``QPDFValue`` has type ``unresolved``. When the object |
| 361 | -is first resolved, the ``QPDFObject`` in the cache has its internal | 369 | +is first accessed, the ``QPDFObject`` in the cache has its internal |
| 362 | ``QPDFValue`` replaced with the object as read from the file. Since it | 370 | ``QPDFValue`` replaced with the object as read from the file. Since it |
| 363 | is the ``QPDFObject`` object that is shared by all referencing | 371 | is the ``QPDFObject`` object that is shared by all referencing |
| 364 | ``QPDFObjectHandle`` objects as well as by the owning ``QPDF`` object, | 372 | ``QPDFObjectHandle`` objects as well as by the owning ``QPDF`` object, |
| 365 | this ensures that any future changes to the object, including | 373 | this ensures that any future changes to the object, including |
| 366 | -replacing the object with a completely different one, will be | 374 | +replacing the object with a completely different one by calling |
| 375 | +``QPDF::replaceObject`` or ``QPDF::swapObjects``, will be | ||
| 367 | reflected across all ``QPDFObjectHandle`` objects that reference it. | 376 | reflected across all ``QPDFObjectHandle`` objects that reference it. |
| 368 | 377 | ||
| 369 | A ``QPDFValue`` that originated from a PDF input source maintains a | 378 | A ``QPDFValue`` that originated from a PDF input source maintains a |
| 370 | pointer to the ``QPDF`` object that read it (its *owner*). When that | 379 | pointer to the ``QPDF`` object that read it (its *owner*). When that |
| 371 | -``QPDF`` object is destroyed, it disconnects all reachable from it by | ||
| 372 | -clearing their owner. For indirect objects (all objects in the object | ||
| 373 | -cache), it also replaces the object's value with an object of type | ||
| 374 | -``destroyed``. This means that, if there are still any referencing | ||
| 375 | -``QPDFObjectHandle`` objects floating around, requesting their owning | ||
| 376 | -``QPDF`` will return a null pointer rather than a pointer to a | ||
| 377 | -``QPDF`` object that is either invalid or points to something else, | ||
| 378 | -and any attempt to access an indirect object that is associated with a | ||
| 379 | -destroyed ``QPDF`` object will throw an exception. This operation also | ||
| 380 | -has the effect of breaking any circular references (which are common | ||
| 381 | -and, in some cases, required by the PDF specification), thus | ||
| 382 | -preventing memory leaks when ``QPDF`` objects are destroyed. | 380 | +``QPDF`` object is destroyed, it disconnects all objects reachable |
| 381 | +from it by clearing their owner. For indirect objects (all objects in | ||
| 382 | +the object cache), it also replaces the object's value with an object | ||
| 383 | +of type ``destroyed``. This means that, if there are still any | ||
| 384 | +referencing ``QPDFObjectHandle`` objects floating around, requesting | ||
| 385 | +their owning ``QPDF`` will return a null pointer rather than a pointer | ||
| 386 | +to a ``QPDF`` object that is either invalid or points to something | ||
| 387 | +else, and any attempt to access an indirect object that is associated | ||
| 388 | +with a destroyed ``QPDF`` object will throw an exception. This | ||
| 389 | +operation also has the effect of breaking any circular references | ||
| 390 | +(which are common and, in some cases, required by the PDF | ||
| 391 | +specification), thus preventing memory leaks when ``QPDF`` objects are | ||
| 392 | +destroyed. | ||
| 383 | 393 | ||
| 384 | Objects prior to qpdf 11 | 394 | Objects prior to qpdf 11 |
| 385 | ~~~~~~~~~~~~~~~~~~~~~~~~ | 395 | ~~~~~~~~~~~~~~~~~~~~~~~~ |
| @@ -478,22 +488,6 @@ and 64-bit platforms, and the test suite is very thorough, so it is | @@ -478,22 +488,6 @@ and 64-bit platforms, and the test suite is very thorough, so it is | ||
| 478 | hard to make any of the potential errors here without being caught in | 488 | hard to make any of the potential errors here without being caught in |
| 479 | build or test. | 489 | build or test. |
| 480 | 490 | ||
| 481 | -Non-const ``unsigned char*`` is used in the ``Pipeline`` interface. The | ||
| 482 | -pipeline interface has a ``write`` call that uses ``unsigned char*`` | ||
| 483 | -without a ``const`` qualifier. The main reason for this is | ||
| 484 | -to support pipelines that make calls to third-party libraries, such as | ||
| 485 | -zlib, that don't include ``const`` in their interfaces. Unfortunately, | ||
| 486 | -there are many places in the code where it is desirable to have | ||
| 487 | -``const char*`` with pipelines. None of the pipeline implementations | ||
| 488 | -in qpdf | ||
| 489 | -currently modify the data passed to write, and doing so would be counter | ||
| 490 | -to the intent of ``Pipeline``, but there is nothing in the code to | ||
| 491 | -prevent this from being done. There are places in the code where | ||
| 492 | -``const_cast`` is used to remove the const-ness of pointers going into | ||
| 493 | -``Pipeline``\ s. This could theoretically be unsafe, but there is | ||
| 494 | -adequate testing to assert that it is safe and will remain safe in | ||
| 495 | -qpdf's code. | ||
| 496 | - | ||
| 497 | .. _encryption: | 491 | .. _encryption: |
| 498 | 492 | ||
| 499 | Encryption | 493 | Encryption |
| @@ -516,14 +510,14 @@ given an encryption key. This is used by ``QPDFWriter`` when it rewrites | @@ -516,14 +510,14 @@ given an encryption key. This is used by ``QPDFWriter`` when it rewrites | ||
| 516 | encrypted files. | 510 | encrypted files. |
| 517 | 511 | ||
| 518 | When copying encrypted files, unless otherwise directed, qpdf will | 512 | When copying encrypted files, unless otherwise directed, qpdf will |
| 519 | -preserve any encryption in force in the original file. qpdf can do this | ||
| 520 | -with either the user or the owner password. There is no difference in | ||
| 521 | -capability based on which password is used. When 40 or 128 bit | ||
| 522 | -encryption keys are used, the user password can be recovered with the | ||
| 523 | -owner password. With 256 keys, the user and owner passwords are used | ||
| 524 | -independently to encrypt the actual encryption key, so while either can | ||
| 525 | -be used, the owner password can no longer be used to recover the user | ||
| 526 | -password. | 513 | +preserve any encryption in effect in the original file. qpdf can do |
| 514 | +this with either the user or the owner password. There is no | ||
| 515 | +difference in capability based on which password is used. When 40 or | ||
| 516 | +128 bit encryption keys are used, the user password can be recovered | ||
| 517 | +with the owner password. With 256 keys, the user and owner passwords | ||
| 518 | +are used independently to encrypt the actual encryption key, so while | ||
| 519 | +either can be used, the owner password can no longer be used to | ||
| 520 | +recover the user password. | ||
| 527 | 521 | ||
| 528 | Starting with version 4.0.0, qpdf can read files that are not encrypted | 522 | Starting with version 4.0.0, qpdf can read files that are not encrypted |
| 529 | but that contain encrypted attachments, but it cannot write such files. | 523 | but that contain encrypted attachments, but it cannot write such files. |
| @@ -538,33 +532,37 @@ format. The only exception to this is that clear-text metadata will be | @@ -538,33 +532,37 @@ format. The only exception to this is that clear-text metadata will be | ||
| 538 | preserved as clear-text if it is that way in the original file. | 532 | preserved as clear-text if it is that way in the original file. |
| 539 | 533 | ||
| 540 | One point of confusion some people have about encrypted PDF files is | 534 | One point of confusion some people have about encrypted PDF files is |
| 541 | -that encryption is not the same as password protection. Password | ||
| 542 | -protected files are always encrypted, but it is also possible to create | ||
| 543 | -encrypted files that do not have passwords. Internally, such files use | ||
| 544 | -the empty string as a password, and most readers try the empty string | ||
| 545 | -first to see if it works and prompt for a password only if the empty | ||
| 546 | -string doesn't work. Normally such files have an empty user password and | ||
| 547 | -a non-empty owner password. In that way, if the file is opened by an | ||
| 548 | -ordinary reader without specification of password, the restrictions | ||
| 549 | -specified in the encryption dictionary can be enforced. Most users | ||
| 550 | -wouldn't even realize such a file was encrypted. Since qpdf always | ||
| 551 | -ignores the restrictions (except for the purpose of reporting what they | ||
| 552 | -are), qpdf doesn't care which password you use. QPDF will allow you to | ||
| 553 | -create PDF files with non-empty user passwords and empty owner | ||
| 554 | -passwords. Some readers will require a password when you open these | ||
| 555 | -files, and others will open the files without a password and not enforce | ||
| 556 | -restrictions. Having a non-empty user password and an empty owner | ||
| 557 | -password doesn't really make sense because it would mean that opening | ||
| 558 | -the file with the user password would be more restrictive than not | ||
| 559 | -supplying a password at all. QPDF also allows you to create PDF files | ||
| 560 | -with the same password as both the user and owner password. Some readers | ||
| 561 | -will not ever allow such files to be accessed without restrictions | ||
| 562 | -because they never try the password as the owner password if it works as | ||
| 563 | -the user password. Nonetheless, one of the powerful aspects of qpdf is | ||
| 564 | -that it allows you to finely specify the way encrypted files are | ||
| 565 | -created, even if the results are not useful to some readers. One use | ||
| 566 | -case for this would be for testing a PDF reader to ensure that it | ||
| 567 | -handles odd configurations of input files. | 535 | +that encryption is not the same as password protection. |
| 536 | +Password-protected files are always encrypted, but it is also possible | ||
| 537 | +to create encrypted files that do not have passwords. Internally, such | ||
| 538 | +files use the empty string as a password, and most readers try the | ||
| 539 | +empty string first to see if it works and prompt for a password only | ||
| 540 | +if the empty string doesn't work. Normally such files have an empty | ||
| 541 | +user password and a non-empty owner password. In that way, if the file | ||
| 542 | +is opened by an ordinary reader without specification of password, the | ||
| 543 | +restrictions specified in the encryption dictionary can be enforced. | ||
| 544 | +Most users wouldn't even realize such a file was encrypted. Since qpdf | ||
| 545 | +always ignores the restrictions (except for the purpose of reporting | ||
| 546 | +what they are), qpdf doesn't care which password you use. QPDF will | ||
| 547 | +allow you to create PDF files with non-empty user passwords and empty | ||
| 548 | +owner passwords. Some readers will require a password when you open | ||
| 549 | +these files, and others will open the files without a password and not | ||
| 550 | +enforce restrictions. Having a non-empty user password and an empty | ||
| 551 | +owner password doesn't really make sense because it would mean that | ||
| 552 | +opening the file with the user password would be more restrictive than | ||
| 553 | +not supplying a password at all. QPDF also allows you to create PDF | ||
| 554 | +files with the same password as both the user and owner password. Some | ||
| 555 | +readers will not ever allow such files to be accessed without | ||
| 556 | +restrictions because they never try the password as the owner password | ||
| 557 | +if it works as the user password. Nonetheless, one of the powerful | ||
| 558 | +aspects of qpdf is that it allows you to finely specify the way | ||
| 559 | +encrypted files are created, even if the results are not useful to | ||
| 560 | +some readers. One use case for this would be for testing a PDF reader | ||
| 561 | +to ensure that it handles odd configurations of input files. If you | ||
| 562 | +attempt to create an encrypted file that is not secure, qpdf will warn | ||
| 563 | +you and require you to explicitly state your intention to create an | ||
| 564 | +insecure file. So while qpdf can create insecure files, it won't let | ||
| 565 | +you do it by mistake. | ||
| 568 | 566 | ||
| 569 | .. _random-numbers: | 567 | .. _random-numbers: |
| 570 | 568 | ||
| @@ -630,23 +628,21 @@ Copying Objects From Other PDF Files | @@ -630,23 +628,21 @@ Copying Objects From Other PDF Files | ||
| 630 | 628 | ||
| 631 | Version 3.0 of qpdf introduced the ability to copy objects into a | 629 | Version 3.0 of qpdf introduced the ability to copy objects into a |
| 632 | ``QPDF`` object from a different ``QPDF`` object, which we refer to as | 630 | ``QPDF`` object from a different ``QPDF`` object, which we refer to as |
| 633 | -*foreign objects*. This allows arbitrary | ||
| 634 | -merging of PDF files. The "from" ``QPDF`` object must remain valid after | ||
| 635 | -the copy as discussed in the note below. The | ||
| 636 | -:command:`qpdf` command-line tool provides limited | ||
| 637 | -support for basic page selection, including merging in pages from other | ||
| 638 | -files, but the library's API makes it possible to implement arbitrarily | ||
| 639 | -complex merging operations. The main method for copying foreign objects | ||
| 640 | -is ``QPDF::copyForeignObject``. This takes an indirect object from | 631 | +*foreign objects*. This allows arbitrary merging of PDF files. The |
| 632 | +:command:`qpdf` command-line tool provides limited support for basic | ||
| 633 | +page selection, including merging in pages from other files, but the | ||
| 634 | +library's API makes it possible to implement arbitrarily complex | ||
| 635 | +merging operations. The main method for copying foreign objects is | ||
| 636 | +``QPDF::copyForeignObject``. This takes an indirect object from | ||
| 641 | another ``QPDF`` and copies it recursively into this object while | 637 | another ``QPDF`` and copies it recursively into this object while |
| 642 | preserving all object structure, including circular references. This | 638 | preserving all object structure, including circular references. This |
| 643 | means you can add a direct object that you create from scratch to a | 639 | means you can add a direct object that you create from scratch to a |
| 644 | ``QPDF`` object with ``QPDF::makeIndirectObject``, and you can add an | 640 | ``QPDF`` object with ``QPDF::makeIndirectObject``, and you can add an |
| 645 | -indirect object from another file with ``QPDF::copyForeignObject``. The | ||
| 646 | -fact that ``QPDF::makeIndirectObject`` does not automatically detect a | ||
| 647 | -foreign object and copy it is an explicit design decision. Copying a | ||
| 648 | -foreign object seems like a sufficiently significant thing to do that it | ||
| 649 | -should be done explicitly. | 641 | +indirect object from another file with ``QPDF::copyForeignObject``. |
| 642 | +The fact that ``QPDF::makeIndirectObject`` does not automatically | ||
| 643 | +detect a foreign object and copy it is an explicit design decision. | ||
| 644 | +Copying a foreign object seems like a sufficiently significant thing | ||
| 645 | +to do that it should be done explicitly. | ||
| 650 | 646 | ||
| 651 | The other way to copy foreign objects is by passing a page from one | 647 | The other way to copy foreign objects is by passing a page from one |
| 652 | ``QPDF`` to another by calling ``QPDF::addPage``. In contrast to | 648 | ``QPDF`` to another by calling ``QPDF::addPage``. In contrast to |
| @@ -654,26 +650,30 @@ The other way to copy foreign objects is by passing a page from one | @@ -654,26 +650,30 @@ The other way to copy foreign objects is by passing a page from one | ||
| 654 | between indirect objects in the current file, foreign objects, and | 650 | between indirect objects in the current file, foreign objects, and |
| 655 | direct objects. | 651 | direct objects. |
| 656 | 652 | ||
| 657 | -Please note: when you copy objects from one ``QPDF`` to another, the | ||
| 658 | -source ``QPDF`` object must remain valid until you have finished with | ||
| 659 | -the destination object. This is because the original object is still | ||
| 660 | -used to retrieve any referenced stream data from the copied object. | 653 | +When you copy objects from one ``QPDF`` to another, the input source |
| 654 | +of the original file remain valid until you have finished with the | ||
| 655 | +destination object. This is because the input source is still used | ||
| 656 | +to retrieve any referenced stream data from the copied object. If | ||
| 657 | +needed, there are methods to force the data to be copied. See comments | ||
| 658 | +near the declaration of ``copyForeignObject`` in | ||
| 659 | +:file:`include/qpdf/QPDF.hh` for details. | ||
| 661 | 660 | ||
| 662 | .. _rewriting: | 661 | .. _rewriting: |
| 663 | 662 | ||
| 664 | Writing PDF Files | 663 | Writing PDF Files |
| 665 | ----------------- | 664 | ----------------- |
| 666 | 665 | ||
| 667 | -The qpdf library supports file writing of ``QPDF`` objects to PDF files | ||
| 668 | -through the ``QPDFWriter`` class. The ``QPDFWriter`` class has two | ||
| 669 | -writing modes: one for non-linearized files, and one for linearized | ||
| 670 | -files. See :ref:`linearization` for a description of | 666 | +The qpdf library supports file writing of ``QPDF`` objects to PDF |
| 667 | +files through the ``QPDFWriter`` class. The ``QPDFWriter`` class has | ||
| 668 | +two writing modes: one for non-linearized files, and one for | ||
| 669 | +linearized files. See :ref:`linearization` for a description of | ||
| 671 | linearization is implemented. This section describes how we write | 670 | linearization is implemented. This section describes how we write |
| 672 | -non-linearized files including the creation of QDF files (see :ref:`qdf`. | 671 | +non-linearized files including the creation of QDF files (see |
| 672 | +:ref:`qdf`). | ||
| 673 | 673 | ||
| 674 | This outline was written prior to implementation and is not exactly | 674 | This outline was written prior to implementation and is not exactly |
| 675 | -accurate, but it provides a correct "notional" idea of how writing | ||
| 676 | -works. Look at the code in ``QPDFWriter`` for exact details. | 675 | +accurate, but it portrays the essence of how writing works. Look at |
| 676 | +the code in ``QPDFWriter`` for exact details. | ||
| 677 | 677 | ||
| 678 | - Initialize state: | 678 | - Initialize state: |
| 679 | 679 | ||
| @@ -685,7 +685,7 @@ works. Look at the code in ``QPDFWriter`` for exact details. | @@ -685,7 +685,7 @@ works. Look at the code in ``QPDFWriter`` for exact details. | ||
| 685 | 685 | ||
| 686 | - xref table: new id -> offset = empty | 686 | - xref table: new id -> offset = empty |
| 687 | 687 | ||
| 688 | -- Create a QPDF object from a file. | 688 | +- Create a ``QPDF`` object from a file. |
| 689 | 689 | ||
| 690 | - Write header for new PDF file. | 690 | - Write header for new PDF file. |
| 691 | 691 | ||
| @@ -750,7 +750,7 @@ Filtered Streams | @@ -750,7 +750,7 @@ Filtered Streams | ||
| 750 | ---------------- | 750 | ---------------- |
| 751 | 751 | ||
| 752 | Support for streams is implemented through the ``Pipeline`` interface | 752 | Support for streams is implemented through the ``Pipeline`` interface |
| 753 | -which was designed for this package. | 753 | +which was designed for this library. |
| 754 | 754 | ||
| 755 | When reading streams, create a series of ``Pipeline`` objects. The | 755 | When reading streams, create a series of ``Pipeline`` objects. The |
| 756 | ``Pipeline`` abstract base requires implementation ``write()`` and | 756 | ``Pipeline`` abstract base requires implementation ``write()`` and |
| @@ -802,32 +802,20 @@ file might be, the presence of type warnings can save lots of developer | @@ -802,32 +802,20 @@ file might be, the presence of type warnings can save lots of developer | ||
| 802 | time. They have also proven useful in exposing issues in qpdf itself | 802 | time. They have also proven useful in exposing issues in qpdf itself |
| 803 | that would have otherwise gone undetected. | 803 | that would have otherwise gone undetected. |
| 804 | 804 | ||
| 805 | -*Can there be a type-safe ``QPDFObjectHandle``?* It would be great if | ||
| 806 | -``QPDFObjectHandle`` could be more strongly typed so that you'd have to | ||
| 807 | -have check that something was of a particular type before calling | ||
| 808 | -type-specific accessor methods. However, implementing this at this stage | ||
| 809 | -of the library's history would be quite difficult, and it would make a | ||
| 810 | -the common pattern of drilling into an object no longer work. While it | ||
| 811 | -would be possible to have a parallel interface, it would create a lot of | ||
| 812 | -extra code. If qpdf were written in a language like rust, an interface | ||
| 813 | -like this would make a lot of sense, but, for a variety of reasons, the | ||
| 814 | -qpdf API is consistent with other APIs of its time, relying on exception | ||
| 815 | -handling to catch errors. The underlying PDF objects are inherently not | ||
| 816 | -type-safe. Forcing stronger type safety in ``QPDFObjectHandle`` would | ||
| 817 | -ultimately cause a lot more code to have to be written and would like | ||
| 818 | -make software that uses qpdf more brittle, and even so, checks would | ||
| 819 | -have to occur at runtime. | ||
| 820 | - | ||
| 821 | -*Why do type errors sometimes raise exceptions?* The way warnings work | ||
| 822 | -in qpdf requires a ``QPDF`` object to be associated with an object | ||
| 823 | -handle for a warning to be issued. It would be nice if this could be | ||
| 824 | -fixed, but it would require major changes to the API. Rather than | ||
| 825 | -throwing away these conditions, we convert them to exceptions. It's not | ||
| 826 | -that bad though. Since any object handle that was read from a file has | ||
| 827 | -an associated ``QPDF`` object, it would only be type errors on objects | ||
| 828 | -that were created explicitly that would cause exceptions, and in that | ||
| 829 | -case, type errors are much more likely to be the result of a coding | ||
| 830 | -error than invalid input. | 805 | +*Can there be a type-safe* ``QPDFObjectHandle``? At the time of the |
| 806 | +release of qpdf 11, there is active work being done toward the goal of | ||
| 807 | +creating a way to work with PDF objects that is more type-safe and | ||
| 808 | +closer in feel to the current C++ standard library. It is hoped that | ||
| 809 | +this work will make it easier to write bindings to qpdf in modern | ||
| 810 | +languages like `Rust <https://www.rust-lang.org/>`__. If this happens, | ||
| 811 | +it will likely be by providing an alternative to ``QPDFObjectHandle`` | ||
| 812 | +that provides a separate path to the underlying object. Details are | ||
| 813 | +still being worked out. Fundamentally, PDF objects are not strongly | ||
| 814 | +typed. They are similar to ``JSON`` objects or to objects in dynamic | ||
| 815 | +languages like `Python <https://python.org/>`__: there are certain | ||
| 816 | +things you can only do to objects of a given type, but you can replace | ||
| 817 | +an object of one type with an object of another. Because of this, | ||
| 818 | +there will always be some checks that will happen at runtime. | ||
| 831 | 819 | ||
| 832 | *Why does the behavior of a type exception differ between the C and C++ | 820 | *Why does the behavior of a type exception differ between the C and C++ |
| 833 | API?* There is no way to throw and catch exceptions in C short of | 821 | API?* There is no way to throw and catch exceptions in C short of |