Commit ed04b80caf7400622aa9d12797e221271c4d2016
1 parent
55cc2ab6
Update internals documentation to reflect QPDFObject split
Showing
1 changed file
with
154 additions
and
58 deletions
manual/design.rst
| @@ -67,17 +67,20 @@ files. | @@ -67,17 +67,20 @@ files. | ||
| 67 | The primary class for interacting with PDF objects is | 67 | The primary class for interacting with PDF objects is |
| 68 | ``QPDFObjectHandle``. Instances of this class can be passed around by | 68 | ``QPDFObjectHandle``. Instances of this class can be passed around by |
| 69 | value, copied, stored in containers, etc. with very low overhead. The | 69 | value, copied, stored in containers, etc. with very low overhead. The |
| 70 | -``QPDFObjectHandle`` object contains an internal shared pointer to an | ||
| 71 | -underlying ``QPDFObject``. Instances of ``QPDFObjectHandle`` created | ||
| 72 | -by reading from a file will always contain a reference back to the | 70 | +``QPDFObjectHandle`` object contains an internal shared pointer to the |
| 71 | +underlying object. Instances of ``QPDFObjectHandle`` created by | ||
| 72 | +reading from a file will always contain a reference back to the | ||
| 73 | ``QPDF`` object from which they were created. A ``QPDFObjectHandle`` | 73 | ``QPDF`` object from which they were created. A ``QPDFObjectHandle`` |
| 74 | -may be direct or indirect. If indirect, the ``QPDFObject`` shared | ||
| 75 | -pointer is initially null. In this case, the first attempt to access | ||
| 76 | -the underlying ``QPDFObject`` will result in the ``QPDFObject`` being | ||
| 77 | -resolved via a call to the referenced ``QPDF`` instance. This makes it | ||
| 78 | -essentially impossible to make coding errors in which certain things | ||
| 79 | -will work for some PDF files and not for others based on which objects | ||
| 80 | -are direct and which objects are indirect. | 74 | +may be direct or indirect. If indirect, object is initially |
| 75 | +*unresolved*. In this case, the first attempt to access the underlying | ||
| 76 | +object will result in the object being resolved via a call to the | ||
| 77 | +referenced ``QPDF`` instance. This makes it essentially impossible to | ||
| 78 | +make coding errors in which certain things will work for some PDF | ||
| 79 | +files and not for others based on which objects are direct and which | ||
| 80 | +objects are indirect. In cases where it is necessary to know whether | ||
| 81 | +an object is indirect or not, this information can be obtained from | ||
| 82 | +the ``QPDFObjectHandle``. It is also possible to convert direct | ||
| 83 | +objects to indirect objects and vice versa. | ||
| 81 | 84 | ||
| 82 | Instances of ``QPDFObjectHandle`` can be directly created and modified | 85 | Instances of ``QPDFObjectHandle`` can be directly created and modified |
| 83 | using static factory methods in the ``QPDFObjectHandle`` class. There | 86 | using static factory methods in the ``QPDFObjectHandle`` class. There |
| @@ -230,43 +233,46 @@ could serve as a starting point to someone trying to understand the | @@ -230,43 +233,46 @@ could serve as a starting point to someone trying to understand the | ||
| 230 | implementation. There is nothing in this section that you need to know | 233 | implementation. There is nothing in this section that you need to know |
| 231 | to use the qpdf library. | 234 | to use the qpdf library. |
| 232 | 235 | ||
| 233 | -``QPDFObject`` is the basic PDF Object class. It is an abstract base | ||
| 234 | -class from which are derived classes for each type of PDF object. | ||
| 235 | -Clients do not interact with Objects directly but instead interact with | ||
| 236 | -``QPDFObjectHandle``. | 236 | +In a PDF file, objects may be direct or indirect. Direct objects are |
| 237 | +objects whose representations appear directly in PDF syntax. Indirect | ||
| 238 | +objects are references to objects by their ID. The qpdf library uses | ||
| 239 | +the ``QPDFObjectHandle`` type to hold onto objects and to abstract | ||
| 240 | +away in most cases whether the object is direct or indirect. | ||
| 241 | + | ||
| 242 | +Internally, ``QPDFObjectHandle`` holds onto a shared pointer to the | ||
| 243 | +underlying object value. When direct object is created, the | ||
| 244 | +``QPDFObjectHandle`` that holds it is not associated with a ``QPDF`` | ||
| 245 | +object. When an indirect object reference is created, it starts off in | ||
| 246 | +an *unresolved* state and must be associated with a ``QPDF`` object, | ||
| 247 | +which is considered its *owner*. To access the actual value of the | ||
| 248 | +object, the object must be *resolved*. This happens automatically when | ||
| 249 | +the the object is accessed in any way. | ||
| 250 | + | ||
| 251 | +To resolve an object, qpdf checks its object cache. If not found in | ||
| 252 | +the cache, it attempts to read the object from the input source | ||
| 253 | +associated with the ``QPDF`` object. If it is not found, a ``null`` | ||
| 254 | +object is returned. A ``null`` object is an object type, just like | ||
| 255 | +boolean, string, number, etc. It is not a null pointer. The PDF | ||
| 256 | +specification states that an indirect reference to an object that | ||
| 257 | +doesn't exist is to be treated as a ``null``. The resulting object, | ||
| 258 | +whether a ``null`` or the actual object that was read, is stored in | ||
| 259 | +the cache. If the object is later replaced or swapped, the underlying | ||
| 260 | +object remains the same, but its value is replaced. This way, if you | ||
| 261 | +have a ``QPDFObjectHandle`` to an indirect object and the object by | ||
| 262 | +that number is replaced (by calling ``QPDF::replaceObject`` or | ||
| 263 | +``QPDF::swapObjects``), your ``QPDFObjectHandle`` will reflect the new | ||
| 264 | +value of the object. This is consistent with what would happen to PDF | ||
| 265 | +objects if you were to replace the definition of an object in the | ||
| 266 | +file. | ||
| 237 | 267 | ||
| 238 | -When the ``QPDF`` class creates a new object, it dynamically allocates | ||
| 239 | -the appropriate type of ``QPDFObject`` and immediately hands the pointer | ||
| 240 | -to an instance of ``QPDFObjectHandle``. The parser reads a token from | ||
| 241 | -the current file position. If the token is a not either a dictionary or | ||
| 242 | -array opener, an object is immediately constructed from the single token | ||
| 243 | -and the parser returns. Otherwise, the parser iterates in a special mode | ||
| 244 | -in which it accumulates objects until it finds a balancing closer. | ||
| 245 | -During this process, the ``R`` keyword is recognized and an indirect | ||
| 246 | -``QPDFObjectHandle`` may be constructed. | ||
| 247 | - | ||
| 248 | -The ``QPDF::resolve()`` method, which is used to resolve an indirect | ||
| 249 | -object, may be invoked from the ``QPDFObjectHandle`` class. It first | ||
| 250 | -checks a cache to see whether this object has already been read. If | ||
| 251 | -not, it reads the object from the PDF file and caches it. It the | ||
| 252 | -returns the resulting ``QPDFObjectHandle``. The calling object handle | ||
| 253 | -then replaces its ``std::shared_ptr<QDFObject>`` with the one from the | ||
| 254 | -newly returned ``QPDFObjectHandle``. In this way, only a single copy | ||
| 255 | -of any direct object need exist and clients can access objects | ||
| 256 | -transparently without knowing or caring whether they are direct or | ||
| 257 | -indirect objects. Additionally, no object is ever read from the file | ||
| 258 | -more than once. That means that only the portions of the PDF file that | ||
| 259 | -are actually needed are ever read from the input file, thus allowing | ||
| 260 | -the qpdf package to take advantage of this important design goal of | ||
| 261 | -PDF files. | ||
| 262 | - | ||
| 263 | -If the requested object is inside of an object stream, the object stream | ||
| 264 | -itself is first read into memory. Then the tokenizer reads objects from | ||
| 265 | -the memory stream based on the offset information stored in the stream. | ||
| 266 | -Those individual objects are cached, after which the temporary buffer | ||
| 267 | -holding the object stream contents is discarded. In this way, the first | ||
| 268 | -time an object in an object stream is requested, all objects in the | ||
| 269 | -stream are cached. | 268 | +When reading an object from the input source, if the requested object |
| 269 | +is inside of an object stream, the object stream itself is first read | ||
| 270 | +into memory. Then the tokenizer reads objects from the memory stream | ||
| 271 | +based on the offset information stored in the stream. Those individual | ||
| 272 | +objects are cached, after which the temporary buffer holding the | ||
| 273 | +object stream contents is discarded. In this way, the first time an | ||
| 274 | +object in an object stream is requested, all objects in the stream are | ||
| 275 | +cached. | ||
| 270 | 276 | ||
| 271 | The following example should clarify how ``QPDF`` processes a simple | 277 | The following example should clarify how ``QPDF`` processes a simple |
| 272 | file. | 278 | file. |
| @@ -287,9 +293,10 @@ file. | @@ -287,9 +293,10 @@ file. | ||
| 287 | until it encounters ``>>``. Each object that is read is pushed onto | 293 | until it encounters ``>>``. Each object that is read is pushed onto |
| 288 | a stack. If ``R`` is read, the last two objects on the stack are | 294 | a stack. If ``R`` is read, the last two objects on the stack are |
| 289 | inspected. If they are integers, they are popped off the stack and | 295 | inspected. If they are integers, they are popped off the stack and |
| 290 | - their values are used to construct an indirect object handle which is | ||
| 291 | - then pushed onto the stack. When ``>>`` is finally read, the stack | ||
| 292 | - is converted into a ``QPDF_Dictionary`` which is placed in a | 296 | + their values are used to construct an indirect object handle which |
| 297 | + is then pushed onto the stack. When ``>>`` is finally read, the | ||
| 298 | + stack is converted into a ``QPDF_Dictionary`` (not directly | ||
| 299 | + accessible through the API) which is placed in a | ||
| 293 | ``QPDFObjectHandle`` and returned. | 300 | ``QPDFObjectHandle`` and returned. |
| 294 | 301 | ||
| 295 | - The resulting dictionary is saved as the trailer dictionary. | 302 | - The resulting dictionary is saved as the trailer dictionary. |
| @@ -299,7 +306,7 @@ file. | @@ -299,7 +306,7 @@ file. | ||
| 299 | saved. If ``/Prev`` is not present, the initial parsing process is | 306 | saved. If ``/Prev`` is not present, the initial parsing process is |
| 300 | complete. | 307 | complete. |
| 301 | 308 | ||
| 302 | - If there is an encryption dictionary, the document's encryption | 309 | +- If there is an encryption dictionary, the document's encryption |
| 303 | parameters are initialized. | 310 | parameters are initialized. |
| 304 | 311 | ||
| 305 | - The client requests root object. The ``QPDF`` class gets the value of | 312 | - The client requests root object. The ``QPDF`` class gets the value of |
| @@ -312,14 +319,103 @@ file. | @@ -312,14 +319,103 @@ file. | ||
| 312 | object cache for an object with the root dictionary's object ID and | 319 | object cache for an object with the root dictionary's object ID and |
| 313 | generation number. Upon not seeing it, it checks the cross reference | 320 | generation number. Upon not seeing it, it checks the cross reference |
| 314 | table, gets the offset, and reads the object present at that offset. | 321 | table, gets the offset, and reads the object present at that offset. |
| 315 | - It stores the result in the object cache and returns the cached | ||
| 316 | - result. The calling ``QPDFObjectHandle`` replaces its object pointer | ||
| 317 | - with the one from the resolved ``QPDFObjectHandle``, verifies that it | ||
| 318 | - a valid dictionary object, and returns the (unresolved indirect) | ||
| 319 | - ``QPDFObject`` handle to the top of the Pages hierarchy. | ||
| 320 | - | ||
| 321 | - As the client continues to request objects, the same process is | ||
| 322 | - followed for each new requested object. | 322 | + It stores the result in the object cache. The cache entry's value is |
| 323 | + replaced by the actual value, which causes any previously unresolved | ||
| 324 | + ``QPDFObjectHandle`` objects that that pointed there to now have a | ||
| 325 | + shared copy of the actual object. Modifications through any such | ||
| 326 | + ``QPDFObjectHandle`` will be reflected in all of them. As the client | ||
| 327 | + continues to request objects, the same process is followed for each | ||
| 328 | + new requested object. | ||
| 329 | + | ||
| 330 | +.. _object_internals: | ||
| 331 | + | ||
| 332 | +QPDF Object Internals | ||
| 333 | +--------------------- | ||
| 334 | + | ||
| 335 | +The internals of ``QPDFObjectHandle`` and how qpdf stores objects were | ||
| 336 | +significantly rewritten for QPDF 11. Here are some additional details. | ||
| 337 | + | ||
| 338 | +Object Internals | ||
| 339 | +~~~~~~~~~~~~~~~~ | ||
| 340 | + | ||
| 341 | +The ``QPDF`` object has an object cache which contains a shared | ||
| 342 | +pointer to each object that was read from the file. Changes can be | ||
| 343 | +made to any of those objects through ``QPDFObjectHandle`` methods. Any | ||
| 344 | +such changes are visible to all ``QPDFObjectHandle`` instances that | ||
| 345 | +point to the same object. When a ``QPDF`` object is written by | ||
| 346 | +``QPDFWriter`` or serialized to JSON, any changes are reflected. | ||
| 347 | + | ||
| 348 | +Objects in qpdf 11 and Newer | ||
| 349 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| 350 | + | ||
| 351 | +The object cache in ``QPDF`` contains a shared pointer to | ||
| 352 | +``QPDFValueProxy``. Any ``QPDFObjectHandle`` resolved from an indirect | ||
| 353 | +reference to that object has a copy of that shared pointer. Each | ||
| 354 | +``QPDFValueProxy`` object contains a shared pointer to an object of | ||
| 355 | +type ``QPDFValue``. The ``QPDFValue`` type is an abstract base class. | ||
| 356 | +There is an implementation for each of the basic object types (array, | ||
| 357 | +dictionary, null, boolean, string, number, etc.) as well as a few | ||
| 358 | +special ones including ``uninitialized``, ``unresolved``, and | ||
| 359 | +``reserved``. When an object is first referenced, its underlying | ||
| 360 | +``QPDFValue`` has type ``unresolved``. When the object is first | ||
| 361 | +resolved, the ``QPDFValueProxy`` in the cache has its internal | ||
| 362 | +``QPDFValue`` replaced with the object as read from the file. Since it | ||
| 363 | +is the ``QPDFValueProxy`` object that is shared by all referencing | ||
| 364 | +``QPDFObjectHandle`` objects as well as by the owning ``QPDF`` object, | ||
| 365 | +this ensures that any future changes to the object, including | ||
| 366 | +replacing the object with a completely different one, will be | ||
| 367 | +reflected across all ``QPDFObjectHandle`` objects that reference it. | ||
| 368 | + | ||
| 369 | +A ``QPDFValue`` that originated from a PDF input source maintains a | ||
| 370 | +pointer to the ``QPDF`` object that read it (its *owner*). When that | ||
| 371 | +``QPDF`` object is destroyed, it replaces the value of each | ||
| 372 | +``QPDFValueProxy`` in its cache with a direct ``null`` object and | ||
| 373 | +clears the pointer to the owning ``QPDF``. This means that, if there | ||
| 374 | +are still any referencing ``QPDFObjectHandle`` objects floating | ||
| 375 | +around, requesting their owning ``QPDF`` will return a null pointer | ||
| 376 | +rather than a pointer to a ``QPDF`` object that is either invalid or | ||
| 377 | +points to something else. This operation also has the effect of | ||
| 378 | +breaking any circular references (which are common and, in some cases, | ||
| 379 | +required by the PDF specification), thus preventing memory leaks when | ||
| 380 | +``QPDF`` objects are destroyed. | ||
| 381 | + | ||
| 382 | +Objects prior to qpdf 11 | ||
| 383 | +~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| 384 | + | ||
| 385 | +Prior to qpdf 11, the functionality of the ``QPDFValue`` and | ||
| 386 | +``QPDFValueProxy`` classes were combined into a single ``QPDFObject`` | ||
| 387 | +class, which served the dual purpose of being the cache entry for | ||
| 388 | +``QPDF`` and being the abstract base class for all the different PDF | ||
| 389 | +object types. The behavior was nearly the same, but there were a few | ||
| 390 | +problems: | ||
| 391 | + | ||
| 392 | +- While changes to a ``QPDFObjectHandle`` through mutation were | ||
| 393 | + visible across all referencing ``QPDFObjectHandle`` objects, | ||
| 394 | + *replacing* an object with ``QPDF::replaceObject`` or | ||
| 395 | + ``QPDF::swapObjects`` would leave ``QPDF`` with no way of notifying | ||
| 396 | + ``QPDFObjectHandle`` objects that pointed to the old ``QPDFObject``. | ||
| 397 | + To work around this, every attempt to access the underlying object | ||
| 398 | + that a ``QPDFObjectHandle`` pointed to had to ask the owning | ||
| 399 | + ``QPDF`` whether the object had changed, and if so, it had to | ||
| 400 | + replace its internal ``QPDFObject`` pointer. This added overhead to | ||
| 401 | + every indirect object access even if no objects were ever changed. | ||
| 402 | + | ||
| 403 | +- When a ``QPDF`` object was destroyed, it was necessary to | ||
| 404 | + recursively traverse the structure of every object in the file to | ||
| 405 | + break any circular references. For complex files, this significantly | ||
| 406 | + increased the cost of destroying ``QPDF`` objects. | ||
| 407 | + | ||
| 408 | +- When a ``QPDF`` object was destroyed, any ``QPDFObjectHandle`` | ||
| 409 | + objects that referenced it would maintain a potentially invalid | ||
| 410 | + pointer as the owning ``QPDF``. In practice, this wasn't usually a | ||
| 411 | + problem since generally people would have no need to maintain copies | ||
| 412 | + of a ``QPDFObjectHandle`` from a destroyed ``QPDF`` object, but | ||
| 413 | + in cases where this was possible, it was necessary for other | ||
| 414 | + software to do its own bookkeeping to ensure that an object's owner | ||
| 415 | + was still valid. | ||
| 416 | + | ||
| 417 | +All of these problems were effectively solved by splitting | ||
| 418 | +``QPDFObject`` into ``QPDFValueProxy`` and ``QPDFValue``. | ||
| 323 | 419 | ||
| 324 | .. _casting: | 420 | .. _casting: |
| 325 | 421 |