Commit ed04b80caf7400622aa9d12797e221271c4d2016

Authored by Jay Berkenbilt
1 parent 55cc2ab6

Update internals documentation to reflect QPDFObject split

Showing 1 changed file with 154 additions and 58 deletions
manual/design.rst
... ... @@ -67,17 +67,20 @@ files.
67 67 The primary class for interacting with PDF objects is
68 68 ``QPDFObjectHandle``. Instances of this class can be passed around by
69 69 value, copied, stored in containers, etc. with very low overhead. The
70   -``QPDFObjectHandle`` object contains an internal shared pointer to an
71   -underlying ``QPDFObject``. Instances of ``QPDFObjectHandle`` created
72   -by reading from a file will always contain a reference back to the
  70 +``QPDFObjectHandle`` object contains an internal shared pointer to the
  71 +underlying object. Instances of ``QPDFObjectHandle`` created by
  72 +reading from a file will always contain a reference back to the
73 73 ``QPDF`` object from which they were created. A ``QPDFObjectHandle``
74   -may be direct or indirect. If indirect, the ``QPDFObject`` shared
75   -pointer is initially null. In this case, the first attempt to access
76   -the underlying ``QPDFObject`` will result in the ``QPDFObject`` being
77   -resolved via a call to the referenced ``QPDF`` instance. This makes it
78   -essentially impossible to make coding errors in which certain things
79   -will work for some PDF files and not for others based on which objects
80   -are direct and which objects are indirect.
  74 +may be direct or indirect. If indirect, object is initially
  75 +*unresolved*. In this case, the first attempt to access the underlying
  76 +object will result in the object being resolved via a call to the
  77 +referenced ``QPDF`` instance. This makes it essentially impossible to
  78 +make coding errors in which certain things will work for some PDF
  79 +files and not for others based on which objects are direct and which
  80 +objects are indirect. In cases where it is necessary to know whether
  81 +an object is indirect or not, this information can be obtained from
  82 +the ``QPDFObjectHandle``. It is also possible to convert direct
  83 +objects to indirect objects and vice versa.
81 84  
82 85 Instances of ``QPDFObjectHandle`` can be directly created and modified
83 86 using static factory methods in the ``QPDFObjectHandle`` class. There
... ... @@ -230,43 +233,46 @@ could serve as a starting point to someone trying to understand the
230 233 implementation. There is nothing in this section that you need to know
231 234 to use the qpdf library.
232 235  
233   -``QPDFObject`` is the basic PDF Object class. It is an abstract base
234   -class from which are derived classes for each type of PDF object.
235   -Clients do not interact with Objects directly but instead interact with
236   -``QPDFObjectHandle``.
  236 +In a PDF file, objects may be direct or indirect. Direct objects are
  237 +objects whose representations appear directly in PDF syntax. Indirect
  238 +objects are references to objects by their ID. The qpdf library uses
  239 +the ``QPDFObjectHandle`` type to hold onto objects and to abstract
  240 +away in most cases whether the object is direct or indirect.
  241 +
  242 +Internally, ``QPDFObjectHandle`` holds onto a shared pointer to the
  243 +underlying object value. When direct object is created, the
  244 +``QPDFObjectHandle`` that holds it is not associated with a ``QPDF``
  245 +object. When an indirect object reference is created, it starts off in
  246 +an *unresolved* state and must be associated with a ``QPDF`` object,
  247 +which is considered its *owner*. To access the actual value of the
  248 +object, the object must be *resolved*. This happens automatically when
  249 +the the object is accessed in any way.
  250 +
  251 +To resolve an object, qpdf checks its object cache. If not found in
  252 +the cache, it attempts to read the object from the input source
  253 +associated with the ``QPDF`` object. If it is not found, a ``null``
  254 +object is returned. A ``null`` object is an object type, just like
  255 +boolean, string, number, etc. It is not a null pointer. The PDF
  256 +specification states that an indirect reference to an object that
  257 +doesn't exist is to be treated as a ``null``. The resulting object,
  258 +whether a ``null`` or the actual object that was read, is stored in
  259 +the cache. If the object is later replaced or swapped, the underlying
  260 +object remains the same, but its value is replaced. This way, if you
  261 +have a ``QPDFObjectHandle`` to an indirect object and the object by
  262 +that number is replaced (by calling ``QPDF::replaceObject`` or
  263 +``QPDF::swapObjects``), your ``QPDFObjectHandle`` will reflect the new
  264 +value of the object. This is consistent with what would happen to PDF
  265 +objects if you were to replace the definition of an object in the
  266 +file.
237 267  
238   -When the ``QPDF`` class creates a new object, it dynamically allocates
239   -the appropriate type of ``QPDFObject`` and immediately hands the pointer
240   -to an instance of ``QPDFObjectHandle``. The parser reads a token from
241   -the current file position. If the token is a not either a dictionary or
242   -array opener, an object is immediately constructed from the single token
243   -and the parser returns. Otherwise, the parser iterates in a special mode
244   -in which it accumulates objects until it finds a balancing closer.
245   -During this process, the ``R`` keyword is recognized and an indirect
246   -``QPDFObjectHandle`` may be constructed.
247   -
248   -The ``QPDF::resolve()`` method, which is used to resolve an indirect
249   -object, may be invoked from the ``QPDFObjectHandle`` class. It first
250   -checks a cache to see whether this object has already been read. If
251   -not, it reads the object from the PDF file and caches it. It the
252   -returns the resulting ``QPDFObjectHandle``. The calling object handle
253   -then replaces its ``std::shared_ptr<QDFObject>`` with the one from the
254   -newly returned ``QPDFObjectHandle``. In this way, only a single copy
255   -of any direct object need exist and clients can access objects
256   -transparently without knowing or caring whether they are direct or
257   -indirect objects. Additionally, no object is ever read from the file
258   -more than once. That means that only the portions of the PDF file that
259   -are actually needed are ever read from the input file, thus allowing
260   -the qpdf package to take advantage of this important design goal of
261   -PDF files.
262   -
263   -If the requested object is inside of an object stream, the object stream
264   -itself is first read into memory. Then the tokenizer reads objects from
265   -the memory stream based on the offset information stored in the stream.
266   -Those individual objects are cached, after which the temporary buffer
267   -holding the object stream contents is discarded. In this way, the first
268   -time an object in an object stream is requested, all objects in the
269   -stream are cached.
  268 +When reading an object from the input source, if the requested object
  269 +is inside of an object stream, the object stream itself is first read
  270 +into memory. Then the tokenizer reads objects from the memory stream
  271 +based on the offset information stored in the stream. Those individual
  272 +objects are cached, after which the temporary buffer holding the
  273 +object stream contents is discarded. In this way, the first time an
  274 +object in an object stream is requested, all objects in the stream are
  275 +cached.
270 276  
271 277 The following example should clarify how ``QPDF`` processes a simple
272 278 file.
... ... @@ -287,9 +293,10 @@ file.
287 293 until it encounters ``>>``. Each object that is read is pushed onto
288 294 a stack. If ``R`` is read, the last two objects on the stack are
289 295 inspected. If they are integers, they are popped off the stack and
290   - their values are used to construct an indirect object handle which is
291   - then pushed onto the stack. When ``>>`` is finally read, the stack
292   - is converted into a ``QPDF_Dictionary`` which is placed in a
  296 + their values are used to construct an indirect object handle which
  297 + is then pushed onto the stack. When ``>>`` is finally read, the
  298 + stack is converted into a ``QPDF_Dictionary`` (not directly
  299 + accessible through the API) which is placed in a
293 300 ``QPDFObjectHandle`` and returned.
294 301  
295 302 - The resulting dictionary is saved as the trailer dictionary.
... ... @@ -299,7 +306,7 @@ file.
299 306 saved. If ``/Prev`` is not present, the initial parsing process is
300 307 complete.
301 308  
302   - If there is an encryption dictionary, the document's encryption
  309 +- If there is an encryption dictionary, the document's encryption
303 310 parameters are initialized.
304 311  
305 312 - The client requests root object. The ``QPDF`` class gets the value of
... ... @@ -312,14 +319,103 @@ file.
312 319 object cache for an object with the root dictionary's object ID and
313 320 generation number. Upon not seeing it, it checks the cross reference
314 321 table, gets the offset, and reads the object present at that offset.
315   - It stores the result in the object cache and returns the cached
316   - result. The calling ``QPDFObjectHandle`` replaces its object pointer
317   - with the one from the resolved ``QPDFObjectHandle``, verifies that it
318   - a valid dictionary object, and returns the (unresolved indirect)
319   - ``QPDFObject`` handle to the top of the Pages hierarchy.
320   -
321   - As the client continues to request objects, the same process is
322   - followed for each new requested object.
  322 + It stores the result in the object cache. The cache entry's value is
  323 + replaced by the actual value, which causes any previously unresolved
  324 + ``QPDFObjectHandle`` objects that that pointed there to now have a
  325 + shared copy of the actual object. Modifications through any such
  326 + ``QPDFObjectHandle`` will be reflected in all of them. As the client
  327 + continues to request objects, the same process is followed for each
  328 + new requested object.
  329 +
  330 +.. _object_internals:
  331 +
  332 +QPDF Object Internals
  333 +---------------------
  334 +
  335 +The internals of ``QPDFObjectHandle`` and how qpdf stores objects were
  336 +significantly rewritten for QPDF 11. Here are some additional details.
  337 +
  338 +Object Internals
  339 +~~~~~~~~~~~~~~~~
  340 +
  341 +The ``QPDF`` object has an object cache which contains a shared
  342 +pointer to each object that was read from the file. Changes can be
  343 +made to any of those objects through ``QPDFObjectHandle`` methods. Any
  344 +such changes are visible to all ``QPDFObjectHandle`` instances that
  345 +point to the same object. When a ``QPDF`` object is written by
  346 +``QPDFWriter`` or serialized to JSON, any changes are reflected.
  347 +
  348 +Objects in qpdf 11 and Newer
  349 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  350 +
  351 +The object cache in ``QPDF`` contains a shared pointer to
  352 +``QPDFValueProxy``. Any ``QPDFObjectHandle`` resolved from an indirect
  353 +reference to that object has a copy of that shared pointer. Each
  354 +``QPDFValueProxy`` object contains a shared pointer to an object of
  355 +type ``QPDFValue``. The ``QPDFValue`` type is an abstract base class.
  356 +There is an implementation for each of the basic object types (array,
  357 +dictionary, null, boolean, string, number, etc.) as well as a few
  358 +special ones including ``uninitialized``, ``unresolved``, and
  359 +``reserved``. When an object is first referenced, its underlying
  360 +``QPDFValue`` has type ``unresolved``. When the object is first
  361 +resolved, the ``QPDFValueProxy`` in the cache has its internal
  362 +``QPDFValue`` replaced with the object as read from the file. Since it
  363 +is the ``QPDFValueProxy`` object that is shared by all referencing
  364 +``QPDFObjectHandle`` objects as well as by the owning ``QPDF`` object,
  365 +this ensures that any future changes to the object, including
  366 +replacing the object with a completely different one, will be
  367 +reflected across all ``QPDFObjectHandle`` objects that reference it.
  368 +
  369 +A ``QPDFValue`` that originated from a PDF input source maintains a
  370 +pointer to the ``QPDF`` object that read it (its *owner*). When that
  371 +``QPDF`` object is destroyed, it replaces the value of each
  372 +``QPDFValueProxy`` in its cache with a direct ``null`` object and
  373 +clears the pointer to the owning ``QPDF``. This means that, if there
  374 +are still any referencing ``QPDFObjectHandle`` objects floating
  375 +around, requesting their owning ``QPDF`` will return a null pointer
  376 +rather than a pointer to a ``QPDF`` object that is either invalid or
  377 +points to something else. This operation also has the effect of
  378 +breaking any circular references (which are common and, in some cases,
  379 +required by the PDF specification), thus preventing memory leaks when
  380 +``QPDF`` objects are destroyed.
  381 +
  382 +Objects prior to qpdf 11
  383 +~~~~~~~~~~~~~~~~~~~~~~~~
  384 +
  385 +Prior to qpdf 11, the functionality of the ``QPDFValue`` and
  386 +``QPDFValueProxy`` classes were combined into a single ``QPDFObject``
  387 +class, which served the dual purpose of being the cache entry for
  388 +``QPDF`` and being the abstract base class for all the different PDF
  389 +object types. The behavior was nearly the same, but there were a few
  390 +problems:
  391 +
  392 +- While changes to a ``QPDFObjectHandle`` through mutation were
  393 + visible across all referencing ``QPDFObjectHandle`` objects,
  394 + *replacing* an object with ``QPDF::replaceObject`` or
  395 + ``QPDF::swapObjects`` would leave ``QPDF`` with no way of notifying
  396 + ``QPDFObjectHandle`` objects that pointed to the old ``QPDFObject``.
  397 + To work around this, every attempt to access the underlying object
  398 + that a ``QPDFObjectHandle`` pointed to had to ask the owning
  399 + ``QPDF`` whether the object had changed, and if so, it had to
  400 + replace its internal ``QPDFObject`` pointer. This added overhead to
  401 + every indirect object access even if no objects were ever changed.
  402 +
  403 +- When a ``QPDF`` object was destroyed, it was necessary to
  404 + recursively traverse the structure of every object in the file to
  405 + break any circular references. For complex files, this significantly
  406 + increased the cost of destroying ``QPDF`` objects.
  407 +
  408 +- When a ``QPDF`` object was destroyed, any ``QPDFObjectHandle``
  409 + objects that referenced it would maintain a potentially invalid
  410 + pointer as the owning ``QPDF``. In practice, this wasn't usually a
  411 + problem since generally people would have no need to maintain copies
  412 + of a ``QPDFObjectHandle`` from a destroyed ``QPDF`` object, but
  413 + in cases where this was possible, it was necessary for other
  414 + software to do its own bookkeeping to ensure that an object's owner
  415 + was still valid.
  416 +
  417 +All of these problems were effectively solved by splitting
  418 +``QPDFObject`` into ``QPDFValueProxy`` and ``QPDFValue``.
323 419  
324 420 .. _casting:
325 421  
... ...