Commit 49f4600dd6feae74079ad3a3678f6a390bb4e3a1

Authored by Jay Berkenbilt
1 parent 0ae19c37

TODO: Move lexical stuff and add detail

Showing 1 changed file with 18 additions and 23 deletions
... ... @@ -59,29 +59,6 @@ C++-11
59 59 time.
60 60  
61 61  
62   -Lexical
63   -=======
64   -
65   - * Make it possible to run the lexer (tokenizer) over a whole file
66   - such that the following things would be possible:
67   -
68   - * Rewrite fix-qdf in C++ so that there is no longer a runtime perl
69   - dependency
70   -
71   - * Make it possible to replace all strings in a file lexically even
72   - on badly broken files. Ideally this should work files that are
73   - lacking xref, have broken links, etc., and ideally it should work
74   - with encrypted files if possible. This should go through the
75   - streams and strings and replace them with fixed or random
76   - characters, preferably, but not necessarily, in a manner that
77   - works with fonts. One possibility would be to detect whether a
78   - string contains characters with normal encoding, and if so, use
79   - 0x41. If the string uses character maps, use 0x01. The output
80   - should otherwise be unrelated to the input. This could be built
81   - after the filtering and tokenizer rewrite and should be done in a
82   - manner that takes advantage of the other lexical features. This
83   - sanitizer should also clear metadata and replace images.
84   -
85 62 Page splitting/merging
86 63 ======================
87 64  
... ... @@ -407,3 +384,21 @@ I find it useful to make reference to them in this list
407 384 * If I ever decide to make appearance stream-generation aware of
408 385 fonts or font metrics, see email from Tobias with Message-ID
409 386 <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14.
  387 +
  388 + * Consider creating a sanitizer to make it easier for people to send
  389 + broken files. Now that we have json mode, this is probably no
  390 + longer worth doing. Here is the previous idea, possibly implemented
  391 + by making it possible to run the lexer (tokenizer) over a whole
  392 + file. Make it possible to replace all strings in a file lexically
  393 + even on badly broken files. Ideally this should work files that are
  394 + lacking xref, have broken links, etc., and ideally it should work
  395 + with encrypted files if possible. This should go through the
  396 + streams and strings and replace them with fixed or random
  397 + characters, preferably, but not necessarily, in a manner that works
  398 + with fonts. One possibility would be to detect whether a string
  399 + contains characters with normal encoding, and if so, use 0x41. If
  400 + the string uses character maps, use 0x01. The output should
  401 + otherwise be unrelated to the input. This could be built after the
  402 + filtering and tokenizer rewrite and should be done in a manner that
  403 + takes advantage of the other lexical features. This sanitizer
  404 + should also clear metadata and replace images.
... ...