Commit 49f4600dd6feae74079ad3a3678f6a390bb4e3a1

Authored by Jay Berkenbilt
1 parent 0ae19c37

TODO: Move lexical stuff and add detail

Showing 1 changed file with 18 additions and 23 deletions
@@ -59,29 +59,6 @@ C++-11 @@ -59,29 +59,6 @@ C++-11
59 time. 59 time.
60 60
61 61
62 -Lexical  
63 -=======  
64 -  
65 - * Make it possible to run the lexer (tokenizer) over a whole file  
66 - such that the following things would be possible:  
67 -  
68 - * Rewrite fix-qdf in C++ so that there is no longer a runtime perl  
69 - dependency  
70 -  
71 - * Make it possible to replace all strings in a file lexically even  
72 - on badly broken files. Ideally this should work files that are  
73 - lacking xref, have broken links, etc., and ideally it should work  
74 - with encrypted files if possible. This should go through the  
75 - streams and strings and replace them with fixed or random  
76 - characters, preferably, but not necessarily, in a manner that  
77 - works with fonts. One possibility would be to detect whether a  
78 - string contains characters with normal encoding, and if so, use  
79 - 0x41. If the string uses character maps, use 0x01. The output  
80 - should otherwise be unrelated to the input. This could be built  
81 - after the filtering and tokenizer rewrite and should be done in a  
82 - manner that takes advantage of the other lexical features. This  
83 - sanitizer should also clear metadata and replace images.  
84 -  
85 Page splitting/merging 62 Page splitting/merging
86 ====================== 63 ======================
87 64
@@ -407,3 +384,21 @@ I find it useful to make reference to them in this list @@ -407,3 +384,21 @@ I find it useful to make reference to them in this list
407 * If I ever decide to make appearance stream-generation aware of 384 * If I ever decide to make appearance stream-generation aware of
408 fonts or font metrics, see email from Tobias with Message-ID 385 fonts or font metrics, see email from Tobias with Message-ID
409 <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14. 386 <5C3C9C6C.8000102@thax.hardliners.org> dated 2019-01-14.
  387 +
  388 + * Consider creating a sanitizer to make it easier for people to send
  389 + broken files. Now that we have json mode, this is probably no
  390 + longer worth doing. Here is the previous idea, possibly implemented
  391 + by making it possible to run the lexer (tokenizer) over a whole
  392 + file. Make it possible to replace all strings in a file lexically
  393 + even on badly broken files. Ideally this should work files that are
  394 + lacking xref, have broken links, etc., and ideally it should work
  395 + with encrypted files if possible. This should go through the
  396 + streams and strings and replace them with fixed or random
  397 + characters, preferably, but not necessarily, in a manner that works
  398 + with fonts. One possibility would be to detect whether a string
  399 + contains characters with normal encoding, and if so, use 0x41. If
  400 + the string uses character maps, use 0x01. The output should
  401 + otherwise be unrelated to the input. This could be built after the
  402 + filtering and tokenizer rewrite and should be done in a manner that
  403 + takes advantage of the other lexical features. This sanitizer
  404 + should also clear metadata and replace images.