-
Want to discourage people working on ppt_parser, which would increase the amount of code required to reprodcue in ppt_record_parser in order for it to replace ppt_parser
-
Regular expression \w behaves differently in Python2 (matches only ascii) and Python3 (matches all unicode word characters). Clarify that we only want ascii in sanitized filenames.
-
Strangest thing: this change was necessary for unittesting oleobj. Without this, running python3.3 -m unittest tests.oleobj.test_basic resulted in: AttributeError: 'module' object has no attribute 'oleobj' . That was a rather unhelpful error message.
-
OleFileIO requires a complete seek() and checks for closed attribute. Also added some commented debug print commands to ZipSubFile
-
Also remove 1 exception from output and add a comment
-
This make compatibility with py3 easier, but requires us to guess an encoding. Should work fine for European-generated files, could produce strange results from Asian files.
-
Most changes are just whitespace or line break or case changes. But: - this did find an actual error (variable exc was used before creation) - did move imports up between license and changelog (although I would prefer it in its original place) - removed the _ansi_ from read_*_ansi_string - move logging constants from main to global scope
-
Tell caller of script roughly what happened in call. Also: check whether given file arguments exist and return non-zero exit and remove print of non-existent __doc__
-
This way we do not have to keep a whole big office file in memory. (Olefile might do that, anyway, but then we have one copy less.) Also merge subfunction process_native_stream back into process_file (harder to read but makes more sense for exception handling)
-
Can parse both now from bytes array or stream
-
This is more efficient and simplifies generalization to using byte-streams instead of byte arrays as data input.
-
This way, oleobj can now handle office 2007+ types (docx, xlsx, pptx, and derivates). Since this adds another loop level into process_file, created own function for inner-most code part (the actual dumping).
-
This was not easy to do if we want to avoid having the complete embedded file in uncompressed form in memory. Had to create a stream around an iterable, kind of fun :-)
-
This compensates for an inconsistency that is probably just an error in some ppt versions. The size attribute of the CurrentUserAtom "forgets" about the optional unicode user name, which then creates strange data behind the record (where nothing should be)