-
This way, oleobj can now handle office 2007+ types (docx, xlsx, pptx, and derivates). Since this adds another loop level into process_file, created own function for inner-most code part (the actual dumping).
-
This was not easy to do if we want to avoid having the complete embedded file in uncompressed form in memory. Had to create a stream around an iterable, kind of fun :-)
-
This compensates for an inconsistency that is probably just an error in some ppt versions. The size attribute of the CurrentUserAtom "forgets" about the optional unicode user name, which then creates strange data behind the record (where nothing should be)
-
Sofar, the ppt_parser is rather stupid, does not understand the structure of the streams but just looks for a certain byte sequence anywhere in the stream (search_* methods). There was another attempt to understand and parse the stream structure but that failed (parse_* methods). Encouraged by xls_parser, that also parses the data as a series of records, tried the same with ppt files and works nicely sofar. Might be able to replace ppt_parser soon.
-
Parsing through records seems to make sense. Try to repeat the same with ppt files next. To avoid copy-and-paste, move code to be used by both to common base record_base.py
-
Dde in csv
-
They actually found a few \ in strings I had overlooked
-
Replace #print(...) with DEBUG_FLAG and conditional print(...)
-
This is not necessary in python3
-
The python sniffer would find "i" as delimiter in text or "<" in xml. We prefer an error over misinterpretation. Also, try all delimiters, not just a second one. Rename one constant (added CSV_)
-
- move imports further up - simplify code for oletools import hack - make a few variable names longer
-
- disable pylint-whitespace-check from FIELD_BLACKLIST - shortend most all lines to max 79 chars (except pylint: disable-*) - moved imports further up - re-wrap a few lines - add missing doc strings - add/remove whitespace - remove old commented debug-log/print statements
-
Fixes