-
This make compatibility with py3 easier, but requires us to guess an encoding. Should work fine for European-generated files, could produce strange results from Asian files.
-
Most changes are just whitespace or line break or case changes. But: - this did find an actual error (variable exc was used before creation) - did move imports up between license and changelog (although I would prefer it in its original place) - removed the _ansi_ from read_*_ansi_string - move logging constants from main to global scope
-
Tell caller of script roughly what happened in call. Also: check whether given file arguments exist and return non-zero exit and remove print of non-existent __doc__
-
This way we do not have to keep a whole big office file in memory. (Olefile might do that, anyway, but then we have one copy less.) Also merge subfunction process_native_stream back into process_file (harder to read but makes more sense for exception handling)
-
Can parse both now from bytes array or stream
-
This is more efficient and simplifies generalization to using byte-streams instead of byte arrays as data input.
-
This way, oleobj can now handle office 2007+ types (docx, xlsx, pptx, and derivates). Since this adds another loop level into process_file, created own function for inner-most code part (the actual dumping).
-
This was not easy to do if we want to avoid having the complete embedded file in uncompressed form in memory. Had to create a stream around an iterable, kind of fun :-)
-
This compensates for an inconsistency that is probably just an error in some ppt versions. The size attribute of the CurrentUserAtom "forgets" about the optional unicode user name, which then creates strange data behind the record (where nothing should be)
-
Sofar, the ppt_parser is rather stupid, does not understand the structure of the streams but just looks for a certain byte sequence anywhere in the stream (search_* methods). There was another attempt to understand and parse the stream structure but that failed (parse_* methods). Encouraged by xls_parser, that also parses the data as a series of records, tried the same with ppt files and works nicely sofar. Might be able to replace ppt_parser soon.
-
Parsing through records seems to make sense. Try to repeat the same with ppt files next. To avoid copy-and-paste, move code to be used by both to common base record_base.py
-
Dde in csv
-
They actually found a few \ in strings I had overlooked
-
Replace #print(...) with DEBUG_FLAG and conditional print(...)
-
This is not necessary in python3