-
This way we do not have to keep a whole big office file in memory. (Olefile might do that, anyway, but then we have one copy less.) Also merge subfunction process_native_stream back into process_file (harder to read but makes more sense for exception handling)
-
Can parse both now from bytes array or stream
-
This is more efficient and simplifies generalization to using byte-streams instead of byte arrays as data input.
-
This way, oleobj can now handle office 2007+ types (docx, xlsx, pptx, and derivates). Since this adds another loop level into process_file, created own function for inner-most code part (the actual dumping).
-
This was not easy to do if we want to avoid having the complete embedded file in uncompressed form in memory. Had to create a stream around an iterable, kind of fun :-)
-
This compensates for an inconsistency that is probably just an error in some ppt versions. The size attribute of the CurrentUserAtom "forgets" about the optional unicode user name, which then creates strange data behind the record (where nothing should be)
-
Sofar, the ppt_parser is rather stupid, does not understand the structure of the streams but just looks for a certain byte sequence anywhere in the stream (search_* methods). There was another attempt to understand and parse the stream structure but that failed (parse_* methods). Encouraged by xls_parser, that also parses the data as a series of records, tried the same with ppt files and works nicely sofar. Might be able to replace ppt_parser soon.
-
Parsing through records seems to make sense. Try to repeat the same with ppt files next. To avoid copy-and-paste, move code to be used by both to common base record_base.py
-
Dde in csv
-
They actually found a few \ in strings I had overlooked
-
Replace #print(...) with DEBUG_FLAG and conditional print(...)
-
This is not necessary in python3
-
The python sniffer would find "i" as delimiter in text or "<" in xml. We prefer an error over misinterpretation. Also, try all delimiters, not just a second one. Rename one constant (added CSV_)