Commit 649f2e0f2e91fe55a93df7ed8a7ca5bd97ff5c2f

Authored by Philippe Lagadec
2 parents 823d07b3 85bd93ee

Merged in christian_intra2net/oletools_json/ppt-parser (pull request #10)

Ppt parser
oletools/README.html
@@ -14,7 +14,8 @@ @@ -14,7 +14,8 @@
14 <p>Note: python-oletools is not related to OLETools published by BeCubed Software.</p> 14 <p>Note: python-oletools is not related to OLETools published by BeCubed Software.</p>
15 <h2 id="news">News</h2> 15 <h2 id="news">News</h2>
16 <ul> 16 <ul>
17 -<li><strong>2016-04-12 v0.45</strong>: improved <a href="https://bitbucket.org/decalage/oletools/wiki/rtfobj">rtfobj</a> to handle several <a href="http://www.decalage.info/rtf_tricks">anti-analysis tricks</a>, improved <a href="https://bitbucket.org/decalage/oletools/wiki/olevba">olevba</a> to export results in JSON format.</li> 17 +<li><strong>2016-04-19 v0.46</strong>: <a href="https://bitbucket.org/decalage/oletools/wiki/olevba">olevba</a> does not deobfuscate VBA expressions by default (much faster), new option --deobf to enable it. Fixed color display bug on Windows for several tools.</li>
  18 +<li>2016-04-12 v0.45: improved <a href="https://bitbucket.org/decalage/oletools/wiki/rtfobj">rtfobj</a> to handle several <a href="http://www.decalage.info/rtf_tricks">anti-analysis tricks</a>, improved <a href="https://bitbucket.org/decalage/oletools/wiki/olevba">olevba</a> to export results in JSON format.</li>
18 <li>2016-03-11 v0.44: improved <a href="https://bitbucket.org/decalage/oletools/wiki/olevba">olevba</a> to extract and analyse strings from VBA Forms.</li> 19 <li>2016-03-11 v0.44: improved <a href="https://bitbucket.org/decalage/oletools/wiki/olevba">olevba</a> to extract and analyse strings from VBA Forms.</li>
19 <li>2016-03-04 v0.43: added new tool MacroRaptor (mraptor) to detect malicious macros, bugfix and slight improvements in <a href="https://bitbucket.org/decalage/oletools/wiki/olevba">olevba</a>.</li> 20 <li>2016-03-04 v0.43: added new tool MacroRaptor (mraptor) to detect malicious macros, bugfix and slight improvements in <a href="https://bitbucket.org/decalage/oletools/wiki/olevba">olevba</a>.</li>
20 <li>2016-02-07 v0.42: added two new tools oledir and olemap, better handling of malformed files and several bugfixes in <a href="https://bitbucket.org/decalage/oletools/wiki/olevba">olevba</a>, improved display for <a href="https://bitbucket.org/decalage/oletools/wiki/olemeta">olemeta</a>.</li> 21 <li>2016-02-07 v0.42: added two new tools oledir and olemap, better handling of malformed files and several bugfixes in <a href="https://bitbucket.org/decalage/oletools/wiki/olevba">olevba</a>, improved display for <a href="https://bitbucket.org/decalage/oletools/wiki/olemeta">olemeta</a>.</li>
oletools/README.rst
@@ -26,7 +26,12 @@ Software. @@ -26,7 +26,12 @@ Software.
26 News 26 News
27 ---- 27 ----
28 28
29 -- **2016-04-12 v0.45**: improved 29 +- **2016-04-19 v0.46**:
  30 + `olevba <https://bitbucket.org/decalage/oletools/wiki/olevba>`__ does
  31 + not deobfuscate VBA expressions by default (much faster), new option
  32 + --deobf to enable it. Fixed color display bug on Windows for several
  33 + tools.
  34 +- 2016-04-12 v0.45: improved
30 `rtfobj <https://bitbucket.org/decalage/oletools/wiki/rtfobj>`__ to 35 `rtfobj <https://bitbucket.org/decalage/oletools/wiki/rtfobj>`__ to
31 handle several `anti-analysis 36 handle several `anti-analysis
32 tricks <http://www.decalage.info/rtf_tricks>`__, improved 37 tricks <http://www.decalage.info/rtf_tricks>`__, improved
oletools/olevba.py
@@ -77,7 +77,7 @@ https://github.com/unixfreak0037/officeparser @@ -77,7 +77,7 @@ https://github.com/unixfreak0037/officeparser
77 # CHANGELOG: 77 # CHANGELOG:
78 # 2014-08-05 v0.01 PL: - first version based on officeparser code 78 # 2014-08-05 v0.01 PL: - first version based on officeparser code
79 # 2014-08-14 v0.02 PL: - fixed bugs in code, added license from officeparser 79 # 2014-08-14 v0.02 PL: - fixed bugs in code, added license from officeparser
80 -# 2014-08-15 PL: - fixed incorrect value check in PROJECTHELPFILEPATH Record 80 +# 2014-08-15 PL: - fixed incorrect value check in projecthelpfilepath Record
81 # 2014-08-15 v0.03 PL: - refactored extract_macros to support OpenXML formats 81 # 2014-08-15 v0.03 PL: - refactored extract_macros to support OpenXML formats
82 # and to find the VBA project root anywhere in the file 82 # and to find the VBA project root anywhere in the file
83 # 2014-11-29 v0.04 PL: - use olefile instead of OleFileIO_PL 83 # 2014-11-29 v0.04 PL: - use olefile instead of OleFileIO_PL
@@ -169,8 +169,12 @@ https://github.com/unixfreak0037/officeparser @@ -169,8 +169,12 @@ https://github.com/unixfreak0037/officeparser
169 # 2016-03-16 CH: - added option --no-deobfuscate (temporary) 169 # 2016-03-16 CH: - added option --no-deobfuscate (temporary)
170 # 2016-04-19 v0.46 PL: - new option --deobf instead of --no-deobfuscate 170 # 2016-04-19 v0.46 PL: - new option --deobf instead of --no-deobfuscate
171 # - updated suspicious keywords 171 # - updated suspicious keywords
  172 +# 2016-05-04 v0.47 PL: - look for VBA code in any stream including orphans
  173 +# 2016-04-28 CH: - return an exit code depending on the results
  174 +# - improved error and exception handling
  175 +# - improved JSON output
172 176
173 -__version__ = '0.46' 177 +__version__ = '0.47'
174 178
175 #------------------------------------------------------------------------------ 179 #------------------------------------------------------------------------------
176 # TODO: 180 # TODO:
@@ -211,10 +215,8 @@ import math @@ -211,10 +215,8 @@ import math
211 import zipfile 215 import zipfile
212 import re 216 import re
213 import optparse 217 import optparse
214 -import os.path  
215 import binascii 218 import binascii
216 import base64 219 import base64
217 -import traceback  
218 import zlib 220 import zlib
219 import email # for MHTML parsing 221 import email # for MHTML parsing
220 import string # for printable 222 import string # for printable
@@ -291,8 +293,51 @@ def get_logger(name, level=logging.CRITICAL+1): @@ -291,8 +293,51 @@ def get_logger(name, level=logging.CRITICAL+1):
291 log = get_logger('olevba') 293 log = get_logger('olevba')
292 294
293 295
  296 +#=== EXCEPTIONS ==============================================================
  297 +
  298 +class FileOpenError(Exception):
  299 + """ raised by VBA_Parser constructor if all open_... attempts failed
  300 +
  301 + probably means the file type is not supported
  302 + """
  303 +
  304 + def __init__(self, filename):
  305 + super(FileOpenError, self).__init__(
  306 + 'Failed to open file %s ... probably not supported' % filename)
  307 + self.filename = filename
  308 +
  309 +
  310 +class ProcessingError(Exception):
  311 + """ raised by VBA_Parser.process_file* functions """
  312 +
  313 + def __init__(self, filename, orig_exception):
  314 + super(ProcessingError, self).__init__(
  315 + 'Error processing file %s (%s)' % (filename, orig_exception))
  316 + self.filename = filename
  317 + self.orig_exception = orig_exception
  318 +
  319 +
  320 +class MsoExtractionError(RuntimeError):
  321 + """ raised by mso_file_extract if parsing MSO/ActiveMIME data failed """
  322 +
  323 + def __init__(self, msg):
  324 + super(MsoExtractionError, self).__init__(msg)
  325 + self.msg = msg
  326 +
  327 +
294 #--- CONSTANTS ---------------------------------------------------------------- 328 #--- CONSTANTS ----------------------------------------------------------------
295 329
  330 +# return codes
  331 +RETURN_OK = 0
  332 +RETURN_WARNINGS = 1 # (reserved, not used yet)
  333 +RETURN_WRONG_ARGS = 2 # (fixed, built into optparse)
  334 +RETURN_FILE_NOT_FOUND = 3
  335 +RETURN_XGLOB_ERR = 4
  336 +RETURN_OPEN_ERROR = 5
  337 +RETURN_PARSE_ERROR = 6
  338 +RETURN_SEVERAL_ERRS = 7
  339 +RETURN_UNEXPECTED = 8
  340 +
296 # URL and message to report issues: 341 # URL and message to report issues:
297 URL_OLEVBA_ISSUES = 'https://bitbucket.org/decalage/oletools/issues' 342 URL_OLEVBA_ISSUES = 'https://bitbucket.org/decalage/oletools/issues'
298 MSG_OLEVBA_ISSUES = 'Please report this issue on %s' % URL_OLEVBA_ISSUES 343 MSG_OLEVBA_ISSUES = 'Please report this issue on %s' % URL_OLEVBA_ISSUES
@@ -376,7 +421,7 @@ SUSPICIOUS_KEYWORDS = { @@ -376,7 +421,7 @@ SUSPICIOUS_KEYWORDS = {
376 #ADODB.Stream sample: http://pastebin.com/Z4TMyuq6 421 #ADODB.Stream sample: http://pastebin.com/Z4TMyuq6
377 'May run an executable file or a system command': 422 'May run an executable file or a system command':
378 ('Shell', 'vbNormal', 'vbNormalFocus', 'vbHide', 'vbMinimizedFocus', 'vbMaximizedFocus', 'vbNormalNoFocus', 423 ('Shell', 'vbNormal', 'vbNormalFocus', 'vbHide', 'vbMinimizedFocus', 'vbMaximizedFocus', 'vbNormalNoFocus',
379 - 'vbMinimizedNoFocus', 'WScript.Shell', 'Run'), 424 + 'vbMinimizedNoFocus', 'WScript.Shell', 'Run', 'ShellExecute'),
380 #Shell: http://msdn.microsoft.com/en-us/library/office/gg278437%28v=office.15%29.aspx 425 #Shell: http://msdn.microsoft.com/en-us/library/office/gg278437%28v=office.15%29.aspx
381 #WScript.Shell+Run sample: http://pastebin.com/Z4TMyuq6 426 #WScript.Shell+Run sample: http://pastebin.com/Z4TMyuq6
382 'May run PowerShell commands': 427 'May run PowerShell commands':
@@ -848,36 +893,37 @@ def mso_file_extract(data): @@ -848,36 +893,37 @@ def mso_file_extract(data):
848 :param data: bytes string, MSO/ActiveMime file content 893 :param data: bytes string, MSO/ActiveMime file content
849 :return: bytes string, extracted data (uncompressed) 894 :return: bytes string, extracted data (uncompressed)
850 895
851 - raise a RuntimeError if the data cannot be extracted 896 + raise a MsoExtractionError if the data cannot be extracted
852 """ 897 """
853 # check the magic: 898 # check the magic:
854 assert is_mso_file(data) 899 assert is_mso_file(data)
  900 +
  901 + # In all the samples seen so far, Word always uses an offset of 0x32,
  902 + # and Excel 0x22A. But we read the offset from the header to be more
  903 + # generic.
  904 + offsets = [0x32, 0x22A]
  905 +
855 # First, attempt to get the compressed data offset from the header 906 # First, attempt to get the compressed data offset from the header
856 # According to my tests, it should be an unsigned 16 bits integer, 907 # According to my tests, it should be an unsigned 16 bits integer,
857 # at offset 0x1E (little endian) + add 46: 908 # at offset 0x1E (little endian) + add 46:
858 try: 909 try:
859 offset = struct.unpack_from('<H', data, offset=0x1E)[0] + 46 910 offset = struct.unpack_from('<H', data, offset=0x1E)[0] + 46
860 log.debug('Parsing MSO file: data offset = 0x%X' % offset) 911 log.debug('Parsing MSO file: data offset = 0x%X' % offset)
861 - except KeyboardInterrupt:  
862 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
863 - raise  
864 - except:  
865 - log.exception('Unable to parse MSO/ActiveMime file header')  
866 - raise RuntimeError('Unable to parse MSO/ActiveMime file header')  
867 - # In all the samples seen so far, Word always uses an offset of 0x32,  
868 - # and Excel 0x22A. But we read the offset from the header to be more  
869 - # generic.  
870 - # Let's try that offset, then 0x32 and 0x22A, just in case:  
871 - for start in (offset, 0x32, 0x22A): 912 + offsets.insert(0, offset) # insert at beginning of offsets
  913 + except struct.error as exc:
  914 + log.info('Unable to parse MSO/ActiveMime file header (%s)' % exc)
  915 + log.debug('Trace:', exc_info=True)
  916 + raise MsoExtractionError('Unable to parse MSO/ActiveMime file header')
  917 + # now try offsets
  918 + for start in offsets:
872 try: 919 try:
873 log.debug('Attempting zlib decompression from MSO file offset 0x%X' % start) 920 log.debug('Attempting zlib decompression from MSO file offset 0x%X' % start)
874 extracted_data = zlib.decompress(data[start:]) 921 extracted_data = zlib.decompress(data[start:])
875 return extracted_data 922 return extracted_data
876 - except KeyboardInterrupt:  
877 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
878 - raise  
879 - except:  
880 - log.exception('zlib decompression failed') 923 + except zlib.error as exc:
  924 + log.info('zlib decompression failed for offset %s (%s)'
  925 + % (start, exc))
  926 + log.debug('Trace:', exc_info=True)
881 # None of the guessed offsets worked, let's try brute-forcing by looking 927 # None of the guessed offsets worked, let's try brute-forcing by looking
882 # for potential zlib-compressed blocks starting with 0x78: 928 # for potential zlib-compressed blocks starting with 0x78:
883 log.debug('Looking for potential zlib-compressed blocks in MSO file') 929 log.debug('Looking for potential zlib-compressed blocks in MSO file')
@@ -887,12 +933,10 @@ def mso_file_extract(data): @@ -887,12 +933,10 @@ def mso_file_extract(data):
887 log.debug('Attempting zlib decompression from MSO file offset 0x%X' % start) 933 log.debug('Attempting zlib decompression from MSO file offset 0x%X' % start)
888 extracted_data = zlib.decompress(data[start:]) 934 extracted_data = zlib.decompress(data[start:])
889 return extracted_data 935 return extracted_data
890 - except KeyboardInterrupt:  
891 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
892 - raise  
893 - except:  
894 - log.exception('zlib decompression failed')  
895 - raise RuntimeError('Unable to decompress data from a MSO/ActiveMime file') 936 + except zlib.error as exc:
  937 + log.info('zlib decompression failed (%s)' % exc)
  938 + log.debug('Trace:', exc_info=True)
  939 + raise MsoExtractionError('Unable to decompress data from a MSO/ActiveMime file')
896 940
897 941
898 #--- FUNCTIONS ---------------------------------------------------------------- 942 #--- FUNCTIONS ----------------------------------------------------------------
@@ -913,29 +957,6 @@ def is_printable(s): @@ -913,29 +957,6 @@ def is_printable(s):
913 return set(s).issubset(_PRINTABLE_SET) 957 return set(s).issubset(_PRINTABLE_SET)
914 958
915 959
916 -def print_json(j):  
917 - """  
918 - Print a dictionary, a list or any other object to stdout  
919 - :param j: object to be printed  
920 - :return:  
921 - """  
922 - if isinstance(j, dict):  
923 - for key, val in j.items():  
924 - print_json(key)  
925 - print_json(val)  
926 - elif isinstance(j, list):  
927 - for elem in j:  
928 - print_json(elem)  
929 - else:  
930 - try:  
931 - if len(j) > 20:  
932 - print type(j), repr(j[:20]), '...(len {0})'.format(len(j))  
933 - else:  
934 - print type(j), repr(j)  
935 - except TypeError:  
936 - print type(j), repr(j)  
937 -  
938 -  
939 def copytoken_help(decompressed_current, decompressed_chunk_start): 960 def copytoken_help(decompressed_current, decompressed_chunk_start):
940 """ 961 """
941 compute bit masks to decode a CopyToken according to MS-OVBA 2.4.1.3.19.1 CopyToken Help 962 compute bit masks to decode a CopyToken according to MS-OVBA 2.4.1.3.19.1 CopyToken Help
@@ -1059,7 +1080,7 @@ def decompress_stream(compressed_container): @@ -1059,7 +1080,7 @@ def decompress_stream(compressed_container):
1059 copy_token = \ 1080 copy_token = \
1060 struct.unpack("<H", compressed_container[compressed_current:compressed_current + 2])[0] 1081 struct.unpack("<H", compressed_container[compressed_current:compressed_current + 2])[0]
1061 #TODO: check this 1082 #TODO: check this
1062 - length_mask, offset_mask, bit_count, maximum_length = copytoken_help( 1083 + length_mask, offset_mask, bit_count, _ = copytoken_help(
1063 len(decompressed_container), decompressed_chunk_start) 1084 len(decompressed_container), decompressed_chunk_start)
1064 length = (copy_token & length_mask) + 3 1085 length = (copy_token & length_mask) + 3
1065 temp1 = copy_token & offset_mask 1086 temp1 = copy_token & offset_mask
@@ -1138,122 +1159,130 @@ def _extract_vba(ole, vba_root, project_path, dir_path): @@ -1138,122 +1159,130 @@ def _extract_vba(ole, vba_root, project_path, dir_path):
1138 dir_stream = cStringIO.StringIO(decompress_stream(dir_compressed)) 1159 dir_stream = cStringIO.StringIO(decompress_stream(dir_compressed))
1139 1160
1140 # PROJECTSYSKIND Record 1161 # PROJECTSYSKIND Record
1141 - PROJECTSYSKIND_Id = struct.unpack("<H", dir_stream.read(2))[0]  
1142 - check_value('PROJECTSYSKIND_Id', 0x0001, PROJECTSYSKIND_Id)  
1143 - PROJECTSYSKIND_Size = struct.unpack("<L", dir_stream.read(4))[0]  
1144 - check_value('PROJECTSYSKIND_Size', 0x0004, PROJECTSYSKIND_Size)  
1145 - PROJECTSYSKIND_SysKind = struct.unpack("<L", dir_stream.read(4))[0]  
1146 - if PROJECTSYSKIND_SysKind == 0x00: 1162 + projectsyskind_id = struct.unpack("<H", dir_stream.read(2))[0]
  1163 + check_value('PROJECTSYSKIND_Id', 0x0001, projectsyskind_id)
  1164 + projectsyskind_size = struct.unpack("<L", dir_stream.read(4))[0]
  1165 + check_value('PROJECTSYSKIND_Size', 0x0004, projectsyskind_size)
  1166 + projectsyskind_syskind = struct.unpack("<L", dir_stream.read(4))[0]
  1167 + if projectsyskind_syskind == 0x00:
1147 log.debug("16-bit Windows") 1168 log.debug("16-bit Windows")
1148 - elif PROJECTSYSKIND_SysKind == 0x01: 1169 + elif projectsyskind_syskind == 0x01:
1149 log.debug("32-bit Windows") 1170 log.debug("32-bit Windows")
1150 - elif PROJECTSYSKIND_SysKind == 0x02: 1171 + elif projectsyskind_syskind == 0x02:
1151 log.debug("Macintosh") 1172 log.debug("Macintosh")
1152 - elif PROJECTSYSKIND_SysKind == 0x03: 1173 + elif projectsyskind_syskind == 0x03:
1153 log.debug("64-bit Windows") 1174 log.debug("64-bit Windows")
1154 else: 1175 else:
1155 - log.error("invalid PROJECTSYSKIND_SysKind {0:04X}".format(PROJECTSYSKIND_SysKind)) 1176 + log.error("invalid PROJECTSYSKIND_SysKind {0:04X}".format(projectsyskind_syskind))
1156 1177
1157 # PROJECTLCID Record 1178 # PROJECTLCID Record
1158 - PROJECTLCID_Id = struct.unpack("<H", dir_stream.read(2))[0]  
1159 - check_value('PROJECTLCID_Id', 0x0002, PROJECTLCID_Id)  
1160 - PROJECTLCID_Size = struct.unpack("<L", dir_stream.read(4))[0]  
1161 - check_value('PROJECTLCID_Size', 0x0004, PROJECTLCID_Size)  
1162 - PROJECTLCID_Lcid = struct.unpack("<L", dir_stream.read(4))[0]  
1163 - check_value('PROJECTLCID_Lcid', 0x409, PROJECTLCID_Lcid) 1179 + projectlcid_id = struct.unpack("<H", dir_stream.read(2))[0]
  1180 + check_value('PROJECTLCID_Id', 0x0002, projectlcid_id)
  1181 + projectlcid_size = struct.unpack("<L", dir_stream.read(4))[0]
  1182 + check_value('PROJECTLCID_Size', 0x0004, projectlcid_size)
  1183 + projectlcid_lcid = struct.unpack("<L", dir_stream.read(4))[0]
  1184 + check_value('PROJECTLCID_Lcid', 0x409, projectlcid_lcid)
1164 1185
1165 # PROJECTLCIDINVOKE Record 1186 # PROJECTLCIDINVOKE Record
1166 - PROJECTLCIDINVOKE_Id = struct.unpack("<H", dir_stream.read(2))[0]  
1167 - check_value('PROJECTLCIDINVOKE_Id', 0x0014, PROJECTLCIDINVOKE_Id)  
1168 - PROJECTLCIDINVOKE_Size = struct.unpack("<L", dir_stream.read(4))[0]  
1169 - check_value('PROJECTLCIDINVOKE_Size', 0x0004, PROJECTLCIDINVOKE_Size)  
1170 - PROJECTLCIDINVOKE_LcidInvoke = struct.unpack("<L", dir_stream.read(4))[0]  
1171 - check_value('PROJECTLCIDINVOKE_LcidInvoke', 0x409, PROJECTLCIDINVOKE_LcidInvoke) 1187 + projectlcidinvoke_id = struct.unpack("<H", dir_stream.read(2))[0]
  1188 + check_value('PROJECTLCIDINVOKE_Id', 0x0014, projectlcidinvoke_id)
  1189 + projectlcidinvoke_size = struct.unpack("<L", dir_stream.read(4))[0]
  1190 + check_value('PROJECTLCIDINVOKE_Size', 0x0004, projectlcidinvoke_size)
  1191 + projectlcidinvoke_lcidinvoke = struct.unpack("<L", dir_stream.read(4))[0]
  1192 + check_value('PROJECTLCIDINVOKE_LcidInvoke', 0x409, projectlcidinvoke_lcidinvoke)
1172 1193
1173 # PROJECTCODEPAGE Record 1194 # PROJECTCODEPAGE Record
1174 - PROJECTCODEPAGE_Id = struct.unpack("<H", dir_stream.read(2))[0]  
1175 - check_value('PROJECTCODEPAGE_Id', 0x0003, PROJECTCODEPAGE_Id)  
1176 - PROJECTCODEPAGE_Size = struct.unpack("<L", dir_stream.read(4))[0]  
1177 - check_value('PROJECTCODEPAGE_Size', 0x0002, PROJECTCODEPAGE_Size)  
1178 - PROJECTCODEPAGE_CodePage = struct.unpack("<H", dir_stream.read(2))[0] 1195 + projectcodepage_id = struct.unpack("<H", dir_stream.read(2))[0]
  1196 + check_value('PROJECTCODEPAGE_Id', 0x0003, projectcodepage_id)
  1197 + projectcodepage_size = struct.unpack("<L", dir_stream.read(4))[0]
  1198 + check_value('PROJECTCODEPAGE_Size', 0x0002, projectcodepage_size)
  1199 + projectcodepage_codepage = struct.unpack("<H", dir_stream.read(2))[0]
1179 1200
1180 # PROJECTNAME Record 1201 # PROJECTNAME Record
1181 - PROJECTNAME_Id = struct.unpack("<H", dir_stream.read(2))[0]  
1182 - check_value('PROJECTNAME_Id', 0x0004, PROJECTNAME_Id)  
1183 - PROJECTNAME_SizeOfProjectName = struct.unpack("<L", dir_stream.read(4))[0]  
1184 - if PROJECTNAME_SizeOfProjectName < 1 or PROJECTNAME_SizeOfProjectName > 128:  
1185 - log.error("PROJECTNAME_SizeOfProjectName value not in range: {0}".format(PROJECTNAME_SizeOfProjectName))  
1186 - PROJECTNAME_ProjectName = dir_stream.read(PROJECTNAME_SizeOfProjectName) 1202 + projectname_id = struct.unpack("<H", dir_stream.read(2))[0]
  1203 + check_value('PROJECTNAME_Id', 0x0004, projectname_id)
  1204 + projectname_sizeof_projectname = struct.unpack("<L", dir_stream.read(4))[0]
  1205 + if projectname_sizeof_projectname < 1 or projectname_sizeof_projectname > 128:
  1206 + log.error("PROJECTNAME_SizeOfProjectName value not in range: {0}".format(projectname_sizeof_projectname))
  1207 + projectname_projectname = dir_stream.read(projectname_sizeof_projectname)
  1208 + unused = projectname_projectname
1187 1209
1188 # PROJECTDOCSTRING Record 1210 # PROJECTDOCSTRING Record
1189 - PROJECTDOCSTRING_Id = struct.unpack("<H", dir_stream.read(2))[0]  
1190 - check_value('PROJECTDOCSTRING_Id', 0x0005, PROJECTDOCSTRING_Id)  
1191 - PROJECTDOCSTRING_SizeOfDocString = struct.unpack("<L", dir_stream.read(4))[0]  
1192 - if PROJECTNAME_SizeOfProjectName > 2000: 1211 + projectdocstring_id = struct.unpack("<H", dir_stream.read(2))[0]
  1212 + check_value('PROJECTDOCSTRING_Id', 0x0005, projectdocstring_id)
  1213 + projectdocstring_sizeof_docstring = struct.unpack("<L", dir_stream.read(4))[0]
  1214 + if projectdocstring_sizeof_docstring > 2000:
1193 log.error( 1215 log.error(
1194 - "PROJECTDOCSTRING_SizeOfDocString value not in range: {0}".format(PROJECTDOCSTRING_SizeOfDocString))  
1195 - PROJECTDOCSTRING_DocString = dir_stream.read(PROJECTDOCSTRING_SizeOfDocString)  
1196 - PROJECTDOCSTRING_Reserved = struct.unpack("<H", dir_stream.read(2))[0]  
1197 - check_value('PROJECTDOCSTRING_Reserved', 0x0040, PROJECTDOCSTRING_Reserved)  
1198 - PROJECTDOCSTRING_SizeOfDocStringUnicode = struct.unpack("<L", dir_stream.read(4))[0]  
1199 - if PROJECTDOCSTRING_SizeOfDocStringUnicode % 2 != 0: 1216 + "PROJECTDOCSTRING_SizeOfDocString value not in range: {0}".format(projectdocstring_sizeof_docstring))
  1217 + projectdocstring_docstring = dir_stream.read(projectdocstring_sizeof_docstring)
  1218 + projectdocstring_reserved = struct.unpack("<H", dir_stream.read(2))[0]
  1219 + check_value('PROJECTDOCSTRING_Reserved', 0x0040, projectdocstring_reserved)
  1220 + projectdocstring_sizeof_docstring_unicode = struct.unpack("<L", dir_stream.read(4))[0]
  1221 + if projectdocstring_sizeof_docstring_unicode % 2 != 0:
1200 log.error("PROJECTDOCSTRING_SizeOfDocStringUnicode is not even") 1222 log.error("PROJECTDOCSTRING_SizeOfDocStringUnicode is not even")
1201 - PROJECTDOCSTRING_DocStringUnicode = dir_stream.read(PROJECTDOCSTRING_SizeOfDocStringUnicode) 1223 + projectdocstring_docstring_unicode = dir_stream.read(projectdocstring_sizeof_docstring_unicode)
  1224 + unused = projectdocstring_docstring
  1225 + unused = projectdocstring_docstring_unicode
1202 1226
1203 # PROJECTHELPFILEPATH Record - MS-OVBA 2.3.4.2.1.7 1227 # PROJECTHELPFILEPATH Record - MS-OVBA 2.3.4.2.1.7
1204 - PROJECTHELPFILEPATH_Id = struct.unpack("<H", dir_stream.read(2))[0]  
1205 - check_value('PROJECTHELPFILEPATH_Id', 0x0006, PROJECTHELPFILEPATH_Id)  
1206 - PROJECTHELPFILEPATH_SizeOfHelpFile1 = struct.unpack("<L", dir_stream.read(4))[0]  
1207 - if PROJECTHELPFILEPATH_SizeOfHelpFile1 > 260: 1228 + projecthelpfilepath_id = struct.unpack("<H", dir_stream.read(2))[0]
  1229 + check_value('PROJECTHELPFILEPATH_Id', 0x0006, projecthelpfilepath_id)
  1230 + projecthelpfilepath_sizeof_helpfile1 = struct.unpack("<L", dir_stream.read(4))[0]
  1231 + if projecthelpfilepath_sizeof_helpfile1 > 260:
1208 log.error( 1232 log.error(
1209 - "PROJECTHELPFILEPATH_SizeOfHelpFile1 value not in range: {0}".format(PROJECTHELPFILEPATH_SizeOfHelpFile1))  
1210 - PROJECTHELPFILEPATH_HelpFile1 = dir_stream.read(PROJECTHELPFILEPATH_SizeOfHelpFile1)  
1211 - PROJECTHELPFILEPATH_Reserved = struct.unpack("<H", dir_stream.read(2))[0]  
1212 - check_value('PROJECTHELPFILEPATH_Reserved', 0x003D, PROJECTHELPFILEPATH_Reserved)  
1213 - PROJECTHELPFILEPATH_SizeOfHelpFile2 = struct.unpack("<L", dir_stream.read(4))[0]  
1214 - if PROJECTHELPFILEPATH_SizeOfHelpFile2 != PROJECTHELPFILEPATH_SizeOfHelpFile1: 1233 + "PROJECTHELPFILEPATH_SizeOfHelpFile1 value not in range: {0}".format(projecthelpfilepath_sizeof_helpfile1))
  1234 + projecthelpfilepath_helpfile1 = dir_stream.read(projecthelpfilepath_sizeof_helpfile1)
  1235 + projecthelpfilepath_reserved = struct.unpack("<H", dir_stream.read(2))[0]
  1236 + check_value('PROJECTHELPFILEPATH_Reserved', 0x003D, projecthelpfilepath_reserved)
  1237 + projecthelpfilepath_sizeof_helpfile2 = struct.unpack("<L", dir_stream.read(4))[0]
  1238 + if projecthelpfilepath_sizeof_helpfile2 != projecthelpfilepath_sizeof_helpfile1:
1215 log.error("PROJECTHELPFILEPATH_SizeOfHelpFile1 does not equal PROJECTHELPFILEPATH_SizeOfHelpFile2") 1239 log.error("PROJECTHELPFILEPATH_SizeOfHelpFile1 does not equal PROJECTHELPFILEPATH_SizeOfHelpFile2")
1216 - PROJECTHELPFILEPATH_HelpFile2 = dir_stream.read(PROJECTHELPFILEPATH_SizeOfHelpFile2)  
1217 - if PROJECTHELPFILEPATH_HelpFile2 != PROJECTHELPFILEPATH_HelpFile1: 1240 + projecthelpfilepath_helpfile2 = dir_stream.read(projecthelpfilepath_sizeof_helpfile2)
  1241 + if projecthelpfilepath_helpfile2 != projecthelpfilepath_helpfile1:
1218 log.error("PROJECTHELPFILEPATH_HelpFile1 does not equal PROJECTHELPFILEPATH_HelpFile2") 1242 log.error("PROJECTHELPFILEPATH_HelpFile1 does not equal PROJECTHELPFILEPATH_HelpFile2")
1219 1243
1220 # PROJECTHELPCONTEXT Record 1244 # PROJECTHELPCONTEXT Record
1221 - PROJECTHELPCONTEXT_Id = struct.unpack("<H", dir_stream.read(2))[0]  
1222 - check_value('PROJECTHELPCONTEXT_Id', 0x0007, PROJECTHELPCONTEXT_Id)  
1223 - PROJECTHELPCONTEXT_Size = struct.unpack("<L", dir_stream.read(4))[0]  
1224 - check_value('PROJECTHELPCONTEXT_Size', 0x0004, PROJECTHELPCONTEXT_Size)  
1225 - PROJECTHELPCONTEXT_HelpContext = struct.unpack("<L", dir_stream.read(4))[0] 1245 + projecthelpcontext_id = struct.unpack("<H", dir_stream.read(2))[0]
  1246 + check_value('PROJECTHELPCONTEXT_Id', 0x0007, projecthelpcontext_id)
  1247 + projecthelpcontext_size = struct.unpack("<L", dir_stream.read(4))[0]
  1248 + check_value('PROJECTHELPCONTEXT_Size', 0x0004, projecthelpcontext_size)
  1249 + projecthelpcontext_helpcontext = struct.unpack("<L", dir_stream.read(4))[0]
  1250 + unused = projecthelpcontext_helpcontext
1226 1251
1227 # PROJECTLIBFLAGS Record 1252 # PROJECTLIBFLAGS Record
1228 - PROJECTLIBFLAGS_Id = struct.unpack("<H", dir_stream.read(2))[0]  
1229 - check_value('PROJECTLIBFLAGS_Id', 0x0008, PROJECTLIBFLAGS_Id)  
1230 - PROJECTLIBFLAGS_Size = struct.unpack("<L", dir_stream.read(4))[0]  
1231 - check_value('PROJECTLIBFLAGS_Size', 0x0004, PROJECTLIBFLAGS_Size)  
1232 - PROJECTLIBFLAGS_ProjectLibFlags = struct.unpack("<L", dir_stream.read(4))[0]  
1233 - check_value('PROJECTLIBFLAGS_ProjectLibFlags', 0x0000, PROJECTLIBFLAGS_ProjectLibFlags) 1253 + projectlibflags_id = struct.unpack("<H", dir_stream.read(2))[0]
  1254 + check_value('PROJECTLIBFLAGS_Id', 0x0008, projectlibflags_id)
  1255 + projectlibflags_size = struct.unpack("<L", dir_stream.read(4))[0]
  1256 + check_value('PROJECTLIBFLAGS_Size', 0x0004, projectlibflags_size)
  1257 + projectlibflags_projectlibflags = struct.unpack("<L", dir_stream.read(4))[0]
  1258 + check_value('PROJECTLIBFLAGS_ProjectLibFlags', 0x0000, projectlibflags_projectlibflags)
1234 1259
1235 # PROJECTVERSION Record 1260 # PROJECTVERSION Record
1236 - PROJECTVERSION_Id = struct.unpack("<H", dir_stream.read(2))[0]  
1237 - check_value('PROJECTVERSION_Id', 0x0009, PROJECTVERSION_Id)  
1238 - PROJECTVERSION_Reserved = struct.unpack("<L", dir_stream.read(4))[0]  
1239 - check_value('PROJECTVERSION_Reserved', 0x0004, PROJECTVERSION_Reserved)  
1240 - PROJECTVERSION_VersionMajor = struct.unpack("<L", dir_stream.read(4))[0]  
1241 - PROJECTVERSION_VersionMinor = struct.unpack("<H", dir_stream.read(2))[0] 1261 + projectversion_id = struct.unpack("<H", dir_stream.read(2))[0]
  1262 + check_value('PROJECTVERSION_Id', 0x0009, projectversion_id)
  1263 + projectversion_reserved = struct.unpack("<L", dir_stream.read(4))[0]
  1264 + check_value('PROJECTVERSION_Reserved', 0x0004, projectversion_reserved)
  1265 + projectversion_versionmajor = struct.unpack("<L", dir_stream.read(4))[0]
  1266 + projectversion_versionminor = struct.unpack("<H", dir_stream.read(2))[0]
  1267 + unused = projectversion_versionmajor
  1268 + unused = projectversion_versionminor
1242 1269
1243 # PROJECTCONSTANTS Record 1270 # PROJECTCONSTANTS Record
1244 - PROJECTCONSTANTS_Id = struct.unpack("<H", dir_stream.read(2))[0]  
1245 - check_value('PROJECTCONSTANTS_Id', 0x000C, PROJECTCONSTANTS_Id)  
1246 - PROJECTCONSTANTS_SizeOfConstants = struct.unpack("<L", dir_stream.read(4))[0]  
1247 - if PROJECTCONSTANTS_SizeOfConstants > 1015: 1271 + projectconstants_id = struct.unpack("<H", dir_stream.read(2))[0]
  1272 + check_value('PROJECTCONSTANTS_Id', 0x000C, projectconstants_id)
  1273 + projectconstants_sizeof_constants = struct.unpack("<L", dir_stream.read(4))[0]
  1274 + if projectconstants_sizeof_constants > 1015:
1248 log.error( 1275 log.error(
1249 - "PROJECTCONSTANTS_SizeOfConstants value not in range: {0}".format(PROJECTCONSTANTS_SizeOfConstants))  
1250 - PROJECTCONSTANTS_Constants = dir_stream.read(PROJECTCONSTANTS_SizeOfConstants)  
1251 - PROJECTCONSTANTS_Reserved = struct.unpack("<H", dir_stream.read(2))[0]  
1252 - check_value('PROJECTCONSTANTS_Reserved', 0x003C, PROJECTCONSTANTS_Reserved)  
1253 - PROJECTCONSTANTS_SizeOfConstantsUnicode = struct.unpack("<L", dir_stream.read(4))[0]  
1254 - if PROJECTCONSTANTS_SizeOfConstantsUnicode % 2 != 0: 1276 + "PROJECTCONSTANTS_SizeOfConstants value not in range: {0}".format(projectconstants_sizeof_constants))
  1277 + projectconstants_constants = dir_stream.read(projectconstants_sizeof_constants)
  1278 + projectconstants_reserved = struct.unpack("<H", dir_stream.read(2))[0]
  1279 + check_value('PROJECTCONSTANTS_Reserved', 0x003C, projectconstants_reserved)
  1280 + projectconstants_sizeof_constants_unicode = struct.unpack("<L", dir_stream.read(4))[0]
  1281 + if projectconstants_sizeof_constants_unicode % 2 != 0:
1255 log.error("PROJECTCONSTANTS_SizeOfConstantsUnicode is not even") 1282 log.error("PROJECTCONSTANTS_SizeOfConstantsUnicode is not even")
1256 - PROJECTCONSTANTS_ConstantsUnicode = dir_stream.read(PROJECTCONSTANTS_SizeOfConstantsUnicode) 1283 + projectconstants_constants_unicode = dir_stream.read(projectconstants_sizeof_constants_unicode)
  1284 + unused = projectconstants_constants
  1285 + unused = projectconstants_constants_unicode
1257 1286
1258 # array of REFERENCE records 1287 # array of REFERENCE records
1259 check = None 1288 check = None
@@ -1265,194 +1294,230 @@ def _extract_vba(ole, vba_root, project_path, dir_path): @@ -1265,194 +1294,230 @@ def _extract_vba(ole, vba_root, project_path, dir_path):
1265 1294
1266 if check == 0x0016: 1295 if check == 0x0016:
1267 # REFERENCENAME 1296 # REFERENCENAME
1268 - REFERENCE_Id = check  
1269 - REFERENCE_SizeOfName = struct.unpack("<L", dir_stream.read(4))[0]  
1270 - REFERENCE_Name = dir_stream.read(REFERENCE_SizeOfName)  
1271 - REFERENCE_Reserved = struct.unpack("<H", dir_stream.read(2))[0]  
1272 - check_value('REFERENCE_Reserved', 0x003E, REFERENCE_Reserved)  
1273 - REFERENCE_SizeOfNameUnicode = struct.unpack("<L", dir_stream.read(4))[0]  
1274 - REFERENCE_NameUnicode = dir_stream.read(REFERENCE_SizeOfNameUnicode) 1297 + reference_id = check
  1298 + reference_sizeof_name = struct.unpack("<L", dir_stream.read(4))[0]
  1299 + reference_name = dir_stream.read(reference_sizeof_name)
  1300 + reference_reserved = struct.unpack("<H", dir_stream.read(2))[0]
  1301 + check_value('REFERENCE_Reserved', 0x003E, reference_reserved)
  1302 + reference_sizeof_name_unicode = struct.unpack("<L", dir_stream.read(4))[0]
  1303 + reference_name_unicode = dir_stream.read(reference_sizeof_name_unicode)
  1304 + unused = reference_id
  1305 + unused = reference_name
  1306 + unused = reference_name_unicode
1275 continue 1307 continue
1276 1308
1277 if check == 0x0033: 1309 if check == 0x0033:
1278 # REFERENCEORIGINAL (followed by REFERENCECONTROL) 1310 # REFERENCEORIGINAL (followed by REFERENCECONTROL)
1279 - REFERENCEORIGINAL_Id = check  
1280 - REFERENCEORIGINAL_SizeOfLibidOriginal = struct.unpack("<L", dir_stream.read(4))[0]  
1281 - REFERENCEORIGINAL_LibidOriginal = dir_stream.read(REFERENCEORIGINAL_SizeOfLibidOriginal) 1311 + referenceoriginal_id = check
  1312 + referenceoriginal_sizeof_libidoriginal = struct.unpack("<L", dir_stream.read(4))[0]
  1313 + referenceoriginal_libidoriginal = dir_stream.read(referenceoriginal_sizeof_libidoriginal)
  1314 + unused = referenceoriginal_id
  1315 + unused = referenceoriginal_libidoriginal
1282 continue 1316 continue
1283 1317
1284 if check == 0x002F: 1318 if check == 0x002F:
1285 # REFERENCECONTROL 1319 # REFERENCECONTROL
1286 - REFERENCECONTROL_Id = check  
1287 - REFERENCECONTROL_SizeTwiddled = struct.unpack("<L", dir_stream.read(4))[0] # ignore  
1288 - REFERENCECONTROL_SizeOfLibidTwiddled = struct.unpack("<L", dir_stream.read(4))[0]  
1289 - REFERENCECONTROL_LibidTwiddled = dir_stream.read(REFERENCECONTROL_SizeOfLibidTwiddled)  
1290 - REFERENCECONTROL_Reserved1 = struct.unpack("<L", dir_stream.read(4))[0] # ignore  
1291 - check_value('REFERENCECONTROL_Reserved1', 0x0000, REFERENCECONTROL_Reserved1)  
1292 - REFERENCECONTROL_Reserved2 = struct.unpack("<H", dir_stream.read(2))[0] # ignore  
1293 - check_value('REFERENCECONTROL_Reserved2', 0x0000, REFERENCECONTROL_Reserved2) 1320 + referencecontrol_id = check
  1321 + referencecontrol_sizetwiddled = struct.unpack("<L", dir_stream.read(4))[0] # ignore
  1322 + referencecontrol_sizeof_libidtwiddled = struct.unpack("<L", dir_stream.read(4))[0]
  1323 + referencecontrol_libidtwiddled = dir_stream.read(referencecontrol_sizeof_libidtwiddled)
  1324 + referencecontrol_reserved1 = struct.unpack("<L", dir_stream.read(4))[0] # ignore
  1325 + check_value('REFERENCECONTROL_Reserved1', 0x0000, referencecontrol_reserved1)
  1326 + referencecontrol_reserved2 = struct.unpack("<H", dir_stream.read(2))[0] # ignore
  1327 + check_value('REFERENCECONTROL_Reserved2', 0x0000, referencecontrol_reserved2)
  1328 + unused = referencecontrol_id
  1329 + unused = referencecontrol_sizetwiddled
  1330 + unused = referencecontrol_libidtwiddled
1294 # optional field 1331 # optional field
1295 check2 = struct.unpack("<H", dir_stream.read(2))[0] 1332 check2 = struct.unpack("<H", dir_stream.read(2))[0]
1296 if check2 == 0x0016: 1333 if check2 == 0x0016:
1297 - REFERENCECONTROL_NameRecordExtended_Id = check  
1298 - REFERENCECONTROL_NameRecordExtended_SizeofName = struct.unpack("<L", dir_stream.read(4))[0]  
1299 - REFERENCECONTROL_NameRecordExtended_Name = dir_stream.read(  
1300 - REFERENCECONTROL_NameRecordExtended_SizeofName)  
1301 - REFERENCECONTROL_NameRecordExtended_Reserved = struct.unpack("<H", dir_stream.read(2))[0] 1334 + referencecontrol_namerecordextended_id = check
  1335 + referencecontrol_namerecordextended_sizeof_name = struct.unpack("<L", dir_stream.read(4))[0]
  1336 + referencecontrol_namerecordextended_name = dir_stream.read(
  1337 + referencecontrol_namerecordextended_sizeof_name)
  1338 + referencecontrol_namerecordextended_reserved = struct.unpack("<H", dir_stream.read(2))[0]
1302 check_value('REFERENCECONTROL_NameRecordExtended_Reserved', 0x003E, 1339 check_value('REFERENCECONTROL_NameRecordExtended_Reserved', 0x003E,
1303 - REFERENCECONTROL_NameRecordExtended_Reserved)  
1304 - REFERENCECONTROL_NameRecordExtended_SizeOfNameUnicode = struct.unpack("<L", dir_stream.read(4))[0]  
1305 - REFERENCECONTROL_NameRecordExtended_NameUnicode = dir_stream.read(  
1306 - REFERENCECONTROL_NameRecordExtended_SizeOfNameUnicode)  
1307 - REFERENCECONTROL_Reserved3 = struct.unpack("<H", dir_stream.read(2))[0] 1340 + referencecontrol_namerecordextended_reserved)
  1341 + referencecontrol_namerecordextended_sizeof_name_unicode = struct.unpack("<L", dir_stream.read(4))[0]
  1342 + referencecontrol_namerecordextended_name_unicode = dir_stream.read(
  1343 + referencecontrol_namerecordextended_sizeof_name_unicode)
  1344 + referencecontrol_reserved3 = struct.unpack("<H", dir_stream.read(2))[0]
  1345 + unused = referencecontrol_namerecordextended_id
  1346 + unused = referencecontrol_namerecordextended_name
  1347 + unused = referencecontrol_namerecordextended_name_unicode
1308 else: 1348 else:
1309 - REFERENCECONTROL_Reserved3 = check2  
1310 -  
1311 - check_value('REFERENCECONTROL_Reserved3', 0x0030, REFERENCECONTROL_Reserved3)  
1312 - REFERENCECONTROL_SizeExtended = struct.unpack("<L", dir_stream.read(4))[0]  
1313 - REFERENCECONTROL_SizeOfLibidExtended = struct.unpack("<L", dir_stream.read(4))[0]  
1314 - REFERENCECONTROL_LibidExtended = dir_stream.read(REFERENCECONTROL_SizeOfLibidExtended)  
1315 - REFERENCECONTROL_Reserved4 = struct.unpack("<L", dir_stream.read(4))[0]  
1316 - REFERENCECONTROL_Reserved5 = struct.unpack("<H", dir_stream.read(2))[0]  
1317 - REFERENCECONTROL_OriginalTypeLib = dir_stream.read(16)  
1318 - REFERENCECONTROL_Cookie = struct.unpack("<L", dir_stream.read(4))[0] 1349 + referencecontrol_reserved3 = check2
  1350 +
  1351 + check_value('REFERENCECONTROL_Reserved3', 0x0030, referencecontrol_reserved3)
  1352 + referencecontrol_sizeextended = struct.unpack("<L", dir_stream.read(4))[0]
  1353 + referencecontrol_sizeof_libidextended = struct.unpack("<L", dir_stream.read(4))[0]
  1354 + referencecontrol_libidextended = dir_stream.read(referencecontrol_sizeof_libidextended)
  1355 + referencecontrol_reserved4 = struct.unpack("<L", dir_stream.read(4))[0]
  1356 + referencecontrol_reserved5 = struct.unpack("<H", dir_stream.read(2))[0]
  1357 + referencecontrol_originaltypelib = dir_stream.read(16)
  1358 + referencecontrol_cookie = struct.unpack("<L", dir_stream.read(4))[0]
  1359 + unused = referencecontrol_sizeextended
  1360 + unused = referencecontrol_libidextended
  1361 + unused = referencecontrol_reserved4
  1362 + unused = referencecontrol_reserved5
  1363 + unused = referencecontrol_originaltypelib
  1364 + unused = referencecontrol_cookie
1319 continue 1365 continue
1320 1366
1321 if check == 0x000D: 1367 if check == 0x000D:
1322 # REFERENCEREGISTERED 1368 # REFERENCEREGISTERED
1323 - REFERENCEREGISTERED_Id = check  
1324 - REFERENCEREGISTERED_Size = struct.unpack("<L", dir_stream.read(4))[0]  
1325 - REFERENCEREGISTERED_SizeOfLibid = struct.unpack("<L", dir_stream.read(4))[0]  
1326 - REFERENCEREGISTERED_Libid = dir_stream.read(REFERENCEREGISTERED_SizeOfLibid)  
1327 - REFERENCEREGISTERED_Reserved1 = struct.unpack("<L", dir_stream.read(4))[0]  
1328 - check_value('REFERENCEREGISTERED_Reserved1', 0x0000, REFERENCEREGISTERED_Reserved1)  
1329 - REFERENCEREGISTERED_Reserved2 = struct.unpack("<H", dir_stream.read(2))[0]  
1330 - check_value('REFERENCEREGISTERED_Reserved2', 0x0000, REFERENCEREGISTERED_Reserved2) 1369 + referenceregistered_id = check
  1370 + referenceregistered_size = struct.unpack("<L", dir_stream.read(4))[0]
  1371 + referenceregistered_sizeof_libid = struct.unpack("<L", dir_stream.read(4))[0]
  1372 + referenceregistered_libid = dir_stream.read(referenceregistered_sizeof_libid)
  1373 + referenceregistered_reserved1 = struct.unpack("<L", dir_stream.read(4))[0]
  1374 + check_value('REFERENCEREGISTERED_Reserved1', 0x0000, referenceregistered_reserved1)
  1375 + referenceregistered_reserved2 = struct.unpack("<H", dir_stream.read(2))[0]
  1376 + check_value('REFERENCEREGISTERED_Reserved2', 0x0000, referenceregistered_reserved2)
  1377 + unused = referenceregistered_id
  1378 + unused = referenceregistered_size
  1379 + unused = referenceregistered_libid
1331 continue 1380 continue
1332 1381
1333 if check == 0x000E: 1382 if check == 0x000E:
1334 # REFERENCEPROJECT 1383 # REFERENCEPROJECT
1335 - REFERENCEPROJECT_Id = check  
1336 - REFERENCEPROJECT_Size = struct.unpack("<L", dir_stream.read(4))[0]  
1337 - REFERENCEPROJECT_SizeOfLibidAbsolute = struct.unpack("<L", dir_stream.read(4))[0]  
1338 - REFERENCEPROJECT_LibidAbsolute = dir_stream.read(REFERENCEPROJECT_SizeOfLibidAbsolute)  
1339 - REFERENCEPROJECT_SizeOfLibidRelative = struct.unpack("<L", dir_stream.read(4))[0]  
1340 - REFERENCEPROJECT_LibidRelative = dir_stream.read(REFERENCEPROJECT_SizeOfLibidRelative)  
1341 - REFERENCEPROJECT_MajorVersion = struct.unpack("<L", dir_stream.read(4))[0]  
1342 - REFERENCEPROJECT_MinorVersion = struct.unpack("<H", dir_stream.read(2))[0] 1384 + referenceproject_id = check
  1385 + referenceproject_size = struct.unpack("<L", dir_stream.read(4))[0]
  1386 + referenceproject_sizeof_libidabsolute = struct.unpack("<L", dir_stream.read(4))[0]
  1387 + referenceproject_libidabsolute = dir_stream.read(referenceproject_sizeof_libidabsolute)
  1388 + referenceproject_sizeof_libidrelative = struct.unpack("<L", dir_stream.read(4))[0]
  1389 + referenceproject_libidrelative = dir_stream.read(referenceproject_sizeof_libidrelative)
  1390 + referenceproject_majorversion = struct.unpack("<L", dir_stream.read(4))[0]
  1391 + referenceproject_minorversion = struct.unpack("<H", dir_stream.read(2))[0]
  1392 + unused = referenceproject_id
  1393 + unused = referenceproject_size
  1394 + unused = referenceproject_libidabsolute
  1395 + unused = referenceproject_libidrelative
  1396 + unused = referenceproject_majorversion
  1397 + unused = referenceproject_minorversion
1343 continue 1398 continue
1344 1399
1345 log.error('invalid or unknown check Id {0:04X}'.format(check)) 1400 log.error('invalid or unknown check Id {0:04X}'.format(check))
1346 sys.exit(0) 1401 sys.exit(0)
1347 1402
1348 - PROJECTMODULES_Id = check #struct.unpack("<H", dir_stream.read(2))[0]  
1349 - check_value('PROJECTMODULES_Id', 0x000F, PROJECTMODULES_Id)  
1350 - PROJECTMODULES_Size = struct.unpack("<L", dir_stream.read(4))[0]  
1351 - check_value('PROJECTMODULES_Size', 0x0002, PROJECTMODULES_Size)  
1352 - PROJECTMODULES_Count = struct.unpack("<H", dir_stream.read(2))[0]  
1353 - PROJECTMODULES_ProjectCookieRecord_Id = struct.unpack("<H", dir_stream.read(2))[0]  
1354 - check_value('PROJECTMODULES_ProjectCookieRecord_Id', 0x0013, PROJECTMODULES_ProjectCookieRecord_Id)  
1355 - PROJECTMODULES_ProjectCookieRecord_Size = struct.unpack("<L", dir_stream.read(4))[0]  
1356 - check_value('PROJECTMODULES_ProjectCookieRecord_Size', 0x0002, PROJECTMODULES_ProjectCookieRecord_Size)  
1357 - PROJECTMODULES_ProjectCookieRecord_Cookie = struct.unpack("<H", dir_stream.read(2))[0]  
1358 -  
1359 - log.debug("parsing {0} modules".format(PROJECTMODULES_Count))  
1360 - for x in xrange(0, PROJECTMODULES_Count):  
1361 - MODULENAME_Id = struct.unpack("<H", dir_stream.read(2))[0]  
1362 - check_value('MODULENAME_Id', 0x0019, MODULENAME_Id)  
1363 - MODULENAME_SizeOfModuleName = struct.unpack("<L", dir_stream.read(4))[0]  
1364 - MODULENAME_ModuleName = dir_stream.read(MODULENAME_SizeOfModuleName) 1403 + projectmodules_id = check #struct.unpack("<H", dir_stream.read(2))[0]
  1404 + check_value('PROJECTMODULES_Id', 0x000F, projectmodules_id)
  1405 + projectmodules_size = struct.unpack("<L", dir_stream.read(4))[0]
  1406 + check_value('PROJECTMODULES_Size', 0x0002, projectmodules_size)
  1407 + projectmodules_count = struct.unpack("<H", dir_stream.read(2))[0]
  1408 + projectmodules_projectcookierecord_id = struct.unpack("<H", dir_stream.read(2))[0]
  1409 + check_value('PROJECTMODULES_ProjectCookieRecord_Id', 0x0013, projectmodules_projectcookierecord_id)
  1410 + projectmodules_projectcookierecord_size = struct.unpack("<L", dir_stream.read(4))[0]
  1411 + check_value('PROJECTMODULES_ProjectCookieRecord_Size', 0x0002, projectmodules_projectcookierecord_size)
  1412 + projectmodules_projectcookierecord_cookie = struct.unpack("<H", dir_stream.read(2))[0]
  1413 + unused = projectmodules_projectcookierecord_cookie
  1414 +
  1415 + log.debug("parsing {0} modules".format(projectmodules_count))
  1416 + for _ in xrange(0, projectmodules_count):
  1417 + modulename_id = struct.unpack("<H", dir_stream.read(2))[0]
  1418 + check_value('MODULENAME_Id', 0x0019, modulename_id)
  1419 + modulename_sizeof_modulename = struct.unpack("<L", dir_stream.read(4))[0]
  1420 + modulename_modulename = dir_stream.read(modulename_sizeof_modulename)
1365 # account for optional sections 1421 # account for optional sections
1366 section_id = struct.unpack("<H", dir_stream.read(2))[0] 1422 section_id = struct.unpack("<H", dir_stream.read(2))[0]
1367 if section_id == 0x0047: 1423 if section_id == 0x0047:
1368 - MODULENAMEUNICODE_Id = section_id  
1369 - MODULENAMEUNICODE_SizeOfModuleNameUnicode = struct.unpack("<L", dir_stream.read(4))[0]  
1370 - MODULENAMEUNICODE_ModuleNameUnicode = dir_stream.read(MODULENAMEUNICODE_SizeOfModuleNameUnicode) 1424 + modulename_unicode_id = section_id
  1425 + modulename_unicode_sizeof_modulename_unicode = struct.unpack("<L", dir_stream.read(4))[0]
  1426 + modulename_unicode_modulename_unicode = dir_stream.read(modulename_unicode_sizeof_modulename_unicode)
  1427 + unused = modulename_unicode_id
  1428 + unused = modulename_unicode_modulename_unicode
1371 section_id = struct.unpack("<H", dir_stream.read(2))[0] 1429 section_id = struct.unpack("<H", dir_stream.read(2))[0]
1372 if section_id == 0x001A: 1430 if section_id == 0x001A:
1373 - MODULESTREAMNAME_id = section_id  
1374 - MODULESTREAMNAME_SizeOfStreamName = struct.unpack("<L", dir_stream.read(4))[0]  
1375 - MODULESTREAMNAME_StreamName = dir_stream.read(MODULESTREAMNAME_SizeOfStreamName)  
1376 - MODULESTREAMNAME_Reserved = struct.unpack("<H", dir_stream.read(2))[0]  
1377 - check_value('MODULESTREAMNAME_Reserved', 0x0032, MODULESTREAMNAME_Reserved)  
1378 - MODULESTREAMNAME_SizeOfStreamNameUnicode = struct.unpack("<L", dir_stream.read(4))[0]  
1379 - MODULESTREAMNAME_StreamNameUnicode = dir_stream.read(MODULESTREAMNAME_SizeOfStreamNameUnicode) 1431 + modulestreamname_id = section_id
  1432 + modulestreamname_sizeof_streamname = struct.unpack("<L", dir_stream.read(4))[0]
  1433 + modulestreamname_streamname = dir_stream.read(modulestreamname_sizeof_streamname)
  1434 + modulestreamname_reserved = struct.unpack("<H", dir_stream.read(2))[0]
  1435 + check_value('MODULESTREAMNAME_Reserved', 0x0032, modulestreamname_reserved)
  1436 + modulestreamname_sizeof_streamname_unicode = struct.unpack("<L", dir_stream.read(4))[0]
  1437 + modulestreamname_streamname_unicode = dir_stream.read(modulestreamname_sizeof_streamname_unicode)
  1438 + unused = modulestreamname_id
1380 section_id = struct.unpack("<H", dir_stream.read(2))[0] 1439 section_id = struct.unpack("<H", dir_stream.read(2))[0]
1381 if section_id == 0x001C: 1440 if section_id == 0x001C:
1382 - MODULEDOCSTRING_Id = section_id  
1383 - check_value('MODULEDOCSTRING_Id', 0x001C, MODULEDOCSTRING_Id)  
1384 - MODULEDOCSTRING_SizeOfDocString = struct.unpack("<L", dir_stream.read(4))[0]  
1385 - MODULEDOCSTRING_DocString = dir_stream.read(MODULEDOCSTRING_SizeOfDocString)  
1386 - MODULEDOCSTRING_Reserved = struct.unpack("<H", dir_stream.read(2))[0]  
1387 - check_value('MODULEDOCSTRING_Reserved', 0x0048, MODULEDOCSTRING_Reserved)  
1388 - MODULEDOCSTRING_SizeOfDocStringUnicode = struct.unpack("<L", dir_stream.read(4))[0]  
1389 - MODULEDOCSTRING_DocStringUnicode = dir_stream.read(MODULEDOCSTRING_SizeOfDocStringUnicode) 1441 + moduledocstring_id = section_id
  1442 + check_value('MODULEDOCSTRING_Id', 0x001C, moduledocstring_id)
  1443 + moduledocstring_sizeof_docstring = struct.unpack("<L", dir_stream.read(4))[0]
  1444 + moduledocstring_docstring = dir_stream.read(moduledocstring_sizeof_docstring)
  1445 + moduledocstring_reserved = struct.unpack("<H", dir_stream.read(2))[0]
  1446 + check_value('MODULEDOCSTRING_Reserved', 0x0048, moduledocstring_reserved)
  1447 + moduledocstring_sizeof_docstring_unicode = struct.unpack("<L", dir_stream.read(4))[0]
  1448 + moduledocstring_docstring_unicode = dir_stream.read(moduledocstring_sizeof_docstring_unicode)
  1449 + unused = moduledocstring_docstring
  1450 + unused = moduledocstring_docstring_unicode
1390 section_id = struct.unpack("<H", dir_stream.read(2))[0] 1451 section_id = struct.unpack("<H", dir_stream.read(2))[0]
1391 if section_id == 0x0031: 1452 if section_id == 0x0031:
1392 - MODULEOFFSET_Id = section_id  
1393 - check_value('MODULEOFFSET_Id', 0x0031, MODULEOFFSET_Id)  
1394 - MODULEOFFSET_Size = struct.unpack("<L", dir_stream.read(4))[0]  
1395 - check_value('MODULEOFFSET_Size', 0x0004, MODULEOFFSET_Size)  
1396 - MODULEOFFSET_TextOffset = struct.unpack("<L", dir_stream.read(4))[0] 1453 + moduleoffset_id = section_id
  1454 + check_value('MODULEOFFSET_Id', 0x0031, moduleoffset_id)
  1455 + moduleoffset_size = struct.unpack("<L", dir_stream.read(4))[0]
  1456 + check_value('MODULEOFFSET_Size', 0x0004, moduleoffset_size)
  1457 + moduleoffset_textoffset = struct.unpack("<L", dir_stream.read(4))[0]
1397 section_id = struct.unpack("<H", dir_stream.read(2))[0] 1458 section_id = struct.unpack("<H", dir_stream.read(2))[0]
1398 if section_id == 0x001E: 1459 if section_id == 0x001E:
1399 - MODULEHELPCONTEXT_Id = section_id  
1400 - check_value('MODULEHELPCONTEXT_Id', 0x001E, MODULEHELPCONTEXT_Id)  
1401 - MODULEHELPCONTEXT_Size = struct.unpack("<L", dir_stream.read(4))[0]  
1402 - check_value('MODULEHELPCONTEXT_Size', 0x0004, MODULEHELPCONTEXT_Size)  
1403 - MODULEHELPCONTEXT_HelpContext = struct.unpack("<L", dir_stream.read(4))[0] 1460 + modulehelpcontext_id = section_id
  1461 + check_value('MODULEHELPCONTEXT_Id', 0x001E, modulehelpcontext_id)
  1462 + modulehelpcontext_size = struct.unpack("<L", dir_stream.read(4))[0]
  1463 + check_value('MODULEHELPCONTEXT_Size', 0x0004, modulehelpcontext_size)
  1464 + modulehelpcontext_helpcontext = struct.unpack("<L", dir_stream.read(4))[0]
  1465 + unused = modulehelpcontext_helpcontext
1404 section_id = struct.unpack("<H", dir_stream.read(2))[0] 1466 section_id = struct.unpack("<H", dir_stream.read(2))[0]
1405 if section_id == 0x002C: 1467 if section_id == 0x002C:
1406 - MODULECOOKIE_Id = section_id  
1407 - check_value('MODULECOOKIE_Id', 0x002C, MODULECOOKIE_Id)  
1408 - MODULECOOKIE_Size = struct.unpack("<L", dir_stream.read(4))[0]  
1409 - check_value('MODULECOOKIE_Size', 0x0002, MODULECOOKIE_Size)  
1410 - MODULECOOKIE_Cookie = struct.unpack("<H", dir_stream.read(2))[0] 1468 + modulecookie_id = section_id
  1469 + check_value('MODULECOOKIE_Id', 0x002C, modulecookie_id)
  1470 + modulecookie_size = struct.unpack("<L", dir_stream.read(4))[0]
  1471 + check_value('MODULECOOKIE_Size', 0x0002, modulecookie_size)
  1472 + modulecookie_cookie = struct.unpack("<H", dir_stream.read(2))[0]
  1473 + unused = modulecookie_cookie
1411 section_id = struct.unpack("<H", dir_stream.read(2))[0] 1474 section_id = struct.unpack("<H", dir_stream.read(2))[0]
1412 if section_id == 0x0021 or section_id == 0x0022: 1475 if section_id == 0x0021 or section_id == 0x0022:
1413 - MODULETYPE_Id = section_id  
1414 - MODULETYPE_Reserved = struct.unpack("<L", dir_stream.read(4))[0] 1476 + moduletype_id = section_id
  1477 + moduletype_reserved = struct.unpack("<L", dir_stream.read(4))[0]
  1478 + unused = moduletype_id
  1479 + unused = moduletype_reserved
1415 section_id = struct.unpack("<H", dir_stream.read(2))[0] 1480 section_id = struct.unpack("<H", dir_stream.read(2))[0]
1416 if section_id == 0x0025: 1481 if section_id == 0x0025:
1417 - MODULEREADONLY_Id = section_id  
1418 - check_value('MODULEREADONLY_Id', 0x0025, MODULEREADONLY_Id)  
1419 - MODULEREADONLY_Reserved = struct.unpack("<L", dir_stream.read(4))[0]  
1420 - check_value('MODULEREADONLY_Reserved', 0x0000, MODULEREADONLY_Reserved) 1482 + modulereadonly_id = section_id
  1483 + check_value('MODULEREADONLY_Id', 0x0025, modulereadonly_id)
  1484 + modulereadonly_reserved = struct.unpack("<L", dir_stream.read(4))[0]
  1485 + check_value('MODULEREADONLY_Reserved', 0x0000, modulereadonly_reserved)
1421 section_id = struct.unpack("<H", dir_stream.read(2))[0] 1486 section_id = struct.unpack("<H", dir_stream.read(2))[0]
1422 if section_id == 0x0028: 1487 if section_id == 0x0028:
1423 - MODULEPRIVATE_Id = section_id  
1424 - check_value('MODULEPRIVATE_Id', 0x0028, MODULEPRIVATE_Id)  
1425 - MODULEPRIVATE_Reserved = struct.unpack("<L", dir_stream.read(4))[0]  
1426 - check_value('MODULEPRIVATE_Reserved', 0x0000, MODULEPRIVATE_Reserved) 1488 + moduleprivate_id = section_id
  1489 + check_value('MODULEPRIVATE_Id', 0x0028, moduleprivate_id)
  1490 + moduleprivate_reserved = struct.unpack("<L", dir_stream.read(4))[0]
  1491 + check_value('MODULEPRIVATE_Reserved', 0x0000, moduleprivate_reserved)
1427 section_id = struct.unpack("<H", dir_stream.read(2))[0] 1492 section_id = struct.unpack("<H", dir_stream.read(2))[0]
1428 if section_id == 0x002B: # TERMINATOR 1493 if section_id == 0x002B: # TERMINATOR
1429 - MODULE_Reserved = struct.unpack("<L", dir_stream.read(4))[0]  
1430 - check_value('MODULE_Reserved', 0x0000, MODULE_Reserved) 1494 + module_reserved = struct.unpack("<L", dir_stream.read(4))[0]
  1495 + check_value('MODULE_Reserved', 0x0000, module_reserved)
1431 section_id = None 1496 section_id = None
1432 if section_id != None: 1497 if section_id != None:
1433 log.warning('unknown or invalid module section id {0:04X}'.format(section_id)) 1498 log.warning('unknown or invalid module section id {0:04X}'.format(section_id))
1434 1499
1435 - log.debug('Project CodePage = %d' % PROJECTCODEPAGE_CodePage)  
1436 - vba_codec = 'cp%d' % PROJECTCODEPAGE_CodePage  
1437 - log.debug("ModuleName = {0}".format(MODULENAME_ModuleName))  
1438 - log.debug("StreamName = {0}".format(repr(MODULESTREAMNAME_StreamName)))  
1439 - streamname_unicode = MODULESTREAMNAME_StreamName.decode(vba_codec) 1500 + log.debug('Project CodePage = %d' % projectcodepage_codepage)
  1501 + vba_codec = 'cp%d' % projectcodepage_codepage
  1502 + log.debug("ModuleName = {0}".format(modulename_modulename))
  1503 + log.debug("StreamName = {0}".format(repr(modulestreamname_streamname)))
  1504 + streamname_unicode = modulestreamname_streamname.decode(vba_codec)
1440 log.debug("StreamName.decode('%s') = %s" % (vba_codec, repr(streamname_unicode))) 1505 log.debug("StreamName.decode('%s') = %s" % (vba_codec, repr(streamname_unicode)))
1441 - log.debug("StreamNameUnicode = {0}".format(repr(MODULESTREAMNAME_StreamNameUnicode)))  
1442 - log.debug("TextOffset = {0}".format(MODULEOFFSET_TextOffset)) 1506 + log.debug("StreamNameUnicode = {0}".format(repr(modulestreamname_streamname_unicode)))
  1507 + log.debug("TextOffset = {0}".format(moduleoffset_textoffset))
1443 1508
1444 code_path = vba_root + u'VBA/' + streamname_unicode 1509 code_path = vba_root + u'VBA/' + streamname_unicode
1445 #TODO: test if stream exists 1510 #TODO: test if stream exists
1446 log.debug('opening VBA code stream %s' % repr(code_path)) 1511 log.debug('opening VBA code stream %s' % repr(code_path))
1447 code_data = ole.openstream(code_path).read() 1512 code_data = ole.openstream(code_path).read()
1448 log.debug("length of code_data = {0}".format(len(code_data))) 1513 log.debug("length of code_data = {0}".format(len(code_data)))
1449 - log.debug("offset of code_data = {0}".format(MODULEOFFSET_TextOffset))  
1450 - code_data = code_data[MODULEOFFSET_TextOffset:] 1514 + log.debug("offset of code_data = {0}".format(moduleoffset_textoffset))
  1515 + code_data = code_data[moduleoffset_textoffset:]
1451 if len(code_data) > 0: 1516 if len(code_data) > 0:
1452 code_data = decompress_stream(code_data) 1517 code_data = decompress_stream(code_data)
1453 # case-insensitive search in the code_modules dict to find the file extension: 1518 # case-insensitive search in the code_modules dict to find the file extension:
1454 - filext = code_modules.get(MODULENAME_ModuleName.lower(), 'bin')  
1455 - filename = '{0}.{1}'.format(MODULENAME_ModuleName, filext) 1519 + filext = code_modules.get(modulename_modulename.lower(), 'bin')
  1520 + filename = '{0}.{1}'.format(modulename_modulename, filext)
1456 #TODO: also yield the codepage so that callers can decode it properly 1521 #TODO: also yield the codepage so that callers can decode it properly
1457 yield (code_path, filename, code_data) 1522 yield (code_path, filename, code_data)
1458 # print '-'*79 1523 # print '-'*79
@@ -1462,7 +1527,8 @@ def _extract_vba(ole, vba_root, project_path, dir_path): @@ -1462,7 +1527,8 @@ def _extract_vba(ole, vba_root, project_path, dir_path):
1462 # print '' 1527 # print ''
1463 log.debug('extracted file {0}'.format(filename)) 1528 log.debug('extracted file {0}'.format(filename))
1464 else: 1529 else:
1465 - log.warning("module stream {0} has code data length 0".format(MODULESTREAMNAME_StreamName)) 1530 + log.warning("module stream {0} has code data length 0".format(modulestreamname_streamname))
  1531 + _ = unused
1466 return 1532 return
1467 1533
1468 1534
@@ -1618,12 +1684,9 @@ def detect_base64_strings(vba_code): @@ -1618,12 +1684,9 @@ def detect_base64_strings(vba_code):
1618 decoded = base64.b64decode(value) 1684 decoded = base64.b64decode(value)
1619 results.append((value, decoded)) 1685 results.append((value, decoded))
1620 found.add(value) 1686 found.add(value)
1621 - except KeyboardInterrupt:  
1622 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
1623 - raise  
1624 - except: 1687 + except (TypeError, ValueError) as exc:
  1688 + log.debug('Failed to base64-decode (%s)' % exc)
1625 # if an exception occurs, it is likely not a base64-encoded string 1689 # if an exception occurs, it is likely not a base64-encoded string
1626 - pass  
1627 return results 1690 return results
1628 1691
1629 1692
@@ -1648,12 +1711,9 @@ def detect_dridex_strings(vba_code): @@ -1648,12 +1711,9 @@ def detect_dridex_strings(vba_code):
1648 decoded = DridexUrlDecode(value) 1711 decoded = DridexUrlDecode(value)
1649 results.append((value, decoded)) 1712 results.append((value, decoded))
1650 found.add(value) 1713 found.add(value)
1651 - except KeyboardInterrupt:  
1652 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
1653 - raise  
1654 - except: 1714 + except Exception as exc:
  1715 + log.debug('Failed to Dridex-decode (%s)' % exc)
1655 # if an exception occurs, it is likely not a dridex-encoded string 1716 # if an exception occurs, it is likely not a dridex-encoded string
1656 - pass  
1657 return results 1717 return results
1658 1718
1659 1719
@@ -1703,16 +1763,17 @@ def json2ascii(json_obj, encoding=&#39;utf8&#39;, errors=&#39;replace&#39;): @@ -1703,16 +1763,17 @@ def json2ascii(json_obj, encoding=&#39;utf8&#39;, errors=&#39;replace&#39;):
1703 elif isinstance(json_obj, (bool, int, float)): 1763 elif isinstance(json_obj, (bool, int, float)):
1704 pass 1764 pass
1705 elif isinstance(json_obj, str): 1765 elif isinstance(json_obj, str):
  1766 + # de-code and re-encode
1706 dencoded = json_obj.decode(encoding, errors).encode(encoding, errors) 1767 dencoded = json_obj.decode(encoding, errors).encode(encoding, errors)
1707 - if dencoded != str:  
1708 - logging.info('json2ascii: replaced: {0} (len {1})'  
1709 - .format(json_obj, len(json_obj)))  
1710 - logging.info('json2ascii: with: {0} (len {1})'  
1711 - .format(dencoded, len(dencoded))) 1768 + if dencoded != json_obj:
  1769 + log.info('json2ascii: replaced: {0} (len {1})'
  1770 + .format(json_obj, len(json_obj)))
  1771 + log.info('json2ascii: with: {0} (len {1})'
  1772 + .format(dencoded, len(dencoded)))
1712 return dencoded 1773 return dencoded
1713 elif isinstance(json_obj, unicode): 1774 elif isinstance(json_obj, unicode):
1714 - logging.info('json2ascii: replaced: {0}'  
1715 - .format(json_obj.encode(encoding, errors))) 1775 + log.info('json2ascii: replaced: {0}'
  1776 + .format(json_obj.encode(encoding, errors)))
1716 # cannot put original into logger 1777 # cannot put original into logger
1717 # print 'original: ' json_obj 1778 # print 'original: ' json_obj
1718 return json_obj.encode(encoding, errors) 1779 return json_obj.encode(encoding, errors)
@@ -1723,11 +1784,50 @@ def json2ascii(json_obj, encoding=&#39;utf8&#39;, errors=&#39;replace&#39;): @@ -1723,11 +1784,50 @@ def json2ascii(json_obj, encoding=&#39;utf8&#39;, errors=&#39;replace&#39;):
1723 for item in json_obj: 1784 for item in json_obj:
1724 item = json2ascii(item) 1785 item = json2ascii(item)
1725 else: 1786 else:
1726 - logging.debug('unexpected type in json2ascii: {0} -- leave as is'  
1727 - .format(type(json_obj))) 1787 + log.debug('unexpected type in json2ascii: {0} -- leave as is'
  1788 + .format(type(json_obj)))
1728 return json_obj 1789 return json_obj
1729 1790
1730 1791
  1792 +_have_printed_json_start = False
  1793 +
  1794 +def print_json(json_dict=None, _json_is_last=False, **json_parts):
  1795 + """ line-wise print of json.dumps(json2ascii(..)) with options and indent+1
  1796 +
  1797 + can use in two ways:
  1798 + (1) print_json(some_dict)
  1799 + (2) print_json(key1=value1, key2=value2, ...)
  1800 +
  1801 + :param bool _json_is_last: set to True only for very last entry to complete
  1802 + the top-level json-list
  1803 + """
  1804 + global _have_printed_json_start
  1805 +
  1806 + if json_dict and json_parts:
  1807 + raise ValueError('Invalid json argument: want either single dict or '
  1808 + 'key=value parts but got both)')
  1809 + elif (json_dict is not None) and (not isinstance(json_dict, dict)):
  1810 + raise ValueError('Invalid json argument: want either single dict or '
  1811 + 'key=value parts but got {} instead of dict)'
  1812 + .format(type(json_dict)))
  1813 + if json_parts:
  1814 + json_dict = json_parts
  1815 +
  1816 + if not _have_printed_json_start:
  1817 + print '['
  1818 + _have_printed_json_start = True
  1819 +
  1820 + lines = json.dumps(json2ascii(json_dict), check_circular=False,
  1821 + indent=4, ensure_ascii=False).splitlines()
  1822 + for line in lines[:-1]:
  1823 + print ' {}'.format(line)
  1824 + if _json_is_last:
  1825 + print ' {}'.format(lines[-1]) # print last line without comma
  1826 + print ']'
  1827 + else:
  1828 + print ' {},'.format(lines[-1]) # print last line with comma
  1829 +
  1830 +
1731 class VBA_Scanner(object): 1831 class VBA_Scanner(object):
1732 """ 1832 """
1733 Class to scan the source code of a VBA module to find obfuscated strings, 1833 Class to scan the source code of a VBA module to find obfuscated strings,
@@ -1927,6 +2027,8 @@ class VBA_Parser(object): @@ -1927,6 +2027,8 @@ class VBA_Parser(object):
1927 2027
1928 :param container: str, path and filename of container if the file is within 2028 :param container: str, path and filename of container if the file is within
1929 a zip archive, None otherwise. 2029 a zip archive, None otherwise.
  2030 +
  2031 + raises a FileOpenError if all attemps to interpret the data header failed
1930 """ 2032 """
1931 #TODO: filename should only be a string, data should be used for the file-like object 2033 #TODO: filename should only be a string, data should be used for the file-like object
1932 #TODO: filename should be mandatory, optional data is a string or file-like object 2034 #TODO: filename should be mandatory, optional data is a string or file-like object
@@ -2006,8 +2108,8 @@ class VBA_Parser(object): @@ -2006,8 +2108,8 @@ class VBA_Parser(object):
2006 if self.type is None: 2108 if self.type is None:
2007 # At this stage, could not match a known format: 2109 # At this stage, could not match a known format:
2008 msg = '%s is not a supported file type, cannot extract VBA Macros.' % self.filename 2110 msg = '%s is not a supported file type, cannot extract VBA Macros.' % self.filename
2009 - log.error(msg)  
2010 - raise TypeError(msg) 2111 + log.info(msg)
  2112 + raise FileOpenError(msg)
2011 2113
2012 def open_ole(self, _file): 2114 def open_ole(self, _file):
2013 """ 2115 """
@@ -2021,13 +2123,10 @@ class VBA_Parser(object): @@ -2021,13 +2123,10 @@ class VBA_Parser(object):
2021 self.ole_file = olefile.OleFileIO(_file, path_encoding=None) 2123 self.ole_file = olefile.OleFileIO(_file, path_encoding=None)
2022 # set type only if parsing succeeds 2124 # set type only if parsing succeeds
2023 self.type = TYPE_OLE 2125 self.type = TYPE_OLE
2024 - except KeyboardInterrupt:  
2025 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
2026 - raise  
2027 - except: 2126 + except (IOError, TypeError, ValueError) as exc:
2028 # TODO: handle OLE parsing exceptions 2127 # TODO: handle OLE parsing exceptions
2029 - log.exception('Failed OLE parsing for file %r' % self.filename)  
2030 - pass 2128 + log.info('Failed OLE parsing for file %r (%s)' % (self.filename, exc))
  2129 + log.debug('Trace:', exc_info=True)
2031 2130
2032 2131
2033 def open_openxml(self, _file): 2132 def open_openxml(self, _file):
@@ -2053,22 +2152,17 @@ class VBA_Parser(object): @@ -2053,22 +2152,17 @@ class VBA_Parser(object):
2053 ole_data = z.open(subfile).read() 2152 ole_data = z.open(subfile).read()
2054 try: 2153 try:
2055 self.ole_subfiles.append(VBA_Parser(filename=subfile, data=ole_data)) 2154 self.ole_subfiles.append(VBA_Parser(filename=subfile, data=ole_data))
2056 - except KeyboardInterrupt:  
2057 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
2058 - raise  
2059 - except:  
2060 - log.debug('%s is not a valid OLE file' % subfile) 2155 + except FileOpenError as exc:
  2156 + log.info('%s is not a valid OLE file (%s)' % (subfile, exc))
2061 continue 2157 continue
2062 z.close() 2158 z.close()
2063 # set type only if parsing succeeds 2159 # set type only if parsing succeeds
2064 self.type = TYPE_OpenXML 2160 self.type = TYPE_OpenXML
2065 - except KeyboardInterrupt:  
2066 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
2067 - raise  
2068 - except: 2161 + except (RuntimeError, zipfile.BadZipfile, zipfile.LargeZipFile, IOError) as exc:
2069 # TODO: handle parsing exceptions 2162 # TODO: handle parsing exceptions
2070 - log.exception('Failed Zip/OpenXML parsing for file %r' % self.filename)  
2071 - pass 2163 + log.info('Failed Zip/OpenXML parsing for file %r (%s)'
  2164 + % (self.filename, exc))
  2165 + log.debug('Trace:', exc_info=True)
2072 2166
2073 def open_word2003xml(self, data): 2167 def open_word2003xml(self, data):
2074 """ 2168 """
@@ -2092,25 +2186,25 @@ class VBA_Parser(object): @@ -2092,25 +2186,25 @@ class VBA_Parser(object):
2092 if is_mso_file(mso_data): 2186 if is_mso_file(mso_data):
2093 # decompress the zlib data stored in the MSO file, which is the OLE container: 2187 # decompress the zlib data stored in the MSO file, which is the OLE container:
2094 # TODO: handle different offsets => separate function 2188 # TODO: handle different offsets => separate function
2095 - ole_data = mso_file_extract(mso_data)  
2096 try: 2189 try:
  2190 + ole_data = mso_file_extract(mso_data)
2097 self.ole_subfiles.append(VBA_Parser(filename=fname, data=ole_data)) 2191 self.ole_subfiles.append(VBA_Parser(filename=fname, data=ole_data))
2098 - except KeyboardInterrupt:  
2099 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
2100 - raise  
2101 - except:  
2102 - log.error('%s does not contain a valid OLE file' % fname) 2192 + except MsoExtractionError:
  2193 + log.info('Failed decompressing an MSO container in %r - %s'
  2194 + % (fname, MSG_OLEVBA_ISSUES))
  2195 + log.debug('Trace:', exc_info=True)
  2196 + except FileOpenError as exc:
  2197 + log.debug('%s is not a valid OLE sub file (%s)' % (fname, exc))
2103 else: 2198 else:
2104 - log.error('%s is not a valid MSO file' % fname) 2199 + log.info('%s is not a valid MSO file' % fname)
2105 # set type only if parsing succeeds 2200 # set type only if parsing succeeds
2106 self.type = TYPE_Word2003_XML 2201 self.type = TYPE_Word2003_XML
2107 - except KeyboardInterrupt:  
2108 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
2109 - raise  
2110 - except: 2202 + except Exception as exc:
2111 # TODO: differentiate exceptions for each parsing stage 2203 # TODO: differentiate exceptions for each parsing stage
2112 - log.exception('Failed XML parsing for file %r' % self.filename)  
2113 - pass 2204 + # (but ET is different libs, no good exception description in API)
  2205 + # found: XMLSyntaxError
  2206 + log.info('Failed XML parsing for file %r (%s)' % (self.filename, exc))
  2207 + log.debug('Trace:', exc_info=True)
2114 2208
2115 def open_mht(self, data): 2209 def open_mht(self, data):
2116 """ 2210 """
@@ -2153,40 +2247,30 @@ class VBA_Parser(object): @@ -2153,40 +2247,30 @@ class VBA_Parser(object):
2153 log.debug('Found ActiveMime header, decompressing MSO container') 2247 log.debug('Found ActiveMime header, decompressing MSO container')
2154 try: 2248 try:
2155 ole_data = mso_file_extract(part_data) 2249 ole_data = mso_file_extract(part_data)
2156 - try:  
2157 - # TODO: check if it is actually an OLE file  
2158 - # TODO: get the MSO filename from content_location?  
2159 - self.ole_subfiles.append(VBA_Parser(filename=fname, data=ole_data))  
2160 - except KeyboardInterrupt:  
2161 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
2162 - raise  
2163 - except:  
2164 - log.debug('%s does not contain a valid OLE file' % fname)  
2165 - except KeyboardInterrupt:  
2166 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
2167 - raise  
2168 - except:  
2169 - log.exception('Failed decompressing an MSO container in %r - %s' 2250 +
  2251 + # TODO: check if it is actually an OLE file
  2252 + # TODO: get the MSO filename from content_location?
  2253 + self.ole_subfiles.append(VBA_Parser(filename=fname, data=ole_data))
  2254 + except MsoExtractionError:
  2255 + log.info('Failed decompressing an MSO container in %r - %s'
2170 % (fname, MSG_OLEVBA_ISSUES)) 2256 % (fname, MSG_OLEVBA_ISSUES))
  2257 + log.debug('Trace:', exc_info=True)
2171 # TODO: bug here - need to split in smaller functions/classes? 2258 # TODO: bug here - need to split in smaller functions/classes?
  2259 + except FileOpenError as exc:
  2260 + log.debug('%s does not contain a valid OLE file (%s)'
  2261 + % (fname, exc))
2172 else: 2262 else:
  2263 + log.debug('type(part_data) = %s' % type(part_data))
2173 try: 2264 try:
2174 - log.debug('type(part_data) = %s' % type(part_data))  
2175 log.debug('part_data[0:20] = %r' % part_data[0:20]) 2265 log.debug('part_data[0:20] = %r' % part_data[0:20])
2176 - except KeyboardInterrupt:  
2177 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
2178 - raise  
2179 - except:  
2180 - pass 2266 + except TypeError as err:
  2267 + log.debug('part_data has no __getitem__')
2181 # set type only if parsing succeeds 2268 # set type only if parsing succeeds
2182 self.type = TYPE_MHTML 2269 self.type = TYPE_MHTML
2183 - except KeyboardInterrupt:  
2184 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
2185 - raise  
2186 - except:  
2187 - log.exception('Failed MIME parsing for file %r - %s'  
2188 - % (self.filename, MSG_OLEVBA_ISSUES))  
2189 - pass 2270 + except Exception:
  2271 + log.info('Failed MIME parsing for file %r - %s'
  2272 + % (self.filename, MSG_OLEVBA_ISSUES))
  2273 + log.debug('Trace:', exc_info=True)
2190 2274
2191 def open_ppt(self): 2275 def open_ppt(self):
2192 """ try to interpret self.ole_file as PowerPoint 97-2003 using PptParser 2276 """ try to interpret self.ole_file as PowerPoint 97-2003 using PptParser
@@ -2225,19 +2309,11 @@ class VBA_Parser(object): @@ -2225,19 +2309,11 @@ class VBA_Parser(object):
2225 :return: nothing 2309 :return: nothing
2226 """ 2310 """
2227 log.info('Opening text file %s' % self.filename) 2311 log.info('Opening text file %s' % self.filename)
2228 - try:  
2229 - # directly store the source code:  
2230 - self.vba_code_all_modules = data  
2231 - self.contains_macros = True  
2232 - # set type only if parsing succeeds  
2233 - self.type = TYPE_TEXT  
2234 - except KeyboardInterrupt:  
2235 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
2236 - raise  
2237 - except:  
2238 - log.exception('Failed text parsing for file %r - %s'  
2239 - % (self.filename, MSG_OLEVBA_ISSUES))  
2240 - pass 2312 + # directly store the source code:
  2313 + self.vba_code_all_modules = data
  2314 + self.contains_macros = True
  2315 + # set type only if parsing succeeds
  2316 + self.type = TYPE_TEXT
2241 2317
2242 2318
2243 def find_vba_projects(self): 2319 def find_vba_projects(self):
@@ -2294,6 +2370,15 @@ class VBA_Parser(object): @@ -2294,6 +2370,15 @@ class VBA_Parser(object):
2294 # - The root/VBA storage MUST contain a _VBA_PROJECT stream and a dir stream 2370 # - The root/VBA storage MUST contain a _VBA_PROJECT stream and a dir stream
2295 # - all names are case-insensitive 2371 # - all names are case-insensitive
2296 2372
  2373 + def check_vba_stream(ole, vba_root, stream_path):
  2374 + full_path = vba_root + stream_path
  2375 + if ole.exists(full_path) and ole.get_type(full_path) == olefile.STGTY_STREAM:
  2376 + log.debug('Found %s stream: %s' % (stream_path, full_path))
  2377 + return full_path
  2378 + else:
  2379 + log.debug('Missing %s stream, this is not a valid VBA project structure' % stream_path)
  2380 + return False
  2381 +
2297 # start with an empty list: 2382 # start with an empty list:
2298 self.vba_projects = [] 2383 self.vba_projects = []
2299 # Look for any storage containing those storage/streams: 2384 # Look for any storage containing those storage/streams:
@@ -2310,15 +2395,6 @@ class VBA_Parser(object): @@ -2310,15 +2395,6 @@ class VBA_Parser(object):
2310 vba_root += '/' 2395 vba_root += '/'
2311 log.debug('Checking vba_root="%s"' % vba_root) 2396 log.debug('Checking vba_root="%s"' % vba_root)
2312 2397
2313 - def check_vba_stream(ole, vba_root, stream_path):  
2314 - full_path = vba_root + stream_path  
2315 - if ole.exists(full_path) and ole.get_type(full_path) == olefile.STGTY_STREAM:  
2316 - log.debug('Found %s stream: %s' % (stream_path, full_path))  
2317 - return full_path  
2318 - else:  
2319 - log.debug('Missing %s stream, this is not a valid VBA project structure' % stream_path)  
2320 - return False  
2321 -  
2322 # Check if the VBA root storage also contains a PROJECT stream: 2398 # Check if the VBA root storage also contains a PROJECT stream:
2323 project_path = check_vba_stream(ole, vba_root, 'PROJECT') 2399 project_path = check_vba_stream(ole, vba_root, 'PROJECT')
2324 if not project_path: continue 2400 if not project_path: continue
@@ -2369,6 +2445,29 @@ class VBA_Parser(object): @@ -2369,6 +2445,29 @@ class VBA_Parser(object):
2369 self.contains_macros = False 2445 self.contains_macros = False
2370 else: 2446 else:
2371 self.contains_macros = True 2447 self.contains_macros = True
  2448 + # Also look for VBA code in any stream including orphans
  2449 + # (happens in some malformed files)
  2450 + ole = self.ole_file
  2451 + for sid in xrange(len(ole.direntries)):
  2452 + # check if id is already done above:
  2453 + log.debug('Checking DirEntry #%d' % sid)
  2454 + d = ole.direntries[sid]
  2455 + if d is None:
  2456 + # this direntry is not part of the tree: either unused or an orphan
  2457 + d = ole._load_direntry(sid)
  2458 + log.debug('This DirEntry is an orphan or unused')
  2459 + if d.entry_type == olefile.STGTY_STREAM:
  2460 + # read data
  2461 + log.debug('Reading data from stream %r - size: %d bytes' % (d.name, d.size))
  2462 + try:
  2463 + data = ole._open(d.isectStart, d.size).read()
  2464 + log.debug('Read %d bytes' % len(data))
  2465 + log.debug(repr(data))
  2466 + if 'Attribut' in data:
  2467 + log.debug('Found VBA compressed code')
  2468 + self.contains_macros = True
  2469 + except:
  2470 + log.exception('Error when reading OLE Stream %r' % d.name)
2372 return self.contains_macros 2471 return self.contains_macros
2373 2472
2374 def extract_macros(self): 2473 def extract_macros(self):
@@ -2381,6 +2480,7 @@ class VBA_Parser(object): @@ -2381,6 +2480,7 @@ class VBA_Parser(object):
2381 within the zip archive, e.g. word/vbaProject.bin. 2480 within the zip archive, e.g. word/vbaProject.bin.
2382 If the file is PPT, result is as for OpenXML but filename is useless 2481 If the file is PPT, result is as for OpenXML but filename is useless
2383 """ 2482 """
  2483 + log.debug('extract_macros:')
2384 if self.ole_file is None: 2484 if self.ole_file is None:
2385 # This may be either an OpenXML/PPT or a text file: 2485 # This may be either an OpenXML/PPT or a text file:
2386 if self.type == TYPE_TEXT: 2486 if self.type == TYPE_TEXT:
@@ -2394,11 +2494,41 @@ class VBA_Parser(object): @@ -2394,11 +2494,41 @@ class VBA_Parser(object):
2394 else: 2494 else:
2395 # This is an OLE file: 2495 # This is an OLE file:
2396 self.find_vba_projects() 2496 self.find_vba_projects()
  2497 + # set of stream ids
  2498 + vba_stream_ids = set()
2397 for vba_root, project_path, dir_path in self.vba_projects: 2499 for vba_root, project_path, dir_path in self.vba_projects:
2398 # extract all VBA macros from that VBA root storage: 2500 # extract all VBA macros from that VBA root storage:
2399 for stream_path, vba_filename, vba_code in _extract_vba(self.ole_file, vba_root, project_path, 2501 for stream_path, vba_filename, vba_code in _extract_vba(self.ole_file, vba_root, project_path,
2400 dir_path): 2502 dir_path):
  2503 + # store direntry ids in a set:
  2504 + vba_stream_ids.add(self.ole_file._find(stream_path))
2401 yield (self.filename, stream_path, vba_filename, vba_code) 2505 yield (self.filename, stream_path, vba_filename, vba_code)
  2506 + # Also look for VBA code in any stream including orphans
  2507 + # (happens in some malformed files)
  2508 + ole = self.ole_file
  2509 + for sid in xrange(len(ole.direntries)):
  2510 + # check if id is already done above:
  2511 + log.debug('Checking DirEntry #%d' % sid)
  2512 + if sid in vba_stream_ids:
  2513 + log.debug('Already extracted')
  2514 + continue
  2515 + d = ole.direntries[sid]
  2516 + if d is None:
  2517 + # this direntry is not part of the tree: either unused or an orphan
  2518 + d = ole._load_direntry(sid)
  2519 + log.debug('This DirEntry is an orphan or unused')
  2520 + if d.entry_type == olefile.STGTY_STREAM:
  2521 + # read data
  2522 + log.debug('Reading data from stream %r' % d.name)
  2523 + data = ole._open(d.isectStart, d.size).read()
  2524 + for match in re.finditer(r'\x00Attribut[^e]', data, flags=re.IGNORECASE):
  2525 + start = match.start() - 3
  2526 + log.debug('Found VBA compressed code at index %X' % start)
  2527 + compressed_code = data[start:]
  2528 + vba_code = decompress_stream(compressed_code)
  2529 + yield (self.filename, d.name, d.name, vba_code)
  2530 +
  2531 +
2402 2532
2403 2533
2404 def extract_all_macros(self): 2534 def extract_all_macros(self):
@@ -2429,10 +2559,10 @@ class VBA_Parser(object): @@ -2429,10 +2559,10 @@ class VBA_Parser(object):
2429 # variable to merge source code from all modules: 2559 # variable to merge source code from all modules:
2430 if self.vba_code_all_modules is None: 2560 if self.vba_code_all_modules is None:
2431 self.vba_code_all_modules = '' 2561 self.vba_code_all_modules = ''
2432 - for (subfilename, stream_path, vba_filename, vba_code) in self.extract_all_macros(): 2562 + for (_, _, _, vba_code) in self.extract_all_macros():
2433 #TODO: filter code? (each module) 2563 #TODO: filter code? (each module)
2434 self.vba_code_all_modules += vba_code + '\n' 2564 self.vba_code_all_modules += vba_code + '\n'
2435 - for (subfilename, form_path, form_string) in self.extract_form_strings(): 2565 + for (_, _, form_string) in self.extract_form_strings():
2436 self.vba_code_all_modules += form_string + '\n' 2566 self.vba_code_all_modules += form_string + '\n'
2437 # Analyze the whole code at once: 2567 # Analyze the whole code at once:
2438 scanner = VBA_Scanner(self.vba_code_all_modules) 2568 scanner = VBA_Scanner(self.vba_code_all_modules)
@@ -2592,8 +2722,7 @@ class VBA_Parser_CLI(VBA_Parser): @@ -2592,8 +2722,7 @@ class VBA_Parser_CLI(VBA_Parser):
2592 def __init__(self, filename, data=None, container=None): 2722 def __init__(self, filename, data=None, container=None):
2593 """ 2723 """
2594 Constructor for VBA_Parser_CLI. 2724 Constructor for VBA_Parser_CLI.
2595 - Calls __init__ from VBA_Parser, but handles the TypeError exception  
2596 - when the file type is not supported. 2725 + Calls __init__ from VBA_Parser
2597 2726
2598 :param filename: filename or path of file to parse, or file-like object 2727 :param filename: filename or path of file to parse, or file-like object
2599 2728
@@ -2604,11 +2733,7 @@ class VBA_Parser_CLI(VBA_Parser): @@ -2604,11 +2733,7 @@ class VBA_Parser_CLI(VBA_Parser):
2604 :param container: str, path and filename of container if the file is within 2733 :param container: str, path and filename of container if the file is within
2605 a zip archive, None otherwise. 2734 a zip archive, None otherwise.
2606 """ 2735 """
2607 - try:  
2608 - VBA_Parser.__init__(self, filename, data=data, container=container)  
2609 - except TypeError:  
2610 - # in that case, self.type=None  
2611 - pass 2736 + super(VBA_Parser_CLI, self).__init__(filename, data=data, container=container)
2612 2737
2613 2738
2614 def print_analysis(self, show_decoded_strings=False, deobfuscate=False): 2739 def print_analysis(self, show_decoded_strings=False, deobfuscate=False):
@@ -2658,7 +2783,7 @@ class VBA_Parser_CLI(VBA_Parser): @@ -2658,7 +2783,7 @@ class VBA_Parser_CLI(VBA_Parser):
2658 for kw_type, keyword, description in self.analyze_macros(show_decoded_strings)] 2783 for kw_type, keyword, description in self.analyze_macros(show_decoded_strings)]
2659 2784
2660 def process_file(self, show_decoded_strings=False, 2785 def process_file(self, show_decoded_strings=False,
2661 - display_code=True, global_analysis=True, hide_attributes=True, 2786 + display_code=True, hide_attributes=True,
2662 vba_code_only=False, show_deobfuscated_code=False, 2787 vba_code_only=False, show_deobfuscated_code=False,
2663 deobfuscate=False): 2788 deobfuscate=False):
2664 """ 2789 """
@@ -2704,19 +2829,12 @@ class VBA_Parser_CLI(VBA_Parser): @@ -2704,19 +2829,12 @@ class VBA_Parser_CLI(VBA_Parser):
2704 print '(empty macro)' 2829 print '(empty macro)'
2705 else: 2830 else:
2706 print vba_code_filtered 2831 print vba_code_filtered
2707 - if not global_analysis and not vba_code_only:  
2708 - #TODO: remove this option  
2709 - raise NotImplementedError  
2710 - print '- ' * 39  
2711 - print 'ANALYSIS:'  
2712 - # analyse each module's code, filtered to avoid false positives:  
2713 - self.print_analysis(show_decoded_strings, deobfuscate)  
2714 for (subfilename, stream_path, form_string) in self.extract_form_strings(): 2832 for (subfilename, stream_path, form_string) in self.extract_form_strings():
2715 print '-' * 79 2833 print '-' * 79
2716 print 'VBA FORM STRING IN %r - OLE stream: %r' % (subfilename, stream_path) 2834 print 'VBA FORM STRING IN %r - OLE stream: %r' % (subfilename, stream_path)
2717 print '- ' * 39 2835 print '- ' * 39
2718 print form_string 2836 print form_string
2719 - if global_analysis and not vba_code_only: 2837 + if not vba_code_only:
2720 # analyse the code from all modules at once: 2838 # analyse the code from all modules at once:
2721 self.print_analysis(show_decoded_strings, deobfuscate) 2839 self.print_analysis(show_decoded_strings, deobfuscate)
2722 if show_deobfuscated_code: 2840 if show_deobfuscated_code:
@@ -2724,20 +2842,16 @@ class VBA_Parser_CLI(VBA_Parser): @@ -2724,20 +2842,16 @@ class VBA_Parser_CLI(VBA_Parser):
2724 print self.reveal() 2842 print self.reveal()
2725 else: 2843 else:
2726 print 'No VBA macros found.' 2844 print 'No VBA macros found.'
2727 - except KeyboardInterrupt:  
2728 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
2729 - raise  
2730 - except: #TypeError:  
2731 - #raise  
2732 - #TODO: print more info if debug mode  
2733 - #print sys.exc_value  
2734 - # display the exception with full stack trace for debugging, but do not stop:  
2735 - traceback.print_exc() 2845 + except Exception as exc:
  2846 + # display the exception with full stack trace for debugging
  2847 + log.info('Error processing file %s (%s)' % (self.filename, exc))
  2848 + log.debug('Traceback:', exc_info=True)
  2849 + raise ProcessingError(self.filename, exc)
2736 print '' 2850 print ''
2737 2851
2738 2852
2739 def process_file_json(self, show_decoded_strings=False, 2853 def process_file_json(self, show_decoded_strings=False,
2740 - display_code=True, global_analysis=True, hide_attributes=True, 2854 + display_code=True, hide_attributes=True,
2741 vba_code_only=False, show_deobfuscated_code=False): 2855 vba_code_only=False, show_deobfuscated_code=False):
2742 """ 2856 """
2743 Process a single file 2857 Process a single file
@@ -2786,27 +2900,19 @@ class VBA_Parser_CLI(VBA_Parser): @@ -2786,27 +2900,19 @@ class VBA_Parser_CLI(VBA_Parser):
2786 curr_macro['ole_stream'] = stream_path 2900 curr_macro['ole_stream'] = stream_path
2787 if display_code: 2901 if display_code:
2788 curr_macro['code'] = vba_code_filtered.strip() 2902 curr_macro['code'] = vba_code_filtered.strip()
2789 - if not global_analysis and not vba_code_only:  
2790 - # analyse each module's code, filtered to avoid false positives:  
2791 - #TODO: remove this option  
2792 - curr_macro['analysis'] = self.print_analysis_json(show_decoded_strings)  
2793 macros.append(curr_macro) 2903 macros.append(curr_macro)
2794 - if global_analysis and not vba_code_only: 2904 + if not vba_code_only:
2795 # analyse the code from all modules at once: 2905 # analyse the code from all modules at once:
2796 result['analysis'] = self.print_analysis_json(show_decoded_strings) 2906 result['analysis'] = self.print_analysis_json(show_decoded_strings)
2797 if show_deobfuscated_code: 2907 if show_deobfuscated_code:
2798 result['code_deobfuscated'] = self.reveal() 2908 result['code_deobfuscated'] = self.reveal()
2799 result['macros'] = macros 2909 result['macros'] = macros
2800 result['json_conversion_successful'] = True 2910 result['json_conversion_successful'] = True
2801 - except KeyboardInterrupt:  
2802 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
2803 - raise  
2804 - except: #TypeError:  
2805 - #raise  
2806 - #TODO: print more info if debug mode  
2807 - #print sys.exc_value  
2808 - # display the exception with full stack trace for debugging, but do not stop:  
2809 - traceback.print_exc() 2911 + except Exception as exc:
  2912 + # display the exception with full stack trace for debugging
  2913 + log.info('Error processing file %s (%s)' % (self.filename, exc))
  2914 + log.debug('Traceback:', exc_info=True)
  2915 + raise ProcessingError(self.filename, exc)
2810 2916
2811 return result 2917 return result
2812 2918
@@ -2816,57 +2922,46 @@ class VBA_Parser_CLI(VBA_Parser): @@ -2816,57 +2922,46 @@ class VBA_Parser_CLI(VBA_Parser):
2816 Process a file in triage mode, showing only summary results on one line. 2922 Process a file in triage mode, showing only summary results on one line.
2817 """ 2923 """
2818 #TODO: replace print by writing to a provided output file (sys.stdout by default) 2924 #TODO: replace print by writing to a provided output file (sys.stdout by default)
2819 - message = ''  
2820 try: 2925 try:
2821 - if self.type is not None:  
2822 - #TODO: handle olefile errors, when an OLE file is malformed  
2823 - if self.detect_vba_macros():  
2824 - # print a waiting message only if the output is not redirected to a file:  
2825 - if sys.stdout.isatty():  
2826 - print 'Analysis...\r',  
2827 - sys.stdout.flush()  
2828 - self.analyze_macros(show_decoded_strings=show_decoded_strings,  
2829 - deobfuscate=deobfuscate)  
2830 - flags = TYPE2TAG[self.type]  
2831 - macros = autoexec = suspicious = iocs = hexstrings = base64obf = dridex = vba_obf = '-'  
2832 - if self.contains_macros: macros = 'M'  
2833 - if self.nb_autoexec: autoexec = 'A'  
2834 - if self.nb_suspicious: suspicious = 'S'  
2835 - if self.nb_iocs: iocs = 'I'  
2836 - if self.nb_hexstrings: hexstrings = 'H'  
2837 - if self.nb_base64strings: base64obf = 'B'  
2838 - if self.nb_dridexstrings: dridex = 'D'  
2839 - if self.nb_vbastrings: vba_obf = 'V'  
2840 - flags += '%s%s%s%s%s%s%s%s' % (macros, autoexec, suspicious, iocs, hexstrings,  
2841 - base64obf, dridex, vba_obf)  
2842 - # old table display:  
2843 - # macros = autoexec = suspicious = iocs = hexstrings = 'no'  
2844 - # if nb_macros: macros = 'YES:%d' % nb_macros  
2845 - # if nb_autoexec: autoexec = 'YES:%d' % nb_autoexec  
2846 - # if nb_suspicious: suspicious = 'YES:%d' % nb_suspicious  
2847 - # if nb_iocs: iocs = 'YES:%d' % nb_iocs  
2848 - # if nb_hexstrings: hexstrings = 'YES:%d' % nb_hexstrings  
2849 - # # 2nd line = info  
2850 - # print '%-8s %-7s %-7s %-7s %-7s %-7s' % (self.type, macros, autoexec, suspicious, iocs, hexstrings)  
2851 - else:  
2852 - # self.type==None  
2853 - # file type not OLE nor OpenXML  
2854 - flags = '?'  
2855 - message = 'File format not supported'  
2856 - except KeyboardInterrupt:  
2857 - # do not ignore exceptions when the user presses Ctrl+C/Pause:  
2858 - raise  
2859 - except:  
2860 - # another error occurred  
2861 - #raise  
2862 - #TODO: print more info if debug mode  
2863 - #TODO: distinguish real errors from incorrect file types  
2864 - flags = '!ERROR'  
2865 - message = sys.exc_value  
2866 - line = '%-12s %s' % (flags, self.filename)  
2867 - if message:  
2868 - line += ' - %s' % message  
2869 - print line 2926 + #TODO: handle olefile errors, when an OLE file is malformed
  2927 + if self.detect_vba_macros():
  2928 + # print a waiting message only if the output is not redirected to a file:
  2929 + if sys.stdout.isatty():
  2930 + print 'Analysis...\r',
  2931 + sys.stdout.flush()
  2932 + self.analyze_macros(show_decoded_strings=show_decoded_strings,
  2933 + deobfuscate=deobfuscate)
  2934 + flags = TYPE2TAG[self.type]
  2935 + macros = autoexec = suspicious = iocs = hexstrings = base64obf = dridex = vba_obf = '-'
  2936 + if self.contains_macros: macros = 'M'
  2937 + if self.nb_autoexec: autoexec = 'A'
  2938 + if self.nb_suspicious: suspicious = 'S'
  2939 + if self.nb_iocs: iocs = 'I'
  2940 + if self.nb_hexstrings: hexstrings = 'H'
  2941 + if self.nb_base64strings: base64obf = 'B'
  2942 + if self.nb_dridexstrings: dridex = 'D'
  2943 + if self.nb_vbastrings: vba_obf = 'V'
  2944 + flags += '%s%s%s%s%s%s%s%s' % (macros, autoexec, suspicious, iocs, hexstrings,
  2945 + base64obf, dridex, vba_obf)
  2946 +
  2947 + line = '%-12s %s' % (flags, self.filename)
  2948 + print line
  2949 +
  2950 + # old table display:
  2951 + # macros = autoexec = suspicious = iocs = hexstrings = 'no'
  2952 + # if nb_macros: macros = 'YES:%d' % nb_macros
  2953 + # if nb_autoexec: autoexec = 'YES:%d' % nb_autoexec
  2954 + # if nb_suspicious: suspicious = 'YES:%d' % nb_suspicious
  2955 + # if nb_iocs: iocs = 'YES:%d' % nb_iocs
  2956 + # if nb_hexstrings: hexstrings = 'YES:%d' % nb_hexstrings
  2957 + # # 2nd line = info
  2958 + # print '%-8s %-7s %-7s %-7s %-7s %-7s' % (self.type, macros, autoexec, suspicious, iocs, hexstrings)
  2959 + except Exception as exc:
  2960 + # display the exception with full stack trace for debugging only
  2961 + log.debug('Error processing file %s (%s)' % (self.filename, exc),
  2962 + exc_info=True)
  2963 + raise ProcessingError(self.filename, exc)
  2964 +
2870 2965
2871 # t = prettytable.PrettyTable(('filename', 'type', 'macros', 'autoexec', 'suspicious', 'ioc', 'hexstrings'), 2966 # t = prettytable.PrettyTable(('filename', 'type', 'macros', 'autoexec', 'suspicious', 'ioc', 'hexstrings'),
2872 # header=False, border=False) 2967 # header=False, border=False)
@@ -2888,7 +2983,6 @@ def main(): @@ -2888,7 +2983,6 @@ def main():
2888 """ 2983 """
2889 Main function, called when olevba is run from the command line 2984 Main function, called when olevba is run from the command line
2890 """ 2985 """
2891 - global log  
2892 DEFAULT_LOG_LEVEL = "warning" # Default log level 2986 DEFAULT_LOG_LEVEL = "warning" # Default log level
2893 LOG_LEVELS = { 2987 LOG_LEVELS = {
2894 'debug': logging.DEBUG, 2988 'debug': logging.DEBUG,
@@ -2944,13 +3038,14 @@ def main(): @@ -2944,13 +3038,14 @@ def main():
2944 if len(args) == 0: 3038 if len(args) == 0:
2945 print __doc__ 3039 print __doc__
2946 parser.print_help() 3040 parser.print_help()
2947 - sys.exit() 3041 + sys.exit(RETURN_WRONG_ARGS)
2948 3042
2949 # provide info about tool and its version 3043 # provide info about tool and its version
2950 if options.output_mode == 'json': 3044 if options.output_mode == 'json':
2951 - json_results = [dict(script_name='olevba', version=__version__,  
2952 - url='http://decalage.info/python/oletools',  
2953 - type='MetaInformation'), ] 3045 + # prints opening [
  3046 + print_json(script_name='olevba', version=__version__,
  3047 + url='http://decalage.info/python/oletools',
  3048 + type='MetaInformation')
2954 else: 3049 else:
2955 print 'olevba %s - http://decalage.info/python/oletools' % __version__ 3050 print 'olevba %s - http://decalage.info/python/oletools' % __version__
2956 3051
@@ -2976,65 +3071,120 @@ def main(): @@ -2976,65 +3071,120 @@ def main():
2976 count = 0 3071 count = 0
2977 container = filename = data = None 3072 container = filename = data = None
2978 vba_parser = None 3073 vba_parser = None
2979 - for container, filename, data in xglob.iter_files(args, recursive=options.recursive,  
2980 - zip_password=options.zip_password, zip_fname=options.zip_fname):  
2981 - # ignore directory names stored in zip files:  
2982 - if container and filename.endswith('/'):  
2983 - continue  
2984 - # Open the file  
2985 - vba_parser = VBA_Parser_CLI(filename, data=data, container=container)  
2986 - if options.output_mode == 'detailed':  
2987 - # fully detailed output  
2988 - vba_parser.process_file(show_decoded_strings=options.show_decoded_strings,  
2989 - display_code=options.display_code, global_analysis=True, #options.global_analysis,  
2990 - hide_attributes=options.hide_attributes, vba_code_only=options.vba_code_only,  
2991 - show_deobfuscated_code=options.show_deobfuscated_code,  
2992 - deobfuscate=options.deobfuscate)  
2993 - elif options.output_mode in ('triage', 'unspecified'):  
2994 - # print container name when it changes:  
2995 - if container != previous_container:  
2996 - if container is not None:  
2997 - print '\nFiles in %s:' % container  
2998 - previous_container = container  
2999 - # summarized output for triage:  
3000 - vba_parser.process_file_triage(show_decoded_strings=options.show_decoded_strings,  
3001 - deobfuscate=options.deobfuscate)  
3002 - elif options.output_mode == 'json':  
3003 - json_results.append(  
3004 - vba_parser.process_file_json(show_decoded_strings=options.show_decoded_strings,  
3005 - display_code=options.display_code, global_analysis=True, #options.global_analysis,  
3006 - hide_attributes=options.hide_attributes, vba_code_only=options.vba_code_only,  
3007 - show_deobfuscated_code=options.show_deobfuscated_code))  
3008 - else: # (should be impossible)  
3009 - raise ValueError('unexpected output mode: "{0}"!'.format(options.output_mode))  
3010 - count += 1  
3011 - if options.output_mode == 'triage':  
3012 - print '\n(Flags: OpX=OpenXML, XML=Word2003XML, MHT=MHTML, TXT=Text, M=Macros, ' \  
3013 - 'A=Auto-executable, S=Suspicious keywords, I=IOCs, H=Hex strings, ' \  
3014 - 'B=Base64 strings, D=Dridex strings, V=VBA strings, ?=Unknown)\n'  
3015 -  
3016 - if count == 1 and options.output_mode == 'unspecified':  
3017 - # if options -t, -d and -j were not specified and it's a single file, print details:  
3018 - vba_parser.process_file(show_decoded_strings=options.show_decoded_strings,  
3019 - display_code=options.display_code, global_analysis=True, #options.global_analysis,  
3020 - hide_attributes=options.hide_attributes, vba_code_only=options.vba_code_only,  
3021 - show_deobfuscated_code=options.show_deobfuscated_code,  
3022 - deobfuscate=options.deobfuscate)  
3023 -  
3024 - if options.output_mode == 'json':  
3025 - json_options = dict(check_circular=False, indent=4, ensure_ascii=False)  
3026 -  
3027 - # json.dump[s] cannot deal with unicode objects that are not properly  
3028 - # encoded --> encode in own function:  
3029 - json_results = json2ascii(json_results)  
3030 - #print_json(json_results)  
3031 -  
3032 - # if False: # options.outfile: # (option currently commented out)  
3033 - # with open(outfile, 'w') as write_handle:  
3034 - # json.dump(write_handle, **json_options)  
3035 - # else:  
3036 - print json.dumps(json_results, **json_options) 3074 + return_code = RETURN_OK
  3075 + try:
  3076 + for container, filename, data in xglob.iter_files(args, recursive=options.recursive,
  3077 + zip_password=options.zip_password, zip_fname=options.zip_fname):
  3078 + # ignore directory names stored in zip files:
  3079 + if container and filename.endswith('/'):
  3080 + continue
  3081 +
  3082 + # handle errors from xglob
  3083 + if isinstance(data, Exception):
  3084 + if isinstance(data, PathNotFoundException):
  3085 + if options.output_mode in ('triage', 'unspecified'):
  3086 + print '%-12s %s - File not found' % ('?', filename)
  3087 + elif options.output_mode != 'json':
  3088 + log.error('Given path %r does not exist!' % filename)
  3089 + return_code = RETURN_FILE_NOT_FOUND if return_code == 0 \
  3090 + else RETURN_SEVERAL_ERRS
  3091 + else:
  3092 + if options.output_mode in ('triage', 'unspecified'):
  3093 + print '%-12s %s - Failed to read from zip file %s' % ('?', filename, container)
  3094 + elif options.output_mode != 'json':
  3095 + log.error('Exception opening/reading %r from zip file %r: %s'
  3096 + % (filename, container, data))
  3097 + return_code = RETURN_XGLOB_ERR if return_code == 0 \
  3098 + else RETURN_SEVERAL_ERRS
  3099 + if options.output_mode == 'json':
  3100 + print_json(file=filename, type='error',
  3101 + error=type(data).__name__, message=str(data))
  3102 + continue
3037 3103
  3104 + try:
  3105 + # Open the file
  3106 + vba_parser = VBA_Parser_CLI(filename, data=data, container=container)
  3107 +
  3108 + if options.output_mode == 'detailed':
  3109 + # fully detailed output
  3110 + vba_parser.process_file(show_decoded_strings=options.show_decoded_strings,
  3111 + display_code=options.display_code,
  3112 + hide_attributes=options.hide_attributes, vba_code_only=options.vba_code_only,
  3113 + show_deobfuscated_code=options.show_deobfuscated_code,
  3114 + deobfuscate=options.deobfuscate)
  3115 + elif options.output_mode in ('triage', 'unspecified'):
  3116 + # print container name when it changes:
  3117 + if container != previous_container:
  3118 + if container is not None:
  3119 + print '\nFiles in %s:' % container
  3120 + previous_container = container
  3121 + # summarized output for triage:
  3122 + vba_parser.process_file_triage(show_decoded_strings=options.show_decoded_strings,
  3123 + deobfuscate=options.deobfuscate)
  3124 + elif options.output_mode == 'json':
  3125 + print_json(
  3126 + vba_parser.process_file_json(show_decoded_strings=options.show_decoded_strings,
  3127 + display_code=options.display_code,
  3128 + hide_attributes=options.hide_attributes, vba_code_only=options.vba_code_only,
  3129 + show_deobfuscated_code=options.show_deobfuscated_code))
  3130 + else: # (should be impossible)
  3131 + raise ValueError('unexpected output mode: "{0}"!'.format(options.output_mode))
  3132 + count += 1
  3133 +
  3134 + except FileOpenError as exc:
  3135 + if options.output_mode in ('triage', 'unspecified'):
  3136 + print '%-12s %s - File format not supported' % ('?', filename)
  3137 + elif options.output_mode == 'json':
  3138 + print_json(file=filename, type='error',
  3139 + error=type(exc).__name__, message=str(exc))
  3140 + else:
  3141 + log.exception('Failed to open %s -- probably not supported!' % filename)
  3142 + return_code = RETURN_OPEN_ERROR if return_code == 0 \
  3143 + else RETURN_SEVERAL_ERRS
  3144 + except ProcessingError as exc:
  3145 + if options.output_mode in ('triage', 'unspecified'):
  3146 + print '%-12s %s - %s' % ('!ERROR', filename, exc.orig_exception)
  3147 + elif options.output_mode == 'json':
  3148 + print_json(file=filename, type='error',
  3149 + error=type(exc).__name__,
  3150 + message=str(exc.orig_exception))
  3151 + else:
  3152 + log.exception('Error processing file %s (%s)!'
  3153 + % (filename, exc.orig_exception))
  3154 + return_code = RETURN_PARSE_ERROR if return_code == 0 \
  3155 + else RETURN_SEVERAL_ERRS
  3156 + finally:
  3157 + if vba_parser is not None:
  3158 + vba_parser.close()
  3159 +
  3160 + if options.output_mode == 'triage':
  3161 + print '\n(Flags: OpX=OpenXML, XML=Word2003XML, MHT=MHTML, TXT=Text, M=Macros, ' \
  3162 + 'A=Auto-executable, S=Suspicious keywords, I=IOCs, H=Hex strings, ' \
  3163 + 'B=Base64 strings, D=Dridex strings, V=VBA strings, ?=Unknown)\n'
  3164 +
  3165 + if count == 1 and options.output_mode == 'unspecified':
  3166 + # if options -t, -d and -j were not specified and it's a single file, print details:
  3167 + vba_parser.process_file(show_decoded_strings=options.show_decoded_strings,
  3168 + display_code=options.display_code,
  3169 + hide_attributes=options.hide_attributes, vba_code_only=options.vba_code_only,
  3170 + show_deobfuscated_code=options.show_deobfuscated_code,
  3171 + deobfuscate=options.deobfuscate)
  3172 +
  3173 + if options.output_mode == 'json':
  3174 + # print last json entry (a last one without a comma) and closing ]
  3175 + print_json(type='MetaInformation', return_code=return_code,
  3176 + n_processed=count, _json_is_last=True)
  3177 +
  3178 + except Exception as exc:
  3179 + # some unexpected error, maybe some of the types caught in except clauses
  3180 + # above were not sufficient. This is very bad, so log complete trace at exception level
  3181 + # and do not care about output mode
  3182 + log.exception('Unhandled exception in main: %s' % exc, exc_info=True)
  3183 + return_code = RETURN_UNEXPECTED # even if there were others before -- this is more important
  3184 +
  3185 + # done. exit
  3186 + log.debug('will exit now with code %s' % return_code)
  3187 + sys.exit(return_code)
3038 3188
3039 if __name__ == '__main__': 3189 if __name__ == '__main__':
3040 main() 3190 main()
oletools/rtfobj.py
@@ -48,8 +48,11 @@ http://www.decalage.info/python/oletools @@ -48,8 +48,11 @@ http://www.decalage.info/python/oletools
48 # - extract files from OLE Package objects 48 # - extract files from OLE Package objects
49 # 2016-04-01 v0.04 PL: - fixed logging output to use stdout instead of stderr 49 # 2016-04-01 v0.04 PL: - fixed logging output to use stdout instead of stderr
50 # 2016-04-07 v0.45 PL: - improved parsing to handle some malware tricks 50 # 2016-04-07 v0.45 PL: - improved parsing to handle some malware tricks
  51 +# 2016-05-06 v0.47 TJ: - added option -d to set the output directory
  52 +# (contribution by Thomas Jarosch)
  53 +# TJ: - sanitize filenames to avoid special characters
51 54
52 -__version__ = '0.45' 55 +__version__ = '0.47'
53 56
54 #------------------------------------------------------------------------------ 57 #------------------------------------------------------------------------------
55 # TODO: 58 # TODO:
@@ -60,7 +63,7 @@ __version__ = &#39;0.45&#39; @@ -60,7 +63,7 @@ __version__ = &#39;0.45&#39;
60 63
61 #=== IMPORTS ================================================================= 64 #=== IMPORTS =================================================================
62 65
63 -import re, sys, string, binascii, logging, optparse 66 +import re, os, sys, string, binascii, logging, optparse
64 67
65 from thirdparty.xglob import xglob 68 from thirdparty.xglob import xglob
66 from oleobj import OleObject, OleNativeStream 69 from oleobj import OleObject, OleNativeStream
@@ -280,7 +283,42 @@ def rtf_iter_objects (data, min_size=32): @@ -280,7 +283,42 @@ def rtf_iter_objects (data, min_size=32):
280 match = re_hexblock.search(data, pos=current) 283 match = re_hexblock.search(data, pos=current)
281 284
282 285
283 -def process_file(container, filename, data): 286 +
  287 +def sanitize_filename(filename, replacement='_', max_length=200):
  288 + """compute basename of filename. Replaces all non-whitelisted characters.
  289 + The returned filename is always a basename of the file."""
  290 + basepath = os.path.basename(filename).strip()
  291 + sane_fname = re.sub(r'[^\w\.\- ]', replacement, basepath)
  292 +
  293 + while ".." in sane_fname:
  294 + sane_fname = sane_fname.replace('..', '.')
  295 +
  296 + while " " in sane_fname:
  297 + sane_fname = sane_fname.replace(' ', ' ')
  298 +
  299 + if not len(filename):
  300 + sane_fname = 'NONAME'
  301 +
  302 + # limit filename length
  303 + if max_length:
  304 + sane_fname = sane_fname[:max_length]
  305 +
  306 + return sane_fname
  307 +
  308 +
  309 +def process_file(container, filename, data, output_dir=None):
  310 + if output_dir:
  311 + if not os.path.isdir(output_dir):
  312 + log.info('creating output directory %s' % output_dir)
  313 + os.mkdir(output_dir)
  314 +
  315 + fname_prefix = os.path.join(output_dir,
  316 + sanitize_filename(filename))
  317 + else:
  318 + base_dir = os.path.dirname(filename)
  319 + sane_fname = sanitize_filename(filename)
  320 + fname_prefix = os.path.join(base_dir, sane_fname)
  321 +
284 # TODO: option to extract objects to files (false by default) 322 # TODO: option to extract objects to files (false by default)
285 if data is None: 323 if data is None:
286 data = open(filename, 'rb').read() 324 data = open(filename, 'rb').read()
@@ -288,7 +326,7 @@ def process_file(container, filename, data): @@ -288,7 +326,7 @@ def process_file(container, filename, data):
288 print 'File: %r - %d bytes' % (filename, len(data)) 326 print 'File: %r - %d bytes' % (filename, len(data))
289 for index, orig_len, objdata in rtf_iter_objects(data): 327 for index, orig_len, objdata in rtf_iter_objects(data):
290 print 'found object size %d at index %08X - end %08X' % (len(objdata), index, index+orig_len) 328 print 'found object size %d at index %08X - end %08X' % (len(objdata), index, index+orig_len)
291 - fname = '%s_object_%08X.raw' % (filename, index) 329 + fname = '%s_object_%08X.raw' % (fname_prefix, index)
292 print 'saving object to file %s' % fname 330 print 'saving object to file %s' % fname
293 open(fname, 'wb').write(objdata) 331 open(fname, 'wb').write(objdata)
294 # TODO: check if all hex data is extracted properly 332 # TODO: check if all hex data is extracted properly
@@ -308,7 +346,8 @@ def process_file(container, filename, data): @@ -308,7 +346,8 @@ def process_file(container, filename, data):
308 ext = 'package' 346 ext = 'package'
309 else: 347 else:
310 ext = 'bin' 348 ext = 'bin'
311 - fname = '%s_object_%08X.%s' % (filename, index, ext) 349 +
  350 + fname = '%s_object_%08X.%s' % (fname_prefix, index, ext)
312 print 'saving to file %s' % fname 351 print 'saving to file %s' % fname
313 open(fname, 'wb').write(obj.data) 352 open(fname, 'wb').write(obj.data)
314 if obj.class_name.lower() == 'package': 353 if obj.class_name.lower() == 'package':
@@ -318,9 +357,10 @@ def process_file(container, filename, data): @@ -318,9 +357,10 @@ def process_file(container, filename, data):
318 print 'Source path = %r' % opkg.src_path 357 print 'Source path = %r' % opkg.src_path
319 print 'Temp path = %r' % opkg.temp_path 358 print 'Temp path = %r' % opkg.temp_path
320 if opkg.filename: 359 if opkg.filename:
321 - fname = '%s_%s' % (filename, opkg.filename) 360 + fname = '%s_%s' % (fname_prefix,
  361 + sanitize_filename(opkg.filename))
322 else: 362 else:
323 - fname = '%s_object_%08X.noname' % (filename, index) 363 + fname = '%s_object_%08X.noname' % (fname_prefix, index)
324 print 'saving to file %s' % fname 364 print 'saving to file %s' % fname
325 open(fname, 'wb').write(opkg.data) 365 open(fname, 'wb').write(opkg.data)
326 except: 366 except:
@@ -354,6 +394,8 @@ if __name__ == &#39;__main__&#39;: @@ -354,6 +394,8 @@ if __name__ == &#39;__main__&#39;:
354 # help='export results to a CSV file') 394 # help='export results to a CSV file')
355 parser.add_option("-r", action="store_true", dest="recursive", 395 parser.add_option("-r", action="store_true", dest="recursive",
356 help='find files recursively in subdirectories.') 396 help='find files recursively in subdirectories.')
  397 + parser.add_option("-d", type="str", dest="output_dir",
  398 + help='use specified directory to output files.', default=None)
357 parser.add_option("-z", "--zip", dest='zip_password', type='str', default=None, 399 parser.add_option("-z", "--zip", dest='zip_password', type='str', default=None,
358 help='if the file is a zip archive, open first file from it, using the provided password (requires Python 2.6+)') 400 help='if the file is a zip archive, open first file from it, using the provided password (requires Python 2.6+)')
359 parser.add_option("-f", "--zipfname", dest='zip_fname', type='str', default='*', 401 parser.add_option("-f", "--zipfname", dest='zip_fname', type='str', default='*',
@@ -384,7 +426,7 @@ if __name__ == &#39;__main__&#39;: @@ -384,7 +426,7 @@ if __name__ == &#39;__main__&#39;:
384 # ignore directory names stored in zip files: 426 # ignore directory names stored in zip files:
385 if container and filename.endswith('/'): 427 if container and filename.endswith('/'):
386 continue 428 continue
387 - process_file(container, filename, data) 429 + process_file(container, filename, data, options.output_dir)
388 430
389 431
390 432
oletools/thirdparty/olefile/olefile.py 100644 โ†’ 100755
1 -#!/usr/bin/env python  
2 -  
3 -# olefile (formerly OleFileIO_PL)  
4 -#  
5 -# Module to read/write Microsoft OLE2 files (also called Structured Storage or  
6 -# Microsoft Compound Document File Format), such as Microsoft Office 97-2003  
7 -# documents, Image Composer and FlashPix files, Outlook messages, ...  
8 -# This version is compatible with Python 2.6+ and 3.x  
9 -#  
10 -# Project website: http://www.decalage.info/olefile  
11 -#  
12 -# olefile is copyright (c) 2005-2016 Philippe Lagadec (http://www.decalage.info)  
13 -#  
14 -# olefile is based on the OleFileIO module from the PIL library v1.1.6  
15 -# See: http://www.pythonware.com/products/pil/index.htm  
16 -#  
17 -# The Python Imaging Library (PIL) is  
18 -# Copyright (c) 1997-2005 by Secret Labs AB  
19 -# Copyright (c) 1995-2005 by Fredrik Lundh  
20 -#  
21 -# See source code and LICENSE.txt for information on usage and redistribution.  
22 -  
23 -  
24 -# Since OleFileIO_PL v0.30, only Python 2.6+ and 3.x is supported  
25 -# This import enables print() as a function rather than a keyword  
26 -# (main requirement to be compatible with Python 3.x)  
27 -# The comment on the line below should be printed on Python 2.5 or older:  
28 -from __future__ import print_function # This version of olefile requires Python 2.6+ or 3.x.  
29 -  
30 -  
31 -__author__ = "Philippe Lagadec"  
32 -__date__ = "2016-02-02"  
33 -__version__ = '0.44'  
34 -  
35 -#--- LICENSE ------------------------------------------------------------------  
36 -  
37 -# olefile (formerly OleFileIO_PL) is copyright (c) 2005-2016 Philippe Lagadec  
38 -# (http://www.decalage.info)  
39 -#  
40 -# All rights reserved.  
41 -#  
42 -# Redistribution and use in source and binary forms, with or without modification,  
43 -# are permitted provided that the following conditions are met:  
44 -#  
45 -# * Redistributions of source code must retain the above copyright notice, this  
46 -# list of conditions and the following disclaimer.  
47 -# * Redistributions in binary form must reproduce the above copyright notice,  
48 -# this list of conditions and the following disclaimer in the documentation  
49 -# and/or other materials provided with the distribution.  
50 -#  
51 -# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND  
52 -# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED  
53 -# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE  
54 -# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE  
55 -# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL  
56 -# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR  
57 -# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER  
58 -# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,  
59 -# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE  
60 -# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.  
61 -  
62 -# ----------  
63 -# PIL License:  
64 -#  
65 -# olefile is based on source code from the OleFileIO module of the Python  
66 -# Imaging Library (PIL) published by Fredrik Lundh under the following license:  
67 -  
68 -# The Python Imaging Library (PIL) is  
69 -# Copyright (c) 1997-2005 by Secret Labs AB  
70 -# Copyright (c) 1995-2005 by Fredrik Lundh  
71 -#  
72 -# By obtaining, using, and/or copying this software and/or its associated  
73 -# documentation, you agree that you have read, understood, and will comply with  
74 -# the following terms and conditions:  
75 -#  
76 -# Permission to use, copy, modify, and distribute this software and its  
77 -# associated documentation for any purpose and without fee is hereby granted,  
78 -# provided that the above copyright notice appears in all copies, and that both  
79 -# that copyright notice and this permission notice appear in supporting  
80 -# documentation, and that the name of Secret Labs AB or the author(s) not be used  
81 -# in advertising or publicity pertaining to distribution of the software  
82 -# without specific, written prior permission.  
83 -#  
84 -# SECRET LABS AB AND THE AUTHORS DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS  
85 -# SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS.  
86 -# IN NO EVENT SHALL SECRET LABS AB OR THE AUTHORS BE LIABLE FOR ANY SPECIAL,  
87 -# INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM  
88 -# LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR  
89 -# OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR  
90 -# PERFORMANCE OF THIS SOFTWARE.  
91 -  
92 -#-----------------------------------------------------------------------------  
93 -# CHANGELOG: (only olefile/OleFileIO_PL changes compared to PIL 1.1.6)  
94 -# 2005-05-11 v0.10 PL: - a few fixes for Python 2.4 compatibility  
95 -# (all changes flagged with [PL])  
96 -# 2006-02-22 v0.11 PL: - a few fixes for some Office 2003 documents which raise  
97 -# exceptions in _OleStream.__init__()  
98 -# 2006-06-09 v0.12 PL: - fixes for files above 6.8MB (DIFAT in loadfat)  
99 -# - added some constants  
100 -# - added header values checks  
101 -# - added some docstrings  
102 -# - getsect: bugfix in case sectors >512 bytes  
103 -# - getsect: added conformity checks  
104 -# - DEBUG_MODE constant to activate debug display  
105 -# 2007-09-04 v0.13 PL: - improved/translated (lots of) comments  
106 -# - updated license  
107 -# - converted tabs to 4 spaces  
108 -# 2007-11-19 v0.14 PL: - added OleFileIO._raise_defect() to adapt sensitivity  
109 -# - improved _unicode() to use Python 2.x unicode support  
110 -# - fixed bug in _OleDirectoryEntry  
111 -# 2007-11-25 v0.15 PL: - added safety checks to detect FAT loops  
112 -# - fixed _OleStream which didn't check stream size  
113 -# - added/improved many docstrings and comments  
114 -# - moved helper functions _unicode and _clsid out of  
115 -# OleFileIO class  
116 -# - improved OleFileIO._find() to add Unix path syntax  
117 -# - OleFileIO._find() is now case-insensitive  
118 -# - added get_type() and get_rootentry_name()  
119 -# - rewritten loaddirectory and _OleDirectoryEntry  
120 -# 2007-11-27 v0.16 PL: - added _OleDirectoryEntry.kids_dict  
121 -# - added detection of duplicate filenames in storages  
122 -# - added detection of duplicate references to streams  
123 -# - added get_size() and exists() to _OleDirectoryEntry  
124 -# - added isOleFile to check header before parsing  
125 -# - added __all__ list to control public keywords in pydoc  
126 -# 2007-12-04 v0.17 PL: - added _load_direntry to fix a bug in loaddirectory  
127 -# - improved _unicode(), added workarounds for Python <2.3  
128 -# - added set_debug_mode and -d option to set debug mode  
129 -# - fixed bugs in OleFileIO.open and _OleDirectoryEntry  
130 -# - added safety check in main for large or binary  
131 -# properties  
132 -# - allow size>0 for storages for some implementations  
133 -# 2007-12-05 v0.18 PL: - fixed several bugs in handling of FAT, MiniFAT and  
134 -# streams  
135 -# - added option '-c' in main to check all streams  
136 -# 2009-12-10 v0.19 PL: - bugfix for 32 bit arrays on 64 bits platforms  
137 -# (thanks to Ben G. and Martijn for reporting the bug)  
138 -# 2009-12-11 v0.20 PL: - bugfix in OleFileIO.open when filename is not plain str  
139 -# 2010-01-22 v0.21 PL: - added support for big-endian CPUs such as PowerPC Macs  
140 -# 2012-02-16 v0.22 PL: - fixed bug in getproperties, patch by chuckleberryfinn  
141 -# (https://bitbucket.org/decalage/olefileio_pl/issue/7)  
142 -# - added close method to OleFileIO (fixed issue #2)  
143 -# 2012-07-25 v0.23 PL: - added support for file-like objects (patch by mete0r_kr)  
144 -# 2013-05-05 v0.24 PL: - getproperties: added conversion from filetime to python  
145 -# datetime  
146 -# - main: displays properties with date format  
147 -# - new class OleMetadata to parse standard properties  
148 -# - added get_metadata method  
149 -# 2013-05-07 v0.24 PL: - a few improvements in OleMetadata  
150 -# 2013-05-24 v0.25 PL: - getproperties: option to not convert some timestamps  
151 -# - OleMetaData: total_edit_time is now a number of seconds,  
152 -# not a timestamp  
153 -# - getproperties: added support for VT_BOOL, VT_INT, V_UINT  
154 -# - getproperties: filter out null chars from strings  
155 -# - getproperties: raise non-fatal defects instead of  
156 -# exceptions when properties cannot be parsed properly  
157 -# 2013-05-27 PL: - getproperties: improved exception handling  
158 -# - _raise_defect: added option to set exception type  
159 -# - all non-fatal issues are now recorded, and displayed  
160 -# when run as a script  
161 -# 2013-07-11 v0.26 PL: - added methods to get modification and creation times  
162 -# of a directory entry or a storage/stream  
163 -# - fixed parsing of direntry timestamps  
164 -# 2013-07-24 PL: - new options in listdir to list storages and/or streams  
165 -# 2014-02-04 v0.30 PL: - upgraded code to support Python 3.x by Martin Panter  
166 -# - several fixes for Python 2.6 (xrange, MAGIC)  
167 -# - reused i32 from Pillow's _binary  
168 -# 2014-07-18 v0.31 - preliminary support for 4K sectors  
169 -# 2014-07-27 v0.31 PL: - a few improvements in OleFileIO.open (header parsing)  
170 -# - Fixed loadfat for large files with 4K sectors (issue #3)  
171 -# 2014-07-30 v0.32 PL: - added write_sect to write sectors to disk  
172 -# - added write_mode option to OleFileIO.__init__ and open  
173 -# 2014-07-31 PL: - fixed padding in write_sect for Python 3, added checks  
174 -# - added write_stream to write a stream to disk  
175 -# 2014-09-26 v0.40 PL: - renamed OleFileIO_PL to olefile  
176 -# 2014-11-09 NE: - added support for Jython (Niko Ehrenfeuchter)  
177 -# 2014-11-13 v0.41 PL: - improved isOleFile and OleFileIO.open to support OLE  
178 -# data in a string buffer and file-like objects.  
179 -# 2014-11-21 PL: - updated comments according to Pillow's commits  
180 -# 2015-01-24 v0.42 PL: - changed the default path name encoding from Latin-1  
181 -# to UTF-8 on Python 2.x (Unicode on Python 3.x)  
182 -# - added path_encoding option to override the default  
183 -# - fixed a bug in _list when a storage is empty  
184 -# 2015-04-17 v0.43 PL: - slight changes in _OleDirectoryEntry  
185 -# 2015-10-19 - fixed issue #26 in OleFileIO.getproperties  
186 -# (using id and type as local variable names)  
187 -# 2015-10-29 - replaced debug() with proper logging  
188 -# - use optparse to handle command line options  
189 -# - improved attribute names in OleFileIO class  
190 -# 2015-11-05 - fixed issue #27 by correcting the MiniFAT sector  
191 -# cutoff size if invalid.  
192 -# 2016-02-02 - logging is disabled by default  
193 -  
194 -#-----------------------------------------------------------------------------  
195 -# TODO (for version 1.0):  
196 -# + get rid of print statements, to simplify Python 2.x and 3.x support  
197 -# + add is_stream and is_storage  
198 -# + remove leading and trailing slashes where a path is used  
199 -# + add functions path_list2str and path_str2list  
200 -# + fix how all the methods handle unicode str and/or bytes as arguments  
201 -# + add path attrib to _OleDirEntry, set it once and for all in init or  
202 -# append_kids (then listdir/_list can be simplified)  
203 -# - TESTS with Linux, MacOSX, Python 1.5.2, various files, PIL, ...  
204 -# - add underscore to each private method, to avoid their display in  
205 -# pydoc/epydoc documentation - Remove it for classes to be documented  
206 -# - replace all raised exceptions with _raise_defect (at least in OleFileIO)  
207 -# - merge code from _OleStream and OleFileIO.getsect to read sectors  
208 -# (maybe add a class for FAT and MiniFAT ?)  
209 -# - add method to check all streams (follow sectors chains without storing all  
210 -# stream in memory, and report anomalies)  
211 -# - use _OleDirectoryEntry.kids_dict to improve _find and _list ?  
212 -# - fix Unicode names handling (find some way to stay compatible with Py1.5.2)  
213 -# => if possible avoid converting names to Latin-1  
214 -# - review DIFAT code: fix handling of DIFSECT blocks in FAT (not stop)  
215 -# - rewrite OleFileIO.getproperties  
216 -# - improve docstrings to show more sample uses  
217 -# - see also original notes and FIXME below  
218 -# - remove all obsolete FIXMEs  
219 -# - OleMetadata: fix version attrib according to  
220 -# http://msdn.microsoft.com/en-us/library/dd945671%28v=office.12%29.aspx  
221 -  
222 -# IDEAS:  
223 -# - in OleFileIO._open and _OleStream, use size=None instead of 0x7FFFFFFF for  
224 -# streams with unknown size  
225 -# - use arrays of int instead of long integers for FAT/MiniFAT, to improve  
226 -# performance and reduce memory usage ? (possible issue with values >2^31)  
227 -# - provide tests with unittest (may need write support to create samples)  
228 -# - move all debug code (and maybe dump methods) to a separate module, with  
229 -# a class which inherits OleFileIO ?  
230 -# - fix docstrings to follow epydoc format  
231 -# - add support for big endian byte order ?  
232 -# - create a simple OLE explorer with wxPython  
233 -  
234 -# FUTURE EVOLUTIONS to add write support:  
235 -# see issue #6 on Bitbucket:  
236 -# https://bitbucket.org/decalage/olefileio_pl/issue/6/improve-olefileio_pl-to-write-ole-files  
237 -  
238 -#-----------------------------------------------------------------------------  
239 -# NOTES from PIL 1.1.6:  
240 -  
241 -# History:  
242 -# 1997-01-20 fl Created  
243 -# 1997-01-22 fl Fixed 64-bit portability quirk  
244 -# 2003-09-09 fl Fixed typo in OleFileIO.loadfat (noted by Daniel Haertle)  
245 -# 2004-02-29 fl Changed long hex constants to signed integers  
246 -#  
247 -# Notes:  
248 -# FIXME: sort out sign problem (eliminate long hex constants)  
249 -# FIXME: change filename to use "a/b/c" instead of ["a", "b", "c"]  
250 -# FIXME: provide a glob mechanism function (using fnmatchcase)  
251 -#  
252 -# Literature:  
253 -#  
254 -# "FlashPix Format Specification, Appendix A", Kodak and Microsoft,  
255 -# September 1996.  
256 -#  
257 -# Quotes:  
258 -#  
259 -# "If this document and functionality of the Software conflict,  
260 -# the actual functionality of the Software represents the correct  
261 -# functionality" -- Microsoft, in the OLE format specification  
262 -  
263 -#------------------------------------------------------------------------------  
264 -  
265 -  
266 -import io  
267 -import sys  
268 -import struct, array, os.path, datetime, logging  
269 -  
270 -#=== COMPATIBILITY WORKAROUNDS ================================================  
271 -  
272 -#[PL] Define explicitly the public API to avoid private objects in pydoc:  
273 -#TODO: add more  
274 -# __all__ = ['OleFileIO', 'isOleFile', 'MAGIC']  
275 -  
276 -# For Python 3.x, need to redefine long as int:  
277 -if str is not bytes:  
278 - long = int  
279 -  
280 -# Need to make sure we use xrange both on Python 2 and 3.x:  
281 -try:  
282 - # on Python 2 we need xrange:  
283 - iterrange = xrange  
284 -except:  
285 - # no xrange, for Python 3 it was renamed as range:  
286 - iterrange = range  
287 -  
288 -#[PL] workaround to fix an issue with array item size on 64 bits systems:  
289 -if array.array('L').itemsize == 4:  
290 - # on 32 bits platforms, long integers in an array are 32 bits:  
291 - UINT32 = 'L'  
292 -elif array.array('I').itemsize == 4:  
293 - # on 64 bits platforms, integers in an array are 32 bits:  
294 - UINT32 = 'I'  
295 -elif array.array('i').itemsize == 4:  
296 - # On 64 bit Jython, signed integers ('i') are the only way to store our 32  
297 - # bit values in an array in a *somewhat* reasonable way, as the otherwise  
298 - # perfectly suited 'H' (unsigned int, 32 bits) results in a completely  
299 - # unusable behaviour. This is most likely caused by the fact that Java  
300 - # doesn't have unsigned values, and thus Jython's "array" implementation,  
301 - # which is based on "jarray", doesn't have them either.  
302 - # NOTE: to trick Jython into converting the values it would normally  
303 - # interpret as "signed" into "unsigned", a binary-and operation with  
304 - # 0xFFFFFFFF can be used. This way it is possible to use the same comparing  
305 - # operations on all platforms / implementations. The corresponding code  
306 - # lines are flagged with a 'JYTHON-WORKAROUND' tag below.  
307 - UINT32 = 'i'  
308 -else:  
309 - raise ValueError('Need to fix a bug with 32 bit arrays, please contact author...')  
310 -  
311 -  
312 -#[PL] These workarounds were inspired from the Path module  
313 -# (see http://www.jorendorff.com/articles/python/path/)  
314 -#TODO: test with old Python versions  
315 -  
316 -# Pre-2.3 workaround for basestring.  
317 -try:  
318 - basestring  
319 -except NameError:  
320 - try:  
321 - # is Unicode supported (Python >2.0 or >1.6 ?)  
322 - basestring = (str, unicode)  
323 - except NameError:  
324 - basestring = str  
325 -  
326 -#[PL] Experimental setting: if True, OLE filenames will be kept in Unicode  
327 -# if False (default PIL behaviour), all filenames are converted to Latin-1.  
328 -KEEP_UNICODE_NAMES = True  
329 -  
330 -if sys.version_info[0] < 3:  
331 - # On Python 2.x, the default encoding for path names is UTF-8:  
332 - DEFAULT_PATH_ENCODING = 'utf-8'  
333 -else:  
334 - # On Python 3.x, the default encoding for path names is Unicode (None):  
335 - DEFAULT_PATH_ENCODING = None  
336 -  
337 -  
338 -# === LOGGING =================================================================  
339 -  
340 -class NullHandler(logging.Handler):  
341 - """  
342 - Log Handler without output, to avoid printing messages if logging is not  
343 - configured by the main application.  
344 - Python 2.7 has logging.NullHandler, but this is necessary for 2.6:  
345 - see https://docs.python.org/2.6/library/logging.html#configuring-logging-for-a-library  
346 - """  
347 - def emit(self, record):  
348 - pass  
349 -  
350 -def get_logger(name, level=logging.CRITICAL+1):  
351 - """  
352 - Create a suitable logger object for this module.  
353 - The goal is not to change settings of the root logger, to avoid getting  
354 - other modules' logs on the screen.  
355 - If a logger exists with same name, reuse it. (Else it would have duplicate  
356 - handlers and messages would be doubled.)  
357 - The level is set to CRITICAL+1 by default, to avoid any logging.  
358 - """  
359 - # First, test if there is already a logger with the same name, else it  
360 - # will generate duplicate messages (due to duplicate handlers):  
361 - if name in logging.Logger.manager.loggerDict:  
362 - #NOTE: another less intrusive but more "hackish" solution would be to  
363 - # use getLogger then test if its effective level is not default.  
364 - logger = logging.getLogger(name)  
365 - # make sure level is OK:  
366 - logger.setLevel(level)  
367 - return logger  
368 - # get a new logger:  
369 - logger = logging.getLogger(name)  
370 - # only add a NullHandler for this logger, it is up to the application  
371 - # to configure its own logging:  
372 - logger.addHandler(NullHandler())  
373 - logger.setLevel(level)  
374 - return logger  
375 -  
376 -# a global logger object used for debugging:  
377 -log = get_logger('olefile')  
378 -  
379 -  
380 -#=== CONSTANTS ===============================================================  
381 -  
382 -# magic bytes that should be at the beginning of every OLE file:  
383 -MAGIC = b'\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1'  
384 -  
385 -#[PL]: added constants for Sector IDs (from AAF specifications)  
386 -MAXREGSECT = 0xFFFFFFFA # (-6) maximum SECT  
387 -DIFSECT = 0xFFFFFFFC # (-4) denotes a DIFAT sector in a FAT  
388 -FATSECT = 0xFFFFFFFD # (-3) denotes a FAT sector in a FAT  
389 -ENDOFCHAIN = 0xFFFFFFFE # (-2) end of a virtual stream chain  
390 -FREESECT = 0xFFFFFFFF # (-1) unallocated sector  
391 -  
392 -#[PL]: added constants for Directory Entry IDs (from AAF specifications)  
393 -MAXREGSID = 0xFFFFFFFA # (-6) maximum directory entry ID  
394 -NOSTREAM = 0xFFFFFFFF # (-1) unallocated directory entry  
395 -  
396 -#[PL] object types in storage (from AAF specifications)  
397 -STGTY_EMPTY = 0 # empty directory entry (according to OpenOffice.org doc)  
398 -STGTY_STORAGE = 1 # element is a storage object  
399 -STGTY_STREAM = 2 # element is a stream object  
400 -STGTY_LOCKBYTES = 3 # element is an ILockBytes object  
401 -STGTY_PROPERTY = 4 # element is an IPropertyStorage object  
402 -STGTY_ROOT = 5 # element is a root storage  
403 -  
404 -  
405 -#  
406 -# --------------------------------------------------------------------  
407 -# property types  
408 -  
409 -VT_EMPTY=0; VT_NULL=1; VT_I2=2; VT_I4=3; VT_R4=4; VT_R8=5; VT_CY=6;  
410 -VT_DATE=7; VT_BSTR=8; VT_DISPATCH=9; VT_ERROR=10; VT_BOOL=11;  
411 -VT_VARIANT=12; VT_UNKNOWN=13; VT_DECIMAL=14; VT_I1=16; VT_UI1=17;  
412 -VT_UI2=18; VT_UI4=19; VT_I8=20; VT_UI8=21; VT_INT=22; VT_UINT=23;  
413 -VT_VOID=24; VT_HRESULT=25; VT_PTR=26; VT_SAFEARRAY=27; VT_CARRAY=28;  
414 -VT_USERDEFINED=29; VT_LPSTR=30; VT_LPWSTR=31; VT_FILETIME=64;  
415 -VT_BLOB=65; VT_STREAM=66; VT_STORAGE=67; VT_STREAMED_OBJECT=68;  
416 -VT_STORED_OBJECT=69; VT_BLOB_OBJECT=70; VT_CF=71; VT_CLSID=72;  
417 -VT_VECTOR=0x1000;  
418 -  
419 -# map property id to name (for debugging purposes)  
420 -  
421 -VT = {}  
422 -for keyword, var in list(vars().items()):  
423 - if keyword[:3] == "VT_":  
424 - VT[var] = keyword  
425 -  
426 -#  
427 -# --------------------------------------------------------------------  
428 -# Some common document types (root.clsid fields)  
429 -  
430 -WORD_CLSID = "00020900-0000-0000-C000-000000000046"  
431 -#TODO: check Excel, PPT, ...  
432 -  
433 -#[PL]: Defect levels to classify parsing errors - see OleFileIO._raise_defect()  
434 -DEFECT_UNSURE = 10 # a case which looks weird, but not sure it's a defect  
435 -DEFECT_POTENTIAL = 20 # a potential defect  
436 -DEFECT_INCORRECT = 30 # an error according to specifications, but parsing  
437 - # can go on  
438 -DEFECT_FATAL = 40 # an error which cannot be ignored, parsing is  
439 - # impossible  
440 -  
441 -# Minimal size of an empty OLE file, with 512-bytes sectors = 1536 bytes  
442 -# (this is used in isOleFile and OleFile.open)  
443 -MINIMAL_OLEFILE_SIZE = 1536  
444 -  
445 -#[PL] add useful constants to __all__:  
446 -# for key in list(vars().keys()):  
447 -# if key.startswith('STGTY_') or key.startswith('DEFECT_'):  
448 -# __all__.append(key)  
449 -  
450 -  
451 -#=== FUNCTIONS ===============================================================  
452 -  
453 -def isOleFile (filename):  
454 - """  
455 - Test if a file is an OLE container (according to the magic bytes in its header).  
456 -  
457 - :param filename: string-like or file-like object, OLE file to parse  
458 -  
459 - - if filename is a string smaller than 1536 bytes, it is the path  
460 - of the file to open. (bytes or unicode string)  
461 - - if filename is a string longer than 1535 bytes, it is parsed  
462 - as the content of an OLE file in memory. (bytes type only)  
463 - - if filename is a file-like object (with read and seek methods),  
464 - it is parsed as-is.  
465 -  
466 - :returns: True if OLE, False otherwise.  
467 - """  
468 - # check if filename is a string-like or file-like object:  
469 - if hasattr(filename, 'read'):  
470 - # file-like object: use it directly  
471 - header = filename.read(len(MAGIC))  
472 - # just in case, seek back to start of file:  
473 - filename.seek(0)  
474 - elif isinstance(filename, bytes) and len(filename) >= MINIMAL_OLEFILE_SIZE:  
475 - # filename is a bytes string containing the OLE file to be parsed:  
476 - header = filename[:len(MAGIC)]  
477 - else:  
478 - # string-like object: filename of file on disk  
479 - header = open(filename, 'rb').read(len(MAGIC))  
480 - if header == MAGIC:  
481 - return True  
482 - else:  
483 - return False  
484 -  
485 -  
486 -if bytes is str:  
487 - # version for Python 2.x  
488 - def i8(c):  
489 - return ord(c)  
490 -else:  
491 - # version for Python 3.x  
492 - def i8(c):  
493 - return c if c.__class__ is int else c[0]  
494 -  
495 -  
496 -#TODO: replace i16 and i32 with more readable struct.unpack equivalent?  
497 -  
498 -def i16(c, o = 0):  
499 - """  
500 - Converts a 2-bytes (16 bits) string to an integer.  
501 -  
502 - :param c: string containing bytes to convert  
503 - :param o: offset of bytes to convert in string  
504 - """  
505 - return i8(c[o]) | (i8(c[o+1])<<8)  
506 -  
507 -  
508 -def i32(c, o = 0):  
509 - """  
510 - Converts a 4-bytes (32 bits) string to an integer.  
511 -  
512 - :param c: string containing bytes to convert  
513 - :param o: offset of bytes to convert in string  
514 - """  
515 -## return int(ord(c[o])+(ord(c[o+1])<<8)+(ord(c[o+2])<<16)+(ord(c[o+3])<<24))  
516 -## # [PL]: added int() because "<<" gives long int since Python 2.4  
517 - # copied from Pillow's _binary:  
518 - return i8(c[o]) | (i8(c[o+1])<<8) | (i8(c[o+2])<<16) | (i8(c[o+3])<<24)  
519 -  
520 -  
521 -def _clsid(clsid):  
522 - """  
523 - Converts a CLSID to a human-readable string.  
524 -  
525 - :param clsid: string of length 16.  
526 - """  
527 - assert len(clsid) == 16  
528 - # if clsid is only made of null bytes, return an empty string:  
529 - # (PL: why not simply return the string with zeroes?)  
530 - if not clsid.strip(b"\0"):  
531 - return ""  
532 - return (("%08X-%04X-%04X-%02X%02X-" + "%02X" * 6) %  
533 - ((i32(clsid, 0), i16(clsid, 4), i16(clsid, 6)) +  
534 - tuple(map(i8, clsid[8:16]))))  
535 -  
536 -  
537 -  
538 -def filetime2datetime(filetime):  
539 - """  
540 - convert FILETIME (64 bits int) to Python datetime.datetime  
541 - """  
542 - # TODO: manage exception when microseconds is too large  
543 - # inspired from http://code.activestate.com/recipes/511425-filetime-to-datetime/  
544 - _FILETIME_null_date = datetime.datetime(1601, 1, 1, 0, 0, 0)  
545 - #log.debug('timedelta days=%d' % (filetime//(10*1000000*3600*24)))  
546 - return _FILETIME_null_date + datetime.timedelta(microseconds=filetime//10)  
547 -  
548 -  
549 -  
550 -#=== CLASSES ==================================================================  
551 -  
552 -class OleMetadata:  
553 - """  
554 - class to parse and store metadata from standard properties of OLE files.  
555 -  
556 - Available attributes:  
557 - codepage, title, subject, author, keywords, comments, template,  
558 - last_saved_by, revision_number, total_edit_time, last_printed, create_time,  
559 - last_saved_time, num_pages, num_words, num_chars, thumbnail,  
560 - creating_application, security, codepage_doc, category, presentation_target,  
561 - bytes, lines, paragraphs, slides, notes, hidden_slides, mm_clips,  
562 - scale_crop, heading_pairs, titles_of_parts, manager, company, links_dirty,  
563 - chars_with_spaces, unused, shared_doc, link_base, hlinks, hlinks_changed,  
564 - version, dig_sig, content_type, content_status, language, doc_version  
565 -  
566 - Note: an attribute is set to None when not present in the properties of the  
567 - OLE file.  
568 -  
569 - References for SummaryInformation stream:  
570 - - http://msdn.microsoft.com/en-us/library/dd942545.aspx  
571 - - http://msdn.microsoft.com/en-us/library/dd925819%28v=office.12%29.aspx  
572 - - http://msdn.microsoft.com/en-us/library/windows/desktop/aa380376%28v=vs.85%29.aspx  
573 - - http://msdn.microsoft.com/en-us/library/aa372045.aspx  
574 - - http://sedna-soft.de/summary-information-stream/  
575 - - http://poi.apache.org/apidocs/org/apache/poi/hpsf/SummaryInformation.html  
576 -  
577 - References for DocumentSummaryInformation stream:  
578 - - http://msdn.microsoft.com/en-us/library/dd945671%28v=office.12%29.aspx  
579 - - http://msdn.microsoft.com/en-us/library/windows/desktop/aa380374%28v=vs.85%29.aspx  
580 - - http://poi.apache.org/apidocs/org/apache/poi/hpsf/DocumentSummaryInformation.html  
581 -  
582 - new in version 0.25  
583 - """  
584 -  
585 - # attribute names for SummaryInformation stream properties:  
586 - # (ordered by property id, starting at 1)  
587 - SUMMARY_ATTRIBS = ['codepage', 'title', 'subject', 'author', 'keywords', 'comments',  
588 - 'template', 'last_saved_by', 'revision_number', 'total_edit_time',  
589 - 'last_printed', 'create_time', 'last_saved_time', 'num_pages',  
590 - 'num_words', 'num_chars', 'thumbnail', 'creating_application',  
591 - 'security']  
592 -  
593 - # attribute names for DocumentSummaryInformation stream properties:  
594 - # (ordered by property id, starting at 1)  
595 - DOCSUM_ATTRIBS = ['codepage_doc', 'category', 'presentation_target', 'bytes', 'lines', 'paragraphs',  
596 - 'slides', 'notes', 'hidden_slides', 'mm_clips',  
597 - 'scale_crop', 'heading_pairs', 'titles_of_parts', 'manager',  
598 - 'company', 'links_dirty', 'chars_with_spaces', 'unused', 'shared_doc',  
599 - 'link_base', 'hlinks', 'hlinks_changed', 'version', 'dig_sig',  
600 - 'content_type', 'content_status', 'language', 'doc_version']  
601 -  
602 - def __init__(self):  
603 - """  
604 - Constructor for OleMetadata  
605 - All attributes are set to None by default  
606 - """  
607 - # properties from SummaryInformation stream  
608 - self.codepage = None  
609 - self.title = None  
610 - self.subject = None  
611 - self.author = None  
612 - self.keywords = None  
613 - self.comments = None  
614 - self.template = None  
615 - self.last_saved_by = None  
616 - self.revision_number = None  
617 - self.total_edit_time = None  
618 - self.last_printed = None  
619 - self.create_time = None  
620 - self.last_saved_time = None  
621 - self.num_pages = None  
622 - self.num_words = None  
623 - self.num_chars = None  
624 - self.thumbnail = None  
625 - self.creating_application = None  
626 - self.security = None  
627 - # properties from DocumentSummaryInformation stream  
628 - self.codepage_doc = None  
629 - self.category = None  
630 - self.presentation_target = None  
631 - self.bytes = None  
632 - self.lines = None  
633 - self.paragraphs = None  
634 - self.slides = None  
635 - self.notes = None  
636 - self.hidden_slides = None  
637 - self.mm_clips = None  
638 - self.scale_crop = None  
639 - self.heading_pairs = None  
640 - self.titles_of_parts = None  
641 - self.manager = None  
642 - self.company = None  
643 - self.links_dirty = None  
644 - self.chars_with_spaces = None  
645 - self.unused = None  
646 - self.shared_doc = None  
647 - self.link_base = None  
648 - self.hlinks = None  
649 - self.hlinks_changed = None  
650 - self.version = None  
651 - self.dig_sig = None  
652 - self.content_type = None  
653 - self.content_status = None  
654 - self.language = None  
655 - self.doc_version = None  
656 -  
657 -  
658 - def parse_properties(self, olefile):  
659 - """  
660 - Parse standard properties of an OLE file, from the streams  
661 - "\x05SummaryInformation" and "\x05DocumentSummaryInformation",  
662 - if present.  
663 - Properties are converted to strings, integers or python datetime objects.  
664 - If a property is not present, its value is set to None.  
665 - """  
666 - # first set all attributes to None:  
667 - for attrib in (self.SUMMARY_ATTRIBS + self.DOCSUM_ATTRIBS):  
668 - setattr(self, attrib, None)  
669 - if olefile.exists("\x05SummaryInformation"):  
670 - # get properties from the stream:  
671 - # (converting timestamps to python datetime, except total_edit_time,  
672 - # which is property #10)  
673 - props = olefile.getproperties("\x05SummaryInformation",  
674 - convert_time=True, no_conversion=[10])  
675 - # store them into this object's attributes:  
676 - for i in range(len(self.SUMMARY_ATTRIBS)):  
677 - # ids for standards properties start at 0x01, until 0x13  
678 - value = props.get(i+1, None)  
679 - setattr(self, self.SUMMARY_ATTRIBS[i], value)  
680 - if olefile.exists("\x05DocumentSummaryInformation"):  
681 - # get properties from the stream:  
682 - props = olefile.getproperties("\x05DocumentSummaryInformation",  
683 - convert_time=True)  
684 - # store them into this object's attributes:  
685 - for i in range(len(self.DOCSUM_ATTRIBS)):  
686 - # ids for standards properties start at 0x01, until 0x13  
687 - value = props.get(i+1, None)  
688 - setattr(self, self.DOCSUM_ATTRIBS[i], value)  
689 -  
690 - def dump(self):  
691 - """  
692 - Dump all metadata, for debugging purposes.  
693 - """  
694 - print('Properties from SummaryInformation stream:')  
695 - for prop in self.SUMMARY_ATTRIBS:  
696 - value = getattr(self, prop)  
697 - print('- %s: %s' % (prop, repr(value)))  
698 - print('Properties from DocumentSummaryInformation stream:')  
699 - for prop in self.DOCSUM_ATTRIBS:  
700 - value = getattr(self, prop)  
701 - print('- %s: %s' % (prop, repr(value)))  
702 -  
703 -  
704 -#--- _OleStream ---------------------------------------------------------------  
705 -  
706 -class _OleStream(io.BytesIO):  
707 - """  
708 - OLE2 Stream  
709 -  
710 - Returns a read-only file object which can be used to read  
711 - the contents of a OLE stream (instance of the BytesIO class).  
712 - To open a stream, use the openstream method in the OleFile class.  
713 -  
714 - This function can be used with either ordinary streams,  
715 - or ministreams, depending on the offset, sectorsize, and  
716 - fat table arguments.  
717 -  
718 - Attributes:  
719 -  
720 - - size: actual size of data stream, after it was opened.  
721 - """  
722 - #TODO: use _raise_defect instead of exceptions  
723 -  
724 - # FIXME: should store the list of sects obtained by following  
725 - # the fat chain, and load new sectors on demand instead of  
726 - # loading it all in one go.  
727 -  
728 - def __init__(self, fp, sect, size, offset, sectorsize, fat, filesize):  
729 - """  
730 - Constructor for _OleStream class.  
731 -  
732 - :param fp: file object, the OLE container or the MiniFAT stream  
733 - :param sect: sector index of first sector in the stream  
734 - :param size: total size of the stream  
735 - :param offset: offset in bytes for the first FAT or MiniFAT sector  
736 - :param sectorsize: size of one sector  
737 - :param fat: array/list of sector indexes (FAT or MiniFAT)  
738 - :param filesize: size of OLE file (for debugging)  
739 - :returns: a BytesIO instance containing the OLE stream  
740 - """  
741 - log.debug('_OleStream.__init__:')  
742 - log.debug(' sect=%d (%X), size=%d, offset=%d, sectorsize=%d, len(fat)=%d, fp=%s'  
743 - %(sect,sect,size,offset,sectorsize,len(fat), repr(fp)))  
744 - #[PL] To detect malformed documents with FAT loops, we compute the  
745 - # expected number of sectors in the stream:  
746 - unknown_size = False  
747 - if size==0x7FFFFFFF:  
748 - # this is the case when called from OleFileIO._open(), and stream  
749 - # size is not known in advance (for example when reading the  
750 - # Directory stream). Then we can only guess maximum size:  
751 - size = len(fat)*sectorsize  
752 - # and we keep a record that size was unknown:  
753 - unknown_size = True  
754 - log.debug(' stream with UNKNOWN SIZE')  
755 - nb_sectors = (size + (sectorsize-1)) // sectorsize  
756 - log.debug('nb_sectors = %d' % nb_sectors)  
757 - # This number should (at least) be less than the total number of  
758 - # sectors in the given FAT:  
759 - if nb_sectors > len(fat):  
760 - raise IOError('malformed OLE document, stream too large')  
761 - # optimization(?): data is first a list of strings, and join() is called  
762 - # at the end to concatenate all in one string.  
763 - # (this may not be really useful with recent Python versions)  
764 - data = []  
765 - # if size is zero, then first sector index should be ENDOFCHAIN:  
766 - if size == 0 and sect != ENDOFCHAIN:  
767 - log.debug('size == 0 and sect != ENDOFCHAIN:')  
768 - raise IOError('incorrect OLE sector index for empty stream')  
769 - #[PL] A fixed-length for loop is used instead of an undefined while  
770 - # loop to avoid DoS attacks:  
771 - for i in range(nb_sectors):  
772 - # Sector index may be ENDOFCHAIN, but only if size was unknown  
773 - if sect == ENDOFCHAIN:  
774 - if unknown_size:  
775 - break  
776 - else:  
777 - # else this means that the stream is smaller than declared:  
778 - log.debug('sect=ENDOFCHAIN before expected size')  
779 - raise IOError('incomplete OLE stream')  
780 - # sector index should be within FAT:  
781 - if sect<0 or sect>=len(fat):  
782 - log.debug('sect=%d (%X) / len(fat)=%d' % (sect, sect, len(fat)))  
783 - log.debug('i=%d / nb_sectors=%d' %(i, nb_sectors))  
784 -## tmp_data = b"".join(data)  
785 -## f = open('test_debug.bin', 'wb')  
786 -## f.write(tmp_data)  
787 -## f.close()  
788 -## log.debug('data read so far: %d bytes' % len(tmp_data))  
789 - raise IOError('incorrect OLE FAT, sector index out of range')  
790 - #TODO: merge this code with OleFileIO.getsect() ?  
791 - #TODO: check if this works with 4K sectors:  
792 - try:  
793 - fp.seek(offset + sectorsize * sect)  
794 - except:  
795 - log.debug('sect=%d, seek=%d, filesize=%d' %  
796 - (sect, offset+sectorsize*sect, filesize))  
797 - raise IOError('OLE sector index out of range')  
798 - sector_data = fp.read(sectorsize)  
799 - # [PL] check if there was enough data:  
800 - # Note: if sector is the last of the file, sometimes it is not a  
801 - # complete sector (of 512 or 4K), so we may read less than  
802 - # sectorsize.  
803 - if len(sector_data)!=sectorsize and sect!=(len(fat)-1):  
804 - log.debug('sect=%d / len(fat)=%d, seek=%d / filesize=%d, len read=%d' %  
805 - (sect, len(fat), offset+sectorsize*sect, filesize, len(sector_data)))  
806 - log.debug('seek+len(read)=%d' % (offset+sectorsize*sect+len(sector_data)))  
807 - raise IOError('incomplete OLE sector')  
808 - data.append(sector_data)  
809 - # jump to next sector in the FAT:  
810 - try:  
811 - sect = fat[sect] & 0xFFFFFFFF # JYTHON-WORKAROUND  
812 - except IndexError:  
813 - # [PL] if pointer is out of the FAT an exception is raised  
814 - raise IOError('incorrect OLE FAT, sector index out of range')  
815 - #[PL] Last sector should be a "end of chain" marker:  
816 - if sect != ENDOFCHAIN:  
817 - raise IOError('incorrect last sector index in OLE stream')  
818 - data = b"".join(data)  
819 - # Data is truncated to the actual stream size:  
820 - if len(data) >= size:  
821 - data = data[:size]  
822 - # actual stream size is stored for future use:  
823 - self.size = size  
824 - elif unknown_size:  
825 - # actual stream size was not known, now we know the size of read  
826 - # data:  
827 - self.size = len(data)  
828 - else:  
829 - # read data is less than expected:  
830 - log.debug('len(data)=%d, size=%d' % (len(data), size))  
831 - # TODO: provide details in exception message  
832 - raise IOError('OLE stream size is less than declared')  
833 - # when all data is read in memory, BytesIO constructor is called  
834 - io.BytesIO.__init__(self, data)  
835 - # Then the _OleStream object can be used as a read-only file object.  
836 -  
837 -  
838 -#--- _OleDirectoryEntry -------------------------------------------------------  
839 -  
840 -class _OleDirectoryEntry:  
841 -  
842 - """  
843 - OLE2 Directory Entry  
844 - """  
845 - #[PL] parsing code moved from OleFileIO.loaddirectory  
846 -  
847 - # struct to parse directory entries:  
848 - # <: little-endian byte order, standard sizes  
849 - # (note: this should guarantee that Q returns a 64 bits int)  
850 - # 64s: string containing entry name in unicode UTF-16 (max 31 chars) + null char = 64 bytes  
851 - # H: uint16, number of bytes used in name buffer, including null = (len+1)*2  
852 - # B: uint8, dir entry type (between 0 and 5)  
853 - # B: uint8, color: 0=black, 1=red  
854 - # I: uint32, index of left child node in the red-black tree, NOSTREAM if none  
855 - # I: uint32, index of right child node in the red-black tree, NOSTREAM if none  
856 - # I: uint32, index of child root node if it is a storage, else NOSTREAM  
857 - # 16s: CLSID, unique identifier (only used if it is a storage)  
858 - # I: uint32, user flags  
859 - # Q (was 8s): uint64, creation timestamp or zero  
860 - # Q (was 8s): uint64, modification timestamp or zero  
861 - # I: uint32, SID of first sector if stream or ministream, SID of 1st sector  
862 - # of stream containing ministreams if root entry, 0 otherwise  
863 - # I: uint32, total stream size in bytes if stream (low 32 bits), 0 otherwise  
864 - # I: uint32, total stream size in bytes if stream (high 32 bits), 0 otherwise  
865 - STRUCT_DIRENTRY = '<64sHBBIII16sIQQIII'  
866 - # size of a directory entry: 128 bytes  
867 - DIRENTRY_SIZE = 128  
868 - assert struct.calcsize(STRUCT_DIRENTRY) == DIRENTRY_SIZE  
869 -  
870 -  
871 - def __init__(self, entry, sid, olefile):  
872 - """  
873 - Constructor for an _OleDirectoryEntry object.  
874 - Parses a 128-bytes entry from the OLE Directory stream.  
875 -  
876 - :param entry : string (must be 128 bytes long)  
877 - :param sid : index of this directory entry in the OLE file directory  
878 - :param olefile: OleFileIO containing this directory entry  
879 - """  
880 - self.sid = sid  
881 - # ref to olefile is stored for future use  
882 - self.olefile = olefile  
883 - # kids is a list of children entries, if this entry is a storage:  
884 - # (list of _OleDirectoryEntry objects)  
885 - self.kids = []  
886 - # kids_dict is a dictionary of children entries, indexed by their  
887 - # name in lowercase: used to quickly find an entry, and to detect  
888 - # duplicates  
889 - self.kids_dict = {}  
890 - # flag used to detect if the entry is referenced more than once in  
891 - # directory:  
892 - self.used = False  
893 - # decode DirEntry  
894 - (  
895 - self.name_raw, # 64s: string containing entry name in unicode UTF-16 (max 31 chars) + null char = 64 bytes  
896 - self.namelength, # H: uint16, number of bytes used in name buffer, including null = (len+1)*2  
897 - self.entry_type,  
898 - self.color,  
899 - self.sid_left,  
900 - self.sid_right,  
901 - self.sid_child,  
902 - clsid,  
903 - self.dwUserFlags,  
904 - self.createTime,  
905 - self.modifyTime,  
906 - self.isectStart,  
907 - self.sizeLow,  
908 - self.sizeHigh  
909 - ) = struct.unpack(_OleDirectoryEntry.STRUCT_DIRENTRY, entry)  
910 - if self.entry_type not in [STGTY_ROOT, STGTY_STORAGE, STGTY_STREAM, STGTY_EMPTY]:  
911 - olefile._raise_defect(DEFECT_INCORRECT, 'unhandled OLE storage type')  
912 - # only first directory entry can (and should) be root:  
913 - if self.entry_type == STGTY_ROOT and sid != 0:  
914 - olefile._raise_defect(DEFECT_INCORRECT, 'duplicate OLE root entry')  
915 - if sid == 0 and self.entry_type != STGTY_ROOT:  
916 - olefile._raise_defect(DEFECT_INCORRECT, 'incorrect OLE root entry')  
917 - #log.debug(struct.unpack(fmt_entry, entry[:len_entry]))  
918 - # name should be at most 31 unicode characters + null character,  
919 - # so 64 bytes in total (31*2 + 2):  
920 - if self.namelength>64:  
921 - olefile._raise_defect(DEFECT_INCORRECT, 'incorrect DirEntry name length >64 bytes')  
922 - # if exception not raised, namelength is set to the maximum value:  
923 - self.namelength = 64  
924 - # only characters without ending null char are kept:  
925 - self.name_utf16 = self.name_raw[:(self.namelength-2)]  
926 - #TODO: check if the name is actually followed by a null unicode character ([MS-CFB] 2.6.1)  
927 - #TODO: check if the name does not contain forbidden characters:  
928 - # [MS-CFB] 2.6.1: "The following characters are illegal and MUST NOT be part of the name: '/', '\', ':', '!'."  
929 - # name is converted from UTF-16LE to the path encoding specified in the OleFileIO:  
930 - self.name = olefile._decode_utf16_str(self.name_utf16)  
931 -  
932 - log.debug('DirEntry SID=%d: %s' % (self.sid, repr(self.name)))  
933 - log.debug(' - type: %d' % self.entry_type)  
934 - log.debug(' - sect: %Xh' % self.isectStart)  
935 - log.debug(' - SID left: %d, right: %d, child: %d' % (self.sid_left,  
936 - self.sid_right, self.sid_child))  
937 -  
938 - # sizeHigh is only used for 4K sectors, it should be zero for 512 bytes  
939 - # sectors, BUT apparently some implementations set it as 0xFFFFFFFF, 1  
940 - # or some other value so it cannot be raised as a defect in general:  
941 - if olefile.sectorsize == 512:  
942 - if self.sizeHigh != 0 and self.sizeHigh != 0xFFFFFFFF:  
943 - log.debug('sectorsize=%d, sizeLow=%d, sizeHigh=%d (%X)' %  
944 - (olefile.sectorsize, self.sizeLow, self.sizeHigh, self.sizeHigh))  
945 - olefile._raise_defect(DEFECT_UNSURE, 'incorrect OLE stream size')  
946 - self.size = self.sizeLow  
947 - else:  
948 - self.size = self.sizeLow + (long(self.sizeHigh)<<32)  
949 - log.debug(' - size: %d (sizeLow=%d, sizeHigh=%d)' % (self.size, self.sizeLow, self.sizeHigh))  
950 -  
951 - self.clsid = _clsid(clsid)  
952 - # a storage should have a null size, BUT some implementations such as  
953 - # Word 8 for Mac seem to allow non-null values => Potential defect:  
954 - if self.entry_type == STGTY_STORAGE and self.size != 0:  
955 - olefile._raise_defect(DEFECT_POTENTIAL, 'OLE storage with size>0')  
956 - # check if stream is not already referenced elsewhere:  
957 - if self.entry_type in (STGTY_ROOT, STGTY_STREAM) and self.size>0:  
958 - if self.size < olefile.minisectorcutoff \  
959 - and self.entry_type==STGTY_STREAM: # only streams can be in MiniFAT  
960 - # ministream object  
961 - minifat = True  
962 - else:  
963 - minifat = False  
964 - olefile._check_duplicate_stream(self.isectStart, minifat)  
965 -  
966 -  
967 -  
968 - def build_storage_tree(self):  
969 - """  
970 - Read and build the red-black tree attached to this _OleDirectoryEntry  
971 - object, if it is a storage.  
972 - Note that this method builds a tree of all subentries, so it should  
973 - only be called for the root object once.  
974 - """  
975 - log.debug('build_storage_tree: SID=%d - %s - sid_child=%d'  
976 - % (self.sid, repr(self.name), self.sid_child))  
977 - if self.sid_child != NOSTREAM:  
978 - # if child SID is not NOSTREAM, then this entry is a storage.  
979 - # Let's walk through the tree of children to fill the kids list:  
980 - self.append_kids(self.sid_child)  
981 -  
982 - # Note from OpenOffice documentation: the safest way is to  
983 - # recreate the tree because some implementations may store broken  
984 - # red-black trees...  
985 -  
986 - # in the OLE file, entries are sorted on (length, name).  
987 - # for convenience, we sort them on name instead:  
988 - # (see rich comparison methods in this class)  
989 - self.kids.sort()  
990 -  
991 -  
992 - def append_kids(self, child_sid):  
993 - """  
994 - Walk through red-black tree of children of this directory entry to add  
995 - all of them to the kids list. (recursive method)  
996 -  
997 - :param child_sid : index of child directory entry to use, or None when called  
998 - first time for the root. (only used during recursion)  
999 - """  
1000 - #[PL] this method was added to use simple recursion instead of a complex  
1001 - # algorithm.  
1002 - # if this is not a storage or a leaf of the tree, nothing to do:  
1003 - if child_sid == NOSTREAM:  
1004 - return  
1005 - # check if child SID is in the proper range:  
1006 - if child_sid<0 or child_sid>=len(self.olefile.direntries):  
1007 - self.olefile._raise_defect(DEFECT_FATAL, 'OLE DirEntry index out of range')  
1008 - # get child direntry:  
1009 - child = self.olefile._load_direntry(child_sid) #direntries[child_sid]  
1010 - log.debug('append_kids: child_sid=%d - %s - sid_left=%d, sid_right=%d, sid_child=%d'  
1011 - % (child.sid, repr(child.name), child.sid_left, child.sid_right, child.sid_child))  
1012 - # the directory entries are organized as a red-black tree.  
1013 - # (cf. Wikipedia for details)  
1014 - # First walk through left side of the tree:  
1015 - self.append_kids(child.sid_left)  
1016 - # Check if its name is not already used (case-insensitive):  
1017 - name_lower = child.name.lower()  
1018 - if name_lower in self.kids_dict:  
1019 - self.olefile._raise_defect(DEFECT_INCORRECT,  
1020 - "Duplicate filename in OLE storage")  
1021 - # Then the child_sid _OleDirectoryEntry object is appended to the  
1022 - # kids list and dictionary:  
1023 - self.kids.append(child)  
1024 - self.kids_dict[name_lower] = child  
1025 - # Check if kid was not already referenced in a storage:  
1026 - if child.used:  
1027 - self.olefile._raise_defect(DEFECT_INCORRECT,  
1028 - 'OLE Entry referenced more than once')  
1029 - child.used = True  
1030 - # Finally walk through right side of the tree:  
1031 - self.append_kids(child.sid_right)  
1032 - # Afterwards build kid's own tree if it's also a storage:  
1033 - child.build_storage_tree()  
1034 -  
1035 -  
1036 - def __eq__(self, other):  
1037 - "Compare entries by name"  
1038 - return self.name == other.name  
1039 -  
1040 - def __lt__(self, other):  
1041 - "Compare entries by name"  
1042 - return self.name < other.name  
1043 -  
1044 - def __ne__(self, other):  
1045 - return not self.__eq__(other)  
1046 -  
1047 - def __le__(self, other):  
1048 - return self.__eq__(other) or self.__lt__(other)  
1049 -  
1050 - # Reflected __lt__() and __le__() will be used for __gt__() and __ge__()  
1051 -  
1052 - #TODO: replace by the same function as MS implementation ?  
1053 - # (order by name length first, then case-insensitive order)  
1054 -  
1055 -  
1056 - def dump(self, tab = 0):  
1057 - "Dump this entry, and all its subentries (for debug purposes only)"  
1058 - TYPES = ["(invalid)", "(storage)", "(stream)", "(lockbytes)",  
1059 - "(property)", "(root)"]  
1060 - print(" "*tab + repr(self.name), TYPES[self.entry_type], end=' ')  
1061 - if self.entry_type in (STGTY_STREAM, STGTY_ROOT):  
1062 - print(self.size, "bytes", end=' ')  
1063 - print()  
1064 - if self.entry_type in (STGTY_STORAGE, STGTY_ROOT) and self.clsid:  
1065 - print(" "*tab + "{%s}" % self.clsid)  
1066 -  
1067 - for kid in self.kids:  
1068 - kid.dump(tab + 2)  
1069 -  
1070 -  
1071 - def getmtime(self):  
1072 - """  
1073 - Return modification time of a directory entry.  
1074 -  
1075 - :returns: None if modification time is null, a python datetime object  
1076 - otherwise (UTC timezone)  
1077 -  
1078 - new in version 0.26  
1079 - """  
1080 - if self.modifyTime == 0:  
1081 - return None  
1082 - return filetime2datetime(self.modifyTime)  
1083 -  
1084 -  
1085 - def getctime(self):  
1086 - """  
1087 - Return creation time of a directory entry.  
1088 -  
1089 - :returns: None if modification time is null, a python datetime object  
1090 - otherwise (UTC timezone)  
1091 -  
1092 - new in version 0.26  
1093 - """  
1094 - if self.createTime == 0:  
1095 - return None  
1096 - return filetime2datetime(self.createTime)  
1097 -  
1098 -  
1099 -#--- OleFileIO ----------------------------------------------------------------  
1100 -  
1101 -class OleFileIO:  
1102 - """  
1103 - OLE container object  
1104 -  
1105 - This class encapsulates the interface to an OLE 2 structured  
1106 - storage file. Use the listdir and openstream methods to  
1107 - access the contents of this file.  
1108 -  
1109 - Object names are given as a list of strings, one for each subentry  
1110 - level. The root entry should be omitted. For example, the following  
1111 - code extracts all image streams from a Microsoft Image Composer file::  
1112 -  
1113 - ole = OleFileIO("fan.mic")  
1114 -  
1115 - for entry in ole.listdir():  
1116 - if entry[1:2] == "Image":  
1117 - fin = ole.openstream(entry)  
1118 - fout = open(entry[0:1], "wb")  
1119 - while True:  
1120 - s = fin.read(8192)  
1121 - if not s:  
1122 - break  
1123 - fout.write(s)  
1124 -  
1125 - You can use the viewer application provided with the Python Imaging  
1126 - Library to view the resulting files (which happens to be standard  
1127 - TIFF files).  
1128 - """  
1129 -  
1130 - def __init__(self, filename=None, raise_defects=DEFECT_FATAL,  
1131 - write_mode=False, debug=False, path_encoding=DEFAULT_PATH_ENCODING):  
1132 - """  
1133 - Constructor for the OleFileIO class.  
1134 -  
1135 - :param filename: file to open.  
1136 -  
1137 - - if filename is a string smaller than 1536 bytes, it is the path  
1138 - of the file to open. (bytes or unicode string)  
1139 - - if filename is a string longer than 1535 bytes, it is parsed  
1140 - as the content of an OLE file in memory. (bytes type only)  
1141 - - if filename is a file-like object (with read, seek and tell methods),  
1142 - it is parsed as-is.  
1143 -  
1144 - :param raise_defects: minimal level for defects to be raised as exceptions.  
1145 - (use DEFECT_FATAL for a typical application, DEFECT_INCORRECT for a  
1146 - security-oriented application, see source code for details)  
1147 -  
1148 - :param write_mode: bool, if True the file is opened in read/write mode instead  
1149 - of read-only by default.  
1150 -  
1151 - :param debug: bool, set debug mode (deprecated, not used anymore)  
1152 -  
1153 - :param path_encoding: None or str, name of the codec to use for path  
1154 - names (streams and storages), or None for Unicode.  
1155 - Unicode by default on Python 3+, UTF-8 on Python 2.x.  
1156 - (new in olefile 0.42, was hardcoded to Latin-1 until olefile v0.41)  
1157 - """  
1158 - # minimal level for defects to be raised as exceptions:  
1159 - self._raise_defects_level = raise_defects  
1160 - # list of defects/issues not raised as exceptions:  
1161 - # tuples of (exception type, message)  
1162 - self.parsing_issues = []  
1163 - self.write_mode = write_mode  
1164 - self.path_encoding = path_encoding  
1165 - self._filesize = None  
1166 - self.fp = None  
1167 - if filename:  
1168 - self.open(filename, write_mode=write_mode)  
1169 -  
1170 -  
1171 - def _raise_defect(self, defect_level, message, exception_type=IOError):  
1172 - """  
1173 - This method should be called for any defect found during file parsing.  
1174 - It may raise an IOError exception according to the minimal level chosen  
1175 - for the OleFileIO object.  
1176 -  
1177 - :param defect_level: defect level, possible values are:  
1178 -  
1179 - - DEFECT_UNSURE : a case which looks weird, but not sure it's a defect  
1180 - - DEFECT_POTENTIAL : a potential defect  
1181 - - DEFECT_INCORRECT : an error according to specifications, but parsing can go on  
1182 - - DEFECT_FATAL : an error which cannot be ignored, parsing is impossible  
1183 -  
1184 - :param message: string describing the defect, used with raised exception.  
1185 - :param exception_type: exception class to be raised, IOError by default  
1186 - """  
1187 - # added by [PL]  
1188 - if defect_level >= self._raise_defects_level:  
1189 - log.error(message)  
1190 - raise exception_type(message)  
1191 - else:  
1192 - # just record the issue, no exception raised:  
1193 - self.parsing_issues.append((exception_type, message))  
1194 - log.warning(message)  
1195 -  
1196 -  
1197 - def _decode_utf16_str(self, utf16_str, errors='replace'):  
1198 - """  
1199 - Decode a string encoded in UTF-16 LE format, as found in the OLE  
1200 - directory or in property streams. Return a string encoded  
1201 - according to the path_encoding specified for the OleFileIO object.  
1202 -  
1203 - :param utf16_str: bytes string encoded in UTF-16 LE format  
1204 - :param errors: str, see python documentation for str.decode()  
1205 - :return: str, encoded according to path_encoding  
1206 - """  
1207 - unicode_str = utf16_str.decode('UTF-16LE', errors)  
1208 - if self.path_encoding:  
1209 - # an encoding has been specified for path names:  
1210 - return unicode_str.encode(self.path_encoding, errors)  
1211 - else:  
1212 - # path_encoding=None, return the Unicode string as-is:  
1213 - return unicode_str  
1214 -  
1215 -  
1216 - def open(self, filename, write_mode=False):  
1217 - """  
1218 - Open an OLE2 file in read-only or read/write mode.  
1219 - Read and parse the header, FAT and directory.  
1220 -  
1221 - :param filename: string-like or file-like object, OLE file to parse  
1222 -  
1223 - - if filename is a string smaller than 1536 bytes, it is the path  
1224 - of the file to open. (bytes or unicode string)  
1225 - - if filename is a string longer than 1535 bytes, it is parsed  
1226 - as the content of an OLE file in memory. (bytes type only)  
1227 - - if filename is a file-like object (with read, seek and tell methods),  
1228 - it is parsed as-is.  
1229 -  
1230 - :param write_mode: bool, if True the file is opened in read/write mode instead  
1231 - of read-only by default. (ignored if filename is not a path)  
1232 - """  
1233 - self.write_mode = write_mode  
1234 - #[PL] check if filename is a string-like or file-like object:  
1235 - # (it is better to check for a read() method)  
1236 - if hasattr(filename, 'read'):  
1237 - #TODO: also check seek and tell methods?  
1238 - # file-like object: use it directly  
1239 - self.fp = filename  
1240 - elif isinstance(filename, bytes) and len(filename) >= MINIMAL_OLEFILE_SIZE:  
1241 - # filename is a bytes string containing the OLE file to be parsed:  
1242 - # convert it to BytesIO  
1243 - self.fp = io.BytesIO(filename)  
1244 - else:  
1245 - # string-like object: filename of file on disk  
1246 - if self.write_mode:  
1247 - # open file in mode 'read with update, binary'  
1248 - # According to https://docs.python.org/2/library/functions.html#open  
1249 - # 'w' would truncate the file, 'a' may only append on some Unixes  
1250 - mode = 'r+b'  
1251 - else:  
1252 - # read-only mode by default  
1253 - mode = 'rb'  
1254 - self.fp = open(filename, mode)  
1255 - # obtain the filesize by using seek and tell, which should work on most  
1256 - # file-like objects:  
1257 - #TODO: do it above, using getsize with filename when possible?  
1258 - #TODO: fix code to fail with clear exception when filesize cannot be obtained  
1259 - filesize=0  
1260 - self.fp.seek(0, os.SEEK_END)  
1261 - try:  
1262 - filesize = self.fp.tell()  
1263 - finally:  
1264 - self.fp.seek(0)  
1265 - self._filesize = filesize  
1266 - log.debug('File size: %d' % self._filesize)  
1267 -  
1268 - # lists of streams in FAT and MiniFAT, to detect duplicate references  
1269 - # (list of indexes of first sectors of each stream)  
1270 - self._used_streams_fat = []  
1271 - self._used_streams_minifat = []  
1272 -  
1273 - header = self.fp.read(512)  
1274 -  
1275 - if len(header) != 512 or header[:8] != MAGIC:  
1276 - log.debug('Magic = %r instead of %r' % (header[:8], MAGIC))  
1277 - self._raise_defect(DEFECT_FATAL, "not an OLE2 structured storage file")  
1278 -  
1279 - # [PL] header structure according to AAF specifications:  
1280 - ##Header  
1281 - ##struct StructuredStorageHeader { // [offset from start (bytes), length (bytes)]  
1282 - ##BYTE _abSig[8]; // [00H,08] {0xd0, 0xcf, 0x11, 0xe0, 0xa1, 0xb1,  
1283 - ## // 0x1a, 0xe1} for current version  
1284 - ##CLSID _clsid; // [08H,16] reserved must be zero (WriteClassStg/  
1285 - ## // GetClassFile uses root directory class id)  
1286 - ##USHORT _uMinorVersion; // [18H,02] minor version of the format: 33 is  
1287 - ## // written by reference implementation  
1288 - ##USHORT _uDllVersion; // [1AH,02] major version of the dll/format: 3 for  
1289 - ## // 512-byte sectors, 4 for 4 KB sectors  
1290 - ##USHORT _uByteOrder; // [1CH,02] 0xFFFE: indicates Intel byte-ordering  
1291 - ##USHORT _uSectorShift; // [1EH,02] size of sectors in power-of-two;  
1292 - ## // typically 9 indicating 512-byte sectors  
1293 - ##USHORT _uMiniSectorShift; // [20H,02] size of mini-sectors in power-of-two;  
1294 - ## // typically 6 indicating 64-byte mini-sectors  
1295 - ##USHORT _usReserved; // [22H,02] reserved, must be zero  
1296 - ##ULONG _ulReserved1; // [24H,04] reserved, must be zero  
1297 - ##FSINDEX _csectDir; // [28H,04] must be zero for 512-byte sectors,  
1298 - ## // number of SECTs in directory chain for 4 KB  
1299 - ## // sectors  
1300 - ##FSINDEX _csectFat; // [2CH,04] number of SECTs in the FAT chain  
1301 - ##SECT _sectDirStart; // [30H,04] first SECT in the directory chain  
1302 - ##DFSIGNATURE _signature; // [34H,04] signature used for transactions; must  
1303 - ## // be zero. The reference implementation  
1304 - ## // does not support transactions  
1305 - ##ULONG _ulMiniSectorCutoff; // [38H,04] maximum size for a mini stream;  
1306 - ## // typically 4096 bytes  
1307 - ##SECT _sectMiniFatStart; // [3CH,04] first SECT in the MiniFAT chain  
1308 - ##FSINDEX _csectMiniFat; // [40H,04] number of SECTs in the MiniFAT chain  
1309 - ##SECT _sectDifStart; // [44H,04] first SECT in the DIFAT chain  
1310 - ##FSINDEX _csectDif; // [48H,04] number of SECTs in the DIFAT chain  
1311 - ##SECT _sectFat[109]; // [4CH,436] the SECTs of first 109 FAT sectors  
1312 - ##};  
1313 -  
1314 - # [PL] header decoding:  
1315 - # '<' indicates little-endian byte ordering for Intel (cf. struct module help)  
1316 - fmt_header = '<8s16sHHHHHHLLLLLLLLLL'  
1317 - header_size = struct.calcsize(fmt_header)  
1318 - log.debug( "fmt_header size = %d, +FAT = %d" % (header_size, header_size + 109*4) )  
1319 - header1 = header[:header_size]  
1320 - (  
1321 - self.header_signature,  
1322 - self.header_clsid,  
1323 - self.minor_version,  
1324 - self.dll_version,  
1325 - self.byte_order,  
1326 - self.sector_shift,  
1327 - self.mini_sector_shift,  
1328 - self.reserved1,  
1329 - self.reserved2,  
1330 - self.num_dir_sectors,  
1331 - self.num_fat_sectors,  
1332 - self.first_dir_sector,  
1333 - self.transaction_signature_number,  
1334 - self.mini_stream_cutoff_size,  
1335 - self.first_mini_fat_sector,  
1336 - self.num_mini_fat_sectors,  
1337 - self.first_difat_sector,  
1338 - self.num_difat_sectors  
1339 - ) = struct.unpack(fmt_header, header1)  
1340 - log.debug( struct.unpack(fmt_header, header1))  
1341 -  
1342 - if self.header_signature != MAGIC:  
1343 - # OLE signature should always be present  
1344 - self._raise_defect(DEFECT_FATAL, "incorrect OLE signature")  
1345 - if self.header_clsid != bytearray(16):  
1346 - # according to AAF specs, CLSID should always be zero  
1347 - self._raise_defect(DEFECT_INCORRECT, "incorrect CLSID in OLE header")  
1348 - log.debug( "Minor Version = %d" % self.minor_version )  
1349 - log.debug( "DLL Version = %d (expected: 3 or 4)" % self.dll_version )  
1350 - if self.dll_version not in [3, 4]:  
1351 - # version 3: usual format, 512 bytes per sector  
1352 - # version 4: large format, 4K per sector  
1353 - self._raise_defect(DEFECT_INCORRECT, "incorrect DllVersion in OLE header")  
1354 - log.debug( "Byte Order = %X (expected: FFFE)" % self.byte_order )  
1355 - if self.byte_order != 0xFFFE:  
1356 - # For now only common little-endian documents are handled correctly  
1357 - self._raise_defect(DEFECT_FATAL, "incorrect ByteOrder in OLE header")  
1358 - # TODO: add big-endian support for documents created on Mac ?  
1359 - # But according to [MS-CFB] ? v20140502, ByteOrder MUST be 0xFFFE.  
1360 - self.sector_size = 2**self.sector_shift  
1361 - log.debug( "Sector Size = %d bytes (expected: 512 or 4096)" % self.sector_size )  
1362 - if self.sector_size not in [512, 4096]:  
1363 - self._raise_defect(DEFECT_INCORRECT, "incorrect sector_size in OLE header")  
1364 - if (self.dll_version==3 and self.sector_size!=512) \  
1365 - or (self.dll_version==4 and self.sector_size!=4096):  
1366 - self._raise_defect(DEFECT_INCORRECT, "sector_size does not match DllVersion in OLE header")  
1367 - self.mini_sector_size = 2**self.mini_sector_shift  
1368 - log.debug( "MiniFAT Sector Size = %d bytes (expected: 64)" % self.mini_sector_size )  
1369 - if self.mini_sector_size not in [64]:  
1370 - self._raise_defect(DEFECT_INCORRECT, "incorrect mini_sector_size in OLE header")  
1371 - if self.reserved1 != 0 or self.reserved2 != 0:  
1372 - self._raise_defect(DEFECT_INCORRECT, "incorrect OLE header (non-null reserved bytes)")  
1373 - log.debug( "Number of directory sectors = %d" % self.num_dir_sectors )  
1374 - # Number of directory sectors (only allowed if DllVersion != 3)  
1375 - if self.sector_size==512 and self.num_dir_sectors!=0:  
1376 - self._raise_defect(DEFECT_INCORRECT, "incorrect number of directory sectors in OLE header")  
1377 - log.debug( "num_fat_sectors = %d" % self.num_fat_sectors )  
1378 - # num_fat_sectors = number of FAT sectors in the file  
1379 - log.debug( "first_dir_sector = %X" % self.first_dir_sector )  
1380 - # first_dir_sector = 1st sector containing the directory  
1381 - log.debug( "transaction_signature_number = %d" % self.transaction_signature_number )  
1382 - # Signature should be zero, BUT some implementations do not follow this  
1383 - # rule => only a potential defect:  
1384 - # (according to MS-CFB, may be != 0 for applications supporting file  
1385 - # transactions)  
1386 - if self.transaction_signature_number != 0:  
1387 - self._raise_defect(DEFECT_POTENTIAL, "incorrect OLE header (transaction_signature_number>0)")  
1388 - log.debug( "mini_stream_cutoff_size = 0x%X (expected: 0x1000)" % self.mini_stream_cutoff_size )  
1389 - # MS-CFB: This integer field MUST be set to 0x00001000. This field  
1390 - # specifies the maximum size of a user-defined data stream allocated  
1391 - # from the mini FAT and mini stream, and that cutoff is 4096 bytes.  
1392 - # Any user-defined data stream larger than or equal to this cutoff size  
1393 - # must be allocated as normal sectors from the FAT.  
1394 - if self.mini_stream_cutoff_size != 0x1000:  
1395 - self._raise_defect(DEFECT_INCORRECT, "incorrect mini_stream_cutoff_size in OLE header")  
1396 - # if no exception is raised, the cutoff size is fixed to 0x1000  
1397 - log.warning('Fixing the mini_stream_cutoff_size to 4096 (mandatory value) instead of %d' %  
1398 - self.mini_stream_cutoff_size)  
1399 - self.mini_stream_cutoff_size = 0x1000  
1400 - log.debug( "first_mini_fat_sector = %Xh" % self.first_mini_fat_sector )  
1401 - log.debug( "num_mini_fat_sectors = %d" % self.num_mini_fat_sectors )  
1402 - log.debug( "first_difat_sector = %Xh" % self.first_difat_sector )  
1403 - log.debug( "num_difat_sectors = %d" % self.num_difat_sectors )  
1404 -  
1405 - # calculate the number of sectors in the file  
1406 - # (-1 because header doesn't count)  
1407 - self.nb_sect = ( (filesize + self.sector_size-1) // self.sector_size) - 1  
1408 - log.debug( "Number of sectors in the file: %d" % self.nb_sect )  
1409 - #TODO: change this test, because an OLE file MAY contain other data  
1410 - # after the last sector.  
1411 -  
1412 - # file clsid  
1413 - self.header_clsid = _clsid(header[8:24])  
1414 -  
1415 - #TODO: remove redundant attributes, and fix the code which uses them?  
1416 - self.sectorsize = self.sector_size #1 << i16(header, 30)  
1417 - self.minisectorsize = self.mini_sector_size #1 << i16(header, 32)  
1418 - self.minisectorcutoff = self.mini_stream_cutoff_size # i32(header, 56)  
1419 -  
1420 - # check known streams for duplicate references (these are always in FAT,  
1421 - # never in MiniFAT):  
1422 - self._check_duplicate_stream(self.first_dir_sector)  
1423 - # check MiniFAT only if it is not empty:  
1424 - if self.num_mini_fat_sectors:  
1425 - self._check_duplicate_stream(self.first_mini_fat_sector)  
1426 - # check DIFAT only if it is not empty:  
1427 - if self.num_difat_sectors:  
1428 - self._check_duplicate_stream(self.first_difat_sector)  
1429 -  
1430 - # Load file allocation tables  
1431 - self.loadfat(header)  
1432 - # Load direcory. This sets both the direntries list (ordered by sid)  
1433 - # and the root (ordered by hierarchy) members.  
1434 - self.loaddirectory(self.first_dir_sector)#i32(header, 48))  
1435 - self.ministream = None  
1436 - self.minifatsect = self.first_mini_fat_sector #i32(header, 60)  
1437 -  
1438 -  
1439 - def close(self):  
1440 - """  
1441 - close the OLE file, to release the file object  
1442 - """  
1443 - self.fp.close()  
1444 -  
1445 -  
1446 - def _check_duplicate_stream(self, first_sect, minifat=False):  
1447 - """  
1448 - Checks if a stream has not been already referenced elsewhere.  
1449 - This method should only be called once for each known stream, and only  
1450 - if stream size is not null.  
1451 -  
1452 - :param first_sect: int, index of first sector of the stream in FAT  
1453 - :param minifat: bool, if True, stream is located in the MiniFAT, else in the FAT  
1454 - """  
1455 - if minifat:  
1456 - log.debug('_check_duplicate_stream: sect=%Xh in MiniFAT' % first_sect)  
1457 - used_streams = self._used_streams_minifat  
1458 - else:  
1459 - log.debug('_check_duplicate_stream: sect=%Xh in FAT' % first_sect)  
1460 - # some values can be safely ignored (not a real stream):  
1461 - if first_sect in (DIFSECT,FATSECT,ENDOFCHAIN,FREESECT):  
1462 - return  
1463 - used_streams = self._used_streams_fat  
1464 - #TODO: would it be more efficient using a dict or hash values, instead  
1465 - # of a list of long ?  
1466 - if first_sect in used_streams:  
1467 - self._raise_defect(DEFECT_INCORRECT, 'Stream referenced twice')  
1468 - else:  
1469 - used_streams.append(first_sect)  
1470 -  
1471 -  
1472 - def dumpfat(self, fat, firstindex=0):  
1473 - """  
1474 - Display a part of FAT in human-readable form for debugging purposes  
1475 - """  
1476 - # dictionary to convert special FAT values in human-readable strings  
1477 - VPL = 8 # values per line (8+1 * 8+1 = 81)  
1478 - fatnames = {  
1479 - FREESECT: "..free..",  
1480 - ENDOFCHAIN: "[ END. ]",  
1481 - FATSECT: "FATSECT ",  
1482 - DIFSECT: "DIFSECT "  
1483 - }  
1484 - nbsect = len(fat)  
1485 - nlines = (nbsect+VPL-1)//VPL  
1486 - print("index", end=" ")  
1487 - for i in range(VPL):  
1488 - print("%8X" % i, end=" ")  
1489 - print()  
1490 - for l in range(nlines):  
1491 - index = l*VPL  
1492 - print("%6X:" % (firstindex+index), end=" ")  
1493 - for i in range(index, index+VPL):  
1494 - if i>=nbsect:  
1495 - break  
1496 - sect = fat[i]  
1497 - aux = sect & 0xFFFFFFFF # JYTHON-WORKAROUND  
1498 - if aux in fatnames:  
1499 - name = fatnames[aux]  
1500 - else:  
1501 - if sect == i+1:  
1502 - name = " --->"  
1503 - else:  
1504 - name = "%8X" % sect  
1505 - print(name, end=" ")  
1506 - print()  
1507 -  
1508 -  
1509 - def dumpsect(self, sector, firstindex=0):  
1510 - """  
1511 - Display a sector in a human-readable form, for debugging purposes  
1512 - """  
1513 - VPL=8 # number of values per line (8+1 * 8+1 = 81)  
1514 - tab = array.array(UINT32, sector)  
1515 - if sys.byteorder == 'big':  
1516 - tab.byteswap()  
1517 - nbsect = len(tab)  
1518 - nlines = (nbsect+VPL-1)//VPL  
1519 - print("index", end=" ")  
1520 - for i in range(VPL):  
1521 - print("%8X" % i, end=" ")  
1522 - print()  
1523 - for l in range(nlines):  
1524 - index = l*VPL  
1525 - print("%6X:" % (firstindex+index), end=" ")  
1526 - for i in range(index, index+VPL):  
1527 - if i>=nbsect:  
1528 - break  
1529 - sect = tab[i]  
1530 - name = "%8X" % sect  
1531 - print(name, end=" ")  
1532 - print()  
1533 -  
1534 - def sect2array(self, sect):  
1535 - """  
1536 - convert a sector to an array of 32 bits unsigned integers,  
1537 - swapping bytes on big endian CPUs such as PowerPC (old Macs)  
1538 - """  
1539 - a = array.array(UINT32, sect)  
1540 - # if CPU is big endian, swap bytes:  
1541 - if sys.byteorder == 'big':  
1542 - a.byteswap()  
1543 - return a  
1544 -  
1545 -  
1546 - def loadfat_sect(self, sect):  
1547 - """  
1548 - Adds the indexes of the given sector to the FAT  
1549 -  
1550 - :param sect: string containing the first FAT sector, or array of long integers  
1551 - :returns: index of last FAT sector.  
1552 - """  
1553 - # a FAT sector is an array of ulong integers.  
1554 - if isinstance(sect, array.array):  
1555 - # if sect is already an array it is directly used  
1556 - fat1 = sect  
1557 - else:  
1558 - # if it's a raw sector, it is parsed in an array  
1559 - fat1 = self.sect2array(sect)  
1560 - # Display the sector contents only if the logging level is debug:  
1561 - if log.isEnabledFor(logging.DEBUG):  
1562 - self.dumpsect(sect)  
1563 - # The FAT is a sector chain starting at the first index of itself.  
1564 - # initialize isect, just in case:  
1565 - isect = None  
1566 - for isect in fat1:  
1567 - isect = isect & 0xFFFFFFFF # JYTHON-WORKAROUND  
1568 - log.debug("isect = %X" % isect)  
1569 - if isect == ENDOFCHAIN or isect == FREESECT:  
1570 - # the end of the sector chain has been reached  
1571 - log.debug("found end of sector chain")  
1572 - break  
1573 - # read the FAT sector  
1574 - s = self.getsect(isect)  
1575 - # parse it as an array of 32 bits integers, and add it to the  
1576 - # global FAT array  
1577 - nextfat = self.sect2array(s)  
1578 - self.fat = self.fat + nextfat  
1579 - return isect  
1580 -  
1581 -  
1582 - def loadfat(self, header):  
1583 - """  
1584 - Load the FAT table.  
1585 - """  
1586 - # The 1st sector of the file contains sector numbers for the first 109  
1587 - # FAT sectors, right after the header which is 76 bytes long.  
1588 - # (always 109, whatever the sector size: 512 bytes = 76+4*109)  
1589 - # Additional sectors are described by DIF blocks  
1590 -  
1591 - sect = header[76:512]  
1592 - log.debug( "len(sect)=%d, so %d integers" % (len(sect), len(sect)//4) )  
1593 - #fat = []  
1594 - # [PL] FAT is an array of 32 bits unsigned ints, it's more effective  
1595 - # to use an array than a list in Python.  
1596 - # It's initialized as empty first:  
1597 - self.fat = array.array(UINT32)  
1598 - self.loadfat_sect(sect)  
1599 - #self.dumpfat(self.fat)  
1600 -## for i in range(0, len(sect), 4):  
1601 -## ix = i32(sect, i)  
1602 -## #[PL] if ix == -2 or ix == -1: # ix == 0xFFFFFFFE or ix == 0xFFFFFFFF:  
1603 -## if ix == 0xFFFFFFFE or ix == 0xFFFFFFFF:  
1604 -## break  
1605 -## s = self.getsect(ix)  
1606 -## #fat = fat + [i32(s, i) for i in range(0, len(s), 4)]  
1607 -## fat = fat + array.array(UINT32, s)  
1608 - if self.num_difat_sectors != 0:  
1609 - # [PL] There's a DIFAT because file is larger than 6.8MB  
1610 - # some checks just in case:  
1611 - if self.num_fat_sectors <= 109:  
1612 - # there must be at least 109 blocks in header and the rest in  
1613 - # DIFAT, so number of sectors must be >109.  
1614 - self._raise_defect(DEFECT_INCORRECT, 'incorrect DIFAT, not enough sectors')  
1615 - if self.first_difat_sector >= self.nb_sect:  
1616 - # initial DIFAT block index must be valid  
1617 - self._raise_defect(DEFECT_FATAL, 'incorrect DIFAT, first index out of range')  
1618 - log.debug( "DIFAT analysis..." )  
1619 - # We compute the necessary number of DIFAT sectors :  
1620 - # Number of pointers per DIFAT sector = (sectorsize/4)-1  
1621 - # (-1 because the last pointer is the next DIFAT sector number)  
1622 - nb_difat_sectors = (self.sectorsize//4)-1  
1623 - # (if 512 bytes: each DIFAT sector = 127 pointers + 1 towards next DIFAT sector)  
1624 - nb_difat = (self.num_fat_sectors-109 + nb_difat_sectors-1)//nb_difat_sectors  
1625 - log.debug( "nb_difat = %d" % nb_difat )  
1626 - if self.num_difat_sectors != nb_difat:  
1627 - raise IOError('incorrect DIFAT')  
1628 - isect_difat = self.first_difat_sector  
1629 - for i in iterrange(nb_difat):  
1630 - log.debug( "DIFAT block %d, sector %X" % (i, isect_difat) )  
1631 - #TODO: check if corresponding FAT SID = DIFSECT  
1632 - sector_difat = self.getsect(isect_difat)  
1633 - difat = self.sect2array(sector_difat)  
1634 - # Display the sector contents only if the logging level is debug:  
1635 - if log.isEnabledFor(logging.DEBUG):  
1636 - self.dumpsect(sector_difat)  
1637 - self.loadfat_sect(difat[:nb_difat_sectors])  
1638 - # last DIFAT pointer is next DIFAT sector:  
1639 - isect_difat = difat[nb_difat_sectors]  
1640 - log.debug( "next DIFAT sector: %X" % isect_difat )  
1641 - # checks:  
1642 - if isect_difat not in [ENDOFCHAIN, FREESECT]:  
1643 - # last DIFAT pointer value must be ENDOFCHAIN or FREESECT  
1644 - raise IOError('incorrect end of DIFAT')  
1645 -## if len(self.fat) != self.num_fat_sectors:  
1646 -## # FAT should contain num_fat_sectors blocks  
1647 -## print("FAT length: %d instead of %d" % (len(self.fat), self.num_fat_sectors))  
1648 -## raise IOError('incorrect DIFAT')  
1649 - # since FAT is read from fixed-size sectors, it may contain more values  
1650 - # than the actual number of sectors in the file.  
1651 - # Keep only the relevant sector indexes:  
1652 - if len(self.fat) > self.nb_sect:  
1653 - log.debug('len(fat)=%d, shrunk to nb_sect=%d' % (len(self.fat), self.nb_sect))  
1654 - self.fat = self.fat[:self.nb_sect]  
1655 - # Display the FAT contents only if the logging level is debug:  
1656 - if log.isEnabledFor(logging.DEBUG):  
1657 - log.debug('\nFAT:')  
1658 - self.dumpfat(self.fat)  
1659 -  
1660 -  
1661 - def loadminifat(self):  
1662 - """  
1663 - Load the MiniFAT table.  
1664 - """  
1665 - # MiniFAT is stored in a standard sub-stream, pointed to by a header  
1666 - # field.  
1667 - # NOTE: there are two sizes to take into account for this stream:  
1668 - # 1) Stream size is calculated according to the number of sectors  
1669 - # declared in the OLE header. This allocated stream may be more than  
1670 - # needed to store the actual sector indexes.  
1671 - # (self.num_mini_fat_sectors is the number of sectors of size self.sector_size)  
1672 - stream_size = self.num_mini_fat_sectors * self.sector_size  
1673 - # 2) Actually used size is calculated by dividing the MiniStream size  
1674 - # (given by root entry size) by the size of mini sectors, *4 for  
1675 - # 32 bits indexes:  
1676 - nb_minisectors = (self.root.size + self.mini_sector_size-1) // self.mini_sector_size  
1677 - used_size = nb_minisectors * 4  
1678 - log.debug('loadminifat(): minifatsect=%d, nb FAT sectors=%d, used_size=%d, stream_size=%d, nb MiniSectors=%d' %  
1679 - (self.minifatsect, self.num_mini_fat_sectors, used_size, stream_size, nb_minisectors))  
1680 - if used_size > stream_size:  
1681 - # This is not really a problem, but may indicate a wrong implementation:  
1682 - self._raise_defect(DEFECT_INCORRECT, 'OLE MiniStream is larger than MiniFAT')  
1683 - # In any case, first read stream_size:  
1684 - s = self._open(self.minifatsect, stream_size, force_FAT=True).read()  
1685 - #[PL] Old code replaced by an array:  
1686 - #self.minifat = [i32(s, i) for i in range(0, len(s), 4)]  
1687 - self.minifat = self.sect2array(s)  
1688 - # Then shrink the array to used size, to avoid indexes out of MiniStream:  
1689 - log.debug('MiniFAT shrunk from %d to %d sectors' % (len(self.minifat), nb_minisectors))  
1690 - self.minifat = self.minifat[:nb_minisectors]  
1691 - log.debug('loadminifat(): len=%d' % len(self.minifat))  
1692 - # Display the FAT contents only if the logging level is debug:  
1693 - if log.isEnabledFor(logging.DEBUG):  
1694 - log.debug('\nMiniFAT:')  
1695 - self.dumpfat(self.minifat)  
1696 -  
1697 - def getsect(self, sect):  
1698 - """  
1699 - Read given sector from file on disk.  
1700 -  
1701 - :param sect: int, sector index  
1702 - :returns: a string containing the sector data.  
1703 - """  
1704 - # From [MS-CFB]: A sector number can be converted into a byte offset  
1705 - # into the file by using the following formula:  
1706 - # (sector number + 1) x Sector Size.  
1707 - # This implies that sector #0 of the file begins at byte offset Sector  
1708 - # Size, not at 0.  
1709 -  
1710 - # [PL] the original code in PIL was wrong when sectors are 4KB instead of  
1711 - # 512 bytes:  
1712 - #self.fp.seek(512 + self.sectorsize * sect)  
1713 - #[PL]: added safety checks:  
1714 - #print("getsect(%X)" % sect)  
1715 - try:  
1716 - self.fp.seek(self.sectorsize * (sect+1))  
1717 - except:  
1718 - log.debug('getsect(): sect=%X, seek=%d, filesize=%d' %  
1719 - (sect, self.sectorsize*(sect+1), self._filesize))  
1720 - self._raise_defect(DEFECT_FATAL, 'OLE sector index out of range')  
1721 - sector = self.fp.read(self.sectorsize)  
1722 - if len(sector) != self.sectorsize:  
1723 - log.debug('getsect(): sect=%X, read=%d, sectorsize=%d' %  
1724 - (sect, len(sector), self.sectorsize))  
1725 - self._raise_defect(DEFECT_FATAL, 'incomplete OLE sector')  
1726 - return sector  
1727 -  
1728 -  
1729 - def write_sect(self, sect, data, padding=b'\x00'):  
1730 - """  
1731 - Write given sector to file on disk.  
1732 -  
1733 - :param sect: int, sector index  
1734 - :param data: bytes, sector data  
1735 - :param padding: single byte, padding character if data < sector size  
1736 - """  
1737 - if not isinstance(data, bytes):  
1738 - raise TypeError("write_sect: data must be a bytes string")  
1739 - if not isinstance(padding, bytes) or len(padding)!=1:  
1740 - raise TypeError("write_sect: padding must be a bytes string of 1 char")  
1741 - #TODO: we could allow padding=None for no padding at all  
1742 - try:  
1743 - self.fp.seek(self.sectorsize * (sect+1))  
1744 - except:  
1745 - log.debug('write_sect(): sect=%X, seek=%d, filesize=%d' %  
1746 - (sect, self.sectorsize*(sect+1), self._filesize))  
1747 - self._raise_defect(DEFECT_FATAL, 'OLE sector index out of range')  
1748 - if len(data) < self.sectorsize:  
1749 - # add padding  
1750 - data += padding * (self.sectorsize - len(data))  
1751 - elif len(data) < self.sectorsize:  
1752 - raise ValueError("Data is larger than sector size")  
1753 - self.fp.write(data)  
1754 -  
1755 -  
1756 - def loaddirectory(self, sect):  
1757 - """  
1758 - Load the directory.  
1759 -  
1760 - :param sect: sector index of directory stream.  
1761 - """  
1762 - # The directory is stored in a standard  
1763 - # substream, independent of its size.  
1764 -  
1765 - # open directory stream as a read-only file:  
1766 - # (stream size is not known in advance)  
1767 - self.directory_fp = self._open(sect)  
1768 -  
1769 - #[PL] to detect malformed documents and avoid DoS attacks, the maximum  
1770 - # number of directory entries can be calculated:  
1771 - max_entries = self.directory_fp.size // 128  
1772 - log.debug('loaddirectory: size=%d, max_entries=%d' %  
1773 - (self.directory_fp.size, max_entries))  
1774 -  
1775 - # Create list of directory entries  
1776 - #self.direntries = []  
1777 - # We start with a list of "None" object  
1778 - self.direntries = [None] * max_entries  
1779 -## for sid in iterrange(max_entries):  
1780 -## entry = fp.read(128)  
1781 -## if not entry:  
1782 -## break  
1783 -## self.direntries.append(_OleDirectoryEntry(entry, sid, self))  
1784 - # load root entry:  
1785 - root_entry = self._load_direntry(0)  
1786 - # Root entry is the first entry:  
1787 - self.root = self.direntries[0]  
1788 - # TODO: read ALL directory entries (ignore bad entries?)  
1789 - # TODO: adapt build_storage_tree to avoid duplicate reads  
1790 - # for i in range(1, max_entries):  
1791 - # self._load_direntry(i)  
1792 - # read and build all storage trees, starting from the root:  
1793 - self.root.build_storage_tree()  
1794 -  
1795 -  
1796 - def _load_direntry (self, sid):  
1797 - """  
1798 - Load a directory entry from the directory.  
1799 - This method should only be called once for each storage/stream when  
1800 - loading the directory.  
1801 -  
1802 - :param sid: index of storage/stream in the directory.  
1803 - :returns: a _OleDirectoryEntry object  
1804 -  
1805 - :exception IOError: if the entry has always been referenced.  
1806 - """  
1807 - # check if SID is OK:  
1808 - if sid<0 or sid>=len(self.direntries):  
1809 - self._raise_defect(DEFECT_FATAL, "OLE directory index out of range")  
1810 - # check if entry was already referenced:  
1811 - if self.direntries[sid] is not None:  
1812 - self._raise_defect(DEFECT_INCORRECT,  
1813 - "double reference for OLE stream/storage")  
1814 - # if exception not raised, return the object  
1815 - return self.direntries[sid]  
1816 - self.directory_fp.seek(sid * 128)  
1817 - entry = self.directory_fp.read(128)  
1818 - self.direntries[sid] = _OleDirectoryEntry(entry, sid, self)  
1819 - return self.direntries[sid]  
1820 -  
1821 -  
1822 - def dumpdirectory(self):  
1823 - """  
1824 - Dump directory (for debugging only)  
1825 - """  
1826 - self.root.dump()  
1827 -  
1828 -  
1829 - def _open(self, start, size = 0x7FFFFFFF, force_FAT=False):  
1830 - """  
1831 - Open a stream, either in FAT or MiniFAT according to its size.  
1832 - (openstream helper)  
1833 -  
1834 - :param start: index of first sector  
1835 - :param size: size of stream (or nothing if size is unknown)  
1836 - :param force_FAT: if False (default), stream will be opened in FAT or MiniFAT  
1837 - according to size. If True, it will always be opened in FAT.  
1838 - """  
1839 - log.debug('OleFileIO.open(): sect=%Xh, size=%d, force_FAT=%s' %  
1840 - (start, size, str(force_FAT)))  
1841 - # stream size is compared to the mini_stream_cutoff_size threshold:  
1842 - if size < self.minisectorcutoff and not force_FAT:  
1843 - # ministream object  
1844 - if not self.ministream:  
1845 - # load MiniFAT if it wasn't already done:  
1846 - self.loadminifat()  
1847 - # The first sector index of the miniFAT stream is stored in the  
1848 - # root directory entry:  
1849 - size_ministream = self.root.size  
1850 - log.debug('Opening MiniStream: sect=%Xh, size=%d' %  
1851 - (self.root.isectStart, size_ministream))  
1852 - self.ministream = self._open(self.root.isectStart,  
1853 - size_ministream, force_FAT=True)  
1854 - return _OleStream(fp=self.ministream, sect=start, size=size,  
1855 - offset=0, sectorsize=self.minisectorsize,  
1856 - fat=self.minifat, filesize=self.ministream.size)  
1857 - else:  
1858 - # standard stream  
1859 - return _OleStream(fp=self.fp, sect=start, size=size,  
1860 - offset=self.sectorsize,  
1861 - sectorsize=self.sectorsize, fat=self.fat,  
1862 - filesize=self._filesize)  
1863 -  
1864 -  
1865 - def _list(self, files, prefix, node, streams=True, storages=False):  
1866 - """  
1867 - listdir helper  
1868 -  
1869 - :param files: list of files to fill in  
1870 - :param prefix: current location in storage tree (list of names)  
1871 - :param node: current node (_OleDirectoryEntry object)  
1872 - :param streams: bool, include streams if True (True by default) - new in v0.26  
1873 - :param storages: bool, include storages if True (False by default) - new in v0.26  
1874 - (note: the root storage is never included)  
1875 - """  
1876 - prefix = prefix + [node.name]  
1877 - for entry in node.kids:  
1878 - if entry.entry_type == STGTY_STORAGE:  
1879 - # this is a storage  
1880 - if storages:  
1881 - # add it to the list  
1882 - files.append(prefix[1:] + [entry.name])  
1883 - # check its kids  
1884 - self._list(files, prefix, entry, streams, storages)  
1885 - elif entry.entry_type == STGTY_STREAM:  
1886 - # this is a stream  
1887 - if streams:  
1888 - # add it to the list  
1889 - files.append(prefix[1:] + [entry.name])  
1890 - else:  
1891 - self._raise_defect(DEFECT_INCORRECT, 'The directory tree contains an entry which is not a stream nor a storage.')  
1892 -  
1893 -  
1894 - def listdir(self, streams=True, storages=False):  
1895 - """  
1896 - Return a list of streams and/or storages stored in this file  
1897 -  
1898 - :param streams: bool, include streams if True (True by default) - new in v0.26  
1899 - :param storages: bool, include storages if True (False by default) - new in v0.26  
1900 - (note: the root storage is never included)  
1901 - :returns: list of stream and/or storage paths  
1902 - """  
1903 - files = []  
1904 - self._list(files, [], self.root, streams, storages)  
1905 - return files  
1906 -  
1907 -  
1908 - def _find(self, filename):  
1909 - """  
1910 - Returns directory entry of given filename. (openstream helper)  
1911 - Note: this method is case-insensitive.  
1912 -  
1913 - :param filename: path of stream in storage tree (except root entry), either:  
1914 -  
1915 - - a string using Unix path syntax, for example:  
1916 - 'storage_1/storage_1.2/stream'  
1917 - - or a list of storage filenames, path to the desired stream/storage.  
1918 - Example: ['storage_1', 'storage_1.2', 'stream']  
1919 -  
1920 - :returns: sid of requested filename  
1921 - :exception IOError: if file not found  
1922 - """  
1923 -  
1924 - # if filename is a string instead of a list, split it on slashes to  
1925 - # convert to a list:  
1926 - if isinstance(filename, basestring):  
1927 - filename = filename.split('/')  
1928 - # walk across storage tree, following given path:  
1929 - node = self.root  
1930 - for name in filename:  
1931 - for kid in node.kids:  
1932 - if kid.name.lower() == name.lower():  
1933 - break  
1934 - else:  
1935 - raise IOError("file not found")  
1936 - node = kid  
1937 - return node.sid  
1938 -  
1939 -  
1940 - def openstream(self, filename):  
1941 - """  
1942 - Open a stream as a read-only file object (BytesIO).  
1943 - Note: filename is case-insensitive.  
1944 -  
1945 - :param filename: path of stream in storage tree (except root entry), either:  
1946 -  
1947 - - a string using Unix path syntax, for example:  
1948 - 'storage_1/storage_1.2/stream'  
1949 - - or a list of storage filenames, path to the desired stream/storage.  
1950 - Example: ['storage_1', 'storage_1.2', 'stream']  
1951 -  
1952 - :returns: file object (read-only)  
1953 - :exception IOError: if filename not found, or if this is not a stream.  
1954 - """  
1955 - sid = self._find(filename)  
1956 - entry = self.direntries[sid]  
1957 - if entry.entry_type != STGTY_STREAM:  
1958 - raise IOError("this file is not a stream")  
1959 - return self._open(entry.isectStart, entry.size)  
1960 -  
1961 -  
1962 - def write_stream(self, stream_name, data):  
1963 - """  
1964 - Write a stream to disk. For now, it is only possible to replace an  
1965 - existing stream by data of the same size.  
1966 -  
1967 - :param stream_name: path of stream in storage tree (except root entry), either:  
1968 -  
1969 - - a string using Unix path syntax, for example:  
1970 - 'storage_1/storage_1.2/stream'  
1971 - - or a list of storage filenames, path to the desired stream/storage.  
1972 - Example: ['storage_1', 'storage_1.2', 'stream']  
1973 -  
1974 - :param data: bytes, data to be written, must be the same size as the original  
1975 - stream.  
1976 - """  
1977 - if not isinstance(data, bytes):  
1978 - raise TypeError("write_stream: data must be a bytes string")  
1979 - sid = self._find(stream_name)  
1980 - entry = self.direntries[sid]  
1981 - if entry.entry_type != STGTY_STREAM:  
1982 - raise IOError("this is not a stream")  
1983 - size = entry.size  
1984 - if size != len(data):  
1985 - raise ValueError("write_stream: data must be the same size as the existing stream")  
1986 - if size < self.minisectorcutoff:  
1987 - raise NotImplementedError("Writing a stream in MiniFAT is not implemented yet")  
1988 - sect = entry.isectStart  
1989 - # number of sectors to write  
1990 - nb_sectors = (size + (self.sectorsize-1)) // self.sectorsize  
1991 - log.debug('nb_sectors = %d' % nb_sectors)  
1992 - for i in range(nb_sectors):  
1993 -## try:  
1994 -## self.fp.seek(offset + self.sectorsize * sect)  
1995 -## except:  
1996 -## log.debug('sect=%d, seek=%d' %  
1997 -## (sect, offset+self.sectorsize*sect))  
1998 -## raise IOError('OLE sector index out of range')  
1999 - # extract one sector from data, the last one being smaller:  
2000 - if i<(nb_sectors-1):  
2001 - data_sector = data [i*self.sectorsize : (i+1)*self.sectorsize]  
2002 - #TODO: comment this if it works  
2003 - assert(len(data_sector)==self.sectorsize)  
2004 - else:  
2005 - data_sector = data [i*self.sectorsize:]  
2006 - #TODO: comment this if it works  
2007 - log.debug('write_stream: size=%d sectorsize=%d data_sector=%Xh size%%sectorsize=%d'  
2008 - % (size, self.sectorsize, len(data_sector), size % self.sectorsize))  
2009 - assert(len(data_sector) % self.sectorsize==size % self.sectorsize)  
2010 - self.write_sect(sect, data_sector)  
2011 -## self.fp.write(data_sector)  
2012 - # jump to next sector in the FAT:  
2013 - try:  
2014 - sect = self.fat[sect]  
2015 - except IndexError:  
2016 - # [PL] if pointer is out of the FAT an exception is raised  
2017 - raise IOError('incorrect OLE FAT, sector index out of range')  
2018 - #[PL] Last sector should be a "end of chain" marker:  
2019 - if sect != ENDOFCHAIN:  
2020 - raise IOError('incorrect last sector index in OLE stream')  
2021 -  
2022 -  
2023 - def get_type(self, filename):  
2024 - """  
2025 - Test if given filename exists as a stream or a storage in the OLE  
2026 - container, and return its type.  
2027 -  
2028 - :param filename: path of stream in storage tree. (see openstream for syntax)  
2029 - :returns: False if object does not exist, its entry type (>0) otherwise:  
2030 -  
2031 - - STGTY_STREAM: a stream  
2032 - - STGTY_STORAGE: a storage  
2033 - - STGTY_ROOT: the root entry  
2034 - """  
2035 - try:  
2036 - sid = self._find(filename)  
2037 - entry = self.direntries[sid]  
2038 - return entry.entry_type  
2039 - except:  
2040 - return False  
2041 -  
2042 -  
2043 - def getmtime(self, filename):  
2044 - """  
2045 - Return modification time of a stream/storage.  
2046 -  
2047 - :param filename: path of stream/storage in storage tree. (see openstream for  
2048 - syntax)  
2049 - :returns: None if modification time is null, a python datetime object  
2050 - otherwise (UTC timezone)  
2051 -  
2052 - new in version 0.26  
2053 - """  
2054 - sid = self._find(filename)  
2055 - entry = self.direntries[sid]  
2056 - return entry.getmtime()  
2057 -  
2058 -  
2059 - def getctime(self, filename):  
2060 - """  
2061 - Return creation time of a stream/storage.  
2062 -  
2063 - :param filename: path of stream/storage in storage tree. (see openstream for  
2064 - syntax)  
2065 - :returns: None if creation time is null, a python datetime object  
2066 - otherwise (UTC timezone)  
2067 -  
2068 - new in version 0.26  
2069 - """  
2070 - sid = self._find(filename)  
2071 - entry = self.direntries[sid]  
2072 - return entry.getctime()  
2073 -  
2074 -  
2075 - def exists(self, filename):  
2076 - """  
2077 - Test if given filename exists as a stream or a storage in the OLE  
2078 - container.  
2079 - Note: filename is case-insensitive.  
2080 -  
2081 - :param filename: path of stream in storage tree. (see openstream for syntax)  
2082 - :returns: True if object exist, else False.  
2083 - """  
2084 - try:  
2085 - sid = self._find(filename)  
2086 - return True  
2087 - except:  
2088 - return False  
2089 -  
2090 -  
2091 - def get_size(self, filename):  
2092 - """  
2093 - Return size of a stream in the OLE container, in bytes.  
2094 -  
2095 - :param filename: path of stream in storage tree (see openstream for syntax)  
2096 - :returns: size in bytes (long integer)  
2097 - :exception IOError: if file not found  
2098 - :exception TypeError: if this is not a stream.  
2099 - """  
2100 - sid = self._find(filename)  
2101 - entry = self.direntries[sid]  
2102 - if entry.entry_type != STGTY_STREAM:  
2103 - #TODO: Should it return zero instead of raising an exception ?  
2104 - raise TypeError('object is not an OLE stream')  
2105 - return entry.size  
2106 -  
2107 -  
2108 - def get_rootentry_name(self):  
2109 - """  
2110 - Return root entry name. Should usually be 'Root Entry' or 'R' in most  
2111 - implementations.  
2112 - """  
2113 - return self.root.name  
2114 -  
2115 -  
2116 - def getproperties(self, filename, convert_time=False, no_conversion=None):  
2117 - """  
2118 - Return properties described in substream.  
2119 -  
2120 - :param filename: path of stream in storage tree (see openstream for syntax)  
2121 - :param convert_time: bool, if True timestamps will be converted to Python datetime  
2122 - :param no_conversion: None or list of int, timestamps not to be converted  
2123 - (for example total editing time is not a real timestamp)  
2124 -  
2125 - :returns: a dictionary of values indexed by id (integer)  
2126 - """  
2127 - #REFERENCE: [MS-OLEPS] https://msdn.microsoft.com/en-us/library/dd942421.aspx  
2128 - # make sure no_conversion is a list, just to simplify code below:  
2129 - if no_conversion == None:  
2130 - no_conversion = []  
2131 - # stream path as a string to report exceptions:  
2132 - streampath = filename  
2133 - if not isinstance(streampath, str):  
2134 - streampath = '/'.join(streampath)  
2135 -  
2136 - fp = self.openstream(filename)  
2137 -  
2138 - data = {}  
2139 -  
2140 - try:  
2141 - # header  
2142 - s = fp.read(28)  
2143 - clsid = _clsid(s[8:24])  
2144 -  
2145 - # format id  
2146 - s = fp.read(20)  
2147 - fmtid = _clsid(s[:16])  
2148 - fp.seek(i32(s, 16))  
2149 -  
2150 - # get section  
2151 - s = b"****" + fp.read(i32(fp.read(4))-4)  
2152 - # number of properties:  
2153 - num_props = i32(s, 4)  
2154 - except BaseException as exc:  
2155 - # catch exception while parsing property header, and only raise  
2156 - # a DEFECT_INCORRECT then return an empty dict, because this is not  
2157 - # a fatal error when parsing the whole file  
2158 - msg = 'Error while parsing properties header in stream %s: %s' % (  
2159 - repr(streampath), exc)  
2160 - self._raise_defect(DEFECT_INCORRECT, msg, type(exc))  
2161 - return data  
2162 -  
2163 - for i in range(num_props):  
2164 - property_id = 0 # just in case of an exception  
2165 - try:  
2166 - property_id = i32(s, 8+i*8)  
2167 - offset = i32(s, 12+i*8)  
2168 - property_type = i32(s, offset)  
2169 -  
2170 - log.debug('property id=%d: type=%d offset=%X' % (property_id, property_type, offset))  
2171 -  
2172 - # test for common types first (should perhaps use  
2173 - # a dictionary instead?)  
2174 -  
2175 - if property_type == VT_I2: # 16-bit signed integer  
2176 - value = i16(s, offset+4)  
2177 - if value >= 32768:  
2178 - value = value - 65536  
2179 - elif property_type == VT_UI2: # 2-byte unsigned integer  
2180 - value = i16(s, offset+4)  
2181 - elif property_type in (VT_I4, VT_INT, VT_ERROR):  
2182 - # VT_I4: 32-bit signed integer  
2183 - # VT_ERROR: HRESULT, similar to 32-bit signed integer,  
2184 - # see http://msdn.microsoft.com/en-us/library/cc230330.aspx  
2185 - value = i32(s, offset+4)  
2186 - elif property_type in (VT_UI4, VT_UINT): # 4-byte unsigned integer  
2187 - value = i32(s, offset+4) # FIXME  
2188 - elif property_type in (VT_BSTR, VT_LPSTR):  
2189 - # CodePageString, see http://msdn.microsoft.com/en-us/library/dd942354.aspx  
2190 - # size is a 32 bits integer, including the null terminator, and  
2191 - # possibly trailing or embedded null chars  
2192 - #TODO: if codepage is unicode, the string should be converted as such  
2193 - count = i32(s, offset+4)  
2194 - value = s[offset+8:offset+8+count-1]  
2195 - # remove all null chars:  
2196 - value = value.replace(b'\x00', b'')  
2197 - elif property_type == VT_BLOB:  
2198 - # binary large object (BLOB)  
2199 - # see http://msdn.microsoft.com/en-us/library/dd942282.aspx  
2200 - count = i32(s, offset+4)  
2201 - value = s[offset+8:offset+8+count]  
2202 - elif property_type == VT_LPWSTR:  
2203 - # UnicodeString  
2204 - # see http://msdn.microsoft.com/en-us/library/dd942313.aspx  
2205 - # "the string should NOT contain embedded or additional trailing  
2206 - # null characters."  
2207 - count = i32(s, offset+4)  
2208 - value = self._decode_utf16_str(s[offset+8:offset+8+count*2])  
2209 - elif property_type == VT_FILETIME:  
2210 - value = long(i32(s, offset+4)) + (long(i32(s, offset+8))<<32)  
2211 - # FILETIME is a 64-bit int: "number of 100ns periods  
2212 - # since Jan 1,1601".  
2213 - if convert_time and property_id not in no_conversion:  
2214 - log.debug('Converting property #%d to python datetime, value=%d=%fs'  
2215 - %(property_id, value, float(value)/10000000))  
2216 - # convert FILETIME to Python datetime.datetime  
2217 - # inspired from http://code.activestate.com/recipes/511425-filetime-to-datetime/  
2218 - _FILETIME_null_date = datetime.datetime(1601, 1, 1, 0, 0, 0)  
2219 - log.debug('timedelta days=%d' % (value//(10*1000000*3600*24)))  
2220 - value = _FILETIME_null_date + datetime.timedelta(microseconds=value//10)  
2221 - else:  
2222 - # legacy code kept for backward compatibility: returns a  
2223 - # number of seconds since Jan 1,1601  
2224 - value = value // 10000000 # seconds  
2225 - elif property_type == VT_UI1: # 1-byte unsigned integer  
2226 - value = i8(s[offset+4])  
2227 - elif property_type == VT_CLSID:  
2228 - value = _clsid(s[offset+4:offset+20])  
2229 - elif property_type == VT_CF:  
2230 - # PropertyIdentifier or ClipboardData??  
2231 - # see http://msdn.microsoft.com/en-us/library/dd941945.aspx  
2232 - count = i32(s, offset+4)  
2233 - value = s[offset+8:offset+8+count]  
2234 - elif property_type == VT_BOOL:  
2235 - # VARIANT_BOOL, 16 bits bool, 0x0000=Fals, 0xFFFF=True  
2236 - # see http://msdn.microsoft.com/en-us/library/cc237864.aspx  
2237 - value = bool(i16(s, offset+4))  
2238 - else:  
2239 - value = None # everything else yields "None"  
2240 - log.debug('property id=%d: type=%d not implemented in parser yet' % (property_id, property_type))  
2241 -  
2242 - # missing: VT_EMPTY, VT_NULL, VT_R4, VT_R8, VT_CY, VT_DATE,  
2243 - # VT_DECIMAL, VT_I1, VT_I8, VT_UI8,  
2244 - # see http://msdn.microsoft.com/en-us/library/dd942033.aspx  
2245 -  
2246 - # FIXME: add support for VT_VECTOR  
2247 - # VT_VECTOR is a 32 uint giving the number of items, followed by  
2248 - # the items in sequence. The VT_VECTOR value is combined with the  
2249 - # type of items, e.g. VT_VECTOR|VT_BSTR  
2250 - # see http://msdn.microsoft.com/en-us/library/dd942011.aspx  
2251 -  
2252 - #print("%08x" % property_id, repr(value), end=" ")  
2253 - #print("(%s)" % VT[i32(s, offset) & 0xFFF])  
2254 -  
2255 - data[property_id] = value  
2256 - except BaseException as exc:  
2257 - # catch exception while parsing each property, and only raise  
2258 - # a DEFECT_INCORRECT, because parsing can go on  
2259 - msg = 'Error while parsing property id %d in stream %s: %s' % (  
2260 - property_id, repr(streampath), exc)  
2261 - self._raise_defect(DEFECT_INCORRECT, msg, type(exc))  
2262 -  
2263 - return data  
2264 -  
2265 - def get_metadata(self):  
2266 - """  
2267 - Parse standard properties streams, return an OleMetadata object  
2268 - containing all the available metadata.  
2269 - (also stored in the metadata attribute of the OleFileIO object)  
2270 -  
2271 - new in version 0.25  
2272 - """  
2273 - self.metadata = OleMetadata()  
2274 - self.metadata.parse_properties(self)  
2275 - return self.metadata  
2276 -  
2277 -#  
2278 -# --------------------------------------------------------------------  
2279 -# This script can be used to dump the directory of any OLE2 structured  
2280 -# storage file.  
2281 -  
2282 -if __name__ == "__main__":  
2283 -  
2284 - import sys, optparse  
2285 -  
2286 - DEFAULT_LOG_LEVEL = "warning" # Default log level  
2287 - LOG_LEVELS = {  
2288 - 'debug': logging.DEBUG,  
2289 - 'info': logging.INFO,  
2290 - 'warning': logging.WARNING,  
2291 - 'error': logging.ERROR,  
2292 - 'critical': logging.CRITICAL  
2293 - }  
2294 -  
2295 - usage = 'usage: %prog [options] <filename> [filename2 ...]'  
2296 - parser = optparse.OptionParser(usage=usage)  
2297 - parser.add_option("-c", action="store_true", dest="check_streams",  
2298 - help='check all streams (for debugging purposes)')  
2299 - parser.add_option("-d", action="store_true", dest="debug_mode",  
2300 - help='debug mode, shortcut for -l debug (displays a lot of debug information, for developers only)')  
2301 - parser.add_option('-l', '--loglevel', dest="loglevel", action="store", default=DEFAULT_LOG_LEVEL,  
2302 - help="logging level debug/info/warning/error/critical (default=%default)")  
2303 -  
2304 - (options, args) = parser.parse_args()  
2305 -  
2306 - print('olefile version %s %s - http://www.decalage.info/en/olefile\n' % (__version__, __date__))  
2307 -  
2308 - # Print help if no arguments are passed  
2309 - if len(args) == 0:  
2310 - print(__doc__)  
2311 - parser.print_help()  
2312 - sys.exit()  
2313 -  
2314 - if options.debug_mode:  
2315 - options.loglevel = 'debug'  
2316 -  
2317 - # setup logging to the console  
2318 - logging.basicConfig(level=LOG_LEVELS[options.loglevel], format='%(levelname)-8s %(message)s')  
2319 -  
2320 - # also set the same log level for the module's logger to enable it:  
2321 - log.setLevel(LOG_LEVELS[options.loglevel])  
2322 -  
2323 - for filename in args:  
2324 - try:  
2325 - ole = OleFileIO(filename)#, raise_defects=DEFECT_INCORRECT)  
2326 - print("-" * 68)  
2327 - print(filename)  
2328 - print("-" * 68)  
2329 - ole.dumpdirectory()  
2330 - for streamname in ole.listdir():  
2331 - if streamname[-1][0] == "\005":  
2332 - print("%r: properties" % streamname)  
2333 - try:  
2334 - props = ole.getproperties(streamname, convert_time=True)  
2335 - props = sorted(props.items())  
2336 - for k, v in props:  
2337 - #[PL]: avoid to display too large or binary values:  
2338 - if isinstance(v, (basestring, bytes)):  
2339 - if len(v) > 50:  
2340 - v = v[:50]  
2341 - if isinstance(v, bytes):  
2342 - # quick and dirty binary check:  
2343 - for c in (1,2,3,4,5,6,7,11,12,14,15,16,17,18,19,20,  
2344 - 21,22,23,24,25,26,27,28,29,30,31):  
2345 - if c in bytearray(v):  
2346 - v = '(binary data)'  
2347 - break  
2348 - print(" ", k, v)  
2349 - except:  
2350 - log.exception('Error while parsing property stream %r' % streamname)  
2351 -  
2352 - if options.check_streams:  
2353 - # Read all streams to check if there are errors:  
2354 - print('\nChecking streams...')  
2355 - for streamname in ole.listdir():  
2356 - # print name using repr() to convert binary chars to \xNN:  
2357 - print('-', repr('/'.join(streamname)),'-', end=' ')  
2358 - st_type = ole.get_type(streamname)  
2359 - if st_type == STGTY_STREAM:  
2360 - print('size %d' % ole.get_size(streamname))  
2361 - # just try to read stream in memory:  
2362 - ole.openstream(streamname)  
2363 - else:  
2364 - print('NOT a stream : type=%d' % st_type)  
2365 - print()  
2366 -  
2367 -## for streamname in ole.listdir():  
2368 -## # print name using repr() to convert binary chars to \xNN:  
2369 -## print('-', repr('/'.join(streamname)),'-', end=' ')  
2370 -## print(ole.getmtime(streamname))  
2371 -## print()  
2372 -  
2373 - print('Modification/Creation times of all directory entries:')  
2374 - for entry in ole.direntries:  
2375 - if entry is not None:  
2376 - print('- %s: mtime=%s ctime=%s' % (entry.name,  
2377 - entry.getmtime(), entry.getctime()))  
2378 - print()  
2379 -  
2380 - # parse and display metadata:  
2381 - try:  
2382 - meta = ole.get_metadata()  
2383 - meta.dump()  
2384 - except:  
2385 - log.exception('Error while parsing metadata')  
2386 - print()  
2387 - #[PL] Test a few new methods:  
2388 - root = ole.get_rootentry_name()  
2389 - print('Root entry name: "%s"' % root)  
2390 - if ole.exists('worddocument'):  
2391 - print("This is a Word document.")  
2392 - print("type of stream 'WordDocument':", ole.get_type('worddocument'))  
2393 - print("size :", ole.get_size('worddocument'))  
2394 - if ole.exists('macros/vba'):  
2395 - print("This document may contain VBA macros.")  
2396 -  
2397 - # print parsing issues:  
2398 - print('\nNon-fatal issues raised during parsing:')  
2399 - if ole.parsing_issues:  
2400 - for exctype, msg in ole.parsing_issues:  
2401 - print('- %s: %s' % (exctype.__name__, msg))  
2402 - else:  
2403 - print('None')  
2404 - except:  
2405 - log.exception('Error while parsing file %r' % filename)  
2406 -  
2407 -# this code was developed while listening to The Wedding Present "Sea Monsters" 1 +#!/usr/bin/env python
  2 +
  3 +# olefile (formerly OleFileIO_PL)
  4 +#
  5 +# Module to read/write Microsoft OLE2 files (also called Structured Storage or
  6 +# Microsoft Compound Document File Format), such as Microsoft Office 97-2003
  7 +# documents, Image Composer and FlashPix files, Outlook messages, ...
  8 +# This version is compatible with Python 2.6+ and 3.x
  9 +#
  10 +# Project website: http://www.decalage.info/olefile
  11 +#
  12 +# olefile is copyright (c) 2005-2016 Philippe Lagadec (http://www.decalage.info)
  13 +#
  14 +# olefile is based on the OleFileIO module from the PIL library v1.1.6
  15 +# See: http://www.pythonware.com/products/pil/index.htm
  16 +#
  17 +# The Python Imaging Library (PIL) is
  18 +# Copyright (c) 1997-2005 by Secret Labs AB
  19 +# Copyright (c) 1995-2005 by Fredrik Lundh
  20 +#
  21 +# See source code and LICENSE.txt for information on usage and redistribution.
  22 +
  23 +
  24 +# Since OleFileIO_PL v0.30, only Python 2.6+ and 3.x is supported
  25 +# This import enables print() as a function rather than a keyword
  26 +# (main requirement to be compatible with Python 3.x)
  27 +# The comment on the line below should be printed on Python 2.5 or older:
  28 +from __future__ import print_function # This version of olefile requires Python 2.6+ or 3.x.
  29 +
  30 +
  31 +__author__ = "Philippe Lagadec"
  32 +__date__ = "2016-04-26"
  33 +__version__ = '0.44'
  34 +
  35 +#--- LICENSE ------------------------------------------------------------------
  36 +
  37 +# olefile (formerly OleFileIO_PL) is copyright (c) 2005-2016 Philippe Lagadec
  38 +# (http://www.decalage.info)
  39 +#
  40 +# All rights reserved.
  41 +#
  42 +# Redistribution and use in source and binary forms, with or without modification,
  43 +# are permitted provided that the following conditions are met:
  44 +#
  45 +# * Redistributions of source code must retain the above copyright notice, this
  46 +# list of conditions and the following disclaimer.
  47 +# * Redistributions in binary form must reproduce the above copyright notice,
  48 +# this list of conditions and the following disclaimer in the documentation
  49 +# and/or other materials provided with the distribution.
  50 +#
  51 +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
  52 +# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
  53 +# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  54 +# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
  55 +# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  56 +# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  57 +# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  58 +# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  59 +# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  60 +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  61 +
  62 +# ----------
  63 +# PIL License:
  64 +#
  65 +# olefile is based on source code from the OleFileIO module of the Python
  66 +# Imaging Library (PIL) published by Fredrik Lundh under the following license:
  67 +
  68 +# The Python Imaging Library (PIL) is
  69 +# Copyright (c) 1997-2005 by Secret Labs AB
  70 +# Copyright (c) 1995-2005 by Fredrik Lundh
  71 +#
  72 +# By obtaining, using, and/or copying this software and/or its associated
  73 +# documentation, you agree that you have read, understood, and will comply with
  74 +# the following terms and conditions:
  75 +#
  76 +# Permission to use, copy, modify, and distribute this software and its
  77 +# associated documentation for any purpose and without fee is hereby granted,
  78 +# provided that the above copyright notice appears in all copies, and that both
  79 +# that copyright notice and this permission notice appear in supporting
  80 +# documentation, and that the name of Secret Labs AB or the author(s) not be used
  81 +# in advertising or publicity pertaining to distribution of the software
  82 +# without specific, written prior permission.
  83 +#
  84 +# SECRET LABS AB AND THE AUTHORS DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS
  85 +# SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS.
  86 +# IN NO EVENT SHALL SECRET LABS AB OR THE AUTHORS BE LIABLE FOR ANY SPECIAL,
  87 +# INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
  88 +# LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
  89 +# OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
  90 +# PERFORMANCE OF THIS SOFTWARE.
  91 +
  92 +#-----------------------------------------------------------------------------
  93 +# CHANGELOG: (only olefile/OleFileIO_PL changes compared to PIL 1.1.6)
  94 +# 2005-05-11 v0.10 PL: - a few fixes for Python 2.4 compatibility
  95 +# (all changes flagged with [PL])
  96 +# 2006-02-22 v0.11 PL: - a few fixes for some Office 2003 documents which raise
  97 +# exceptions in OleStream.__init__()
  98 +# 2006-06-09 v0.12 PL: - fixes for files above 6.8MB (DIFAT in loadfat)
  99 +# - added some constants
  100 +# - added header values checks
  101 +# - added some docstrings
  102 +# - getsect: bugfix in case sectors >512 bytes
  103 +# - getsect: added conformity checks
  104 +# - DEBUG_MODE constant to activate debug display
  105 +# 2007-09-04 v0.13 PL: - improved/translated (lots of) comments
  106 +# - updated license
  107 +# - converted tabs to 4 spaces
  108 +# 2007-11-19 v0.14 PL: - added OleFileIO._raise_defect() to adapt sensitivity
  109 +# - improved _unicode() to use Python 2.x unicode support
  110 +# - fixed bug in OleDirectoryEntry
  111 +# 2007-11-25 v0.15 PL: - added safety checks to detect FAT loops
  112 +# - fixed OleStream which didn't check stream size
  113 +# - added/improved many docstrings and comments
  114 +# - moved helper functions _unicode and _clsid out of
  115 +# OleFileIO class
  116 +# - improved OleFileIO._find() to add Unix path syntax
  117 +# - OleFileIO._find() is now case-insensitive
  118 +# - added get_type() and get_rootentry_name()
  119 +# - rewritten loaddirectory and OleDirectoryEntry
  120 +# 2007-11-27 v0.16 PL: - added OleDirectoryEntry.kids_dict
  121 +# - added detection of duplicate filenames in storages
  122 +# - added detection of duplicate references to streams
  123 +# - added get_size() and exists() to OleDirectoryEntry
  124 +# - added isOleFile to check header before parsing
  125 +# - added __all__ list to control public keywords in pydoc
  126 +# 2007-12-04 v0.17 PL: - added _load_direntry to fix a bug in loaddirectory
  127 +# - improved _unicode(), added workarounds for Python <2.3
  128 +# - added set_debug_mode and -d option to set debug mode
  129 +# - fixed bugs in OleFileIO.open and OleDirectoryEntry
  130 +# - added safety check in main for large or binary
  131 +# properties
  132 +# - allow size>0 for storages for some implementations
  133 +# 2007-12-05 v0.18 PL: - fixed several bugs in handling of FAT, MiniFAT and
  134 +# streams
  135 +# - added option '-c' in main to check all streams
  136 +# 2009-12-10 v0.19 PL: - bugfix for 32 bit arrays on 64 bits platforms
  137 +# (thanks to Ben G. and Martijn for reporting the bug)
  138 +# 2009-12-11 v0.20 PL: - bugfix in OleFileIO.open when filename is not plain str
  139 +# 2010-01-22 v0.21 PL: - added support for big-endian CPUs such as PowerPC Macs
  140 +# 2012-02-16 v0.22 PL: - fixed bug in getproperties, patch by chuckleberryfinn
  141 +# (https://bitbucket.org/decalage/olefileio_pl/issue/7)
  142 +# - added close method to OleFileIO (fixed issue #2)
  143 +# 2012-07-25 v0.23 PL: - added support for file-like objects (patch by mete0r_kr)
  144 +# 2013-05-05 v0.24 PL: - getproperties: added conversion from filetime to python
  145 +# datetime
  146 +# - main: displays properties with date format
  147 +# - new class OleMetadata to parse standard properties
  148 +# - added get_metadata method
  149 +# 2013-05-07 v0.24 PL: - a few improvements in OleMetadata
  150 +# 2013-05-24 v0.25 PL: - getproperties: option to not convert some timestamps
  151 +# - OleMetaData: total_edit_time is now a number of seconds,
  152 +# not a timestamp
  153 +# - getproperties: added support for VT_BOOL, VT_INT, V_UINT
  154 +# - getproperties: filter out null chars from strings
  155 +# - getproperties: raise non-fatal defects instead of
  156 +# exceptions when properties cannot be parsed properly
  157 +# 2013-05-27 PL: - getproperties: improved exception handling
  158 +# - _raise_defect: added option to set exception type
  159 +# - all non-fatal issues are now recorded, and displayed
  160 +# when run as a script
  161 +# 2013-07-11 v0.26 PL: - added methods to get modification and creation times
  162 +# of a directory entry or a storage/stream
  163 +# - fixed parsing of direntry timestamps
  164 +# 2013-07-24 PL: - new options in listdir to list storages and/or streams
  165 +# 2014-02-04 v0.30 PL: - upgraded code to support Python 3.x by Martin Panter
  166 +# - several fixes for Python 2.6 (xrange, MAGIC)
  167 +# - reused i32 from Pillow's _binary
  168 +# 2014-07-18 v0.31 - preliminary support for 4K sectors
  169 +# 2014-07-27 v0.31 PL: - a few improvements in OleFileIO.open (header parsing)
  170 +# - Fixed loadfat for large files with 4K sectors (issue #3)
  171 +# 2014-07-30 v0.32 PL: - added write_sect to write sectors to disk
  172 +# - added write_mode option to OleFileIO.__init__ and open
  173 +# 2014-07-31 PL: - fixed padding in write_sect for Python 3, added checks
  174 +# - added write_stream to write a stream to disk
  175 +# 2014-09-26 v0.40 PL: - renamed OleFileIO_PL to olefile
  176 +# 2014-11-09 NE: - added support for Jython (Niko Ehrenfeuchter)
  177 +# 2014-11-13 v0.41 PL: - improved isOleFile and OleFileIO.open to support OLE
  178 +# data in a string buffer and file-like objects.
  179 +# 2014-11-21 PL: - updated comments according to Pillow's commits
  180 +# 2015-01-24 v0.42 PL: - changed the default path name encoding from Latin-1
  181 +# to UTF-8 on Python 2.x (Unicode on Python 3.x)
  182 +# - added path_encoding option to override the default
  183 +# - fixed a bug in _list when a storage is empty
  184 +# 2015-04-17 v0.43 PL: - slight changes in OleDirectoryEntry
  185 +# 2015-10-19 - fixed issue #26 in OleFileIO.getproperties
  186 +# (using id and type as local variable names)
  187 +# 2015-10-29 - replaced debug() with proper logging
  188 +# - use optparse to handle command line options
  189 +# - improved attribute names in OleFileIO class
  190 +# 2015-11-05 - fixed issue #27 by correcting the MiniFAT sector
  191 +# cutoff size if invalid.
  192 +# 2016-02-02 - logging is disabled by default
  193 +# 2016-04-26 v0.44 PL: - added enable_logging
  194 +# - renamed _OleDirectoryEntry and _OleStream without '_'
  195 +# - in OleStream use _raise_defect instead of exceptions
  196 +# 2016-04-27 - added support for incomplete streams and incorrect
  197 +# directory entries (to read malformed documents)
  198 +# 2016-05-04 - fixed slight bug in OleStream
  199 +
  200 +#-----------------------------------------------------------------------------
  201 +# TODO (for version 1.0):
  202 +# + get rid of print statements, to simplify Python 2.x and 3.x support
  203 +# + add is_stream and is_storage
  204 +# + remove leading and trailing slashes where a path is used
  205 +# + add functions path_list2str and path_str2list
  206 +# + fix how all the methods handle unicode str and/or bytes as arguments
  207 +# + add path attrib to _OleDirEntry, set it once and for all in init or
  208 +# append_kids (then listdir/_list can be simplified)
  209 +# - TESTS with Linux, MacOSX, Python 1.5.2, various files, PIL, ...
  210 +# - add underscore to each private method, to avoid their display in
  211 +# pydoc/epydoc documentation - Remove it for classes to be documented
  212 +# - replace all raised exceptions with _raise_defect (at least in OleFileIO)
  213 +# - merge code from OleStream and OleFileIO.getsect to read sectors
  214 +# (maybe add a class for FAT and MiniFAT ?)
  215 +# - add method to check all streams (follow sectors chains without storing all
  216 +# stream in memory, and report anomalies)
  217 +# - use OleDirectoryEntry.kids_dict to improve _find and _list ?
  218 +# - fix Unicode names handling (find some way to stay compatible with Py1.5.2)
  219 +# => if possible avoid converting names to Latin-1
  220 +# - review DIFAT code: fix handling of DIFSECT blocks in FAT (not stop)
  221 +# - rewrite OleFileIO.getproperties
  222 +# - improve docstrings to show more sample uses
  223 +# - see also original notes and FIXME below
  224 +# - remove all obsolete FIXMEs
  225 +# - OleMetadata: fix version attrib according to
  226 +# http://msdn.microsoft.com/en-us/library/dd945671%28v=office.12%29.aspx
  227 +
  228 +# IDEAS:
  229 +# - in OleFileIO._open and OleStream, use size=None instead of 0x7FFFFFFF for
  230 +# streams with unknown size
  231 +# - use arrays of int instead of long integers for FAT/MiniFAT, to improve
  232 +# performance and reduce memory usage ? (possible issue with values >2^31)
  233 +# - provide tests with unittest (may need write support to create samples)
  234 +# - move all debug code (and maybe dump methods) to a separate module, with
  235 +# a class which inherits OleFileIO ?
  236 +# - fix docstrings to follow epydoc format
  237 +# - add support for big endian byte order ?
  238 +# - create a simple OLE explorer with wxPython
  239 +
  240 +# FUTURE EVOLUTIONS to add write support:
  241 +# see issue #6 on Bitbucket:
  242 +# https://bitbucket.org/decalage/olefileio_pl/issue/6/improve-olefileio_pl-to-write-ole-files
  243 +
  244 +#-----------------------------------------------------------------------------
  245 +# NOTES from PIL 1.1.6:
  246 +
  247 +# History:
  248 +# 1997-01-20 fl Created
  249 +# 1997-01-22 fl Fixed 64-bit portability quirk
  250 +# 2003-09-09 fl Fixed typo in OleFileIO.loadfat (noted by Daniel Haertle)
  251 +# 2004-02-29 fl Changed long hex constants to signed integers
  252 +#
  253 +# Notes:
  254 +# FIXME: sort out sign problem (eliminate long hex constants)
  255 +# FIXME: change filename to use "a/b/c" instead of ["a", "b", "c"]
  256 +# FIXME: provide a glob mechanism function (using fnmatchcase)
  257 +#
  258 +# Literature:
  259 +#
  260 +# "FlashPix Format Specification, Appendix A", Kodak and Microsoft,
  261 +# September 1996.
  262 +#
  263 +# Quotes:
  264 +#
  265 +# "If this document and functionality of the Software conflict,
  266 +# the actual functionality of the Software represents the correct
  267 +# functionality" -- Microsoft, in the OLE format specification
  268 +
  269 +#------------------------------------------------------------------------------
  270 +
  271 +
  272 +import io
  273 +import sys
  274 +import struct, array, os.path, datetime, logging
  275 +
  276 +#=== COMPATIBILITY WORKAROUNDS ================================================
  277 +
  278 +#[PL] Define explicitly the public API to avoid private objects in pydoc:
  279 +#TODO: add more
  280 +# __all__ = ['OleFileIO', 'isOleFile', 'MAGIC']
  281 +
  282 +# For Python 3.x, need to redefine long as int:
  283 +if str is not bytes:
  284 + long = int
  285 +
  286 +# Need to make sure we use xrange both on Python 2 and 3.x:
  287 +try:
  288 + # on Python 2 we need xrange:
  289 + iterrange = xrange
  290 +except:
  291 + # no xrange, for Python 3 it was renamed as range:
  292 + iterrange = range
  293 +
  294 +#[PL] workaround to fix an issue with array item size on 64 bits systems:
  295 +if array.array('L').itemsize == 4:
  296 + # on 32 bits platforms, long integers in an array are 32 bits:
  297 + UINT32 = 'L'
  298 +elif array.array('I').itemsize == 4:
  299 + # on 64 bits platforms, integers in an array are 32 bits:
  300 + UINT32 = 'I'
  301 +elif array.array('i').itemsize == 4:
  302 + # On 64 bit Jython, signed integers ('i') are the only way to store our 32
  303 + # bit values in an array in a *somewhat* reasonable way, as the otherwise
  304 + # perfectly suited 'H' (unsigned int, 32 bits) results in a completely
  305 + # unusable behaviour. This is most likely caused by the fact that Java
  306 + # doesn't have unsigned values, and thus Jython's "array" implementation,
  307 + # which is based on "jarray", doesn't have them either.
  308 + # NOTE: to trick Jython into converting the values it would normally
  309 + # interpret as "signed" into "unsigned", a binary-and operation with
  310 + # 0xFFFFFFFF can be used. This way it is possible to use the same comparing
  311 + # operations on all platforms / implementations. The corresponding code
  312 + # lines are flagged with a 'JYTHON-WORKAROUND' tag below.
  313 + UINT32 = 'i'
  314 +else:
  315 + raise ValueError('Need to fix a bug with 32 bit arrays, please contact author...')
  316 +
  317 +
  318 +#[PL] These workarounds were inspired from the Path module
  319 +# (see http://www.jorendorff.com/articles/python/path/)
  320 +#TODO: test with old Python versions
  321 +
  322 +# Pre-2.3 workaround for basestring.
  323 +try:
  324 + basestring
  325 +except NameError:
  326 + try:
  327 + # is Unicode supported (Python >2.0 or >1.6 ?)
  328 + basestring = (str, unicode)
  329 + except NameError:
  330 + basestring = str
  331 +
  332 +#[PL] Experimental setting: if True, OLE filenames will be kept in Unicode
  333 +# if False (default PIL behaviour), all filenames are converted to Latin-1.
  334 +KEEP_UNICODE_NAMES = True
  335 +
  336 +if sys.version_info[0] < 3:
  337 + # On Python 2.x, the default encoding for path names is UTF-8:
  338 + DEFAULT_PATH_ENCODING = 'utf-8'
  339 +else:
  340 + # On Python 3.x, the default encoding for path names is Unicode (None):
  341 + DEFAULT_PATH_ENCODING = None
  342 +
  343 +
  344 +# === LOGGING =================================================================
  345 +
  346 +class NullHandler(logging.Handler):
  347 + """
  348 + Log Handler without output, to avoid printing messages if logging is not
  349 + configured by the main application.
  350 + Python 2.7 has logging.NullHandler, but this is necessary for 2.6:
  351 + see https://docs.python.org/2.6/library/logging.html#configuring-logging-for-a-library
  352 + """
  353 + def emit(self, record):
  354 + pass
  355 +
  356 +def get_logger(name, level=logging.CRITICAL+1):
  357 + """
  358 + Create a suitable logger object for this module.
  359 + The goal is not to change settings of the root logger, to avoid getting
  360 + other modules' logs on the screen.
  361 + If a logger exists with same name, reuse it. (Else it would have duplicate
  362 + handlers and messages would be doubled.)
  363 + The level is set to CRITICAL+1 by default, to avoid any logging.
  364 + """
  365 + # First, test if there is already a logger with the same name, else it
  366 + # will generate duplicate messages (due to duplicate handlers):
  367 + if name in logging.Logger.manager.loggerDict:
  368 + #NOTE: another less intrusive but more "hackish" solution would be to
  369 + # use getLogger then test if its effective level is not default.
  370 + logger = logging.getLogger(name)
  371 + # make sure level is OK:
  372 + logger.setLevel(level)
  373 + return logger
  374 + # get a new logger:
  375 + logger = logging.getLogger(name)
  376 + # only add a NullHandler for this logger, it is up to the application
  377 + # to configure its own logging:
  378 + logger.addHandler(NullHandler())
  379 + logger.setLevel(level)
  380 + return logger
  381 +
  382 +
  383 +# a global logger object used for debugging:
  384 +log = get_logger('olefile')
  385 +
  386 +
  387 +def enable_logging():
  388 + """
  389 + Enable logging for this module (disabled by default).
  390 + This will set the module-specific logger level to NOTSET, which
  391 + means the main application controls the actual logging level.
  392 + """
  393 + log.setLevel(logging.NOTSET)
  394 +
  395 +
  396 +#=== CONSTANTS ===============================================================
  397 +
  398 +# magic bytes that should be at the beginning of every OLE file:
  399 +MAGIC = b'\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1'
  400 +
  401 +#[PL]: added constants for Sector IDs (from AAF specifications)
  402 +MAXREGSECT = 0xFFFFFFFA # (-6) maximum SECT
  403 +DIFSECT = 0xFFFFFFFC # (-4) denotes a DIFAT sector in a FAT
  404 +FATSECT = 0xFFFFFFFD # (-3) denotes a FAT sector in a FAT
  405 +ENDOFCHAIN = 0xFFFFFFFE # (-2) end of a virtual stream chain
  406 +FREESECT = 0xFFFFFFFF # (-1) unallocated sector
  407 +
  408 +#[PL]: added constants for Directory Entry IDs (from AAF specifications)
  409 +MAXREGSID = 0xFFFFFFFA # (-6) maximum directory entry ID
  410 +NOSTREAM = 0xFFFFFFFF # (-1) unallocated directory entry
  411 +
  412 +#[PL] object types in storage (from AAF specifications)
  413 +STGTY_EMPTY = 0 # empty directory entry (according to OpenOffice.org doc)
  414 +STGTY_STORAGE = 1 # element is a storage object
  415 +STGTY_STREAM = 2 # element is a stream object
  416 +STGTY_LOCKBYTES = 3 # element is an ILockBytes object
  417 +STGTY_PROPERTY = 4 # element is an IPropertyStorage object
  418 +STGTY_ROOT = 5 # element is a root storage
  419 +
  420 +# Unknown size for a stream (used by OleStream):
  421 +UNKNOWN_SIZE = 0x7FFFFFFF
  422 +
  423 +#
  424 +# --------------------------------------------------------------------
  425 +# property types
  426 +
  427 +VT_EMPTY=0; VT_NULL=1; VT_I2=2; VT_I4=3; VT_R4=4; VT_R8=5; VT_CY=6;
  428 +VT_DATE=7; VT_BSTR=8; VT_DISPATCH=9; VT_ERROR=10; VT_BOOL=11;
  429 +VT_VARIANT=12; VT_UNKNOWN=13; VT_DECIMAL=14; VT_I1=16; VT_UI1=17;
  430 +VT_UI2=18; VT_UI4=19; VT_I8=20; VT_UI8=21; VT_INT=22; VT_UINT=23;
  431 +VT_VOID=24; VT_HRESULT=25; VT_PTR=26; VT_SAFEARRAY=27; VT_CARRAY=28;
  432 +VT_USERDEFINED=29; VT_LPSTR=30; VT_LPWSTR=31; VT_FILETIME=64;
  433 +VT_BLOB=65; VT_STREAM=66; VT_STORAGE=67; VT_STREAMED_OBJECT=68;
  434 +VT_STORED_OBJECT=69; VT_BLOB_OBJECT=70; VT_CF=71; VT_CLSID=72;
  435 +VT_VECTOR=0x1000;
  436 +
  437 +# map property id to name (for debugging purposes)
  438 +
  439 +VT = {}
  440 +for keyword, var in list(vars().items()):
  441 + if keyword[:3] == "VT_":
  442 + VT[var] = keyword
  443 +
  444 +#
  445 +# --------------------------------------------------------------------
  446 +# Some common document types (root.clsid fields)
  447 +
  448 +WORD_CLSID = "00020900-0000-0000-C000-000000000046"
  449 +#TODO: check Excel, PPT, ...
  450 +
  451 +#[PL]: Defect levels to classify parsing errors - see OleFileIO._raise_defect()
  452 +DEFECT_UNSURE = 10 # a case which looks weird, but not sure it's a defect
  453 +DEFECT_POTENTIAL = 20 # a potential defect
  454 +DEFECT_INCORRECT = 30 # an error according to specifications, but parsing
  455 + # can go on
  456 +DEFECT_FATAL = 40 # an error which cannot be ignored, parsing is
  457 + # impossible
  458 +
  459 +# Minimal size of an empty OLE file, with 512-bytes sectors = 1536 bytes
  460 +# (this is used in isOleFile and OleFile.open)
  461 +MINIMAL_OLEFILE_SIZE = 1536
  462 +
  463 +#[PL] add useful constants to __all__:
  464 +# for key in list(vars().keys()):
  465 +# if key.startswith('STGTY_') or key.startswith('DEFECT_'):
  466 +# __all__.append(key)
  467 +
  468 +
  469 +#=== FUNCTIONS ===============================================================
  470 +
  471 +def isOleFile (filename):
  472 + """
  473 + Test if a file is an OLE container (according to the magic bytes in its header).
  474 +
  475 + :param filename: string-like or file-like object, OLE file to parse
  476 +
  477 + - if filename is a string smaller than 1536 bytes, it is the path
  478 + of the file to open. (bytes or unicode string)
  479 + - if filename is a string longer than 1535 bytes, it is parsed
  480 + as the content of an OLE file in memory. (bytes type only)
  481 + - if filename is a file-like object (with read and seek methods),
  482 + it is parsed as-is.
  483 +
  484 + :returns: True if OLE, False otherwise.
  485 + """
  486 + # check if filename is a string-like or file-like object:
  487 + if hasattr(filename, 'read'):
  488 + # file-like object: use it directly
  489 + header = filename.read(len(MAGIC))
  490 + # just in case, seek back to start of file:
  491 + filename.seek(0)
  492 + elif isinstance(filename, bytes) and len(filename) >= MINIMAL_OLEFILE_SIZE:
  493 + # filename is a bytes string containing the OLE file to be parsed:
  494 + header = filename[:len(MAGIC)]
  495 + else:
  496 + # string-like object: filename of file on disk
  497 + header = open(filename, 'rb').read(len(MAGIC))
  498 + if header == MAGIC:
  499 + return True
  500 + else:
  501 + return False
  502 +
  503 +
  504 +if bytes is str:
  505 + # version for Python 2.x
  506 + def i8(c):
  507 + return ord(c)
  508 +else:
  509 + # version for Python 3.x
  510 + def i8(c):
  511 + return c if c.__class__ is int else c[0]
  512 +
  513 +
  514 +#TODO: replace i16 and i32 with more readable struct.unpack equivalent?
  515 +
  516 +def i16(c, o = 0):
  517 + """
  518 + Converts a 2-bytes (16 bits) string to an integer.
  519 +
  520 + :param c: string containing bytes to convert
  521 + :param o: offset of bytes to convert in string
  522 + """
  523 + return i8(c[o]) | (i8(c[o+1])<<8)
  524 +
  525 +
  526 +def i32(c, o = 0):
  527 + """
  528 + Converts a 4-bytes (32 bits) string to an integer.
  529 +
  530 + :param c: string containing bytes to convert
  531 + :param o: offset of bytes to convert in string
  532 + """
  533 +## return int(ord(c[o])+(ord(c[o+1])<<8)+(ord(c[o+2])<<16)+(ord(c[o+3])<<24))
  534 +## # [PL]: added int() because "<<" gives long int since Python 2.4
  535 + # copied from Pillow's _binary:
  536 + return i8(c[o]) | (i8(c[o+1])<<8) | (i8(c[o+2])<<16) | (i8(c[o+3])<<24)
  537 +
  538 +
  539 +def _clsid(clsid):
  540 + """
  541 + Converts a CLSID to a human-readable string.
  542 +
  543 + :param clsid: string of length 16.
  544 + """
  545 + assert len(clsid) == 16
  546 + # if clsid is only made of null bytes, return an empty string:
  547 + # (PL: why not simply return the string with zeroes?)
  548 + if not clsid.strip(b"\0"):
  549 + return ""
  550 + return (("%08X-%04X-%04X-%02X%02X-" + "%02X" * 6) %
  551 + ((i32(clsid, 0), i16(clsid, 4), i16(clsid, 6)) +
  552 + tuple(map(i8, clsid[8:16]))))
  553 +
  554 +
  555 +
  556 +def filetime2datetime(filetime):
  557 + """
  558 + convert FILETIME (64 bits int) to Python datetime.datetime
  559 + """
  560 + # TODO: manage exception when microseconds is too large
  561 + # inspired from http://code.activestate.com/recipes/511425-filetime-to-datetime/
  562 + _FILETIME_null_date = datetime.datetime(1601, 1, 1, 0, 0, 0)
  563 + #log.debug('timedelta days=%d' % (filetime//(10*1000000*3600*24)))
  564 + return _FILETIME_null_date + datetime.timedelta(microseconds=filetime//10)
  565 +
  566 +
  567 +
  568 +#=== CLASSES ==================================================================
  569 +
  570 +class OleMetadata:
  571 + """
  572 + class to parse and store metadata from standard properties of OLE files.
  573 +
  574 + Available attributes:
  575 + codepage, title, subject, author, keywords, comments, template,
  576 + last_saved_by, revision_number, total_edit_time, last_printed, create_time,
  577 + last_saved_time, num_pages, num_words, num_chars, thumbnail,
  578 + creating_application, security, codepage_doc, category, presentation_target,
  579 + bytes, lines, paragraphs, slides, notes, hidden_slides, mm_clips,
  580 + scale_crop, heading_pairs, titles_of_parts, manager, company, links_dirty,
  581 + chars_with_spaces, unused, shared_doc, link_base, hlinks, hlinks_changed,
  582 + version, dig_sig, content_type, content_status, language, doc_version
  583 +
  584 + Note: an attribute is set to None when not present in the properties of the
  585 + OLE file.
  586 +
  587 + References for SummaryInformation stream:
  588 + - http://msdn.microsoft.com/en-us/library/dd942545.aspx
  589 + - http://msdn.microsoft.com/en-us/library/dd925819%28v=office.12%29.aspx
  590 + - http://msdn.microsoft.com/en-us/library/windows/desktop/aa380376%28v=vs.85%29.aspx
  591 + - http://msdn.microsoft.com/en-us/library/aa372045.aspx
  592 + - http://sedna-soft.de/summary-information-stream/
  593 + - http://poi.apache.org/apidocs/org/apache/poi/hpsf/SummaryInformation.html
  594 +
  595 + References for DocumentSummaryInformation stream:
  596 + - http://msdn.microsoft.com/en-us/library/dd945671%28v=office.12%29.aspx
  597 + - http://msdn.microsoft.com/en-us/library/windows/desktop/aa380374%28v=vs.85%29.aspx
  598 + - http://poi.apache.org/apidocs/org/apache/poi/hpsf/DocumentSummaryInformation.html
  599 +
  600 + new in version 0.25
  601 + """
  602 +
  603 + # attribute names for SummaryInformation stream properties:
  604 + # (ordered by property id, starting at 1)
  605 + SUMMARY_ATTRIBS = ['codepage', 'title', 'subject', 'author', 'keywords', 'comments',
  606 + 'template', 'last_saved_by', 'revision_number', 'total_edit_time',
  607 + 'last_printed', 'create_time', 'last_saved_time', 'num_pages',
  608 + 'num_words', 'num_chars', 'thumbnail', 'creating_application',
  609 + 'security']
  610 +
  611 + # attribute names for DocumentSummaryInformation stream properties:
  612 + # (ordered by property id, starting at 1)
  613 + DOCSUM_ATTRIBS = ['codepage_doc', 'category', 'presentation_target', 'bytes', 'lines', 'paragraphs',
  614 + 'slides', 'notes', 'hidden_slides', 'mm_clips',
  615 + 'scale_crop', 'heading_pairs', 'titles_of_parts', 'manager',
  616 + 'company', 'links_dirty', 'chars_with_spaces', 'unused', 'shared_doc',
  617 + 'link_base', 'hlinks', 'hlinks_changed', 'version', 'dig_sig',
  618 + 'content_type', 'content_status', 'language', 'doc_version']
  619 +
  620 + def __init__(self):
  621 + """
  622 + Constructor for OleMetadata
  623 + All attributes are set to None by default
  624 + """
  625 + # properties from SummaryInformation stream
  626 + self.codepage = None
  627 + self.title = None
  628 + self.subject = None
  629 + self.author = None
  630 + self.keywords = None
  631 + self.comments = None
  632 + self.template = None
  633 + self.last_saved_by = None
  634 + self.revision_number = None
  635 + self.total_edit_time = None
  636 + self.last_printed = None
  637 + self.create_time = None
  638 + self.last_saved_time = None
  639 + self.num_pages = None
  640 + self.num_words = None
  641 + self.num_chars = None
  642 + self.thumbnail = None
  643 + self.creating_application = None
  644 + self.security = None
  645 + # properties from DocumentSummaryInformation stream
  646 + self.codepage_doc = None
  647 + self.category = None
  648 + self.presentation_target = None
  649 + self.bytes = None
  650 + self.lines = None
  651 + self.paragraphs = None
  652 + self.slides = None
  653 + self.notes = None
  654 + self.hidden_slides = None
  655 + self.mm_clips = None
  656 + self.scale_crop = None
  657 + self.heading_pairs = None
  658 + self.titles_of_parts = None
  659 + self.manager = None
  660 + self.company = None
  661 + self.links_dirty = None
  662 + self.chars_with_spaces = None
  663 + self.unused = None
  664 + self.shared_doc = None
  665 + self.link_base = None
  666 + self.hlinks = None
  667 + self.hlinks_changed = None
  668 + self.version = None
  669 + self.dig_sig = None
  670 + self.content_type = None
  671 + self.content_status = None
  672 + self.language = None
  673 + self.doc_version = None
  674 +
  675 +
  676 + def parse_properties(self, olefile):
  677 + """
  678 + Parse standard properties of an OLE file, from the streams
  679 + "\x05SummaryInformation" and "\x05DocumentSummaryInformation",
  680 + if present.
  681 + Properties are converted to strings, integers or python datetime objects.
  682 + If a property is not present, its value is set to None.
  683 + """
  684 + # first set all attributes to None:
  685 + for attrib in (self.SUMMARY_ATTRIBS + self.DOCSUM_ATTRIBS):
  686 + setattr(self, attrib, None)
  687 + if olefile.exists("\x05SummaryInformation"):
  688 + # get properties from the stream:
  689 + # (converting timestamps to python datetime, except total_edit_time,
  690 + # which is property #10)
  691 + props = olefile.getproperties("\x05SummaryInformation",
  692 + convert_time=True, no_conversion=[10])
  693 + # store them into this object's attributes:
  694 + for i in range(len(self.SUMMARY_ATTRIBS)):
  695 + # ids for standards properties start at 0x01, until 0x13
  696 + value = props.get(i+1, None)
  697 + setattr(self, self.SUMMARY_ATTRIBS[i], value)
  698 + if olefile.exists("\x05DocumentSummaryInformation"):
  699 + # get properties from the stream:
  700 + props = olefile.getproperties("\x05DocumentSummaryInformation",
  701 + convert_time=True)
  702 + # store them into this object's attributes:
  703 + for i in range(len(self.DOCSUM_ATTRIBS)):
  704 + # ids for standards properties start at 0x01, until 0x13
  705 + value = props.get(i+1, None)
  706 + setattr(self, self.DOCSUM_ATTRIBS[i], value)
  707 +
  708 + def dump(self):
  709 + """
  710 + Dump all metadata, for debugging purposes.
  711 + """
  712 + print('Properties from SummaryInformation stream:')
  713 + for prop in self.SUMMARY_ATTRIBS:
  714 + value = getattr(self, prop)
  715 + print('- %s: %s' % (prop, repr(value)))
  716 + print('Properties from DocumentSummaryInformation stream:')
  717 + for prop in self.DOCSUM_ATTRIBS:
  718 + value = getattr(self, prop)
  719 + print('- %s: %s' % (prop, repr(value)))
  720 +
  721 +
  722 +#--- OleStream ---------------------------------------------------------------
  723 +
  724 +class OleStream(io.BytesIO):
  725 + """
  726 + OLE2 Stream
  727 +
  728 + Returns a read-only file object which can be used to read
  729 + the contents of a OLE stream (instance of the BytesIO class).
  730 + To open a stream, use the openstream method in the OleFile class.
  731 +
  732 + This function can be used with either ordinary streams,
  733 + or ministreams, depending on the offset, sectorsize, and
  734 + fat table arguments.
  735 +
  736 + Attributes:
  737 +
  738 + - size: actual size of data stream, after it was opened.
  739 + """
  740 + # FIXME: should store the list of sects obtained by following
  741 + # the fat chain, and load new sectors on demand instead of
  742 + # loading it all in one go.
  743 +
  744 + def __init__(self, fp, sect, size, offset, sectorsize, fat, filesize, olefileio):
  745 + """
  746 + Constructor for OleStream class.
  747 +
  748 + :param fp: file object, the OLE container or the MiniFAT stream
  749 + :param sect: sector index of first sector in the stream
  750 + :param size: total size of the stream
  751 + :param offset: offset in bytes for the first FAT or MiniFAT sector
  752 + :param sectorsize: size of one sector
  753 + :param fat: array/list of sector indexes (FAT or MiniFAT)
  754 + :param filesize: size of OLE file (for debugging)
  755 + :param olefileio: OleFileIO object containing this stream
  756 + :returns: a BytesIO instance containing the OLE stream
  757 + """
  758 + log.debug('OleStream.__init__:')
  759 + log.debug(' sect=%d (%X), size=%d, offset=%d, sectorsize=%d, len(fat)=%d, fp=%s'
  760 + %(sect,sect,size,offset,sectorsize,len(fat), repr(fp)))
  761 + self.ole = olefileio
  762 + #[PL] To detect malformed documents with FAT loops, we compute the
  763 + # expected number of sectors in the stream:
  764 + unknown_size = False
  765 + if size == UNKNOWN_SIZE:
  766 + # this is the case when called from OleFileIO._open(), and stream
  767 + # size is not known in advance (for example when reading the
  768 + # Directory stream). Then we can only guess maximum size:
  769 + size = len(fat)*sectorsize
  770 + # and we keep a record that size was unknown:
  771 + unknown_size = True
  772 + log.debug(' stream with UNKNOWN SIZE')
  773 + nb_sectors = (size + (sectorsize-1)) // sectorsize
  774 + log.debug('nb_sectors = %d' % nb_sectors)
  775 + # This number should (at least) be less than the total number of
  776 + # sectors in the given FAT:
  777 + if nb_sectors > len(fat):
  778 + self.ole._raise_defect(DEFECT_INCORRECT, 'malformed OLE document, stream too large')
  779 + # optimization(?): data is first a list of strings, and join() is called
  780 + # at the end to concatenate all in one string.
  781 + # (this may not be really useful with recent Python versions)
  782 + data = []
  783 + # if size is zero, then first sector index should be ENDOFCHAIN:
  784 + if size == 0 and sect != ENDOFCHAIN:
  785 + log.debug('size == 0 and sect != ENDOFCHAIN:')
  786 + self.ole._raise_defect(DEFECT_INCORRECT, 'incorrect OLE sector index for empty stream')
  787 + #[PL] A fixed-length for loop is used instead of an undefined while
  788 + # loop to avoid DoS attacks:
  789 + for i in range(nb_sectors):
  790 + log.debug('Reading stream sector[%d] = %Xh' % (i, sect))
  791 + # Sector index may be ENDOFCHAIN, but only if size was unknown
  792 + if sect == ENDOFCHAIN:
  793 + if unknown_size:
  794 + log.debug('Reached ENDOFCHAIN sector for stream with unknown size')
  795 + break
  796 + else:
  797 + # else this means that the stream is smaller than declared:
  798 + log.debug('sect=ENDOFCHAIN before expected size')
  799 + self.ole._raise_defect(DEFECT_INCORRECT, 'incomplete OLE stream')
  800 + # sector index should be within FAT:
  801 + if sect<0 or sect>=len(fat):
  802 + log.debug('sect=%d (%X) / len(fat)=%d' % (sect, sect, len(fat)))
  803 + log.debug('i=%d / nb_sectors=%d' %(i, nb_sectors))
  804 +## tmp_data = b"".join(data)
  805 +## f = open('test_debug.bin', 'wb')
  806 +## f.write(tmp_data)
  807 +## f.close()
  808 +## log.debug('data read so far: %d bytes' % len(tmp_data))
  809 + self.ole._raise_defect(DEFECT_INCORRECT, 'incorrect OLE FAT, sector index out of range')
  810 + # stop reading here if the exception is ignored:
  811 + break
  812 + #TODO: merge this code with OleFileIO.getsect() ?
  813 + #TODO: check if this works with 4K sectors:
  814 + try:
  815 + fp.seek(offset + sectorsize * sect)
  816 + except:
  817 + log.debug('sect=%d, seek=%d, filesize=%d' %
  818 + (sect, offset+sectorsize*sect, filesize))
  819 + self.ole._raise_defect(DEFECT_INCORRECT, 'OLE sector index out of range')
  820 + # stop reading here if the exception is ignored:
  821 + break
  822 + sector_data = fp.read(sectorsize)
  823 + # [PL] check if there was enough data:
  824 + # Note: if sector is the last of the file, sometimes it is not a
  825 + # complete sector (of 512 or 4K), so we may read less than
  826 + # sectorsize.
  827 + if len(sector_data)!=sectorsize and sect!=(len(fat)-1):
  828 + log.debug('sect=%d / len(fat)=%d, seek=%d / filesize=%d, len read=%d' %
  829 + (sect, len(fat), offset+sectorsize*sect, filesize, len(sector_data)))
  830 + log.debug('seek+len(read)=%d' % (offset+sectorsize*sect+len(sector_data)))
  831 + self.ole._raise_defect(DEFECT_INCORRECT, 'incomplete OLE sector')
  832 + data.append(sector_data)
  833 + # jump to next sector in the FAT:
  834 + try:
  835 + sect = fat[sect] & 0xFFFFFFFF # JYTHON-WORKAROUND
  836 + except IndexError:
  837 + # [PL] if pointer is out of the FAT an exception is raised
  838 + self.ole._raise_defect(DEFECT_INCORRECT, 'incorrect OLE FAT, sector index out of range')
  839 + # stop reading here if the exception is ignored:
  840 + break
  841 + #[PL] Last sector should be a "end of chain" marker:
  842 + # if sect != ENDOFCHAIN:
  843 + # raise IOError('incorrect last sector index in OLE stream')
  844 + data = b"".join(data)
  845 + # Data is truncated to the actual stream size:
  846 + if len(data) >= size:
  847 + log.debug('Read data of length %d, truncated to stream size %d' % (len(data), size))
  848 + data = data[:size]
  849 + # actual stream size is stored for future use:
  850 + self.size = size
  851 + elif unknown_size:
  852 + # actual stream size was not known, now we know the size of read
  853 + # data:
  854 + log.debug('Read data of length %d, the stream size was unkown' % len(data))
  855 + self.size = len(data)
  856 + else:
  857 + # read data is less than expected:
  858 + log.debug('Read data of length %d, less than expected stream size %d' % (len(data), size))
  859 + # TODO: provide details in exception message
  860 + self.ole._raise_defect(DEFECT_INCORRECT, 'OLE stream size is less than declared')
  861 + self.size = len(data)
  862 + # when all data is read in memory, BytesIO constructor is called
  863 + io.BytesIO.__init__(self, data)
  864 + # Then the OleStream object can be used as a read-only file object.
  865 +
  866 +
  867 +#--- OleDirectoryEntry -------------------------------------------------------
  868 +
  869 +class OleDirectoryEntry:
  870 +
  871 + """
  872 + OLE2 Directory Entry
  873 + """
  874 + #[PL] parsing code moved from OleFileIO.loaddirectory
  875 +
  876 + # struct to parse directory entries:
  877 + # <: little-endian byte order, standard sizes
  878 + # (note: this should guarantee that Q returns a 64 bits int)
  879 + # 64s: string containing entry name in unicode UTF-16 (max 31 chars) + null char = 64 bytes
  880 + # H: uint16, number of bytes used in name buffer, including null = (len+1)*2
  881 + # B: uint8, dir entry type (between 0 and 5)
  882 + # B: uint8, color: 0=black, 1=red
  883 + # I: uint32, index of left child node in the red-black tree, NOSTREAM if none
  884 + # I: uint32, index of right child node in the red-black tree, NOSTREAM if none
  885 + # I: uint32, index of child root node if it is a storage, else NOSTREAM
  886 + # 16s: CLSID, unique identifier (only used if it is a storage)
  887 + # I: uint32, user flags
  888 + # Q (was 8s): uint64, creation timestamp or zero
  889 + # Q (was 8s): uint64, modification timestamp or zero
  890 + # I: uint32, SID of first sector if stream or ministream, SID of 1st sector
  891 + # of stream containing ministreams if root entry, 0 otherwise
  892 + # I: uint32, total stream size in bytes if stream (low 32 bits), 0 otherwise
  893 + # I: uint32, total stream size in bytes if stream (high 32 bits), 0 otherwise
  894 + STRUCT_DIRENTRY = '<64sHBBIII16sIQQIII'
  895 + # size of a directory entry: 128 bytes
  896 + DIRENTRY_SIZE = 128
  897 + assert struct.calcsize(STRUCT_DIRENTRY) == DIRENTRY_SIZE
  898 +
  899 +
  900 + def __init__(self, entry, sid, olefile):
  901 + """
  902 + Constructor for an OleDirectoryEntry object.
  903 + Parses a 128-bytes entry from the OLE Directory stream.
  904 +
  905 + :param entry : string (must be 128 bytes long)
  906 + :param sid : index of this directory entry in the OLE file directory
  907 + :param olefile: OleFileIO containing this directory entry
  908 + """
  909 + self.sid = sid
  910 + # ref to olefile is stored for future use
  911 + self.olefile = olefile
  912 + # kids is a list of children entries, if this entry is a storage:
  913 + # (list of OleDirectoryEntry objects)
  914 + self.kids = []
  915 + # kids_dict is a dictionary of children entries, indexed by their
  916 + # name in lowercase: used to quickly find an entry, and to detect
  917 + # duplicates
  918 + self.kids_dict = {}
  919 + # flag used to detect if the entry is referenced more than once in
  920 + # directory:
  921 + self.used = False
  922 + # decode DirEntry
  923 + (
  924 + self.name_raw, # 64s: string containing entry name in unicode UTF-16 (max 31 chars) + null char = 64 bytes
  925 + self.namelength, # H: uint16, number of bytes used in name buffer, including null = (len+1)*2
  926 + self.entry_type,
  927 + self.color,
  928 + self.sid_left,
  929 + self.sid_right,
  930 + self.sid_child,
  931 + clsid,
  932 + self.dwUserFlags,
  933 + self.createTime,
  934 + self.modifyTime,
  935 + self.isectStart,
  936 + self.sizeLow,
  937 + self.sizeHigh
  938 + ) = struct.unpack(OleDirectoryEntry.STRUCT_DIRENTRY, entry)
  939 + if self.entry_type not in [STGTY_ROOT, STGTY_STORAGE, STGTY_STREAM, STGTY_EMPTY]:
  940 + olefile._raise_defect(DEFECT_INCORRECT, 'unhandled OLE storage type')
  941 + # only first directory entry can (and should) be root:
  942 + if self.entry_type == STGTY_ROOT and sid != 0:
  943 + olefile._raise_defect(DEFECT_INCORRECT, 'duplicate OLE root entry')
  944 + if sid == 0 and self.entry_type != STGTY_ROOT:
  945 + olefile._raise_defect(DEFECT_INCORRECT, 'incorrect OLE root entry')
  946 + #log.debug(struct.unpack(fmt_entry, entry[:len_entry]))
  947 + # name should be at most 31 unicode characters + null character,
  948 + # so 64 bytes in total (31*2 + 2):
  949 + if self.namelength>64:
  950 + olefile._raise_defect(DEFECT_INCORRECT, 'incorrect DirEntry name length >64 bytes')
  951 + # if exception not raised, namelength is set to the maximum value:
  952 + self.namelength = 64
  953 + # only characters without ending null char are kept:
  954 + self.name_utf16 = self.name_raw[:(self.namelength-2)]
  955 + #TODO: check if the name is actually followed by a null unicode character ([MS-CFB] 2.6.1)
  956 + #TODO: check if the name does not contain forbidden characters:
  957 + # [MS-CFB] 2.6.1: "The following characters are illegal and MUST NOT be part of the name: '/', '\', ':', '!'."
  958 + # name is converted from UTF-16LE to the path encoding specified in the OleFileIO:
  959 + self.name = olefile._decode_utf16_str(self.name_utf16)
  960 +
  961 + log.debug('DirEntry SID=%d: %s' % (self.sid, repr(self.name)))
  962 + log.debug(' - type: %d' % self.entry_type)
  963 + log.debug(' - sect: %Xh' % self.isectStart)
  964 + log.debug(' - SID left: %d, right: %d, child: %d' % (self.sid_left,
  965 + self.sid_right, self.sid_child))
  966 +
  967 + # sizeHigh is only used for 4K sectors, it should be zero for 512 bytes
  968 + # sectors, BUT apparently some implementations set it as 0xFFFFFFFF, 1
  969 + # or some other value so it cannot be raised as a defect in general:
  970 + if olefile.sectorsize == 512:
  971 + if self.sizeHigh != 0 and self.sizeHigh != 0xFFFFFFFF:
  972 + log.debug('sectorsize=%d, sizeLow=%d, sizeHigh=%d (%X)' %
  973 + (olefile.sectorsize, self.sizeLow, self.sizeHigh, self.sizeHigh))
  974 + olefile._raise_defect(DEFECT_UNSURE, 'incorrect OLE stream size')
  975 + self.size = self.sizeLow
  976 + else:
  977 + self.size = self.sizeLow + (long(self.sizeHigh)<<32)
  978 + log.debug(' - size: %d (sizeLow=%d, sizeHigh=%d)' % (self.size, self.sizeLow, self.sizeHigh))
  979 +
  980 + self.clsid = _clsid(clsid)
  981 + # a storage should have a null size, BUT some implementations such as
  982 + # Word 8 for Mac seem to allow non-null values => Potential defect:
  983 + if self.entry_type == STGTY_STORAGE and self.size != 0:
  984 + olefile._raise_defect(DEFECT_POTENTIAL, 'OLE storage with size>0')
  985 + # check if stream is not already referenced elsewhere:
  986 + if self.entry_type in (STGTY_ROOT, STGTY_STREAM) and self.size>0:
  987 + if self.size < olefile.minisectorcutoff \
  988 + and self.entry_type==STGTY_STREAM: # only streams can be in MiniFAT
  989 + # ministream object
  990 + minifat = True
  991 + else:
  992 + minifat = False
  993 + olefile._check_duplicate_stream(self.isectStart, minifat)
  994 +
  995 +
  996 +
  997 + def build_storage_tree(self):
  998 + """
  999 + Read and build the red-black tree attached to this OleDirectoryEntry
  1000 + object, if it is a storage.
  1001 + Note that this method builds a tree of all subentries, so it should
  1002 + only be called for the root object once.
  1003 + """
  1004 + log.debug('build_storage_tree: SID=%d - %s - sid_child=%d'
  1005 + % (self.sid, repr(self.name), self.sid_child))
  1006 + if self.sid_child != NOSTREAM:
  1007 + # if child SID is not NOSTREAM, then this entry is a storage.
  1008 + # Let's walk through the tree of children to fill the kids list:
  1009 + self.append_kids(self.sid_child)
  1010 +
  1011 + # Note from OpenOffice documentation: the safest way is to
  1012 + # recreate the tree because some implementations may store broken
  1013 + # red-black trees...
  1014 +
  1015 + # in the OLE file, entries are sorted on (length, name).
  1016 + # for convenience, we sort them on name instead:
  1017 + # (see rich comparison methods in this class)
  1018 + self.kids.sort()
  1019 +
  1020 +
  1021 + def append_kids(self, child_sid):
  1022 + """
  1023 + Walk through red-black tree of children of this directory entry to add
  1024 + all of them to the kids list. (recursive method)
  1025 +
  1026 + :param child_sid : index of child directory entry to use, or None when called
  1027 + first time for the root. (only used during recursion)
  1028 + """
  1029 + log.debug('append_kids: child_sid=%d' % child_sid)
  1030 + #[PL] this method was added to use simple recursion instead of a complex
  1031 + # algorithm.
  1032 + # if this is not a storage or a leaf of the tree, nothing to do:
  1033 + if child_sid == NOSTREAM:
  1034 + return
  1035 + # check if child SID is in the proper range:
  1036 + if child_sid<0 or child_sid>=len(self.olefile.direntries):
  1037 + self.olefile._raise_defect(DEFECT_INCORRECT, 'OLE DirEntry index out of range')
  1038 + else:
  1039 + # get child direntry:
  1040 + child = self.olefile._load_direntry(child_sid) #direntries[child_sid]
  1041 + log.debug('append_kids: child_sid=%d - %s - sid_left=%d, sid_right=%d, sid_child=%d'
  1042 + % (child.sid, repr(child.name), child.sid_left, child.sid_right, child.sid_child))
  1043 + # the directory entries are organized as a red-black tree.
  1044 + # (cf. Wikipedia for details)
  1045 + # First walk through left side of the tree:
  1046 + self.append_kids(child.sid_left)
  1047 + # Check if its name is not already used (case-insensitive):
  1048 + name_lower = child.name.lower()
  1049 + if name_lower in self.kids_dict:
  1050 + self.olefile._raise_defect(DEFECT_INCORRECT,
  1051 + "Duplicate filename in OLE storage")
  1052 + # Then the child_sid OleDirectoryEntry object is appended to the
  1053 + # kids list and dictionary:
  1054 + self.kids.append(child)
  1055 + self.kids_dict[name_lower] = child
  1056 + # Check if kid was not already referenced in a storage:
  1057 + if child.used:
  1058 + self.olefile._raise_defect(DEFECT_INCORRECT,
  1059 + 'OLE Entry referenced more than once')
  1060 + child.used = True
  1061 + # Finally walk through right side of the tree:
  1062 + self.append_kids(child.sid_right)
  1063 + # Afterwards build kid's own tree if it's also a storage:
  1064 + child.build_storage_tree()
  1065 +
  1066 +
  1067 + def __eq__(self, other):
  1068 + "Compare entries by name"
  1069 + return self.name == other.name
  1070 +
  1071 + def __lt__(self, other):
  1072 + "Compare entries by name"
  1073 + return self.name < other.name
  1074 +
  1075 + def __ne__(self, other):
  1076 + return not self.__eq__(other)
  1077 +
  1078 + def __le__(self, other):
  1079 + return self.__eq__(other) or self.__lt__(other)
  1080 +
  1081 + # Reflected __lt__() and __le__() will be used for __gt__() and __ge__()
  1082 +
  1083 + #TODO: replace by the same function as MS implementation ?
  1084 + # (order by name length first, then case-insensitive order)
  1085 +
  1086 +
  1087 + def dump(self, tab = 0):
  1088 + "Dump this entry, and all its subentries (for debug purposes only)"
  1089 + TYPES = ["(invalid)", "(storage)", "(stream)", "(lockbytes)",
  1090 + "(property)", "(root)"]
  1091 + print(" "*tab + repr(self.name), TYPES[self.entry_type], end=' ')
  1092 + if self.entry_type in (STGTY_STREAM, STGTY_ROOT):
  1093 + print(self.size, "bytes", end=' ')
  1094 + print()
  1095 + if self.entry_type in (STGTY_STORAGE, STGTY_ROOT) and self.clsid:
  1096 + print(" "*tab + "{%s}" % self.clsid)
  1097 +
  1098 + for kid in self.kids:
  1099 + kid.dump(tab + 2)
  1100 +
  1101 +
  1102 + def getmtime(self):
  1103 + """
  1104 + Return modification time of a directory entry.
  1105 +
  1106 + :returns: None if modification time is null, a python datetime object
  1107 + otherwise (UTC timezone)
  1108 +
  1109 + new in version 0.26
  1110 + """
  1111 + if self.modifyTime == 0:
  1112 + return None
  1113 + return filetime2datetime(self.modifyTime)
  1114 +
  1115 +
  1116 + def getctime(self):
  1117 + """
  1118 + Return creation time of a directory entry.
  1119 +
  1120 + :returns: None if modification time is null, a python datetime object
  1121 + otherwise (UTC timezone)
  1122 +
  1123 + new in version 0.26
  1124 + """
  1125 + if self.createTime == 0:
  1126 + return None
  1127 + return filetime2datetime(self.createTime)
  1128 +
  1129 +
  1130 +#--- OleFileIO ----------------------------------------------------------------
  1131 +
  1132 +class OleFileIO:
  1133 + """
  1134 + OLE container object
  1135 +
  1136 + This class encapsulates the interface to an OLE 2 structured
  1137 + storage file. Use the listdir and openstream methods to
  1138 + access the contents of this file.
  1139 +
  1140 + Object names are given as a list of strings, one for each subentry
  1141 + level. The root entry should be omitted. For example, the following
  1142 + code extracts all image streams from a Microsoft Image Composer file::
  1143 +
  1144 + ole = OleFileIO("fan.mic")
  1145 +
  1146 + for entry in ole.listdir():
  1147 + if entry[1:2] == "Image":
  1148 + fin = ole.openstream(entry)
  1149 + fout = open(entry[0:1], "wb")
  1150 + while True:
  1151 + s = fin.read(8192)
  1152 + if not s:
  1153 + break
  1154 + fout.write(s)
  1155 +
  1156 + You can use the viewer application provided with the Python Imaging
  1157 + Library to view the resulting files (which happens to be standard
  1158 + TIFF files).
  1159 + """
  1160 +
  1161 + def __init__(self, filename=None, raise_defects=DEFECT_FATAL,
  1162 + write_mode=False, debug=False, path_encoding=DEFAULT_PATH_ENCODING):
  1163 + """
  1164 + Constructor for the OleFileIO class.
  1165 +
  1166 + :param filename: file to open.
  1167 +
  1168 + - if filename is a string smaller than 1536 bytes, it is the path
  1169 + of the file to open. (bytes or unicode string)
  1170 + - if filename is a string longer than 1535 bytes, it is parsed
  1171 + as the content of an OLE file in memory. (bytes type only)
  1172 + - if filename is a file-like object (with read, seek and tell methods),
  1173 + it is parsed as-is.
  1174 +
  1175 + :param raise_defects: minimal level for defects to be raised as exceptions.
  1176 + (use DEFECT_FATAL for a typical application, DEFECT_INCORRECT for a
  1177 + security-oriented application, see source code for details)
  1178 +
  1179 + :param write_mode: bool, if True the file is opened in read/write mode instead
  1180 + of read-only by default.
  1181 +
  1182 + :param debug: bool, set debug mode (deprecated, not used anymore)
  1183 +
  1184 + :param path_encoding: None or str, name of the codec to use for path
  1185 + names (streams and storages), or None for Unicode.
  1186 + Unicode by default on Python 3+, UTF-8 on Python 2.x.
  1187 + (new in olefile 0.42, was hardcoded to Latin-1 until olefile v0.41)
  1188 + """
  1189 + # minimal level for defects to be raised as exceptions:
  1190 + self._raise_defects_level = raise_defects
  1191 + # list of defects/issues not raised as exceptions:
  1192 + # tuples of (exception type, message)
  1193 + self.parsing_issues = []
  1194 + self.write_mode = write_mode
  1195 + self.path_encoding = path_encoding
  1196 + self._filesize = None
  1197 + self.fp = None
  1198 + if filename:
  1199 + self.open(filename, write_mode=write_mode)
  1200 +
  1201 +
  1202 + def _raise_defect(self, defect_level, message, exception_type=IOError):
  1203 + """
  1204 + This method should be called for any defect found during file parsing.
  1205 + It may raise an IOError exception according to the minimal level chosen
  1206 + for the OleFileIO object.
  1207 +
  1208 + :param defect_level: defect level, possible values are:
  1209 +
  1210 + - DEFECT_UNSURE : a case which looks weird, but not sure it's a defect
  1211 + - DEFECT_POTENTIAL : a potential defect
  1212 + - DEFECT_INCORRECT : an error according to specifications, but parsing can go on
  1213 + - DEFECT_FATAL : an error which cannot be ignored, parsing is impossible
  1214 +
  1215 + :param message: string describing the defect, used with raised exception.
  1216 + :param exception_type: exception class to be raised, IOError by default
  1217 + """
  1218 + # added by [PL]
  1219 + if defect_level >= self._raise_defects_level:
  1220 + log.error(message)
  1221 + raise exception_type(message)
  1222 + else:
  1223 + # just record the issue, no exception raised:
  1224 + self.parsing_issues.append((exception_type, message))
  1225 + log.warning(message)
  1226 +
  1227 +
  1228 + def _decode_utf16_str(self, utf16_str, errors='replace'):
  1229 + """
  1230 + Decode a string encoded in UTF-16 LE format, as found in the OLE
  1231 + directory or in property streams. Return a string encoded
  1232 + according to the path_encoding specified for the OleFileIO object.
  1233 +
  1234 + :param utf16_str: bytes string encoded in UTF-16 LE format
  1235 + :param errors: str, see python documentation for str.decode()
  1236 + :return: str, encoded according to path_encoding
  1237 + """
  1238 + unicode_str = utf16_str.decode('UTF-16LE', errors)
  1239 + if self.path_encoding:
  1240 + # an encoding has been specified for path names:
  1241 + return unicode_str.encode(self.path_encoding, errors)
  1242 + else:
  1243 + # path_encoding=None, return the Unicode string as-is:
  1244 + return unicode_str
  1245 +
  1246 +
  1247 + def open(self, filename, write_mode=False):
  1248 + """
  1249 + Open an OLE2 file in read-only or read/write mode.
  1250 + Read and parse the header, FAT and directory.
  1251 +
  1252 + :param filename: string-like or file-like object, OLE file to parse
  1253 +
  1254 + - if filename is a string smaller than 1536 bytes, it is the path
  1255 + of the file to open. (bytes or unicode string)
  1256 + - if filename is a string longer than 1535 bytes, it is parsed
  1257 + as the content of an OLE file in memory. (bytes type only)
  1258 + - if filename is a file-like object (with read, seek and tell methods),
  1259 + it is parsed as-is.
  1260 +
  1261 + :param write_mode: bool, if True the file is opened in read/write mode instead
  1262 + of read-only by default. (ignored if filename is not a path)
  1263 + """
  1264 + self.write_mode = write_mode
  1265 + #[PL] check if filename is a string-like or file-like object:
  1266 + # (it is better to check for a read() method)
  1267 + if hasattr(filename, 'read'):
  1268 + #TODO: also check seek and tell methods?
  1269 + # file-like object: use it directly
  1270 + self.fp = filename
  1271 + elif isinstance(filename, bytes) and len(filename) >= MINIMAL_OLEFILE_SIZE:
  1272 + # filename is a bytes string containing the OLE file to be parsed:
  1273 + # convert it to BytesIO
  1274 + self.fp = io.BytesIO(filename)
  1275 + else:
  1276 + # string-like object: filename of file on disk
  1277 + if self.write_mode:
  1278 + # open file in mode 'read with update, binary'
  1279 + # According to https://docs.python.org/2/library/functions.html#open
  1280 + # 'w' would truncate the file, 'a' may only append on some Unixes
  1281 + mode = 'r+b'
  1282 + else:
  1283 + # read-only mode by default
  1284 + mode = 'rb'
  1285 + self.fp = open(filename, mode)
  1286 + # obtain the filesize by using seek and tell, which should work on most
  1287 + # file-like objects:
  1288 + #TODO: do it above, using getsize with filename when possible?
  1289 + #TODO: fix code to fail with clear exception when filesize cannot be obtained
  1290 + filesize=0
  1291 + self.fp.seek(0, os.SEEK_END)
  1292 + try:
  1293 + filesize = self.fp.tell()
  1294 + finally:
  1295 + self.fp.seek(0)
  1296 + self._filesize = filesize
  1297 + log.debug('File size: %d bytes (%Xh)' % (self._filesize, self._filesize))
  1298 +
  1299 + # lists of streams in FAT and MiniFAT, to detect duplicate references
  1300 + # (list of indexes of first sectors of each stream)
  1301 + self._used_streams_fat = []
  1302 + self._used_streams_minifat = []
  1303 +
  1304 + header = self.fp.read(512)
  1305 +
  1306 + if len(header) != 512 or header[:8] != MAGIC:
  1307 + log.debug('Magic = %r instead of %r' % (header[:8], MAGIC))
  1308 + self._raise_defect(DEFECT_FATAL, "not an OLE2 structured storage file")
  1309 +
  1310 + # [PL] header structure according to AAF specifications:
  1311 + ##Header
  1312 + ##struct StructuredStorageHeader { // [offset from start (bytes), length (bytes)]
  1313 + ##BYTE _abSig[8]; // [00H,08] {0xd0, 0xcf, 0x11, 0xe0, 0xa1, 0xb1,
  1314 + ## // 0x1a, 0xe1} for current version
  1315 + ##CLSID _clsid; // [08H,16] reserved must be zero (WriteClassStg/
  1316 + ## // GetClassFile uses root directory class id)
  1317 + ##USHORT _uMinorVersion; // [18H,02] minor version of the format: 33 is
  1318 + ## // written by reference implementation
  1319 + ##USHORT _uDllVersion; // [1AH,02] major version of the dll/format: 3 for
  1320 + ## // 512-byte sectors, 4 for 4 KB sectors
  1321 + ##USHORT _uByteOrder; // [1CH,02] 0xFFFE: indicates Intel byte-ordering
  1322 + ##USHORT _uSectorShift; // [1EH,02] size of sectors in power-of-two;
  1323 + ## // typically 9 indicating 512-byte sectors
  1324 + ##USHORT _uMiniSectorShift; // [20H,02] size of mini-sectors in power-of-two;
  1325 + ## // typically 6 indicating 64-byte mini-sectors
  1326 + ##USHORT _usReserved; // [22H,02] reserved, must be zero
  1327 + ##ULONG _ulReserved1; // [24H,04] reserved, must be zero
  1328 + ##FSINDEX _csectDir; // [28H,04] must be zero for 512-byte sectors,
  1329 + ## // number of SECTs in directory chain for 4 KB
  1330 + ## // sectors
  1331 + ##FSINDEX _csectFat; // [2CH,04] number of SECTs in the FAT chain
  1332 + ##SECT _sectDirStart; // [30H,04] first SECT in the directory chain
  1333 + ##DFSIGNATURE _signature; // [34H,04] signature used for transactions; must
  1334 + ## // be zero. The reference implementation
  1335 + ## // does not support transactions
  1336 + ##ULONG _ulMiniSectorCutoff; // [38H,04] maximum size for a mini stream;
  1337 + ## // typically 4096 bytes
  1338 + ##SECT _sectMiniFatStart; // [3CH,04] first SECT in the MiniFAT chain
  1339 + ##FSINDEX _csectMiniFat; // [40H,04] number of SECTs in the MiniFAT chain
  1340 + ##SECT _sectDifStart; // [44H,04] first SECT in the DIFAT chain
  1341 + ##FSINDEX _csectDif; // [48H,04] number of SECTs in the DIFAT chain
  1342 + ##SECT _sectFat[109]; // [4CH,436] the SECTs of first 109 FAT sectors
  1343 + ##};
  1344 +
  1345 + # [PL] header decoding:
  1346 + # '<' indicates little-endian byte ordering for Intel (cf. struct module help)
  1347 + fmt_header = '<8s16sHHHHHHLLLLLLLLLL'
  1348 + header_size = struct.calcsize(fmt_header)
  1349 + log.debug( "fmt_header size = %d, +FAT = %d" % (header_size, header_size + 109*4) )
  1350 + header1 = header[:header_size]
  1351 + (
  1352 + self.header_signature,
  1353 + self.header_clsid,
  1354 + self.minor_version,
  1355 + self.dll_version,
  1356 + self.byte_order,
  1357 + self.sector_shift,
  1358 + self.mini_sector_shift,
  1359 + self.reserved1,
  1360 + self.reserved2,
  1361 + self.num_dir_sectors,
  1362 + self.num_fat_sectors,
  1363 + self.first_dir_sector,
  1364 + self.transaction_signature_number,
  1365 + self.mini_stream_cutoff_size,
  1366 + self.first_mini_fat_sector,
  1367 + self.num_mini_fat_sectors,
  1368 + self.first_difat_sector,
  1369 + self.num_difat_sectors
  1370 + ) = struct.unpack(fmt_header, header1)
  1371 + log.debug( struct.unpack(fmt_header, header1))
  1372 +
  1373 + if self.header_signature != MAGIC:
  1374 + # OLE signature should always be present
  1375 + self._raise_defect(DEFECT_FATAL, "incorrect OLE signature")
  1376 + if self.header_clsid != bytearray(16):
  1377 + # according to AAF specs, CLSID should always be zero
  1378 + self._raise_defect(DEFECT_INCORRECT, "incorrect CLSID in OLE header")
  1379 + log.debug( "Minor Version = %d" % self.minor_version )
  1380 + # TODO: according to MS-CFB, minor version should be 0x003E
  1381 + log.debug( "DLL Version = %d (expected: 3 or 4)" % self.dll_version )
  1382 + if self.dll_version not in [3, 4]:
  1383 + # version 3: usual format, 512 bytes per sector
  1384 + # version 4: large format, 4K per sector
  1385 + self._raise_defect(DEFECT_INCORRECT, "incorrect DllVersion in OLE header")
  1386 + log.debug( "Byte Order = %X (expected: FFFE)" % self.byte_order )
  1387 + if self.byte_order != 0xFFFE:
  1388 + # For now only common little-endian documents are handled correctly
  1389 + self._raise_defect(DEFECT_FATAL, "incorrect ByteOrder in OLE header")
  1390 + # TODO: add big-endian support for documents created on Mac ?
  1391 + # But according to [MS-CFB] ? v20140502, ByteOrder MUST be 0xFFFE.
  1392 + self.sector_size = 2**self.sector_shift
  1393 + log.debug( "Sector Size = %d bytes (expected: 512 or 4096)" % self.sector_size )
  1394 + if self.sector_size not in [512, 4096]:
  1395 + self._raise_defect(DEFECT_INCORRECT, "incorrect sector_size in OLE header")
  1396 + if (self.dll_version==3 and self.sector_size!=512) \
  1397 + or (self.dll_version==4 and self.sector_size!=4096):
  1398 + self._raise_defect(DEFECT_INCORRECT, "sector_size does not match DllVersion in OLE header")
  1399 + self.mini_sector_size = 2**self.mini_sector_shift
  1400 + log.debug( "MiniFAT Sector Size = %d bytes (expected: 64)" % self.mini_sector_size )
  1401 + if self.mini_sector_size not in [64]:
  1402 + self._raise_defect(DEFECT_INCORRECT, "incorrect mini_sector_size in OLE header")
  1403 + if self.reserved1 != 0 or self.reserved2 != 0:
  1404 + self._raise_defect(DEFECT_INCORRECT, "incorrect OLE header (non-null reserved bytes)")
  1405 + log.debug( "Number of Directory sectors = %d" % self.num_dir_sectors )
  1406 + # Number of directory sectors (only allowed if DllVersion != 3)
  1407 + if self.sector_size==512 and self.num_dir_sectors!=0:
  1408 + self._raise_defect(DEFECT_INCORRECT, "incorrect number of directory sectors in OLE header")
  1409 + log.debug( "Number of FAT sectors = %d" % self.num_fat_sectors )
  1410 + # num_fat_sectors = number of FAT sectors in the file
  1411 + log.debug( "First Directory sector = %Xh" % self.first_dir_sector )
  1412 + # first_dir_sector = 1st sector containing the directory
  1413 + log.debug( "Transaction Signature Number = %d" % self.transaction_signature_number )
  1414 + # Signature should be zero, BUT some implementations do not follow this
  1415 + # rule => only a potential defect:
  1416 + # (according to MS-CFB, may be != 0 for applications supporting file
  1417 + # transactions)
  1418 + if self.transaction_signature_number != 0:
  1419 + self._raise_defect(DEFECT_POTENTIAL, "incorrect OLE header (transaction_signature_number>0)")
  1420 + log.debug( "Mini Stream cutoff size = %Xh (expected: 1000h)" % self.mini_stream_cutoff_size )
  1421 + # MS-CFB: This integer field MUST be set to 0x00001000. This field
  1422 + # specifies the maximum size of a user-defined data stream allocated
  1423 + # from the mini FAT and mini stream, and that cutoff is 4096 bytes.
  1424 + # Any user-defined data stream larger than or equal to this cutoff size
  1425 + # must be allocated as normal sectors from the FAT.
  1426 + if self.mini_stream_cutoff_size != 0x1000:
  1427 + self._raise_defect(DEFECT_INCORRECT, "incorrect mini_stream_cutoff_size in OLE header")
  1428 + # if no exception is raised, the cutoff size is fixed to 0x1000
  1429 + log.warning('Fixing the mini_stream_cutoff_size to 4096 (mandatory value) instead of %d' %
  1430 + self.mini_stream_cutoff_size)
  1431 + self.mini_stream_cutoff_size = 0x1000
  1432 + # TODO: check if these values are OK
  1433 + log.debug( "First MiniFAT sector = %Xh" % self.first_mini_fat_sector )
  1434 + log.debug( "Number of MiniFAT sectors = %d" % self.num_mini_fat_sectors )
  1435 + log.debug( "First DIFAT sector = %Xh" % self.first_difat_sector )
  1436 + log.debug( "Number of DIFAT sectors = %d" % self.num_difat_sectors )
  1437 +
  1438 + # calculate the number of sectors in the file
  1439 + # (-1 because header doesn't count)
  1440 + self.nb_sect = ( (filesize + self.sector_size-1) // self.sector_size) - 1
  1441 + log.debug( "Maximum number of sectors in the file: %d (%Xh)" % (self.nb_sect, self.nb_sect))
  1442 + #TODO: change this test, because an OLE file MAY contain other data
  1443 + # after the last sector.
  1444 +
  1445 + # file clsid
  1446 + self.header_clsid = _clsid(header[8:24])
  1447 +
  1448 + #TODO: remove redundant attributes, and fix the code which uses them?
  1449 + self.sectorsize = self.sector_size #1 << i16(header, 30)
  1450 + self.minisectorsize = self.mini_sector_size #1 << i16(header, 32)
  1451 + self.minisectorcutoff = self.mini_stream_cutoff_size # i32(header, 56)
  1452 +
  1453 + # check known streams for duplicate references (these are always in FAT,
  1454 + # never in MiniFAT):
  1455 + self._check_duplicate_stream(self.first_dir_sector)
  1456 + # check MiniFAT only if it is not empty:
  1457 + if self.num_mini_fat_sectors:
  1458 + self._check_duplicate_stream(self.first_mini_fat_sector)
  1459 + # check DIFAT only if it is not empty:
  1460 + if self.num_difat_sectors:
  1461 + self._check_duplicate_stream(self.first_difat_sector)
  1462 +
  1463 + # Load file allocation tables
  1464 + self.loadfat(header)
  1465 + # Load directory. This sets both the direntries list (ordered by sid)
  1466 + # and the root (ordered by hierarchy) members.
  1467 + self.loaddirectory(self.first_dir_sector)
  1468 + self.ministream = None
  1469 + self.minifatsect = self.first_mini_fat_sector
  1470 +
  1471 +
  1472 + def close(self):
  1473 + """
  1474 + close the OLE file, to release the file object
  1475 + """
  1476 + self.fp.close()
  1477 +
  1478 +
  1479 + def _check_duplicate_stream(self, first_sect, minifat=False):
  1480 + """
  1481 + Checks if a stream has not been already referenced elsewhere.
  1482 + This method should only be called once for each known stream, and only
  1483 + if stream size is not null.
  1484 +
  1485 + :param first_sect: int, index of first sector of the stream in FAT
  1486 + :param minifat: bool, if True, stream is located in the MiniFAT, else in the FAT
  1487 + """
  1488 + if minifat:
  1489 + log.debug('_check_duplicate_stream: sect=%Xh in MiniFAT' % first_sect)
  1490 + used_streams = self._used_streams_minifat
  1491 + else:
  1492 + log.debug('_check_duplicate_stream: sect=%Xh in FAT' % first_sect)
  1493 + # some values can be safely ignored (not a real stream):
  1494 + if first_sect in (DIFSECT,FATSECT,ENDOFCHAIN,FREESECT):
  1495 + return
  1496 + used_streams = self._used_streams_fat
  1497 + #TODO: would it be more efficient using a dict or hash values, instead
  1498 + # of a list of long ?
  1499 + if first_sect in used_streams:
  1500 + self._raise_defect(DEFECT_INCORRECT, 'Stream referenced twice')
  1501 + else:
  1502 + used_streams.append(first_sect)
  1503 +
  1504 +
  1505 + def dumpfat(self, fat, firstindex=0):
  1506 + """
  1507 + Display a part of FAT in human-readable form for debugging purposes
  1508 + """
  1509 + # dictionary to convert special FAT values in human-readable strings
  1510 + VPL = 8 # values per line (8+1 * 8+1 = 81)
  1511 + fatnames = {
  1512 + FREESECT: "..free..",
  1513 + ENDOFCHAIN: "[ END. ]",
  1514 + FATSECT: "FATSECT ",
  1515 + DIFSECT: "DIFSECT "
  1516 + }
  1517 + nbsect = len(fat)
  1518 + nlines = (nbsect+VPL-1)//VPL
  1519 + print("index", end=" ")
  1520 + for i in range(VPL):
  1521 + print("%8X" % i, end=" ")
  1522 + print()
  1523 + for l in range(nlines):
  1524 + index = l*VPL
  1525 + print("%6X:" % (firstindex+index), end=" ")
  1526 + for i in range(index, index+VPL):
  1527 + if i>=nbsect:
  1528 + break
  1529 + sect = fat[i]
  1530 + aux = sect & 0xFFFFFFFF # JYTHON-WORKAROUND
  1531 + if aux in fatnames:
  1532 + name = fatnames[aux]
  1533 + else:
  1534 + if sect == i+1:
  1535 + name = " --->"
  1536 + else:
  1537 + name = "%8X" % sect
  1538 + print(name, end=" ")
  1539 + print()
  1540 +
  1541 +
  1542 + def dumpsect(self, sector, firstindex=0):
  1543 + """
  1544 + Display a sector in a human-readable form, for debugging purposes
  1545 + """
  1546 + VPL=8 # number of values per line (8+1 * 8+1 = 81)
  1547 + tab = array.array(UINT32, sector)
  1548 + if sys.byteorder == 'big':
  1549 + tab.byteswap()
  1550 + nbsect = len(tab)
  1551 + nlines = (nbsect+VPL-1)//VPL
  1552 + print("index", end=" ")
  1553 + for i in range(VPL):
  1554 + print("%8X" % i, end=" ")
  1555 + print()
  1556 + for l in range(nlines):
  1557 + index = l*VPL
  1558 + print("%6X:" % (firstindex+index), end=" ")
  1559 + for i in range(index, index+VPL):
  1560 + if i>=nbsect:
  1561 + break
  1562 + sect = tab[i]
  1563 + name = "%8X" % sect
  1564 + print(name, end=" ")
  1565 + print()
  1566 +
  1567 + def sect2array(self, sect):
  1568 + """
  1569 + convert a sector to an array of 32 bits unsigned integers,
  1570 + swapping bytes on big endian CPUs such as PowerPC (old Macs)
  1571 + """
  1572 + a = array.array(UINT32, sect)
  1573 + # if CPU is big endian, swap bytes:
  1574 + if sys.byteorder == 'big':
  1575 + a.byteswap()
  1576 + return a
  1577 +
  1578 +
  1579 + def loadfat_sect(self, sect):
  1580 + """
  1581 + Adds the indexes of the given sector to the FAT
  1582 +
  1583 + :param sect: string containing the first FAT sector, or array of long integers
  1584 + :returns: index of last FAT sector.
  1585 + """
  1586 + # a FAT sector is an array of ulong integers.
  1587 + if isinstance(sect, array.array):
  1588 + # if sect is already an array it is directly used
  1589 + fat1 = sect
  1590 + else:
  1591 + # if it's a raw sector, it is parsed in an array
  1592 + fat1 = self.sect2array(sect)
  1593 + # Display the sector contents only if the logging level is debug:
  1594 + if log.isEnabledFor(logging.DEBUG):
  1595 + self.dumpsect(sect)
  1596 + # The FAT is a sector chain starting at the first index of itself.
  1597 + # initialize isect, just in case:
  1598 + isect = None
  1599 + for isect in fat1:
  1600 + isect = isect & 0xFFFFFFFF # JYTHON-WORKAROUND
  1601 + log.debug("isect = %X" % isect)
  1602 + if isect == ENDOFCHAIN or isect == FREESECT:
  1603 + # the end of the sector chain has been reached
  1604 + log.debug("found end of sector chain")
  1605 + break
  1606 + # read the FAT sector
  1607 + s = self.getsect(isect)
  1608 + # parse it as an array of 32 bits integers, and add it to the
  1609 + # global FAT array
  1610 + nextfat = self.sect2array(s)
  1611 + self.fat = self.fat + nextfat
  1612 + return isect
  1613 +
  1614 +
  1615 + def loadfat(self, header):
  1616 + """
  1617 + Load the FAT table.
  1618 + """
  1619 + # The 1st sector of the file contains sector numbers for the first 109
  1620 + # FAT sectors, right after the header which is 76 bytes long.
  1621 + # (always 109, whatever the sector size: 512 bytes = 76+4*109)
  1622 + # Additional sectors are described by DIF blocks
  1623 +
  1624 + log.debug('Loading the FAT table, starting with the 1st sector after the header')
  1625 + sect = header[76:512]
  1626 + log.debug( "len(sect)=%d, so %d integers" % (len(sect), len(sect)//4) )
  1627 + #fat = []
  1628 + # [PL] FAT is an array of 32 bits unsigned ints, it's more effective
  1629 + # to use an array than a list in Python.
  1630 + # It's initialized as empty first:
  1631 + self.fat = array.array(UINT32)
  1632 + self.loadfat_sect(sect)
  1633 + #self.dumpfat(self.fat)
  1634 +## for i in range(0, len(sect), 4):
  1635 +## ix = i32(sect, i)
  1636 +## #[PL] if ix == -2 or ix == -1: # ix == 0xFFFFFFFE or ix == 0xFFFFFFFF:
  1637 +## if ix == 0xFFFFFFFE or ix == 0xFFFFFFFF:
  1638 +## break
  1639 +## s = self.getsect(ix)
  1640 +## #fat = fat + [i32(s, i) for i in range(0, len(s), 4)]
  1641 +## fat = fat + array.array(UINT32, s)
  1642 + if self.num_difat_sectors != 0:
  1643 + log.debug('DIFAT is used, because file size > 6.8MB.')
  1644 + # [PL] There's a DIFAT because file is larger than 6.8MB
  1645 + # some checks just in case:
  1646 + if self.num_fat_sectors <= 109:
  1647 + # there must be at least 109 blocks in header and the rest in
  1648 + # DIFAT, so number of sectors must be >109.
  1649 + self._raise_defect(DEFECT_INCORRECT, 'incorrect DIFAT, not enough sectors')
  1650 + if self.first_difat_sector >= self.nb_sect:
  1651 + # initial DIFAT block index must be valid
  1652 + self._raise_defect(DEFECT_FATAL, 'incorrect DIFAT, first index out of range')
  1653 + log.debug( "DIFAT analysis..." )
  1654 + # We compute the necessary number of DIFAT sectors :
  1655 + # Number of pointers per DIFAT sector = (sectorsize/4)-1
  1656 + # (-1 because the last pointer is the next DIFAT sector number)
  1657 + nb_difat_sectors = (self.sectorsize//4)-1
  1658 + # (if 512 bytes: each DIFAT sector = 127 pointers + 1 towards next DIFAT sector)
  1659 + nb_difat = (self.num_fat_sectors-109 + nb_difat_sectors-1)//nb_difat_sectors
  1660 + log.debug( "nb_difat = %d" % nb_difat )
  1661 + if self.num_difat_sectors != nb_difat:
  1662 + raise IOError('incorrect DIFAT')
  1663 + isect_difat = self.first_difat_sector
  1664 + for i in iterrange(nb_difat):
  1665 + log.debug( "DIFAT block %d, sector %X" % (i, isect_difat) )
  1666 + #TODO: check if corresponding FAT SID = DIFSECT
  1667 + sector_difat = self.getsect(isect_difat)
  1668 + difat = self.sect2array(sector_difat)
  1669 + # Display the sector contents only if the logging level is debug:
  1670 + if log.isEnabledFor(logging.DEBUG):
  1671 + self.dumpsect(sector_difat)
  1672 + self.loadfat_sect(difat[:nb_difat_sectors])
  1673 + # last DIFAT pointer is next DIFAT sector:
  1674 + isect_difat = difat[nb_difat_sectors]
  1675 + log.debug( "next DIFAT sector: %X" % isect_difat )
  1676 + # checks:
  1677 + if isect_difat not in [ENDOFCHAIN, FREESECT]:
  1678 + # last DIFAT pointer value must be ENDOFCHAIN or FREESECT
  1679 + raise IOError('incorrect end of DIFAT')
  1680 +## if len(self.fat) != self.num_fat_sectors:
  1681 +## # FAT should contain num_fat_sectors blocks
  1682 +## print("FAT length: %d instead of %d" % (len(self.fat), self.num_fat_sectors))
  1683 +## raise IOError('incorrect DIFAT')
  1684 + else:
  1685 + log.debug('No DIFAT, because file size < 6.8MB.')
  1686 + # since FAT is read from fixed-size sectors, it may contain more values
  1687 + # than the actual number of sectors in the file.
  1688 + # Keep only the relevant sector indexes:
  1689 + if len(self.fat) > self.nb_sect:
  1690 + log.debug('len(fat)=%d, shrunk to nb_sect=%d' % (len(self.fat), self.nb_sect))
  1691 + self.fat = self.fat[:self.nb_sect]
  1692 + log.debug('FAT references %d sectors / Maximum %d sectors in file' % (len(self.fat), self.nb_sect))
  1693 + # Display the FAT contents only if the logging level is debug:
  1694 + if log.isEnabledFor(logging.DEBUG):
  1695 + log.debug('\nFAT:')
  1696 + self.dumpfat(self.fat)
  1697 +
  1698 +
  1699 + def loadminifat(self):
  1700 + """
  1701 + Load the MiniFAT table.
  1702 + """
  1703 + # MiniFAT is stored in a standard sub-stream, pointed to by a header
  1704 + # field.
  1705 + # NOTE: there are two sizes to take into account for this stream:
  1706 + # 1) Stream size is calculated according to the number of sectors
  1707 + # declared in the OLE header. This allocated stream may be more than
  1708 + # needed to store the actual sector indexes.
  1709 + # (self.num_mini_fat_sectors is the number of sectors of size self.sector_size)
  1710 + stream_size = self.num_mini_fat_sectors * self.sector_size
  1711 + # 2) Actually used size is calculated by dividing the MiniStream size
  1712 + # (given by root entry size) by the size of mini sectors, *4 for
  1713 + # 32 bits indexes:
  1714 + nb_minisectors = (self.root.size + self.mini_sector_size-1) // self.mini_sector_size
  1715 + used_size = nb_minisectors * 4
  1716 + log.debug('loadminifat(): minifatsect=%d, nb FAT sectors=%d, used_size=%d, stream_size=%d, nb MiniSectors=%d' %
  1717 + (self.minifatsect, self.num_mini_fat_sectors, used_size, stream_size, nb_minisectors))
  1718 + if used_size > stream_size:
  1719 + # This is not really a problem, but may indicate a wrong implementation:
  1720 + self._raise_defect(DEFECT_INCORRECT, 'OLE MiniStream is larger than MiniFAT')
  1721 + # In any case, first read stream_size:
  1722 + s = self._open(self.minifatsect, stream_size, force_FAT=True).read()
  1723 + #[PL] Old code replaced by an array:
  1724 + #self.minifat = [i32(s, i) for i in range(0, len(s), 4)]
  1725 + self.minifat = self.sect2array(s)
  1726 + # Then shrink the array to used size, to avoid indexes out of MiniStream:
  1727 + log.debug('MiniFAT shrunk from %d to %d sectors' % (len(self.minifat), nb_minisectors))
  1728 + self.minifat = self.minifat[:nb_minisectors]
  1729 + log.debug('loadminifat(): len=%d' % len(self.minifat))
  1730 + # Display the FAT contents only if the logging level is debug:
  1731 + if log.isEnabledFor(logging.DEBUG):
  1732 + log.debug('\nMiniFAT:')
  1733 + self.dumpfat(self.minifat)
  1734 +
  1735 + def getsect(self, sect):
  1736 + """
  1737 + Read given sector from file on disk.
  1738 +
  1739 + :param sect: int, sector index
  1740 + :returns: a string containing the sector data.
  1741 + """
  1742 + # From [MS-CFB]: A sector number can be converted into a byte offset
  1743 + # into the file by using the following formula:
  1744 + # (sector number + 1) x Sector Size.
  1745 + # This implies that sector #0 of the file begins at byte offset Sector
  1746 + # Size, not at 0.
  1747 +
  1748 + # [PL] the original code in PIL was wrong when sectors are 4KB instead of
  1749 + # 512 bytes:
  1750 + #self.fp.seek(512 + self.sectorsize * sect)
  1751 + #[PL]: added safety checks:
  1752 + #print("getsect(%X)" % sect)
  1753 + try:
  1754 + self.fp.seek(self.sectorsize * (sect+1))
  1755 + except:
  1756 + log.debug('getsect(): sect=%X, seek=%d, filesize=%d' %
  1757 + (sect, self.sectorsize*(sect+1), self._filesize))
  1758 + self._raise_defect(DEFECT_FATAL, 'OLE sector index out of range')
  1759 + sector = self.fp.read(self.sectorsize)
  1760 + if len(sector) != self.sectorsize:
  1761 + log.debug('getsect(): sect=%X, read=%d, sectorsize=%d' %
  1762 + (sect, len(sector), self.sectorsize))
  1763 + self._raise_defect(DEFECT_FATAL, 'incomplete OLE sector')
  1764 + return sector
  1765 +
  1766 +
  1767 + def write_sect(self, sect, data, padding=b'\x00'):
  1768 + """
  1769 + Write given sector to file on disk.
  1770 +
  1771 + :param sect: int, sector index
  1772 + :param data: bytes, sector data
  1773 + :param padding: single byte, padding character if data < sector size
  1774 + """
  1775 + if not isinstance(data, bytes):
  1776 + raise TypeError("write_sect: data must be a bytes string")
  1777 + if not isinstance(padding, bytes) or len(padding)!=1:
  1778 + raise TypeError("write_sect: padding must be a bytes string of 1 char")
  1779 + #TODO: we could allow padding=None for no padding at all
  1780 + try:
  1781 + self.fp.seek(self.sectorsize * (sect+1))
  1782 + except:
  1783 + log.debug('write_sect(): sect=%X, seek=%d, filesize=%d' %
  1784 + (sect, self.sectorsize*(sect+1), self._filesize))
  1785 + self._raise_defect(DEFECT_FATAL, 'OLE sector index out of range')
  1786 + if len(data) < self.sectorsize:
  1787 + # add padding
  1788 + data += padding * (self.sectorsize - len(data))
  1789 + elif len(data) < self.sectorsize:
  1790 + raise ValueError("Data is larger than sector size")
  1791 + self.fp.write(data)
  1792 +
  1793 +
  1794 + def loaddirectory(self, sect):
  1795 + """
  1796 + Load the directory.
  1797 +
  1798 + :param sect: sector index of directory stream.
  1799 + """
  1800 + log.debug('Loading the Directory:')
  1801 + # The directory is stored in a standard
  1802 + # substream, independent of its size.
  1803 +
  1804 + # open directory stream as a read-only file:
  1805 + # (stream size is not known in advance)
  1806 + self.directory_fp = self._open(sect)
  1807 +
  1808 + #[PL] to detect malformed documents and avoid DoS attacks, the maximum
  1809 + # number of directory entries can be calculated:
  1810 + max_entries = self.directory_fp.size // 128
  1811 + log.debug('loaddirectory: size=%d, max_entries=%d' %
  1812 + (self.directory_fp.size, max_entries))
  1813 +
  1814 + # Create list of directory entries
  1815 + #self.direntries = []
  1816 + # We start with a list of "None" object
  1817 + self.direntries = [None] * max_entries
  1818 +## for sid in iterrange(max_entries):
  1819 +## entry = fp.read(128)
  1820 +## if not entry:
  1821 +## break
  1822 +## self.direntries.append(OleDirectoryEntry(entry, sid, self))
  1823 + # load root entry:
  1824 + root_entry = self._load_direntry(0)
  1825 + # Root entry is the first entry:
  1826 + self.root = self.direntries[0]
  1827 + # TODO: read ALL directory entries (ignore bad entries?)
  1828 + # TODO: adapt build_storage_tree to avoid duplicate reads
  1829 + # for i in range(1, max_entries):
  1830 + # self._load_direntry(i)
  1831 + # read and build all storage trees, starting from the root:
  1832 + self.root.build_storage_tree()
  1833 +
  1834 +
  1835 + def _load_direntry (self, sid):
  1836 + """
  1837 + Load a directory entry from the directory.
  1838 + This method should only be called once for each storage/stream when
  1839 + loading the directory.
  1840 +
  1841 + :param sid: index of storage/stream in the directory.
  1842 + :returns: a OleDirectoryEntry object
  1843 +
  1844 + :exception IOError: if the entry has always been referenced.
  1845 + """
  1846 + # check if SID is OK:
  1847 + if sid<0 or sid>=len(self.direntries):
  1848 + self._raise_defect(DEFECT_FATAL, "OLE directory index out of range")
  1849 + # check if entry was already referenced:
  1850 + if self.direntries[sid] is not None:
  1851 + self._raise_defect(DEFECT_INCORRECT,
  1852 + "double reference for OLE stream/storage")
  1853 + # if exception not raised, return the object
  1854 + return self.direntries[sid]
  1855 + self.directory_fp.seek(sid * 128)
  1856 + entry = self.directory_fp.read(128)
  1857 + self.direntries[sid] = OleDirectoryEntry(entry, sid, self)
  1858 + return self.direntries[sid]
  1859 +
  1860 +
  1861 + def dumpdirectory(self):
  1862 + """
  1863 + Dump directory (for debugging only)
  1864 + """
  1865 + self.root.dump()
  1866 +
  1867 +
  1868 + def _open(self, start, size = UNKNOWN_SIZE, force_FAT=False):
  1869 + """
  1870 + Open a stream, either in FAT or MiniFAT according to its size.
  1871 + (openstream helper)
  1872 +
  1873 + :param start: index of first sector
  1874 + :param size: size of stream (or nothing if size is unknown)
  1875 + :param force_FAT: if False (default), stream will be opened in FAT or MiniFAT
  1876 + according to size. If True, it will always be opened in FAT.
  1877 + """
  1878 + log.debug('OleFileIO.open(): sect=%Xh, size=%d, force_FAT=%s' %
  1879 + (start, size, str(force_FAT)))
  1880 + # stream size is compared to the mini_stream_cutoff_size threshold:
  1881 + if size < self.minisectorcutoff and not force_FAT:
  1882 + # ministream object
  1883 + if not self.ministream:
  1884 + # load MiniFAT if it wasn't already done:
  1885 + self.loadminifat()
  1886 + # The first sector index of the miniFAT stream is stored in the
  1887 + # root directory entry:
  1888 + size_ministream = self.root.size
  1889 + log.debug('Opening MiniStream: sect=%Xh, size=%d' %
  1890 + (self.root.isectStart, size_ministream))
  1891 + self.ministream = self._open(self.root.isectStart,
  1892 + size_ministream, force_FAT=True)
  1893 + return OleStream(fp=self.ministream, sect=start, size=size,
  1894 + offset=0, sectorsize=self.minisectorsize,
  1895 + fat=self.minifat, filesize=self.ministream.size,
  1896 + olefileio=self)
  1897 + else:
  1898 + # standard stream
  1899 + return OleStream(fp=self.fp, sect=start, size=size,
  1900 + offset=self.sectorsize,
  1901 + sectorsize=self.sectorsize, fat=self.fat,
  1902 + filesize=self._filesize,
  1903 + olefileio=self)
  1904 +
  1905 +
  1906 + def _list(self, files, prefix, node, streams=True, storages=False):
  1907 + """
  1908 + listdir helper
  1909 +
  1910 + :param files: list of files to fill in
  1911 + :param prefix: current location in storage tree (list of names)
  1912 + :param node: current node (OleDirectoryEntry object)
  1913 + :param streams: bool, include streams if True (True by default) - new in v0.26
  1914 + :param storages: bool, include storages if True (False by default) - new in v0.26
  1915 + (note: the root storage is never included)
  1916 + """
  1917 + prefix = prefix + [node.name]
  1918 + for entry in node.kids:
  1919 + if entry.entry_type == STGTY_STORAGE:
  1920 + # this is a storage
  1921 + if storages:
  1922 + # add it to the list
  1923 + files.append(prefix[1:] + [entry.name])
  1924 + # check its kids
  1925 + self._list(files, prefix, entry, streams, storages)
  1926 + elif entry.entry_type == STGTY_STREAM:
  1927 + # this is a stream
  1928 + if streams:
  1929 + # add it to the list
  1930 + files.append(prefix[1:] + [entry.name])
  1931 + else:
  1932 + self._raise_defect(DEFECT_INCORRECT, 'The directory tree contains an entry which is not a stream nor a storage.')
  1933 +
  1934 +
  1935 + def listdir(self, streams=True, storages=False):
  1936 + """
  1937 + Return a list of streams and/or storages stored in this file
  1938 +
  1939 + :param streams: bool, include streams if True (True by default) - new in v0.26
  1940 + :param storages: bool, include storages if True (False by default) - new in v0.26
  1941 + (note: the root storage is never included)
  1942 + :returns: list of stream and/or storage paths
  1943 + """
  1944 + files = []
  1945 + self._list(files, [], self.root, streams, storages)
  1946 + return files
  1947 +
  1948 +
  1949 + def _find(self, filename):
  1950 + """
  1951 + Returns directory entry of given filename. (openstream helper)
  1952 + Note: this method is case-insensitive.
  1953 +
  1954 + :param filename: path of stream in storage tree (except root entry), either:
  1955 +
  1956 + - a string using Unix path syntax, for example:
  1957 + 'storage_1/storage_1.2/stream'
  1958 + - or a list of storage filenames, path to the desired stream/storage.
  1959 + Example: ['storage_1', 'storage_1.2', 'stream']
  1960 +
  1961 + :returns: sid of requested filename
  1962 + :exception IOError: if file not found
  1963 + """
  1964 +
  1965 + # if filename is a string instead of a list, split it on slashes to
  1966 + # convert to a list:
  1967 + if isinstance(filename, basestring):
  1968 + filename = filename.split('/')
  1969 + # walk across storage tree, following given path:
  1970 + node = self.root
  1971 + for name in filename:
  1972 + for kid in node.kids:
  1973 + if kid.name.lower() == name.lower():
  1974 + break
  1975 + else:
  1976 + raise IOError("file not found")
  1977 + node = kid
  1978 + return node.sid
  1979 +
  1980 +
  1981 + def openstream(self, filename):
  1982 + """
  1983 + Open a stream as a read-only file object (BytesIO).
  1984 + Note: filename is case-insensitive.
  1985 +
  1986 + :param filename: path of stream in storage tree (except root entry), either:
  1987 +
  1988 + - a string using Unix path syntax, for example:
  1989 + 'storage_1/storage_1.2/stream'
  1990 + - or a list of storage filenames, path to the desired stream/storage.
  1991 + Example: ['storage_1', 'storage_1.2', 'stream']
  1992 +
  1993 + :returns: file object (read-only)
  1994 + :exception IOError: if filename not found, or if this is not a stream.
  1995 + """
  1996 + sid = self._find(filename)
  1997 + entry = self.direntries[sid]
  1998 + if entry.entry_type != STGTY_STREAM:
  1999 + raise IOError("this file is not a stream")
  2000 + return self._open(entry.isectStart, entry.size)
  2001 +
  2002 +
  2003 + def write_stream(self, stream_name, data):
  2004 + """
  2005 + Write a stream to disk. For now, it is only possible to replace an
  2006 + existing stream by data of the same size.
  2007 +
  2008 + :param stream_name: path of stream in storage tree (except root entry), either:
  2009 +
  2010 + - a string using Unix path syntax, for example:
  2011 + 'storage_1/storage_1.2/stream'
  2012 + - or a list of storage filenames, path to the desired stream/storage.
  2013 + Example: ['storage_1', 'storage_1.2', 'stream']
  2014 +
  2015 + :param data: bytes, data to be written, must be the same size as the original
  2016 + stream.
  2017 + """
  2018 + if not isinstance(data, bytes):
  2019 + raise TypeError("write_stream: data must be a bytes string")
  2020 + sid = self._find(stream_name)
  2021 + entry = self.direntries[sid]
  2022 + if entry.entry_type != STGTY_STREAM:
  2023 + raise IOError("this is not a stream")
  2024 + size = entry.size
  2025 + if size != len(data):
  2026 + raise ValueError("write_stream: data must be the same size as the existing stream")
  2027 + if size < self.minisectorcutoff:
  2028 + raise NotImplementedError("Writing a stream in MiniFAT is not implemented yet")
  2029 + sect = entry.isectStart
  2030 + # number of sectors to write
  2031 + nb_sectors = (size + (self.sectorsize-1)) // self.sectorsize
  2032 + log.debug('nb_sectors = %d' % nb_sectors)
  2033 + for i in range(nb_sectors):
  2034 +## try:
  2035 +## self.fp.seek(offset + self.sectorsize * sect)
  2036 +## except:
  2037 +## log.debug('sect=%d, seek=%d' %
  2038 +## (sect, offset+self.sectorsize*sect))
  2039 +## raise IOError('OLE sector index out of range')
  2040 + # extract one sector from data, the last one being smaller:
  2041 + if i<(nb_sectors-1):
  2042 + data_sector = data [i*self.sectorsize : (i+1)*self.sectorsize]
  2043 + #TODO: comment this if it works
  2044 + assert(len(data_sector)==self.sectorsize)
  2045 + else:
  2046 + data_sector = data [i*self.sectorsize:]
  2047 + #TODO: comment this if it works
  2048 + log.debug('write_stream: size=%d sectorsize=%d data_sector=%Xh size%%sectorsize=%d'
  2049 + % (size, self.sectorsize, len(data_sector), size % self.sectorsize))
  2050 + assert(len(data_sector) % self.sectorsize==size % self.sectorsize)
  2051 + self.write_sect(sect, data_sector)
  2052 +## self.fp.write(data_sector)
  2053 + # jump to next sector in the FAT:
  2054 + try:
  2055 + sect = self.fat[sect]
  2056 + except IndexError:
  2057 + # [PL] if pointer is out of the FAT an exception is raised
  2058 + raise IOError('incorrect OLE FAT, sector index out of range')
  2059 + #[PL] Last sector should be a "end of chain" marker:
  2060 + if sect != ENDOFCHAIN:
  2061 + raise IOError('incorrect last sector index in OLE stream')
  2062 +
  2063 +
  2064 + def get_type(self, filename):
  2065 + """
  2066 + Test if given filename exists as a stream or a storage in the OLE
  2067 + container, and return its type.
  2068 +
  2069 + :param filename: path of stream in storage tree. (see openstream for syntax)
  2070 + :returns: False if object does not exist, its entry type (>0) otherwise:
  2071 +
  2072 + - STGTY_STREAM: a stream
  2073 + - STGTY_STORAGE: a storage
  2074 + - STGTY_ROOT: the root entry
  2075 + """
  2076 + try:
  2077 + sid = self._find(filename)
  2078 + entry = self.direntries[sid]
  2079 + return entry.entry_type
  2080 + except:
  2081 + return False
  2082 +
  2083 +
  2084 + def getmtime(self, filename):
  2085 + """
  2086 + Return modification time of a stream/storage.
  2087 +
  2088 + :param filename: path of stream/storage in storage tree. (see openstream for
  2089 + syntax)
  2090 + :returns: None if modification time is null, a python datetime object
  2091 + otherwise (UTC timezone)
  2092 +
  2093 + new in version 0.26
  2094 + """
  2095 + sid = self._find(filename)
  2096 + entry = self.direntries[sid]
  2097 + return entry.getmtime()
  2098 +
  2099 +
  2100 + def getctime(self, filename):
  2101 + """
  2102 + Return creation time of a stream/storage.
  2103 +
  2104 + :param filename: path of stream/storage in storage tree. (see openstream for
  2105 + syntax)
  2106 + :returns: None if creation time is null, a python datetime object
  2107 + otherwise (UTC timezone)
  2108 +
  2109 + new in version 0.26
  2110 + """
  2111 + sid = self._find(filename)
  2112 + entry = self.direntries[sid]
  2113 + return entry.getctime()
  2114 +
  2115 +
  2116 + def exists(self, filename):
  2117 + """
  2118 + Test if given filename exists as a stream or a storage in the OLE
  2119 + container.
  2120 + Note: filename is case-insensitive.
  2121 +
  2122 + :param filename: path of stream in storage tree. (see openstream for syntax)
  2123 + :returns: True if object exist, else False.
  2124 + """
  2125 + try:
  2126 + sid = self._find(filename)
  2127 + return True
  2128 + except:
  2129 + return False
  2130 +
  2131 +
  2132 + def get_size(self, filename):
  2133 + """
  2134 + Return size of a stream in the OLE container, in bytes.
  2135 +
  2136 + :param filename: path of stream in storage tree (see openstream for syntax)
  2137 + :returns: size in bytes (long integer)
  2138 + :exception IOError: if file not found
  2139 + :exception TypeError: if this is not a stream.
  2140 + """
  2141 + sid = self._find(filename)
  2142 + entry = self.direntries[sid]
  2143 + if entry.entry_type != STGTY_STREAM:
  2144 + #TODO: Should it return zero instead of raising an exception ?
  2145 + raise TypeError('object is not an OLE stream')
  2146 + return entry.size
  2147 +
  2148 +
  2149 + def get_rootentry_name(self):
  2150 + """
  2151 + Return root entry name. Should usually be 'Root Entry' or 'R' in most
  2152 + implementations.
  2153 + """
  2154 + return self.root.name
  2155 +
  2156 +
  2157 + def getproperties(self, filename, convert_time=False, no_conversion=None):
  2158 + """
  2159 + Return properties described in substream.
  2160 +
  2161 + :param filename: path of stream in storage tree (see openstream for syntax)
  2162 + :param convert_time: bool, if True timestamps will be converted to Python datetime
  2163 + :param no_conversion: None or list of int, timestamps not to be converted
  2164 + (for example total editing time is not a real timestamp)
  2165 +
  2166 + :returns: a dictionary of values indexed by id (integer)
  2167 + """
  2168 + #REFERENCE: [MS-OLEPS] https://msdn.microsoft.com/en-us/library/dd942421.aspx
  2169 + # make sure no_conversion is a list, just to simplify code below:
  2170 + if no_conversion == None:
  2171 + no_conversion = []
  2172 + # stream path as a string to report exceptions:
  2173 + streampath = filename
  2174 + if not isinstance(streampath, str):
  2175 + streampath = '/'.join(streampath)
  2176 +
  2177 + fp = self.openstream(filename)
  2178 +
  2179 + data = {}
  2180 +
  2181 + try:
  2182 + # header
  2183 + s = fp.read(28)
  2184 + clsid = _clsid(s[8:24])
  2185 +
  2186 + # format id
  2187 + s = fp.read(20)
  2188 + fmtid = _clsid(s[:16])
  2189 + fp.seek(i32(s, 16))
  2190 +
  2191 + # get section
  2192 + s = b"****" + fp.read(i32(fp.read(4))-4)
  2193 + # number of properties:
  2194 + num_props = i32(s, 4)
  2195 + except BaseException as exc:
  2196 + # catch exception while parsing property header, and only raise
  2197 + # a DEFECT_INCORRECT then return an empty dict, because this is not
  2198 + # a fatal error when parsing the whole file
  2199 + msg = 'Error while parsing properties header in stream %s: %s' % (
  2200 + repr(streampath), exc)
  2201 + self._raise_defect(DEFECT_INCORRECT, msg, type(exc))
  2202 + return data
  2203 +
  2204 + for i in range(num_props):
  2205 + property_id = 0 # just in case of an exception
  2206 + try:
  2207 + property_id = i32(s, 8+i*8)
  2208 + offset = i32(s, 12+i*8)
  2209 + property_type = i32(s, offset)
  2210 +
  2211 + log.debug('property id=%d: type=%d offset=%X' % (property_id, property_type, offset))
  2212 +
  2213 + # test for common types first (should perhaps use
  2214 + # a dictionary instead?)
  2215 +
  2216 + if property_type == VT_I2: # 16-bit signed integer
  2217 + value = i16(s, offset+4)
  2218 + if value >= 32768:
  2219 + value = value - 65536
  2220 + elif property_type == VT_UI2: # 2-byte unsigned integer
  2221 + value = i16(s, offset+4)
  2222 + elif property_type in (VT_I4, VT_INT, VT_ERROR):
  2223 + # VT_I4: 32-bit signed integer
  2224 + # VT_ERROR: HRESULT, similar to 32-bit signed integer,
  2225 + # see http://msdn.microsoft.com/en-us/library/cc230330.aspx
  2226 + value = i32(s, offset+4)
  2227 + elif property_type in (VT_UI4, VT_UINT): # 4-byte unsigned integer
  2228 + value = i32(s, offset+4) # FIXME
  2229 + elif property_type in (VT_BSTR, VT_LPSTR):
  2230 + # CodePageString, see http://msdn.microsoft.com/en-us/library/dd942354.aspx
  2231 + # size is a 32 bits integer, including the null terminator, and
  2232 + # possibly trailing or embedded null chars
  2233 + #TODO: if codepage is unicode, the string should be converted as such
  2234 + count = i32(s, offset+4)
  2235 + value = s[offset+8:offset+8+count-1]
  2236 + # remove all null chars:
  2237 + value = value.replace(b'\x00', b'')
  2238 + elif property_type == VT_BLOB:
  2239 + # binary large object (BLOB)
  2240 + # see http://msdn.microsoft.com/en-us/library/dd942282.aspx
  2241 + count = i32(s, offset+4)
  2242 + value = s[offset+8:offset+8+count]
  2243 + elif property_type == VT_LPWSTR:
  2244 + # UnicodeString
  2245 + # see http://msdn.microsoft.com/en-us/library/dd942313.aspx
  2246 + # "the string should NOT contain embedded or additional trailing
  2247 + # null characters."
  2248 + count = i32(s, offset+4)
  2249 + value = self._decode_utf16_str(s[offset+8:offset+8+count*2])
  2250 + elif property_type == VT_FILETIME:
  2251 + value = long(i32(s, offset+4)) + (long(i32(s, offset+8))<<32)
  2252 + # FILETIME is a 64-bit int: "number of 100ns periods
  2253 + # since Jan 1,1601".
  2254 + if convert_time and property_id not in no_conversion:
  2255 + log.debug('Converting property #%d to python datetime, value=%d=%fs'
  2256 + %(property_id, value, float(value)/10000000))
  2257 + # convert FILETIME to Python datetime.datetime
  2258 + # inspired from http://code.activestate.com/recipes/511425-filetime-to-datetime/
  2259 + _FILETIME_null_date = datetime.datetime(1601, 1, 1, 0, 0, 0)
  2260 + log.debug('timedelta days=%d' % (value//(10*1000000*3600*24)))
  2261 + value = _FILETIME_null_date + datetime.timedelta(microseconds=value//10)
  2262 + else:
  2263 + # legacy code kept for backward compatibility: returns a
  2264 + # number of seconds since Jan 1,1601
  2265 + value = value // 10000000 # seconds
  2266 + elif property_type == VT_UI1: # 1-byte unsigned integer
  2267 + value = i8(s[offset+4])
  2268 + elif property_type == VT_CLSID:
  2269 + value = _clsid(s[offset+4:offset+20])
  2270 + elif property_type == VT_CF:
  2271 + # PropertyIdentifier or ClipboardData??
  2272 + # see http://msdn.microsoft.com/en-us/library/dd941945.aspx
  2273 + count = i32(s, offset+4)
  2274 + value = s[offset+8:offset+8+count]
  2275 + elif property_type == VT_BOOL:
  2276 + # VARIANT_BOOL, 16 bits bool, 0x0000=Fals, 0xFFFF=True
  2277 + # see http://msdn.microsoft.com/en-us/library/cc237864.aspx
  2278 + value = bool(i16(s, offset+4))
  2279 + else:
  2280 + value = None # everything else yields "None"
  2281 + log.debug('property id=%d: type=%d not implemented in parser yet' % (property_id, property_type))
  2282 +
  2283 + # missing: VT_EMPTY, VT_NULL, VT_R4, VT_R8, VT_CY, VT_DATE,
  2284 + # VT_DECIMAL, VT_I1, VT_I8, VT_UI8,
  2285 + # see http://msdn.microsoft.com/en-us/library/dd942033.aspx
  2286 +
  2287 + # FIXME: add support for VT_VECTOR
  2288 + # VT_VECTOR is a 32 uint giving the number of items, followed by
  2289 + # the items in sequence. The VT_VECTOR value is combined with the
  2290 + # type of items, e.g. VT_VECTOR|VT_BSTR
  2291 + # see http://msdn.microsoft.com/en-us/library/dd942011.aspx
  2292 +
  2293 + #print("%08x" % property_id, repr(value), end=" ")
  2294 + #print("(%s)" % VT[i32(s, offset) & 0xFFF])
  2295 +
  2296 + data[property_id] = value
  2297 + except BaseException as exc:
  2298 + # catch exception while parsing each property, and only raise
  2299 + # a DEFECT_INCORRECT, because parsing can go on
  2300 + msg = 'Error while parsing property id %d in stream %s: %s' % (
  2301 + property_id, repr(streampath), exc)
  2302 + self._raise_defect(DEFECT_INCORRECT, msg, type(exc))
  2303 +
  2304 + return data
  2305 +
  2306 + def get_metadata(self):
  2307 + """
  2308 + Parse standard properties streams, return an OleMetadata object
  2309 + containing all the available metadata.
  2310 + (also stored in the metadata attribute of the OleFileIO object)
  2311 +
  2312 + new in version 0.25
  2313 + """
  2314 + self.metadata = OleMetadata()
  2315 + self.metadata.parse_properties(self)
  2316 + return self.metadata
  2317 +
  2318 +#
  2319 +# --------------------------------------------------------------------
  2320 +# This script can be used to dump the directory of any OLE2 structured
  2321 +# storage file.
  2322 +
  2323 +if __name__ == "__main__":
  2324 +
  2325 + import sys, optparse
  2326 +
  2327 + DEFAULT_LOG_LEVEL = "warning" # Default log level
  2328 + LOG_LEVELS = {
  2329 + 'debug': logging.DEBUG,
  2330 + 'info': logging.INFO,
  2331 + 'warning': logging.WARNING,
  2332 + 'error': logging.ERROR,
  2333 + 'critical': logging.CRITICAL
  2334 + }
  2335 +
  2336 + usage = 'usage: %prog [options] <filename> [filename2 ...]'
  2337 + parser = optparse.OptionParser(usage=usage)
  2338 + parser.add_option("-c", action="store_true", dest="check_streams",
  2339 + help='check all streams (for debugging purposes)')
  2340 + parser.add_option("-d", action="store_true", dest="debug_mode",
  2341 + help='debug mode, shortcut for -l debug (displays a lot of debug information, for developers only)')
  2342 + parser.add_option('-l', '--loglevel', dest="loglevel", action="store", default=DEFAULT_LOG_LEVEL,
  2343 + help="logging level debug/info/warning/error/critical (default=%default)")
  2344 +
  2345 + (options, args) = parser.parse_args()
  2346 +
  2347 + print('olefile version %s %s - http://www.decalage.info/en/olefile\n' % (__version__, __date__))
  2348 +
  2349 + # Print help if no arguments are passed
  2350 + if len(args) == 0:
  2351 + print(__doc__)
  2352 + parser.print_help()
  2353 + sys.exit()
  2354 +
  2355 + if options.debug_mode:
  2356 + options.loglevel = 'debug'
  2357 +
  2358 + # setup logging to the console
  2359 + logging.basicConfig(level=LOG_LEVELS[options.loglevel], format='%(levelname)-8s %(message)s')
  2360 +
  2361 + # also enable the module's logger:
  2362 + enable_logging()
  2363 +
  2364 + for filename in args:
  2365 + try:
  2366 + ole = OleFileIO(filename)#, raise_defects=DEFECT_INCORRECT)
  2367 + print("-" * 68)
  2368 + print(filename)
  2369 + print("-" * 68)
  2370 + ole.dumpdirectory()
  2371 + for streamname in ole.listdir():
  2372 + if streamname[-1][0] == "\005":
  2373 + print("%r: properties" % streamname)
  2374 + try:
  2375 + props = ole.getproperties(streamname, convert_time=True)
  2376 + props = sorted(props.items())
  2377 + for k, v in props:
  2378 + #[PL]: avoid to display too large or binary values:
  2379 + if isinstance(v, (basestring, bytes)):
  2380 + if len(v) > 50:
  2381 + v = v[:50]
  2382 + if isinstance(v, bytes):
  2383 + # quick and dirty binary check:
  2384 + for c in (1,2,3,4,5,6,7,11,12,14,15,16,17,18,19,20,
  2385 + 21,22,23,24,25,26,27,28,29,30,31):
  2386 + if c in bytearray(v):
  2387 + v = '(binary data)'
  2388 + break
  2389 + print(" ", k, v)
  2390 + except:
  2391 + log.exception('Error while parsing property stream %r' % streamname)
  2392 +
  2393 + if options.check_streams:
  2394 + # Read all streams to check if there are errors:
  2395 + print('\nChecking streams...')
  2396 + for streamname in ole.listdir():
  2397 + # print name using repr() to convert binary chars to \xNN:
  2398 + print('-', repr('/'.join(streamname)),'-', end=' ')
  2399 + st_type = ole.get_type(streamname)
  2400 + if st_type == STGTY_STREAM:
  2401 + print('size %d' % ole.get_size(streamname))
  2402 + # just try to read stream in memory:
  2403 + ole.openstream(streamname)
  2404 + else:
  2405 + print('NOT a stream : type=%d' % st_type)
  2406 + print()
  2407 +
  2408 +## for streamname in ole.listdir():
  2409 +## # print name using repr() to convert binary chars to \xNN:
  2410 +## print('-', repr('/'.join(streamname)),'-', end=' ')
  2411 +## print(ole.getmtime(streamname))
  2412 +## print()
  2413 +
  2414 + print('Modification/Creation times of all directory entries:')
  2415 + for entry in ole.direntries:
  2416 + if entry is not None:
  2417 + print('- %s: mtime=%s ctime=%s' % (entry.name,
  2418 + entry.getmtime(), entry.getctime()))
  2419 + print()
  2420 +
  2421 + # parse and display metadata:
  2422 + try:
  2423 + meta = ole.get_metadata()
  2424 + meta.dump()
  2425 + except:
  2426 + log.exception('Error while parsing metadata')
  2427 + print()
  2428 + #[PL] Test a few new methods:
  2429 + root = ole.get_rootentry_name()
  2430 + print('Root entry name: "%s"' % root)
  2431 + if ole.exists('worddocument'):
  2432 + print("This is a Word document.")
  2433 + print("type of stream 'WordDocument':", ole.get_type('worddocument'))
  2434 + print("size :", ole.get_size('worddocument'))
  2435 + if ole.exists('macros/vba'):
  2436 + print("This document may contain VBA macros.")
  2437 +
  2438 + # print parsing issues:
  2439 + print('\nNon-fatal issues raised during parsing:')
  2440 + if ole.parsing_issues:
  2441 + for exctype, msg in ole.parsing_issues:
  2442 + print('- %s: %s' % (exctype.__name__, msg))
  2443 + else:
  2444 + print('None')
  2445 + except:
  2446 + log.exception('Error while parsing file %r' % filename)
  2447 +
  2448 +# this code was developed while listening to The Wedding Present "Sea Monsters"
oletools/thirdparty/xglob/xglob.py
@@ -52,14 +52,25 @@ For more info and updates: http://www.decalage.info/xglob @@ -52,14 +52,25 @@ For more info and updates: http://www.decalage.info/xglob
52 # 2015-01-03 v0.04 PL: - fixed issues in iter_files + yield container name 52 # 2015-01-03 v0.04 PL: - fixed issues in iter_files + yield container name
53 # 2016-02-24 v0.05 PL: - do not stop on exceptions, return them as data 53 # 2016-02-24 v0.05 PL: - do not stop on exceptions, return them as data
54 # - fixed issue when using wildcards with empty path 54 # - fixed issue when using wildcards with empty path
  55 +# 2016-04-28 v0.06 CH: - improved handling of non-existing files
  56 +# (by Christian Herdtweck)
55 57
56 -__version__ = '0.05' 58 +__version__ = '0.06'
57 59
58 60
59 #=== IMPORTS ================================================================= 61 #=== IMPORTS =================================================================
60 62
61 import os, fnmatch, glob, zipfile 63 import os, fnmatch, glob, zipfile
62 64
  65 +#=== EXCEPTIONS ==============================================================
  66 +
  67 +class PathNotFoundException(Exception):
  68 + """ raised if given a fixed file/dir (not a glob) that does not exist """
  69 + def __init__(self, path):
  70 + super(PathNotFoundException, self).__init__(
  71 + 'Given path does not exist: %r' % path)
  72 +
  73 +
63 #=== FUNCTIONS =============================================================== 74 #=== FUNCTIONS ===============================================================
64 75
65 # recursive glob function to find files in any subfolder: 76 # recursive glob function to find files in any subfolder:
@@ -118,8 +129,11 @@ def iter_files(files, recursive=False, zip_password=None, zip_fname=&#39;*&#39;): @@ -118,8 +129,11 @@ def iter_files(files, recursive=False, zip_password=None, zip_fname=&#39;*&#39;):
118 - then files matching zip_fname are opened from the zip archive 129 - then files matching zip_fname are opened from the zip archive
119 130
120 Iterator: yields (container, filename, data) for each file. If zip_password is None, then 131 Iterator: yields (container, filename, data) for each file. If zip_password is None, then
121 - only the filename is returned, container and data=None. Otherwise container si the  
122 - filename of the container (zip file), and data is the file content. 132 + only the filename is returned, container and data=None. Otherwise container is the
  133 + filename of the container (zip file), and data is the file content (or an exception).
  134 + If a given filename is not a glob and does not exist, the triplet
  135 + (None, filename, PathNotFoundException) is yielded. (Globs matching nothing
  136 + do not trigger exceptions)
123 """ 137 """
124 #TODO: catch exceptions and yield them for the caller (no file found, file is not zip, wrong password, etc) 138 #TODO: catch exceptions and yield them for the caller (no file found, file is not zip, wrong password, etc)
125 #TODO: use logging instead of printing 139 #TODO: use logging instead of printing
@@ -131,6 +145,9 @@ def iter_files(files, recursive=False, zip_password=None, zip_fname=&#39;*&#39;): @@ -131,6 +145,9 @@ def iter_files(files, recursive=False, zip_password=None, zip_fname=&#39;*&#39;):
131 else: 145 else:
132 iglob = glob.iglob 146 iglob = glob.iglob
133 for filespec in files: 147 for filespec in files:
  148 + if not is_glob(filespec) and not os.path.exists(filespec):
  149 + yield None, filespec, PathNotFoundException(filespec)
  150 + continue
134 for filename in iglob(filespec): 151 for filename in iglob(filespec):
135 if zip_password is not None: 152 if zip_password is not None:
136 # Each file is expected to be a zip archive: 153 # Each file is expected to be a zip archive:
@@ -153,3 +170,39 @@ def iter_files(files, recursive=False, zip_password=None, zip_fname=&#39;*&#39;): @@ -153,3 +170,39 @@ def iter_files(files, recursive=False, zip_password=None, zip_fname=&#39;*&#39;):
153 #data = open(filename, 'rb').read() 170 #data = open(filename, 'rb').read()
154 #yield None, filename, data 171 #yield None, filename, data
155 172
  173 +
  174 +def is_glob(filespec):
  175 + """ determine if given file specification is a single file name or a glob
  176 +
  177 + python's glob and fnmatch can only interpret ?, *, [list], and [ra-nge],
  178 + (and combinations: hex_*_[A-Fabcdef0-9]).
  179 + The special chars *?[-] can only be escaped using []
  180 + --> file_name is not a glob
  181 + --> file?name is a glob
  182 + --> file* is a glob
  183 + --> file[-._]name is a glob
  184 + --> file[?]name is not a glob (matches literal "file?name")
  185 + --> file[*]name is not a glob (matches literal "file*name")
  186 + --> file[-]name is not a glob (matches literal "file-name")
  187 + --> file-name is not a glob
  188 +
  189 + Also, obviously incorrect globs are treated as non-globs
  190 + --> file[name is not a glob (matches literal "file[name")
  191 + --> file]-[name is treated as a glob
  192 + (it is not a valid glob but detecting errors like this requires
  193 + sophisticated regular expression matching)
  194 +
  195 + Python's glob also works with globs in directory-part of path
  196 + --> dir-part of path is analyzed just like filename-part
  197 + --> thirdparty/*/xglob.py is a (valid) glob
  198 +
  199 + TODO: create a correct regexp to test for validity of ranges
  200 + """
  201 +
  202 + # remove escaped special chars
  203 + cleaned = filespec.replace('[*]', '').replace('[?]', '') \
  204 + .replace('[[]', '').replace('[]]', '').replace('[-]', '')
  205 +
  206 + # check if special chars remain
  207 + return '*' in cleaned or '?' in cleaned or \
  208 + ('[' in cleaned and ']' in cleaned)