Commit c1d26ba7fe93a2070ccc675a2a94c2c948ec2577

Authored by Philippe Lagadec
1 parent 03c0a9ec

pyxswf v0.02: added extraction from RTF embedded objects, with new rtfobj module

README.md
... ... @@ -12,13 +12,14 @@ Tools in python-oletools:
12 12 view and extract individual data streams.
13 13 - **oleid**: a tool to analyze OLE files to detect specific characteristics that could potentially indicate that the file is suspicious or malicious.
14 14 - **pyxswf**: a tool to detect, extract and analyze Flash objects (SWF) that may
15   - be embedded in files such as MS Office documents (e.g. Word, Excel),
  15 + be embedded in files such as MS Office documents (e.g. Word, Excel) and RTF,
16 16 which is especially useful for malware analysis.
17 17 - and a few others (coming soon)
18 18  
19 19 News
20 20 ----
21 21  
  22 +- 2012-11-09 v0.03: Improved pyxswf to extract Flash objects from RTF
22 23 - 2012-10-29 v0.02: Added oleid
23 24 - 2012-10-09 v0.01: Initial version of olebrowse and pyxswf
24 25 - see changelog in source code for more info.
... ... @@ -84,13 +85,18 @@ their OLE structure properly, which is necessary when streams are fragmented.
84 85 Stream fragmentation is a known obfuscation technique, as explained on
85 86 [http://www.breakingpointsystems.com/resources/blog/evasion-with-ole2-fragmentation/](http://www.breakingpointsystems.com/resources/blog/evasion-with-ole2-fragmentation/)
86 87  
87   -For this, simply add the -o option to work on OLE streams rather than raw files.
  88 +It can also extract Flash objects from RTF documents, by parsing embedded objects encoded in hexadecimal format (-f option).
  89 +
  90 +
  91 +For this, simply add the -o option to work on OLE streams rather than raw files, or the -f option to work on RTF files.
88 92  
89 93 Usage: pyxswf.py [options] <file.bad>
90 94  
91 95 Options:
92 96 -o, --ole Parse an OLE file (e.g. Word, Excel) to look for SWF
93 97 in each stream
  98 + -f, --rtf Parse an RTF file to look for SWF in each embedded
  99 + object
94 100 -x, --extract Extracts the embedded SWF(s), names it MD5HASH.swf &
95 101 saves it in the working dir. No addition args needed
96 102 -h, --help show this help message and exit
... ... @@ -106,7 +112,7 @@ For this, simply add the -o option to work on OLE streams rather than raw files.
106 112 contain SWFs. Must provide path in quotes
107 113 -c, --compress Compresses the SWF using Zlib
108 114  
109   -Example - detecting and extracting a SWF file from a Word document on Windows:
  115 +Example 1 - detecting and extracting a SWF file from a Word document on Windows:
110 116  
111 117 C:\oletools>pyxswf.py -o word_flash.doc
112 118 OLE stream: 'Contents'
... ... @@ -118,7 +124,16 @@ Example - detecting and extracting a SWF file from a Word document on Windows:
118 124 [SUMMARY] 1 SWF(s) in MD5:993664cc86f60d52d671b6610813cfd1:Contents
119 125 [ADDR] SWF 1 at 0x8 - FWS Header
120 126 [FILE] Carved SWF MD5: 2498e9c0701dc0e461ab4358f9102bc5.swf
121   -
  127 +
  128 +Example 2 - detecting and extracting a SWF file from a RTF document on Windows:
  129 +
  130 + C:\oletools>pyxswf.py -xf "rtf_flash.rtf"
  131 + RTF embedded object size 1498557 at index 000036DD
  132 + [SUMMARY] 1 SWF(s) in MD5:46a110548007e04f4043785ac4184558:RTF_embedded_object_0
  133 + 00036DD
  134 + [ADDR] SWF 1 at 0xc40 - FWS Header
  135 + [FILE] Carved SWF MD5: 2498e9c0701dc0e461ab4358f9102bc5.swf
  136 +
122 137 For more info, see [http://www.decalage.info/python/pyxswf](http://www.decalage.info/python/pyxswf)
123 138  
124 139  
... ...
oletools/README.txt
... ... @@ -25,12 +25,13 @@ Tools in python-oletools:
25 25 suspicious or malicious.
26 26 - **pyxswf**: a tool to detect, extract and analyze Flash objects (SWF)
27 27 that may be embedded in files such as MS Office documents (e.g. Word,
28   - Excel), which is especially useful for malware analysis.
  28 + Excel) and RTF, which is especially useful for malware analysis.
29 29 - and a few others (coming soon)
30 30  
31 31 News
32 32 ----
33 33  
  34 +- 2012-11-09 v0.03: Improved pyxswf to extract Flash objects from RTF
34 35 - 2012-10-29 v0.02: Added oleid
35 36 - 2012-10-09 v0.01: Initial version of olebrowse and pyxswf
36 37 - see changelog in source code for more info.
... ... @@ -112,8 +113,11 @@ are fragmented. Stream fragmentation is a known obfuscation technique,
112 113 as explained on
113 114 `http://www.breakingpointsystems.com/resources/blog/evasion-with-ole2-fragmentation/ <http://www.breakingpointsystems.com/resources/blog/evasion-with-ole2-fragmentation/>`_
114 115  
  116 +It can also extract Flash objects from RTF documents, by parsing
  117 +embedded objects encoded in hexadecimal format (-f option).
  118 +
115 119 For this, simply add the -o option to work on OLE streams rather than
116   -raw files.
  120 +raw files, or the -f option to work on RTF files.
117 121  
118 122 ::
119 123  
... ... @@ -122,6 +126,8 @@ raw files.
122 126 Options:
123 127 -o, --ole Parse an OLE file (e.g. Word, Excel) to look for SWF
124 128 in each stream
  129 + -f, --rtf Parse an RTF file to look for SWF in each embedded
  130 + object
125 131 -x, --extract Extracts the embedded SWF(s), names it MD5HASH.swf &
126 132 saves it in the working dir. No addition args needed
127 133 -h, --help show this help message and exit
... ... @@ -137,7 +143,7 @@ raw files.
137 143 contain SWFs. Must provide path in quotes
138 144 -c, --compress Compresses the SWF using Zlib
139 145  
140   -Example - detecting and extracting a SWF file from a Word document on
  146 +Example 1 - detecting and extracting a SWF file from a Word document on
141 147 Windows:
142 148  
143 149 ::
... ... @@ -153,6 +159,18 @@ Windows:
153 159 [ADDR] SWF 1 at 0x8 - FWS Header
154 160 [FILE] Carved SWF MD5: 2498e9c0701dc0e461ab4358f9102bc5.swf
155 161  
  162 +Example 2 - detecting and extracting a SWF file from a RTF document on
  163 +Windows:
  164 +
  165 +::
  166 +
  167 + C:\oletools>pyxswf.py -xf "rtf_flash.rtf"
  168 + RTF embedded object size 1498557 at index 000036DD
  169 + [SUMMARY] 1 SWF(s) in MD5:46a110548007e04f4043785ac4184558:RTF_embedded_object_0
  170 + 00036DD
  171 + [ADDR] SWF 1 at 0xc40 - FWS Header
  172 + [FILE] Carved SWF MD5: 2498e9c0701dc0e461ab4358f9102bc5.swf
  173 +
156 174 For more info, see
157 175 `http://www.decalage.info/python/pyxswf <http://www.decalage.info/python/pyxswf>`_
158 176  
... ...
oletools/pyxswf.py
1 1 #!/usr/bin/env python
2 2 """
3   -pyxswf.py - Philippe Lagadec 2012-09-17
  3 +pyxswf.py
4 4  
5 5 pyxswf is a script to detect, extract and analyze Flash objects (SWF) that may
6 6 be embedded in files such as MS Office documents (e.g. Word, Excel),
7 7 which is especially useful for malware analysis.
  8 +
8 9 pyxswf is an extension to xxxswf.py published by Alexander Hanel on
9 10 http://hooked-on-mnemonics.blogspot.nl/2011/12/xxxswfpy.html
10 11 Compared to xxxswf, it can extract streams from MS Office documents by parsing
11   -their OLE structure properly, which is necessary when streams are fragmented.
  12 +their OLE structure properly (-o option), which is necessary when streams are
  13 +fragmented.
12 14 Stream fragmentation is a known obfuscation technique, as explained on
13 15 http://www.breakingpointsystems.com/resources/blog/evasion-with-ole2-fragmentation/
14 16  
  17 +It can also extract Flash objects from RTF documents, by parsing embedded
  18 +objects encoded in hexadecimal format (-f option).
  19 +
15 20 pyxswf project website: http://www.decalage.info/python/pyxswf
16 21  
17 22 pyxswf is part of the python-oletools package:
... ... @@ -41,18 +46,19 @@ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
41 46 OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
42 47 """
43 48  
44   -__version__ = '0.01'
  49 +__version__ = '0.02'
45 50  
46 51 #------------------------------------------------------------------------------
47 52 # CHANGELOG:
48 53 # 2012-09-17 v0.01 PL: - first version
  54 +# 2012-11-09 v0.02 PL: - added RTF embedded objects extraction
49 55  
50 56 #------------------------------------------------------------------------------
51 57 # TODO:
52 58 # - check if file is OLE
53 59 # - support -r
54 60  
55   -import optparse, sys, os
  61 +import optparse, sys, os, rtfobj, StringIO
56 62 from thirdparty.xxxswf import xxxswf
57 63 from thirdparty.OleFileIO_PL import OleFileIO_PL
58 64  
... ... @@ -76,6 +82,7 @@ def main():
76 82 parser.add_option('-c', '--compress', action='store_true', dest='compress', help='Compresses the SWF using Zlib')
77 83  
78 84 parser.add_option('-o', '--ole', action='store_true', dest='ole', help='Parse an OLE file (e.g. Word, Excel) to look for SWF in each stream')
  85 + parser.add_option('-f', '--rtf', action='store_true', dest='rtf', help='Parse an RTF file to look for SWF in each embedded object')
79 86  
80 87  
81 88 (options, args) = parser.parse_args()
... ... @@ -85,6 +92,7 @@ def main():
85 92 parser.print_help()
86 93 return
87 94  
  95 + # OLE MODE:
88 96 if options.ole:
89 97 for filename in args:
90 98 ole = OleFileIO_PL.OleFileIO(filename)
... ... @@ -99,6 +107,18 @@ def main():
99 107 xxxswf.disneyland(f, direntry.name, options)
100 108 f.close()
101 109 ole.close()
  110 +
  111 + # RTF MODE:
  112 + elif options.rtf:
  113 + for filename in args:
  114 + for index, data in rtfobj.rtf_iter_objects(filename):
  115 + if 'FWS' in data or 'CWS' in data:
  116 + print 'RTF embedded object size %d at index %08X' % (len(data), index)
  117 + f = StringIO.StringIO(data)
  118 + name = 'RTF_embedded_object_%08X' % index
  119 + # call xxxswf to scan or extract Flash files:
  120 + xxxswf.disneyland(f, name, options)
  121 +
102 122 else:
103 123 xxxswf.main()
104 124  
... ...
oletools/rtfobj.py 0 → 100644
  1 +#!/usr/bin/env python
  2 +"""
  3 +rtfobj.py - Philippe Lagadec 2012-11-09
  4 +
  5 +rtfobj is a Python module to extract embedded objects from RTF files, such as
  6 +OLE ojects. It can be used as a Python library or a command-line tool.
  7 +
  8 +Usage: rtfobj.py <file.rtf>
  9 +
  10 +rtfobj project website: http://www.decalage.info/python/rtfobj
  11 +
  12 +rtfobj is part of the python-oletools package:
  13 +http://www.decalage.info/python/oletools
  14 +
  15 +rtfobj is copyright (c) 2012, Philippe Lagadec (http://www.decalage.info)
  16 +All rights reserved.
  17 +
  18 +Redistribution and use in source and binary forms, with or without modification,
  19 +are permitted provided that the following conditions are met:
  20 +
  21 + * Redistributions of source code must retain the above copyright notice, this
  22 + list of conditions and the following disclaimer.
  23 + * Redistributions in binary form must reproduce the above copyright notice,
  24 + this list of conditions and the following disclaimer in the documentation
  25 + and/or other materials provided with the distribution.
  26 +
  27 +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
  28 +ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
  29 +WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  30 +DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
  31 +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  32 +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  33 +SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  34 +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  35 +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  36 +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  37 +"""
  38 +
  39 +__version__ = '0.01'
  40 +
  41 +#------------------------------------------------------------------------------
  42 +# CHANGELOG:
  43 +# 2012-11-09 v0.01 PL: - first version
  44 +
  45 +#------------------------------------------------------------------------------
  46 +# TODO:
  47 +# - improve regex pattern for better performance?
  48 +
  49 +import re, sys, string, binascii
  50 +
  51 +# REGEX pattern to extract embedded OLE objects in hexadecimal format:
  52 +# alphanum digit: [0-9A-Fa-f]
  53 +# hex char = two alphanum digits: [0-9A-Fa-f]{2}
  54 +# several hex chars, at least 4: (?:[0-9A-Fa-f]{2}){4,}
  55 +# at least 4 hex chars, followed by whitespace or CR/LF: (?:[0-9A-Fa-f]{2}){4,}\s*
  56 +PATTERN = r'(?:(?:[0-9A-Fa-f]{2})+\s*)*(?:[0-9A-Fa-f]{2}){4,}'
  57 +
  58 +# a dummy translation table for str.translate, which does not change anythying:
  59 +TRANSTABLE_NOCHANGE = string.maketrans('', '')
  60 +
  61 +
  62 +def rtf_iter_objects (filename, min_size=32):
  63 + """
  64 + Open a RTF file, extract each embedded object encoded in hexadecimal of
  65 + size > min_size, yield the index of the object in the RTF file and its data
  66 + in binary format.
  67 + This is an iterator.
  68 + """
  69 + data = open(filename, 'rb').read()
  70 + for m in re.finditer(PATTERN, data):
  71 + found = m.group(0)
  72 + # remove all whitespace and line feeds:
  73 + #NOTE: with Python 2.6+, we could use None instead of TRANSTABLE_NOCHANGE
  74 + found = found.translate(TRANSTABLE_NOCHANGE, ' \t\r\n\f\v')
  75 + found = binascii.unhexlify(found)
  76 + #print repr(found)
  77 + if len(found)>min_size:
  78 + yield m.start(), found
  79 +
  80 +if __name__ == '__main__':
  81 + if len(sys.argv<2):
  82 + sys.exit(__doc__)
  83 + for index, data in rtf_iter_objects(sys.argv[1]):
  84 + print 'found object size %d at index %08X' % (len(data), index)
  85 + fname = 'object_%08X.bin' % index
  86 + print 'saving to file %s' % fname
  87 + open(fname, 'wb').write(data)
... ...