diff --git a/README.md b/README.md index b1ee2ef..019cb87 100644 --- a/README.md +++ b/README.md @@ -12,13 +12,14 @@ Tools in python-oletools: view and extract individual data streams. - **oleid**: a tool to analyze OLE files to detect specific characteristics that could potentially indicate that the file is suspicious or malicious. - **pyxswf**: a tool to detect, extract and analyze Flash objects (SWF) that may - be embedded in files such as MS Office documents (e.g. Word, Excel), + be embedded in files such as MS Office documents (e.g. Word, Excel) and RTF, which is especially useful for malware analysis. - and a few others (coming soon) News ---- +- 2012-11-09 v0.03: Improved pyxswf to extract Flash objects from RTF - 2012-10-29 v0.02: Added oleid - 2012-10-09 v0.01: Initial version of olebrowse and pyxswf - see changelog in source code for more info. @@ -84,13 +85,18 @@ their OLE structure properly, which is necessary when streams are fragmented. Stream fragmentation is a known obfuscation technique, as explained on [http://www.breakingpointsystems.com/resources/blog/evasion-with-ole2-fragmentation/](http://www.breakingpointsystems.com/resources/blog/evasion-with-ole2-fragmentation/) -For this, simply add the -o option to work on OLE streams rather than raw files. +It can also extract Flash objects from RTF documents, by parsing embedded objects encoded in hexadecimal format (-f option). + + +For this, simply add the -o option to work on OLE streams rather than raw files, or the -f option to work on RTF files. Usage: pyxswf.py [options] Options: -o, --ole Parse an OLE file (e.g. Word, Excel) to look for SWF in each stream + -f, --rtf Parse an RTF file to look for SWF in each embedded + object -x, --extract Extracts the embedded SWF(s), names it MD5HASH.swf & saves it in the working dir. No addition args needed -h, --help show this help message and exit @@ -106,7 +112,7 @@ For this, simply add the -o option to work on OLE streams rather than raw files. contain SWFs. Must provide path in quotes -c, --compress Compresses the SWF using Zlib -Example - detecting and extracting a SWF file from a Word document on Windows: +Example 1 - detecting and extracting a SWF file from a Word document on Windows: C:\oletools>pyxswf.py -o word_flash.doc OLE stream: 'Contents' @@ -118,7 +124,16 @@ Example - detecting and extracting a SWF file from a Word document on Windows: [SUMMARY] 1 SWF(s) in MD5:993664cc86f60d52d671b6610813cfd1:Contents [ADDR] SWF 1 at 0x8 - FWS Header [FILE] Carved SWF MD5: 2498e9c0701dc0e461ab4358f9102bc5.swf - + +Example 2 - detecting and extracting a SWF file from a RTF document on Windows: + + C:\oletools>pyxswf.py -xf "rtf_flash.rtf" + RTF embedded object size 1498557 at index 000036DD + [SUMMARY] 1 SWF(s) in MD5:46a110548007e04f4043785ac4184558:RTF_embedded_object_0 + 00036DD + [ADDR] SWF 1 at 0xc40 - FWS Header + [FILE] Carved SWF MD5: 2498e9c0701dc0e461ab4358f9102bc5.swf + For more info, see [http://www.decalage.info/python/pyxswf](http://www.decalage.info/python/pyxswf) diff --git a/oletools/README.txt b/oletools/README.txt index 73b4c51..601594a 100644 --- a/oletools/README.txt +++ b/oletools/README.txt @@ -25,12 +25,13 @@ Tools in python-oletools: suspicious or malicious. - **pyxswf**: a tool to detect, extract and analyze Flash objects (SWF) that may be embedded in files such as MS Office documents (e.g. Word, - Excel), which is especially useful for malware analysis. + Excel) and RTF, which is especially useful for malware analysis. - and a few others (coming soon) News ---- +- 2012-11-09 v0.03: Improved pyxswf to extract Flash objects from RTF - 2012-10-29 v0.02: Added oleid - 2012-10-09 v0.01: Initial version of olebrowse and pyxswf - see changelog in source code for more info. @@ -112,8 +113,11 @@ are fragmented. Stream fragmentation is a known obfuscation technique, as explained on `http://www.breakingpointsystems.com/resources/blog/evasion-with-ole2-fragmentation/ `_ +It can also extract Flash objects from RTF documents, by parsing +embedded objects encoded in hexadecimal format (-f option). + For this, simply add the -o option to work on OLE streams rather than -raw files. +raw files, or the -f option to work on RTF files. :: @@ -122,6 +126,8 @@ raw files. Options: -o, --ole Parse an OLE file (e.g. Word, Excel) to look for SWF in each stream + -f, --rtf Parse an RTF file to look for SWF in each embedded + object -x, --extract Extracts the embedded SWF(s), names it MD5HASH.swf & saves it in the working dir. No addition args needed -h, --help show this help message and exit @@ -137,7 +143,7 @@ raw files. contain SWFs. Must provide path in quotes -c, --compress Compresses the SWF using Zlib -Example - detecting and extracting a SWF file from a Word document on +Example 1 - detecting and extracting a SWF file from a Word document on Windows: :: @@ -153,6 +159,18 @@ Windows: [ADDR] SWF 1 at 0x8 - FWS Header [FILE] Carved SWF MD5: 2498e9c0701dc0e461ab4358f9102bc5.swf +Example 2 - detecting and extracting a SWF file from a RTF document on +Windows: + +:: + + C:\oletools>pyxswf.py -xf "rtf_flash.rtf" + RTF embedded object size 1498557 at index 000036DD + [SUMMARY] 1 SWF(s) in MD5:46a110548007e04f4043785ac4184558:RTF_embedded_object_0 + 00036DD + [ADDR] SWF 1 at 0xc40 - FWS Header + [FILE] Carved SWF MD5: 2498e9c0701dc0e461ab4358f9102bc5.swf + For more info, see `http://www.decalage.info/python/pyxswf `_ diff --git a/oletools/pyxswf.py b/oletools/pyxswf.py index 8076b52..794a90e 100644 --- a/oletools/pyxswf.py +++ b/oletools/pyxswf.py @@ -1,17 +1,22 @@ #!/usr/bin/env python """ -pyxswf.py - Philippe Lagadec 2012-09-17 +pyxswf.py pyxswf is a script to detect, extract and analyze Flash objects (SWF) that may be embedded in files such as MS Office documents (e.g. Word, Excel), which is especially useful for malware analysis. + pyxswf is an extension to xxxswf.py published by Alexander Hanel on http://hooked-on-mnemonics.blogspot.nl/2011/12/xxxswfpy.html Compared to xxxswf, it can extract streams from MS Office documents by parsing -their OLE structure properly, which is necessary when streams are fragmented. +their OLE structure properly (-o option), which is necessary when streams are +fragmented. Stream fragmentation is a known obfuscation technique, as explained on http://www.breakingpointsystems.com/resources/blog/evasion-with-ole2-fragmentation/ +It can also extract Flash objects from RTF documents, by parsing embedded +objects encoded in hexadecimal format (-f option). + pyxswf project website: http://www.decalage.info/python/pyxswf pyxswf is part of the python-oletools package: @@ -41,18 +46,19 @@ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. """ -__version__ = '0.01' +__version__ = '0.02' #------------------------------------------------------------------------------ # CHANGELOG: # 2012-09-17 v0.01 PL: - first version +# 2012-11-09 v0.02 PL: - added RTF embedded objects extraction #------------------------------------------------------------------------------ # TODO: # - check if file is OLE # - support -r -import optparse, sys, os +import optparse, sys, os, rtfobj, StringIO from thirdparty.xxxswf import xxxswf from thirdparty.OleFileIO_PL import OleFileIO_PL @@ -76,6 +82,7 @@ def main(): parser.add_option('-c', '--compress', action='store_true', dest='compress', help='Compresses the SWF using Zlib') parser.add_option('-o', '--ole', action='store_true', dest='ole', help='Parse an OLE file (e.g. Word, Excel) to look for SWF in each stream') + parser.add_option('-f', '--rtf', action='store_true', dest='rtf', help='Parse an RTF file to look for SWF in each embedded object') (options, args) = parser.parse_args() @@ -85,6 +92,7 @@ def main(): parser.print_help() return + # OLE MODE: if options.ole: for filename in args: ole = OleFileIO_PL.OleFileIO(filename) @@ -99,6 +107,18 @@ def main(): xxxswf.disneyland(f, direntry.name, options) f.close() ole.close() + + # RTF MODE: + elif options.rtf: + for filename in args: + for index, data in rtfobj.rtf_iter_objects(filename): + if 'FWS' in data or 'CWS' in data: + print 'RTF embedded object size %d at index %08X' % (len(data), index) + f = StringIO.StringIO(data) + name = 'RTF_embedded_object_%08X' % index + # call xxxswf to scan or extract Flash files: + xxxswf.disneyland(f, name, options) + else: xxxswf.main() diff --git a/oletools/rtfobj.py b/oletools/rtfobj.py new file mode 100644 index 0000000..96539bc --- /dev/null +++ b/oletools/rtfobj.py @@ -0,0 +1,87 @@ +#!/usr/bin/env python +""" +rtfobj.py - Philippe Lagadec 2012-11-09 + +rtfobj is a Python module to extract embedded objects from RTF files, such as +OLE ojects. It can be used as a Python library or a command-line tool. + +Usage: rtfobj.py + +rtfobj project website: http://www.decalage.info/python/rtfobj + +rtfobj is part of the python-oletools package: +http://www.decalage.info/python/oletools + +rtfobj is copyright (c) 2012, Philippe Lagadec (http://www.decalage.info) +All rights reserved. + +Redistribution and use in source and binary forms, with or without modification, +are permitted provided that the following conditions are met: + + * Redistributions of source code must retain the above copyright notice, this + list of conditions and the following disclaimer. + * Redistributions in binary form must reproduce the above copyright notice, + this list of conditions and the following disclaimer in the documentation + and/or other materials provided with the distribution. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND +ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED +WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +""" + +__version__ = '0.01' + +#------------------------------------------------------------------------------ +# CHANGELOG: +# 2012-11-09 v0.01 PL: - first version + +#------------------------------------------------------------------------------ +# TODO: +# - improve regex pattern for better performance? + +import re, sys, string, binascii + +# REGEX pattern to extract embedded OLE objects in hexadecimal format: +# alphanum digit: [0-9A-Fa-f] +# hex char = two alphanum digits: [0-9A-Fa-f]{2} +# several hex chars, at least 4: (?:[0-9A-Fa-f]{2}){4,} +# at least 4 hex chars, followed by whitespace or CR/LF: (?:[0-9A-Fa-f]{2}){4,}\s* +PATTERN = r'(?:(?:[0-9A-Fa-f]{2})+\s*)*(?:[0-9A-Fa-f]{2}){4,}' + +# a dummy translation table for str.translate, which does not change anythying: +TRANSTABLE_NOCHANGE = string.maketrans('', '') + + +def rtf_iter_objects (filename, min_size=32): + """ + Open a RTF file, extract each embedded object encoded in hexadecimal of + size > min_size, yield the index of the object in the RTF file and its data + in binary format. + This is an iterator. + """ + data = open(filename, 'rb').read() + for m in re.finditer(PATTERN, data): + found = m.group(0) + # remove all whitespace and line feeds: + #NOTE: with Python 2.6+, we could use None instead of TRANSTABLE_NOCHANGE + found = found.translate(TRANSTABLE_NOCHANGE, ' \t\r\n\f\v') + found = binascii.unhexlify(found) + #print repr(found) + if len(found)>min_size: + yield m.start(), found + +if __name__ == '__main__': + if len(sys.argv<2): + sys.exit(__doc__) + for index, data in rtf_iter_objects(sys.argv[1]): + print 'found object size %d at index %08X' % (len(data), index) + fname = 'object_%08X.bin' % index + print 'saving to file %s' % fname + open(fname, 'wb').write(data)