OpenRCE

About Articles Book Store Distributed RCE Downloads Event Calendar Forums Live Discussion Reference Library RSS Feeds Search Users What's New

Customize Theme

Flag: Tornado! Hurricane!

Friday, June 24 2005 10:07.18 CDT

Author:

ero

# Views: 96716

Printer Friendly ...

Python and IDA

Python is a powerful scripting language which has features greatly appreciated by its followers. Versatility, speed of development and readability are among the top ones.

IDA provides the advanced user with IDC, a C-like scripting language to automate some of the tasks of analysis. Yet, compared to Python, IDC feels clumsy and slow. Many times has the author (and others) wished for something more versatile.

IDAPython (Erdélyi 2005) was first introduced in an earlier joint paper, Carrera and Erdélyi 2004, where a general overview was given together with minimal examples comparing IDC and equivalent Python scripts.

Python goes well beyond the possibilities of IDC by providing networking support, avanced I/O and a host of other features not available in IDC at all.

In this article, a series of examples will be introduced in order to get acquainted with IDAPython and its possibilities.

The examples presented in this paper are known to work with IDA 4.8 and IDAPython 0.7.0. running under Linux.

IDAPython keeps the same global dictionary regardless of the input method. Whether Python code is run from external files or typed in its notepad, the data is persistent. This is extremely convenient as one might want to run a script that will gather and parse certain data but does not yet know, or want to, do anything further with it. Having such data always accessible sets a wonderful environment for poking and tinkering around.

IDAPython provides the full API available to those writing plugins and also the well known IDC functions. It?s possible to access nearly anything within IDA?s database.

Walking the Functions

As an introductory script, the first example will loop through all the functions IDA has found and any others the user has defined, and will print their effective addresses and names. The script is nearly identical to one of the examples in (Carrera and Erdélyi 2004):

### Walk the functions

# Get the segment's starting address
ea = ScreenEA()

# Loop through all the functions
for function_ea in Functions(SegStart(ea), SegEnd(ea)):
    
    # Print the address and the function name.
    print hex(function_ea), GetFunctionName(function_ea)

Functions such as ScreenEA and GetFunctionName exist also in IDC and documentation for them can be found at .

The functions Functions(), is provided by IDAPython?s idautils module, which is automatically imported on load.

Walking the Segments

This example will loop though all segments and fetch their data, byte by byte, storing it in a Python string.

### Going through the segments

segments = dict()

# For each segment
for seg_ea in Segments():

    data = []
    
    # For each byte in the address range of the segment
    for ea in range(seg_ea, SegEnd(seg_ea)):

        # Fetch byte
        data.append(chr(Byte(ea)))
    
    # Put the data together
    segments[SegName(seg_ea)] = ''.join(data)

# Loop through the dictionary and print the segment's names
# and their sizes
for seg_name, seg_data in segments.items():
    print seg_name, len(seg_data)

The function Segments() is again provided by idautils. Byte(), SegEnd() and SegName() exist in IDC and their functionality is quite self-evident.

Function Connectivity

The third example is a bit more elaborate. It will go through all the functions and will find all the calls performed to and from each of them. The references will be stored in two dictionaries and, in the end, a list of functions with their indegree and outdegree will be shown.

### Indegree and outdegree of functions

from sets import Set

# Get the segment's starting address
ea = ScreenEA()

callers = dict()
callees = dict()

# Loop through all the functions

for function_ea in Functions(SegStart(ea), SegEnd(ea)):

    f_name = GetFunctionName(function_ea)

    # Create a set with all the names of the functions calling (referring to)
    # the current one.
    callers[f_name] = Set(map(GetFunctionName, CodeRefsTo(function_ea, 0)))

    # For each of the incoming references
    for ref_ea in CodeRefsTo(function_ea, 0):

        
        # Get the name of the referring function
        caller_name = GetFunctionName(ref_ea)
    
        # Add the current function to the list of functions
        # called by the referring function
        callees[caller_name] = callees.get(caller_name, Set())
        callees[caller_name].add(f_name)

# Get the list of all functions
functions = Set(callees.keys()+callers.keys())

# For each of the functions, print the number of functions calling it and
# number of functions being called. In short, indegree and outdegree
for f in functions:
    print '%d:%s:%d' % (len(callers.get(f, [])), f, len(callees.get(f, [])))

Walking the Instructions

The fourth example will take us to the instruction level. For each segment, we will walk through all the defined elements, by means of Heads(start address, end address) and check whether the element defined at each address is an instruction; if so, the mnemonic will be fetched and its occurrence count will be updates in the mnemonics dictionary. Finally, the mnemonics and their number of occurrences are shown.

### Nmemonics histogram
    
mnemonics = dict()

# For each of the segments
for seg_ea in Segments():

    # For each of the defined elements
    for head in Heads(seg_ea, SegEnd(seg_ea)):


        # If it's an instruction
        if isCode(GetFlags(head)):
        
            # Get the mnemonic and increment the mnemonic
            # count
            mnem = GetMnem(head)
            mnemonics[mnem] = mnemonics.get(mnem, 0)+1

# Sort the mnemonics by number of occurrences
sorted = map(lambda x:(x[1], x[0]), mnemonics.items())
sorted.sort()


# Print the sorted list
for mnemonic, count in sorted:
    print mnemonic, count

Cyclomatic Complexity

The next example goes a bit further. It will go through all the functions and for each of them it will compute the Cyclomatic Complexity. The Cyclomatic Complexity measures the complexity of the code by looking at the nodes and edges (basic blocks and branches) of the graph of a function. It is usually defined as:

CC = Edges - Nodes + 2

The function cyclomatic_complexity() will compute its value, given the function?s start address as input.

The example can be run in two different modes. The first one is invoked as usual, through IDAPython, by locating the Python script and running it. A second way is to launch IDA and make it run the script in batch mode; that will be explored in the next section.

In this example function chunks are not considered. IDA added in recent versions, support for function chunks, which are a result of some compiler?s optimization process. It is possible to walk the chunks by using the function API function func_tail_iterator_t(). The following code shows how to iterate through the chunks.

### Collecting function chunks

function_chunks = []

#Get the tail iterator
func_iter = func_tail_iterator_t(get_func(ea))

# While the iterator?s status is valid
status = func_iter.main()

while status:
    # Get the chunk
    chunk = func_iter.chunk()

    # Store its start and ending address as a tuple
    function_chunks.append((chunk.startEA, chunk.endEA))

    # Get the last status
    status = func_iter.next()

### Cyclomatic complexity

import os

from sets import Set

def cyclomatic_complexity(function_ea):
    """Calculate the cyclomatic complexity measure for a function.
    
    Given the starting address of a function, it will find all
    the basic block's boundaries and edges between them and will
    return the cyclomatic complexity, defined as:
    
        CC = Edges - Nodes + 2
    """

    f_start = function_ea
    f_end = FindFuncEnd(function_ea)
    
    edges = Set()
    boundaries = Set((f_start,))
    
    # For each defined element in the function.
    for head in Heads(f_start, f_end):
    
        # If the element is an instruction
        if isCode(GetFlags(head)):
        
            # Get the references made from the current instruction
            # and keep only the ones local to the function.
            refs = CodeRefsFrom(head, 0)
            refs = Set(filter(lambda x: x>=f_start and x<=f_end, refs))
            
            if refs:
                # If the flow continues also to the next (address-wise)
                # instruction, we add a reference to it.
                # For instance, a conditional jump will not branch
                # if the condition is not met, so we save that
                # reference as well.
                next_head = NextHead(head, f_end)
                if isFlow(GetFlags(next_head)):
                    refs.add(next_head)
                
                # Update the boundaries found so far.
                boundaries.union_update(refs)
                            
                # For each of the references found, and edge is
                # created.
                for r in refs:
                    # If the flow could also come from the address
                    # previous to the destination of the branching
                    # an edge is created.
                    if isFlow(GetFlags(r)):
                        edges.add((PrevHead(r, f_start), r))
                    edges.add((head, r))

    return len(edges) - len(boundaries) + 2
    
    
def do_functions():
    cc_dict = dict()
    
    # For each of the segments
    for seg_ea in Segments():
        # For each of the functions
        for function_ea in Functions(seg_ea, SegEnd(seg_ea)):
            cc_dict[GetFunctionName(function_ea)] = cyclomatic_complexity(function_ea)
    
    return cc_dict

# Wait until IDA has done all the analysis tasks.
# If loaded in batch mode, the script will be run before
# everything is finished, so the script will explicitly
# wait until the autoanalysis is done.
autoWait()

# Collect data
cc_dict = do_functions()

# Get the list of functions and sort it.
functions = cc_dict.keys()
functions.sort()
ccs = cc_dict.values()

# If the environment variable IDAPYTHON exists and its value is 'auto' 
# the results will be appended to a data file and the script will quit
# IDA. Otherwise it will just output the results.

if os.getenv('IDAPYTHON') == 'auto':
    results = file('example5.dat', 'a+')

    results.write('%3.4f,%03d,%03d %s\n' % (
        sum(ccs)/float(len(ccs)), max(ccs), min(ccs), GetInputFile()))

    results.close()
            
    Exit(0)
    
else:
    # Print the cyclomatic complexity for each of the functions.
    for f in functions:
        print f, cc_dict[f]
        
    # Print the maximum, minimum and average cyclomatic complexity.
    print 'Max: %d, Min: %d, Avg: %f' % (max(ccs), min(ccs), sum(ccs)/float(len(ccs)))

Automating IDA through IDAPython

As mentioned in the last section, the previous example has a a second way of operating. IDAPython now supports to run Python scripts on start up, from the command line. Such functionality comes handy, to say the least, when analyzing a set of binaries in batch mode.

The switch -OIDAPython:/path/to/python/script.py can be used to tell IDAPython which script to run. Another switch which might come handy is -A which will instruct IDA to run in batch mode, not asking anything, just performing the auto-analysis. With those two options combined it is possible to auto-analyze a binary and run a Python script to perform some mining. A function which will be usually required is autoWait() which will instruct the Python script to wait until IDA is done performing the analysis. It is a good idea to call it in the beginning of any script. To analyze a bunch of files a command like the following could be entered (if working in Bash on Linux).

for virus in virus/*.idb; do IDAPYTHON='auto' idal -A -OIDAPython:example5.py $virus; done

It will go through all the .idb files in the virus/ directory and will invoke idal which each of them, running the script example5.py on load.

The script is the one in the last example. If it finds the environment variable IDAPYTHON, it will just collect the data and append it to a file instead of showing it in IDA?s messages window. Subsequently it will call Exit() to close the database and quit.

It would be equally easy to batch mode analyze a set of executables. If IDB files are given, IDA will just load them and no auto-analysis will be performed; otherwise, if a binary file is provided the analysis will be done and the script run once finished.

All this allows for a good degree of automation in analysis of a set of binaries. For instance, the next table is the output of running the previous script on a bunch of malware IDBs. A nice feature is to see the clear clustering of the families by their cyclomatic complexity features.

Output of running the example in batch mode on a set of malware binaries.

Sample	Cyclomatic Complexity
Sample	Avg.	Max	Min	Filename
Klez	7.4197	148	001	klez_a.ex
	7.4975	148	001	klez_b.ex
	7.5972	148	001	klez_c.ex
	7.5972	148	001	klez_d.ex
	7.0349	148	001	klez_e.ex
	7.0502	148	001	klez_f.ex
	7.0502	148	001	klez_g.ex
	7.0573	148	001	klez_h.ex
	7.0573	148	001	klez_i.ex
	7.0502	148	001	klez-j.ex
Mimail	3.2190	052	001	mimailA.ex_.1.unp
	3.2353	052	001	mimailB.ex_
	3.2313	052	001	mimailC.ex_.1.unp
	3.4148	052	001	mimailD.ex_
	2.8110	052	001	mimailE.ex_.1.unp
	2.7953	052	001	mimailF.ex_.1.unp
	2.7638	052	001	mimailG.ex_.1.unp
	2.7874	052	001	mimailH.ex_.1.unp
	2.8376	052	001	mimailI.ex_.1.unp
	2.8632	052	001	mimailJ.ex_
	2.8984	052	001	mimailL.ex_.1.unp
	2.8231	052	001	mimail-m_u.ex
	3.4375	052	001	outlook_.dmp
	3.1138	052	001	mimail-s_u.ex
Sasser	6.5301	039	001	sasser.avpe
	6.5422	039	001	sasser-b.avpe
	6.6098	039	001	sasser-c.avpe
	6.5955	041	001	sasser-d.ex_unp.exe
	6.5444	041	001	sasser-e.unp
	6.8452	041	001	sasser-f.unp
	8.0000	041	001	sasser-g.unp
Netsky	7.3505	041	001	netskyaa.unp
	7.4947	041	001	netsky_unk.unp
	7.1667	041	001	netsky_ac.ex_unp
	5.9694	051	001	Netsky.AD.unp
	7.3125	041	001	virus.ex_.1.unp
	7.2478	041	001	your_details.doc.exe.2.unp
	8.0407	123	001	userconfig9x.dl.1.unp
	7.9068	041	001	netsky-q-dll.unp
	7.9068	041	001	netsky-q-dll.unp
	7.5702	041	001	netsky-r-dll_unp_.exe
	7.5657	041	001	list0_unp_.pif
	7.5743	041	001	private.unp.pi_
	7.5268	041	001	netsky_v_unp_.exe
	7.8824	041	001	netsky-w.unp
	6.8165	041	001	netsky.pif.2.unp

Visualizing Binaries

This example is based on the one collecting the indegrees and outdegree of all functions. This time, we will use that information to generate a graph of the call-tree and plot it using pydot, (Carrera 2005a); a package to interface Graphviz, (Ellson et al. 2005).

The code follows, the only changes from the example it is based on, are the lines creating the graph, setting some defaults and then adding the edges.

### Visualizing Binaries

from sets import Set
import pydot

# Get the segment's starting address
ea = ScreenEA()

callers = dict()
callees = dict()

# Loop through all the functions
for function_ea in Functions(SegStart(ea), SegEnd(ea)):

    f_name = GetFunctionName(function_ea)
    
    # For each of the incoming references
    for ref_ea in CodeRefsTo(function_ea, 0):
    
        # Get the name of the referring function
        caller_name = GetFunctionName(ref_ea)
        
        # Add the current function to the list of functions
        # called by the referring function
        callees[caller_name] = callees.get(caller_name, Set())

        callees[caller_name].add(f_name)
        
# Create graph        
g = pydot.Dot(type='digraph')

# Set some defaults
g.set_rankdir('LR')
g.set_size('11,11')
g.add_node(pydot.Node('node', shape='ellipse', color='lightblue', style='filled'))
g.add_node(pydot.Node('edge', color='lightgrey'))


# Get the list of all functions
functions = Set(callees.keys()+callers.keys())

# For each of the functions and each of the called ones, add
# the corresponding edges.
for f in functions:
    if callees.has_key(f):
        for f2 in callees[f]:
            g.add_edge(pydot.Edge(f, f2))
            
# Write the output to a Postscript file
g.write_ps('example6.ps')

Some examples output is shown next, the different plots are obtained by using the different plotting utilities provided by Graphviz.

Projects Using IDAPython

It might be also useful to check some already existing projects based solely on IDAPython. Some of them are:

idb2reml, (Carrera 2005); will export IDB information to a XML format, REML (ReverseEngineering ML)
pyreml, (Carrera 2005a); loads the REML produced by idb2reml and provides a set of functions to perform advanced analysis.

This paper is also available in PDF form. The PDF version is 64 pages long, contains in addition to this article a full function reference and is available from Introduction to IDAPython.pdf.

Article Comments Write Comment / View Complete Comments

Username	Comment Excerpt	Date
ResearchAviator	Hi Ero, Can this be extended to plot a graph...	Friday, November 27 2009 01:02.13 CST
Nadya	Hi Ero, Interesting introduction, but one qu...	Monday, July 7 2008 10:08.03 CDT
ero	I should get around to write some more docs on ...	Tuesday, June 28 2005 20:02.15 CDT
JCRoberts	Ero, Is there any further known uses or additi...	Tuesday, June 28 2005 02:31.39 CDT
ero	The number of connected components in this case...	Monday, June 27 2005 19:01.23 CDT
ThorstenSchneider	The correct definition of the Cyclomatic Comple...	Sunday, June 26 2005 01:49.46 CDT

There are 31,314 total registered users.

Recently Created Topics
[help] Unpacking VMP...	Mar/12
Reverse Engineering ...	Jul/06
hi!	Jul/01
let 'IDAPython' impo...	Sep/24
set 'IDAPython' as t...	Sep/24
GuessType return une...	Sep/20
About retrieving the...	Sep/07
How to find specific...	Aug/15
How to get data depe...	Jul/07
Identify RVA data in...	May/06

Recent Forum Posts
Finding the procedur...	rolEYder
Question about debbu...	rolEYder
Identify RVA data in...	sohlow
let 'IDAPython' impo...	sohlow
How to find specific...	hackgreti
Problem with ollydbg	sh3dow
How can I write olly...	sh3dow
New LoadMAP plugin v...	mefisto...
Intel pin in loaded ...	djnemo
OOP_RE tool available?	Bl4ckm4n

Recent Blog Entries
	halsten	Mar/14
Breaking IonCUBE VM
	oleavr	Oct/24
Anatomy of a code tracer
	hasherezade	Sep/24
IAT Patcher - new tool for ...
	oleavr	Aug/27
CryptoShark: code tracer ba...
	oleavr	Jun/25
Build a debugger in 5 minutes
More ...

Recent Blog Comments
	nieo on:	Mar/22
IAT Patcher - new tool for ...
	djnemo on:	Nov/17
Kernel debugger vs user mod...
	acel on:	Nov/14
Kernel debugger vs user mod...
	pedram on:	Dec/21
frida.github.io: scriptable...
	capadleman on:	Jun/19
Using NtCreateThreadEx for ...
More ...

Imagery

SoySauce Blueprint
Jun 6, 2008

[+] expand

View Gallery (11) / Submit