BinNavi database format
Ero Carrera (ero) <erocarreragmailcom> Tuesday, February 6 2007 16:16.00 CST


In the latest version of BinNavi we moved to SQL for storage of the disassembly information.

We spent a fair amount of time thinking about how to best store the disassembly information in a way that would be as little architecture-dependent as possible and allow for fast querying while, at the same time, trying to not make it too hard to use directly through SQL.

We also wanted to be able to have a central repository of all disassembly data, doing away with the need of keeping local databases that easily get out of sync. A central repository has other advantages like allowing for different users to look at the same project.

The disassembly data is currently exported from IDA into the database via a exporter , ida2sql, written in Python and requiring IDAPython. This exporter is included with BinNavi and also made available separately. If you want to play with it just drops us a line at Sabre.

One can already perform interesting analysis by just using SQL queries directly, although we provide a lot of the functionality in a more convenient form through BinNavis integrated Python interface.

The core set of tables looks like this:

BinNavis database schema

The basic table layout is:

  • modules, which holds the information about all the disassemblies in the database, the name of the file, the date it was imported and a comment field

For each of the following the module id will be appended to the table name.

  • functions, containing all the functions in the disassembly and specifying their address name and type

  • basic_blocks, all the basic blocks in the disassembly and their parent function

  • instructions, the instructions making the basic blocks. Contains the data making up the instruction together with its address, mnemonic and parent basic block

  • callgraph relates all callers and callees

  • control_flow_graph expresses all the links between basic blocks

  • operand_strings contains the operand strings as shown by IDA

  • expression_tree represents all expressions composing the operands as a tree

  • operand_tuples maps addresses to the operands used by the instruction at such location

  • expression_substitutions allows to replace any part of the expression tree of an operand with a string, variable names are handled through this table

  • operand_expressions relates the operands to the expressions composing them

  • address_references contains all references, both to code and data labeled with their type

  • sections holds the raw data for the section composing the binary source for the disassembly


Currently only MySQL has been thoroughly tested. Some of the advantages the schema brings are:
  • Supports multiple modules in a single database

  • Instruction operands are stored as trees, which enables it to support a variety of architectures, efficient storage and advanced querying

  • Use of SQL statements to perform advanced analysis and data-mining

  • The exporter module from IDA to the SQL schema currently supports METAPC(x86), PPC and ARM exporting (the latter two are still in beta). More will be added in the future.

Some query examples



Listing the modules in the database:
SELECT * FROM modules

Counting the number of functions:
SELECT count(address) FROM functions

Counting the number of basic blocks (blocks shared by functions will be counted multiple times):
SELECT count(address) FROM basic_blocks

and without counting the shared ones:
SELECT count(DISTINCT(address)) FROM basic_blocks;

How about a histogram showing the of mnemonic distribution?
SELECT mnemonic, COUNT(mnemonic) as mnem_count FROM instructions GROUP BY mnemonic ORDER BY mnem_count;


.
.
or, 275
inc, 361
movzx, 377
leave, 392
sub, 403
stos, 429
.
.

or something a bit more elaborate like getting all addresses in a disassembly where a specific register is used:

SELECT
    HEX(instructions.address), mnemonic
FROM
    instructions
JOIN
    (operand_tuples, operand_expressions, expression_tree)
ON
    instructions.address = operand_tuples.address AND
    operand_tuples.operand_id = operand_expressions.operand_id AND
    expression_tree.id = operand_expressions.expr_id
WHERE
    symbol=ebp;



.
.
71CF14D3, mov
71CF14D6, mov
71CF14E9, pop
71CF15E1, push
71CF15E2, mov
71CF15E6, mov
71CF1617, pop
.
.

for queries like the last one or more complex ones its usually a good idea to move to the embedded Python interpreter in BinNavi before getting lost in SQL...

Comments
Posted: Wednesday, December 31 1969 18:00.00 CST