Flag: Tornado! Hurricane!

Blogs >> RolfRolles's Blog

Created: Monday, January 21 2008 13:33.49 CST Modified: Monday, January 21 2008 14:08.02 CST
Printer Friendly ...
Binary Search in Large-Scale Structure Recovery
Author: RolfRolles # Views: 3881

Here's an old and simple trick that I use extensively while recovering large structures in C++ reversing.  Briefly, the challenges in structure recovery are to determine:

1) The size of the structure;
2) The inheritance hierarchy that relates this structure to others;
3) The location of the members within the structure;
4) The data types of the members;
5) All of the locations in the code where the structure is used;
6) The holistic picture:  the overall purpose of the structure and the contributions of each data member to that.

This entry is concerned with point #5.  Let's assume that we know (#3) where a particular member within a structure is situated.  In order to figure out (#4) its data type and (#6) its purpose, we should inspect (#5) the locations at which this data member is used.  We might get lucky; maybe we'll find something like this:


mov eax, [esi+Structure.field_XYZ]
push eax
push offset fmtstr ; "%s:  Loading into memory for emulation\n"
call LoggingFunction
add esp, 8


From this we can infer both the data type (char *) and functionality (it's a pointer to the name of the file that is about to be emulated), and draw a conclusion about the overall structure (that it's probably related to emulation).  Perhaps we won't get as lucky as this scenario, but maybe a more subtle clue is revealed by one of the references.  So, how do we find other locations at which this structure member is being used?

The obvious answer would be to text-search for the phrase "[reg32+0XYZh]", but this method has a few drawbacks:

A)  It's slow;
B)  It relies upon the disassembler properly distinguishing code from data, which is in fact impossible to solve generally due to equivalence with the halting problem (a result of indirect addressing, which is the bread and butter of C++'s implementation of polymorphism via function pointers);
C)  Finding the string above just tells us that field_XYZ in *some* structure is being used, not necessarily our particular structure of interest.

Ignoring points A and B for the moment, point C is critical and bears closer inspection; how can we be sure that the results we are finding actually refer to the structure that interests us?  Let's examine the situation for some specific values of XYZ:

Q:  How many structures contain a member defined at XYZ = +0?
A:  All of them.  Therefore if we were to search for [reg32], we could make no guarantees about which structure is actually being used (or even that a structure is being used, period).

Q:  How many structures contain a member defined at XYZ = +4?
A:  Most of them.  The same comment from above applies.

Q:  How many structures contain a member defined at XYZ = +40?
A:  Few of them.  In my experience a program generally contains proportionally very many structures that have size 0x40 or less, and proportionally very few structures larger than that.

Q:  How many structures contain a member defined at XYZ = +X, where X >= 0x80?  X >= 0x100?  X >= 0x1000?  X >= 0x10000?  X >= 0x80 and X is not dword-boundary-aligned?  X >= 0x80 and X is not a multiple of a high power of two?
A:  The larger the structure, the better the chance that the location of its data members are unique, or if not unique, then at least that the structures were derived from a common base class.

The first lesson is that the higher the offset within the structure, the fewer structures are going to have data members defined at that offset, which means that offset searching begins to become feasible for these high-offset data members.  Point C is addressed, but points A and B remain.  To address the latter, let's briefly look at some characteristics of instruction encoding on x86.  Below are some typical structure references:


8B 16             mov edx, [esi]       ; notice that the +0 is not present in the encoding
66 89 42 36       mov [edx+36h], ax    ; notice that the +36h is present as a byte in the encoding
8B 8E AC 5F 00 00 mov ecx, [esi+5FACh] ; notice that the +5FACh is present as a dword in the encoding


Displacements off of a register that fit into a single signed byte, e.g. [esi-80h] ... [esi+7Fh], are represented with a single signed byte in the instruction's encoding, e.g. the 36h from the above.  But searching for a byte is no good; a single byte could appear in any context.  Displacements off of a register that are outside of this small window, e.g. [esi+80h], are represented with a dword in the instruction's encoding, e.g. the AC 5F 00 00 from the above.  Therefore, any time one of these high-offset structure members is accessed directly, we're going to see an entire dword in the instruction stream that corresponds to the offset of the structure member.  Searching for an entire dword gives much more precise results than searching for a byte.

Now all of the machinery is in place for the real point of this entry.  Suppose we can't figure out field_5FAC's data type or functionality, and we would like to see other references to that member to see if they provide any clues.  We could text search for the regular expression [.*+5FACh], and we would be reasonably sure that we were finding references to our structure of interest, or at least structures from the same family, but it would be slow, and would only find references that were defined as code.

This is where IDA's "binary search" feature, alt-B or Search->Sequence of bytes..., comes in handy.  Enter AC 5F 00 00 into the window.  IDA instantaneously brings up a window with sixty-nine lines of code, all of which have the form "mov reg32, [reg32+5FACh]" or "mov [reg32+5FACh], reg32".  There is one additional result of the form "db 0ACh", which, when the surrounding bytes are turned into code, is revealed to be a structure reference of the aforementioned variety.  None of the results are false positives.

The point of this blog entry was to say that, the larger a structure becomes, the more "unique" the addresses of the members within the structure become, and due to the instruction encoding on x86, we can find all direct references to the high-offset structure members quickly, easily, and with few to no false positives using IDA's binary search feature.


Blog Comments
upbupb2 Posted: Tuesday, January 22 2008 08:06.47 CST


but nicely explained :)

c1de0x Posted: Tuesday, January 22 2008 11:25.13 CST
Rolf: good explanation.

I wrote a script to do exactly this (step #5)... now if only I could find it!

Of course, things get a whole lot more interesting when dealing with the 'instance structures' of static instances of polymorphic classes (for e.g.). In those circumstances, MSVC emits code which uses different 'base-offsets' to access the same members, even utilizing 'negative offsets' to look 'before' the base-offset in some strange cases.

It'd be great if we could automate some of the 'heuristics' of structure reconstruction... perhaps by coupling a powerful static analyser with some limited data-flow analysis?

RolfRolles Posted: Tuesday, January 22 2008 15:08.50 CST
Haha upb, I haven't seen that meme in a while ;-)  I wonder how many distinct copies of that image that ImageShack is hosting ... You're right, byte-searching certainly is not brain surgery, although sometimes it's the best/fastest way to find something.  For instance, I wondered to myself, "this AV emulator has got to take a processor exception code and convert it into a Windows exception code at some point ... how/where does it do that?"  Searched for 03 00 00 80, and found it within thiry seconds after weeding out the three "real-looking" false positives (two registry accesses and the program's top-level exception handler).  

Another example would be auditing file-format parsing code.  I was once dealing with a program that had something like sixty DLLs, and I wanted to audit the image handlers for overflows.  Instead of going through some complicated dynamic reversing exercise, I simply byte-searched the directory for the "magic/signature" field of the image file header, and located that code within a matter of minutes.

c1de0x:  you're certainly right on a couple of points.  For one it would be great to have a tool that was formally justified in an academic sense to recover structures; all public attempts thus far have fallen short, but I have high hopes that Hex-Rays will eventually be able to do this.  For two, static instances (and even stack instances, for that matter) do not even appear to be structures at first glance, unless their address is passed into a function that operates on a structure.  I would be happy with a tool that worked reliably in even the simplest of cases.

Regarding negated structure offsets, hmm, that's a good topic for a new entry :-)  Will post that soon.

c1de0x Posted: Wednesday, January 23 2008 00:07.32 CST
Rolf:

Glad to have helped fuel the fire ;0

I'd like to try and develop some heuristics for structure recovery in pynary at some point.

For example, I'd like to use ctors (which are fairly easy to pattern match) as a means of locating statically instanced structures. Once you have a vtable (or 2 ;) ) you could run through the table analyzing offsets from the 'this' (ecx) argument and generating a 'usage table' for class member accesses. In some cases it may be possible to infer field type information, although that relies on a fairly advanced type propagation mechanism.

(Note to self: add type propagation and structure reconstruction to pynary ToDo list.)

Regarding 'dynamic' heap/stack instances, it may be possible to utilize similar logic.... have to give it a little more thought....

All in all, I hope that pynary will become a good platform for 'prototyping' these kinds of heuristics fairly quickly. Care to lend a hand?




Add New Comment
Comment:









There are 31,310 total registered users.


Recently Created Topics
[help] Unpacking VMP...
Mar/12
Reverse Engineering ...
Jul/06
hi!
Jul/01
let 'IDAPython' impo...
Sep/24
set 'IDAPython' as t...
Sep/24
GuessType return une...
Sep/20
About retrieving the...
Sep/07
How to find specific...
Aug/15
How to get data depe...
Jul/07
Identify RVA data in...
May/06


Recent Forum Posts
Finding the procedur...
rolEYder
Question about debbu...
rolEYder
Identify RVA data in...
sohlow
let 'IDAPython' impo...
sohlow
How to find specific...
hackgreti
Problem with ollydbg
sh3dow
How can I write olly...
sh3dow
New LoadMAP plugin v...
mefisto...
Intel pin in loaded ...
djnemo
OOP_RE tool available?
Bl4ckm4n


Recent Blog Entries
halsten
Mar/14
Breaking IonCUBE VM

oleavr
Oct/24
Anatomy of a code tracer

hasherezade
Sep/24
IAT Patcher - new tool for ...

oleavr
Aug/27
CryptoShark: code tracer ba...

oleavr
Jun/25
Build a debugger in 5 minutes

More ...


Recent Blog Comments
nieo on:
Mar/22
IAT Patcher - new tool for ...

djnemo on:
Nov/17
Kernel debugger vs user mod...

acel on:
Nov/14
Kernel debugger vs user mod...

pedram on:
Dec/21
frida.github.io: scriptable...

capadleman on:
Jun/19
Using NtCreateThreadEx for ...

More ...


Imagery
SoySauce Blueprint
Jun 6, 2008

[+] expand

View Gallery (11) / Submit