Function Search Engine (or whatever)

Tue, 23 Jan 2007 13:20:35 -0600

Well, here is the first draft about how will work the "Function Search Engine". BTW, I'm looking for a name for this thing (something better that FSE or FUSEN). Okay, here we go!

Concept
--------------

The funtion matching will be using different levels of filters. I mean, there will be a group of first-level filters who will compound the core of the search engine. Also, the 2nd, 3rd... level filters will be used to "refine" the results. Every filter will be taking care of a function attribute, a set of attributes or will represent a relation between parts of the function (here is where we will face with some graph analysis and/or some other abstract representation of the function).

I build a first draft about what goes on every filter-level. It may be considered pre-alpha xD because some attributes will flow from one level to another, new attributes may appear and others may disapear, based in the tests done.

First level filters:
- Set of external references.
- Set of internal constant values.
- Stack size.
- Abstract graph representation

Second level filters:
- Number of parameters.
- Set of types (parameters and/or local variables).

Brief description of main filters
----------------------------------------------------

External References: The data used by the function (strings, constant values, data blocks such as a vector or a matrix, floating point numbers, etc.). The data a function uses says a lot about its functionality. Sometimes a simple Internet search save us a lot of time.

Constant values: Same thing as above but taking care of common used constants like 0xFFFFFFFF, 0x0, etc. that will include a lot of noise in the system. Also this references only constant values that are embeded in the function, I mean, they aren't outside the function's code.

Stack size: Just look at the function preamble (push-mov-sub). It will tell us about local variables size. I know that the same function can use static buffers or allocate them, I'm trying to get the best way to manage this.

Abstract graph: This is the "bindiff-like" graph representation of the function. It will be the hardest part of the implementation process, so we will focus on it once we have a good general desing. I've been talking with the radare developer about this, because he wants to add a graphical diff to his tool. Maybe we can get a base desing in a couple of weeks (I know, my time estimations are not quite accurate, but...)

Testing
--------------

Regarding the tests, at the begining we will be "good boys" and will use short and (of course) well known funcions, but that are very similar. Let's think in what makes different toupper() from tolower(), strcpy() and strncpy(). At the same time, trying that different implementations of these functions match each together.

For the final tests, it will be interesting to use the same function but compiled for different architectures (if the system finally supports it).

That's all folks!
-------------------------

As a final word, only talk about a concept I have been discussing with a friend. We were talking about the abstract graph representation and why not to use a "Divide and Conquer" desing, so we began to discuss about to use all this filters and techniques but with "basic function blocks" instead of taking the entire function. With this approach, we can ask ourselves: the definition of the basic blocks behaviour defines the global function behaviour? I think it's a nice question, but too hard to answer at this moment... or maybe not ;D

Using this "basic block" approach we face with new problems like how to define what's a "basic block", how this blocks change using different compiler flags, if we need to ignore certain types of blocks (like heap init or clean blocks), etc... Well, while I was writing this I see that Ero posted another interesting entry in his blog about "basic blocks" ;D

"I have promises to keep,
And miles to go before I sleep."

Disassembling in XML

Mon, 18 Dec 2006 09:35:58 -0600

Talking with a friend about hown to obtain a good abstract respresentation of a piece of code, he came up an interesting tool that you can see in action just here. It seems that the tool was born as an advanced hexeditor, but now has a lot of reversing uses/utilities.

At first look, it seemed very strange to me how the intructions are represented in xml. In example, the xor has two source operands and a destination. But if we think that it's not limited to x86, then this fact makes more sense.

I still want to use IDA as a base framework for getting all this "function property abstraction" stuff, but maybe it's time to think about how to embed some external tools (a proxy.plw plugin? Mmm, interesting...). Well, that's all for now, I hope to have soon a first definition of how all this crazy thing will work, and maybe with a bit of "alpha" code to play with ;D

who needs backups?

Wed, 13 Dec 2006 06:30:50 -0600

Well, a couple of weeks ago, I started a thread to receive some feedback on a complex topic: create an indexable database of known aka already-analyzed functions.

After that, no more news. Maybe some of you were thinking that I was giving up... well, almost true. Few days after my last post, my smartmon throwed an SMART warning saying "hey, your hdd has an expected lifetime of 0hours" ¿? 5min later, my root partition was remounted-ro due to read errors and then, just died (very "smart" the smartmon... sorry, this joke got to go).

The funniest thing is that my last backup was an old and dusty 650Mb CD with a few scripts, tools and reports. Of course I'm leaving the recovery mode status, but at very low speed ;D

Ok then, after this "here I am" stuff, only to say that I will try to keep this blog updated with thoughts about the function database idea, collecting interesting concepts, suggestions, bibliography, etc.

That's all folks!

OpenRCE: Blog

Function Search Engine (or whatever)

Disassembling in XML

who needs backups?