📚 OpenRCE is preserved as a read-only archive. Launched at RECon Montreal in 2005. Registration and posting are disabled.








Flag: Tornado! Hurricane!

 Forums >>  Brainstorms - General  >>  Automated Function Recognition

Topic created on: November 5, 2006 17:43 CST by tagetora .

Hi all,

I had an idea I want to comment here. Well, when I begin to reversing a new target, I usually find that it has some known functions. I mean functions that are common constructions used by the compiler, copy&pasted code, standard implementations for public protocols/operations (in example: smtp/pop engines, MIME, NAT operations, RSA...)

We can recognise these functions, and label them, but we are here in a "repetitive task". For every piece of code that fits in this description, there will be a lot of people that already analyzed it (or a very similar one). So... why not have a repository of this kind of information?

I mean, to have a repository of functions (or pieces of code) with their respective comments or descriptions where people can submit their findings or search for code that resists to be analyzed ;D

Think in a "google code" for reversers, maybe merged with a "bindiff"-like engine.

I don't know if something like this already exists, so I want any kind of feedback (from "you're crazy, man" to "I have an implementation of that thing", there's no problem).

Well, thanks for reading all this stuff. I hope that it was not a waste of time ;P

  ryanlrussell     November 5, 2006 21:26.18 CST
Yep, IDA Pro does this for libc with its FLIRT libraries.  It would be nice if these were available for openssl, zlib, and similar libraries.

In some cases, if you've got hard-linked code like openssl, it was probably included from source.  That means it's not going to be a bit-for-bit match on different compilers and settings.  That's when you would need the bindiff-like functionality, and you will get some false positives and negatives.

  MohammadHosein     November 5, 2006 22:36.26 CST
D:\IDA pro 5\IDA5FLAIR\flair\bin
make your own and share ;)

  pedram     November 6, 2006 00:22.51 CST
I don't think generation of new FLIRT sigs will work quite as well as creating a database of bindiff-like heuristics.

We've talked about building something like this at TippingPoint and I think it's a great idea. I'd be curious to see how effective this would be and would love to host the repository / service on OpenRCE. Hopefully it won't receive as little contribution as other references (such as the IDA SDK) have.

Taking a step to an even higher abstraction level, it might also be interesting (if it ever gets implemented) to maintain a database of my malware profiler XML entries. See my RECON 2006 slides from this year for more information but essentially the idea is to take guesses at common functionality of unnamed subroutines through XML descriptions such as:

<classification name="SMTP Engine">
    <API name="htons">
        <argument index=1>25</argument>
    </API>
</classification>

<classification name="Address Harvesting">
    <API name="FindFirstFile()"></API>
    <API name="FindNextFile()"></API>
    <API name="MapViewOfFile()"></API>
    <string match="regex">
        [^@]+@[^\.]+\.com
    </string>
</classification>

<classification name="Startup Entry">
    <API name="RegCreateKeyEx">
    <argument index=1>
        HKEY_LOCAL_MACHINE
    </argument
    <argument index=2>
        <string match="regex">\run|\runonce</string>
    </argument>
    </API>
</classification>

  sp     November 6, 2006 02:18.15 CST
In 2005 Andrew Schulman published three articles named "Finding Binary Clones with Opstrings & Function Digests" in Dr. Dobb's Journal. They're not completely about your suggestion but they might nevertheless be interesting.

http://www.ddj.com/showArticle.jhtml?articleID=184406152
http://www.ddj.com/documents/ddj0508h/0508h.html
http://www.ddj.com/documents/ddj0509i/0509i.html

  tagetora   November 6, 2006 13:38.51 CST
First of all, thank you...

Ryan and Mohammad:

It's a nice idea to use flirt signatures, but it not seems to be as flexible as i want. However, I think it's a good start.

Pedram:

My concept of abstract representation (bindiff-like output) is close to your "XML description", but I think I would need more detail-level to represent a function/piece of code. I don't know exactly how it would work, but my initial idea is to have an xml output to interact with the search engine.

sp:

very interesting read! I'm taking a look just now, but it will take me some time to process all this information ;D

Well, thanks again.

  tthtlc     November 10, 2006 01:35.41 CST
Your idea can be solved by Google CodeSearch.

Just go to:

http://www.google.com/codesearch/advanced_code_search

And select "C language" or "Assembly" depending on your inputs, and then you can find lots of similar code segments everywhere.

  PSUJobu     November 10, 2006 07:58.27 CST
> tthtlc: Your idea can be solved by Google CodeSearch.

To clarify, we are discussing ways to "find" routines with known purpose inside a disassembly. We don't need the source for an "SMTP Engine" (to use Pedram's example) -- we want to identify a number of "known" routines in an unkown binary.

This way, we don't waste hours of work reverse engineering "well known" routines and can boil down a binary (e.g., a new malware sample) to:

1. Its feature set - by identifying the well-known routines

2. Its unique payload - by reverse engineering the rest (or some subset thereof)

BTW, since I haven't chimed in previously, I like some of the ideas I see here. I have been faced with similar challenges before, and there are definitely some good concepts here that I'll have to try out...

  pedram     November 10, 2006 11:52.12 CST
> PSUJobu: > tthtlc: Your idea can be solved by Google CodeSearch.
>
> To clarify, we are discussing ways to \"find\" routines with known purpose inside a disassembly. We don\'t need the source for an \"SMTP Engine\" (to use Pedram\'s example) -- we want to identify a number of \"known\" routines in an unkown binary.

Exactly. And most interestingly, if we could establish some central repository for submitted high level "signatures" that can then be applied over whatever new binary you may be looking at.

  tagetora   November 11, 2006 06:15.34 CST
what tthtlc says is what we usually do (at least I do) when finding an unknown function. I try to get a non-standard piece of code inside the function (non usual asm or operands) and search internet for some hits.

The main problem I see here is that you will face a lot of functions where you can't easyly get a piece of code that makes it different/unique. Also, in Google Code, the asm you can search for is code directly written in asm, that will always be a bit different from compiler generated asm. I haven't been able to match code of 3+ lines long, maybe my regexp skills are not the best ;D

I'm still reading a lot of technical papers related with this topic, but my initial idea is to represent functions as a set of property such as API calls, data references, code structure, data flow... so you can search based on this (hey, I have a function with 3 parameters that calls <list_of_api's> and uses <list_of_data_resources>, etc.) and get results that match and functions that share part of your function's set of parameters (think in functions that call the same apis and have a similar set of data refs or the same code structure)

As always, any ideas are welcome.
Thank you all.

  gera     November 13, 2006 06:23.49 CST
Not exactly what you are saying, but a much simpler old idea. now finally instantiated by Ilfak (http://hexblog.com/2006/02/findcrypt2.html) may help you sometimes. And can certainly be extended to more stuff.

  ryanlrussell     November 13, 2006 13:53.13 CST
I think findcrypt is a lot more of a special case, since it is just looking for various ways to represent a handful of magic numbers.

I don't think that extends to, for example, 20 different ways to do strlen().

Note: Registration is required to post to the forums.

There are 31,328 total registered users.


Recently Created Topics
[help] Unpacking VMP...
Mar/12
Reverse Engineering ...
Jul/06
let 'IDAPython' impo...
Sep/24
set 'IDAPython' as t...
Sep/24
GuessType return une...
Sep/20
About retrieving the...
Sep/07
How to find specific...
Aug/15
How to get data depe...
Jul/07
Identify RVA data in...
May/06
Question about memor...
Dec/12


Recent Forum Posts
Finding the procedur...
rolEYder
Question about debbu...
rolEYder
Identify RVA data in...
sohlow
let 'IDAPython' impo...
sohlow
How to find specific...
hackgreti
Problem with ollydbg
sh3dow
How can I write olly...
sh3dow
New LoadMAP plugin v...
mefisto...
Intel pin in loaded ...
djnemo
OOP_RE tool available?
Bl4ckm4n


Recent Blog Entries
halsten
Mar/14
Breaking IonCUBE VM

oleavr
Oct/24
Anatomy of a code tracer

hasherezade
Sep/24
IAT Patcher - new tool for ...

oleavr
Aug/27
CryptoShark: code tracer ba...

oleavr
Jun/25
Build a debugger in 5 minutes

More ...


Recent Blog Comments
nieo on:
Mar/22
IAT Patcher - new tool for ...

djnemo on:
Nov/17
Kernel debugger vs user mod...

acel on:
Nov/14
Kernel debugger vs user mod...

pedram on:
Dec/21
frida.github.io: scriptable...

capadleman on:
Jun/19
Using NtCreateThreadEx for ...

More ...


Imagery
SoySauce Blueprint
Jun 6, 2008

[+] expand

View Gallery (11) / Submit