Topic created on: June 27, 2005 23:48 CDT by
hoglund 
.
This is jsut an idea, but it might be useful to put into the OpenRCE database a list of functions that can be instrinsic to a binary, for example, strcmp, strtok, etc. Often times these can be identified using a simply byte match, but a database of these signatures might be useful. It might help people who have utilities to 'dress' a bianry - similar to Flirt from Datarescue I guess. If someone knows of an already compiled database of such a thing pls let me know :-)
Funny, just yesterday Pedram was asking me about the usefulness of IDB2PAT for building a library of malware function signatures to ease the analysis of variants.
A limitation/feature of IDA signatures is function length. When you've got lots of very short functions, the odds of a pattern collision are greatly increased (i.e. two functions with the same signature).
Ilfak made a (wise) design decision when creating the FLIRT/FLAIR features in IDA. It only uses the first "X" bytes of a function (I've forgotten what "X" is but could probably look it up), so the resulting signatures are truncated. This adds to the collision problem (i.e. two functions that start off the same way but end differently) and decreaces accuracy somewhat but on the bright side, it greatly increases speed of application and greatly decreases the storage space required to hold all the signatures. If I remember correctly, the odd side effect of truncation and decreased accuracy can compensate for the uses of various compile time optimizations.
When you're dealing with libraries (and complete signatures of *EVERY* byte in them), the trade off of speed versus storage space (multiple GiB) is worth it.
Though I've never personally used it, in the commercial BinDiff plugin from Sabre-Security, Halvar uses a novel approach to reach similar identification ends, namely, graph theoretic analysis. For the pedantically challenged, "graph theoretic analysis means building a signature of the function flows. A painful over-simplification would be if Myfunct() calls strcmp three times then calls strlen twice in executable #1, the odds are if I find a similar flow in executable #2, in might be the same function. Some forms of conditional execution can cause trouble for this method, but you can still get a reasonable degree accuracy, on the other hand, it would work like a dream on executables with tons of tiny functions.
The real questions to answer are, "how accurate/detailed do you want the identification to be?" and "what kind of compute/storage resources will you require?" Once you have a disassembly with function start/end info, the problem falls into the world of "embarrassingly parallel" and could be run on a cluster.
JCR
|