Topic created on: October 17, 2005 02:25 CDT by pedram .
On the notion of freeware patch analysis tools, Dave Zimmer (dzzie) and I were tossing back and forth some heuristics that could be used for a freeware binary diffing tool he is putting together. I thought I'd share it on the forum for all to contribute/discuss.
First some important background on the seminal heuristics developed by Halvar Flake and implemented in his commercial product Sabre BinDiff. From what I recall reading in Halvar's whitepapers (and Halvar please make any appropriate corrections), his approach can be broken down into: function matching, node matching and instruction matching built on the following heuristics:
Match functions with similar node/edge/call (NEC) call counts. The NEC "trifecta" comes of course from the control-flow graph. The number of calls is used as a heuristic for node matching as well.
Match functions that have the same non-trivial (ie: sub_XXXXX) name.
Match functions that make recursive calls to other functions that make recursive calls.
Match functions/nodes/instructions with identical string references.
Match functions/nodes with similar in/out degree (calls to / calls from)
SPP (Small Prime Product)
This is a genius approach. Match functions/nodes with similar SPP values. SPP is stored as large numeric value and is determined as follows. Apply a unique small prime number to each opcode. SPP = the product of each opcode in the function/node. This is a great heuristic because it is resistant to instruction re-ordering.
Shortest Path to Entry/Exit
Match nodes/instructions based upon the shortest path to the function entry/exit in the case of nodes and the node entry/exit in the case of instructions.
Expanding on this list of heuristics the following ideas were tossed around:
A straight MD5 of function instructions. The idea is that a straight MD5 may help in matching smaller functions, ex: NEC = 1,1,1.
Take the instruction list, tokenize jumps (ie: convert jump instructions into SIGNED, UNSIGNED or ABSOLUTE), alphanumeric sort it, take the MD5. This approach is roughly similar to SPP.
Expanding on using the number of calls as a heuristic the idea is to also count the number of PUSHes (arguments) to each CALL in an attempt to more accurately match functions/nodes.
Match functions/nodes based on the number of API calls they make or more specifically the actual API calls they make.
Match functions/nodes/instructions based on references to constant values. For example, if two functions uniquely XOR, PUSH, AND etc. a certain constant numeric value then they should be matched/aligned.
Stack Frame Size
Match functions with similar stack frame sizes, ie: similar amount of stack space for local variables.
Match functions "closer" to one another with more of a bias. Example. If in binary A we have "exact" matches for function index 20 and 25 to function indexes 30 and 35 in binary B, respectively (ie: A.20 == B.30 and A.25 == B.35), and we are trying to find a match within B for A.26, then we should give more bias for finding a match within B between B.30 and B.35.
None of these heuristics have been tested enough to determine their value. Perhaps others have tried similar approaches and can share their results. Some questions:
- What are your thoughts in regards to the order of these test?
- What are your thoughts in regards to the recursion of these tests? ie: if we have a list of tests 1-5, should we rewind and restart at test 1 every time we make a new match?
- What other interesting heuristics can be used?