Justin Seitz (jms) <jms bughunter ca> |
Friday, November 2 2007 11:49.00 CDT |
So, I am having a glance at Greg Conti's Book "Security Data Visualization" and I got to wondering: is there a way we could apply some of the visualization techniques to better understand how to fuzz a file?
Well I put together a Python script using PIL that essentially reads in a file, and then plots a red pixel for printable bytes and and leaves it white if its non-printable, rudimentary but works. Here are some of the results:

(This is a dll file loaded, you can see clearly the beginning structure of the dll)

(This is a MS Word document, you can clearly see the text position, as well the top and bottom portions where the best spots to fuzz are)

(This is a PDF document, because a lot of the PDF directives are plaintext you can see them clearly in this document at the top)

(This is a small MS Excel document, again its quite clear where the control structures would be located)
And here are two captures of a normal exe and the same exe that was UPX packed. Can you tell which is which?

I have been walking down this path in some of my recent research and have, unfortunately, shelved it as I have other pressing matters. But, what I have found, obviously mirrors the results here.
There are a few other applicable areas of use for it too that I will have to get into in a future version.
(Incidentally, I am assuming the last two images: the one on the left is the original and the one on the right is UPXed? Given UPXes nature, I am assuming the less dense image would reflect its natural tendency to compress and reorganize data a bit.) |
|
Yep, you are correct the one on the right is UPX'd. If you have some interesting research you would like to share drop me an email, I am always interested in hearing what everyone is up to. |
Nice. The output looks similar to Nwdiff, which maybe you're aware of.
http://computer.forensikblog.de/en/2006/02/compare_binary_files_with_nwdiff.html
Nwdiff also shows a hex view but I've found it a bit difficult to navigate when selecting pixel points. Perhaps you can improve on that feature.
Kayaker |
Great, thanks for the link. Here is the code to do the magic in Python, pretty simple really, but of course isn't as polished as it could be:
[code]
import Image
import ImageDraw
import sys
import getopt
filename = "test.wmv"
threshold = 10000
# Define a blank-canvas image to use
im = Image.new("RGB",(512,2056),(255,255,255))
draw = ImageDraw.Draw(im)
# Now open the test file and begin reading in the bytes
file = open(filename,"rb")
file_contents = file.read()
x_pos = 0
y_pos = 0
counter = 0
for byte in file_contents:
# Now read the byte, and appropriately handle it
if ord(byte) >= 32 and ord(byte) <= 126:
print "Plotting Byte: %d" % counter
if x_pos < 512:
draw.point((x_pos,y_pos),fill=200)
else:
x_pos = 0
y_pos += 1
draw.point((x_pos,y_pos),fill=200)
if counter == threshold and threshold != 0:
break
counter += 1
x_pos += 1
im.save(filename+".png")
[code]
Kayaker, drop me an email! I am always looking to chat it up with fellow Canadians! |
This reminds me, without doubt, on Amiga days :) Every decent monitor (read: disassembler) had similar functionality with one option more - to change the actual width of the bitmap, however that's trivial to add, excellent :) This is something that shouldn't be overlooked in some specific tools, like IDA or Immunity Debugger, it really helps a lot to see through the birds eye quickly where is what
|
when aaron portnoy of tsrt was playing with this idea, we were sort of discussing different ways of rendering his example binary data in order to possibly identify structures in code.
I was thinkin' that if you decide to render a dword utilizing
each byte as a CMYK value (which you then transform to RGB),
that it'd produce an image that could used to visually
determine data structures. probably won't really be useful at all, but still fun enough to keep one occupied for 30m. |
Damn site. Lost my comment because I wasn't logged in. fsdfsdasdfasdf
Anyway, here's an idea that may help further improve your analysis: entropy analysis. There was some interest in this over the summer. The algorithm is fairly simple:
dsize; // size of original data buffer
wsize; // size of sliding window
inbuff[dsize]; // original data
outbuff[dsize-wsize]; // entropy calculations
counts[256]; // a running count of the number of each type
// of byte currently visible in the sliding window
// produce an initial count of each unique byte currently
// visible in the sliding window, place values in counts[],
// all other indexes should be zeroed.
// calculate entropy for initial window position
for (i = 0; i < 256; i++)
outbuff[0] -= counts[i]/wsize * log(2, counts[i]/wsize);
// slide window and iteratively update entropy calculation
for (i = 1; i < dsize-wsize; i++)
{
t1 = counts[inbuff[i-1]]/wsize;
t2 = counts[inbuff[i+wsize-1]]/wsize;
outbuff[i] = outbuff[i-1] + t1 * log(2, t1) + t2 * log(2,t2);
// update counts[]
t1 = counts[inbuff[i-1]]/wsize;
t2 = counts[inbuff[i+wsize-1]]/wsize;
outbuff[i] -= t1 * log(2, t1) + t2 * log(2,t2);
}
As long as you're familiar with Shannon entropy, this algorithm should be fairly obvious. The only catch might be the probability function, p(x)=count(x)/wsize. I've seen other implementations that used count(x)/256, but unless wsize>256, the result will be incorrect. I tend to use a wsize of ~64-128.
The way I envision this working with what you've already done: a composite image where entropy is added as a different color.
One hangup, though: the entropy calculation is wsize bytes smaller than the original data, and is centered around the midpoint of the file, but I have a few ideas for how to deal with this:
:- Only 'graph' the entropy starting at floor(wsize/2) ending at ceiling(wsize/2), but graph the original data as before
:- Extend this algorithm to allow a variable window size, whose max is whatever max you'd have used in the current algorithm, but is allowed to slide between 1 & wsize. In the beginning it would be small, then grow until it hits the max. When you near the end of the file, the window size would then begin shrinking. On the one hand you get data that matches up perfectly with your original data, but on the other hand the data is less meaningful at the edges.
-Brian |
One more thing I forgot to mention. In my implementation I'm using an index of pre-calculated p(x)*log(2,p(x)) values, x from 0...wsize, and just indexing into that list with a line similar to:
somevalue = eindex[counts[buff[i]]];
This significantly improves speed of the algorithm when running it on a large dataset.
Furthermore, you can take this a step further and start doing analysis of the entropy data by calculating a mean & standard deviation for the entire file, then calculating a mean/SD with another sliding window (2 pass algorithm) to give you an idea how the entropy is varying throughout the file.
The update for mean/variance isn't awful, but I don't have any code I can demonstrate at the moment. Suffice it to say, if this interests you, look at the wikipedia article on calculating mean/variance.
From the definition of mean/variance it may not be clear, but with some algebraic manipulation it isn't hard to get it into a form that, while not numerically stable (eg, your results won't be quite as accurate as if you'd done it the hard way), but the result you need can be calculated much faster by updating based on previously calculated values, rather than re-calculating it from scratch every time the window slides.
-Brian |
|