Flag: Tornado! Hurricane!

Blogs >> jms's Blog

Created: Friday, November 2 2007 11:49.00 CDT Modified: Friday, November 2 2007 11:57.28 CDT
Printer Friendly ...
Visual Patterns for File Format Fuzzing
Author: jms # Views: 44488

So, I am having a glance at Greg Conti's Book "Security Data Visualization" and I got to wondering: is there a way we could apply some of the visualization techniques to better understand how to fuzz a file?

Well I put together a Python script using PIL that essentially reads in a file, and then plots a red pixel for printable bytes and and leaves it white if its non-printable, rudimentary but works. Here are some of the results:


(This is a dll file loaded, you can see clearly the beginning structure of the dll)



(This is a MS Word document, you can clearly see the text position, as well the top and bottom portions where the best spots to fuzz are)



(This is a PDF document, because a lot of the PDF directives are plaintext you can see them clearly in this document at the top)



(This is a small MS Excel document, again its quite clear where the control structures would be located)



And here are two captures of a normal exe and the same exe that was UPX packed. Can you tell which is which?

    


Blog Comments
aeppert Posted: Friday, November 2 2007 13:07.01 CDT
I have been walking down this path in some of my recent research and have, unfortunately, shelved it as I have other pressing matters.  But, what I have found, obviously mirrors the results here.

There are a few other applicable areas of use for it too that I will have to get into in a future version.

(Incidentally, I am assuming the last two images: the one on the left is the original and the one on the right is UPXed?  Given UPXes nature, I am assuming the less dense image would reflect its natural tendency to compress and reorganize data a bit.)

jms Posted: Friday, November 2 2007 13:16.10 CDT
Yep, you are correct the one on the right is UPX'd. If you have some interesting research you would like to share drop me an email, I am always interested in hearing what everyone is up to.

Kayaker Posted: Friday, November 2 2007 16:35.12 CDT
Nice. The output looks similar to Nwdiff, which maybe you're aware of.  

http://computer.forensikblog.de/en/2006/02/compare_binary_files_with_nwdiff.html

Nwdiff also shows a hex view but I've found it a bit difficult to navigate when selecting pixel points.  Perhaps you can improve on that feature.

Kayaker

jms Posted: Friday, November 2 2007 16:39.12 CDT
Great, thanks for the link. Here is the code to do the magic in Python, pretty simple really, but of course isn't as polished as it could be:

[code]

import Image
import ImageDraw
import sys
import getopt

filename = "test.wmv"
threshold = 10000

# Define a blank-canvas image to use
im = Image.new("RGB",(512,2056),(255,255,255))
draw = ImageDraw.Draw(im)

# Now open the test file and begin reading in the bytes
file = open(filename,"rb")

file_contents = file.read()

x_pos = 0
y_pos = 0
counter = 0

for byte in file_contents:
    
    # Now read the byte, and appropriately handle it
    if ord(byte) >= 32 and ord(byte) <= 126:
        print "Plotting Byte: %d" % counter
        if x_pos < 512:
            draw.point((x_pos,y_pos),fill=200)
        else:
            x_pos = 0
            y_pos += 1
            draw.point((x_pos,y_pos),fill=200)
    
    if counter == threshold and threshold != 0:
        break
    
    counter += 1    
    x_pos += 1

im.save(filename+".png")
[code]

Kayaker, drop me an email! I am always looking to chat it up with fellow Canadians!

VLaaD Posted: Friday, November 2 2007 17:46.38 CDT
This reminds me, without doubt, on Amiga days :) Every decent monitor (read: disassembler) had similar functionality with one option more - to change the actual width of the bitmap, however that's trivial to add, excellent :) This is something that shouldn't be overlooked in some specific tools, like IDA or Immunity Debugger, it really helps a lot to see through the birds eye quickly where is what

sohlow Posted: Friday, November 9 2007 16:35.14 CST
when aaron portnoy of tsrt was playing with this idea, we were sort of discussing different ways of rendering his example binary data in order to possibly identify structures in code.

I was thinkin' that if you decide to render a dword utilizing
each byte as a CMYK value (which you then transform to RGB),
that it'd produce an image that could used to visually
determine data structures. probably won't really be useful at all, but still fun enough to keep one occupied for 30m.

phantal Posted: Tuesday, November 13 2007 12:42.55 CST
  Damn site.  Lost my comment because I wasn't logged in. fsdfsdasdfasdf

  Anyway, here's an idea that may help further improve your analysis: entropy analysis.  There was some interest in this over the summer.  The algorithm is fairly simple:


dsize; // size of original data buffer
wsize; // size of sliding window
inbuff[dsize]; // original data
outbuff[dsize-wsize]; // entropy calculations
counts[256]; // a running count of the number of each type
             // of byte currently visible in the sliding window

// produce an initial count of each unique byte currently
// visible in the sliding window, place values in counts[],
// all other indexes should be zeroed.

// calculate entropy for initial window position
for (i = 0; i < 256; i++)
  outbuff[0] -= counts[i]/wsize * log(2, counts[i]/wsize);

// slide window and iteratively update entropy calculation
for (i = 1; i < dsize-wsize; i++)
{
  t1 = counts[inbuff[i-1]]/wsize;
  t2 = counts[inbuff[i+wsize-1]]/wsize;
  outbuff[i] = outbuff[i-1] + t1 * log(2, t1) + t2 * log(2,t2);

  // update counts[]

  t1 = counts[inbuff[i-1]]/wsize;
  t2 = counts[inbuff[i+wsize-1]]/wsize;
  outbuff[i] -= t1 * log(2, t1) + t2 * log(2,t2);
}


  As long as you're familiar with Shannon entropy, this algorithm should be fairly obvious.  The only catch might be the probability function, p(x)=count(x)/wsize.  I've seen other implementations that used count(x)/256, but unless wsize>256, the result will be incorrect.  I tend to use a wsize of ~64-128.

  The way I envision this working with what you've already done: a composite image where entropy is added as a different color.

  One hangup, though: the entropy calculation is wsize bytes smaller than the original data, and is centered around the midpoint of the file, but I have a few ideas for how to deal with this:

:- Only 'graph' the entropy starting at floor(wsize/2) ending at ceiling(wsize/2), but graph the original data as before
:- Extend this algorithm to allow a variable window size, whose max is whatever max you'd have used in the current algorithm, but is allowed to slide between 1 & wsize.  In the beginning it would be small, then grow until it hits the max.  When you near the end of the file, the window size would then begin shrinking.  On the one hand you get data that matches up perfectly with your original data, but on the other hand the data is less meaningful at the edges.

-Brian

phantal Posted: Tuesday, November 13 2007 12:54.40 CST
  One more thing I forgot to mention.  In my implementation I'm using an index of pre-calculated p(x)*log(2,p(x)) values, x from 0...wsize, and just indexing into that list with a line similar to:

somevalue = eindex[counts[buff[i]]];

  This significantly improves speed of the algorithm when running it on a large dataset.

  Furthermore, you can take this a step further and start doing analysis of the entropy data by calculating a mean & standard deviation for the entire file, then calculating a mean/SD with another sliding window (2 pass algorithm) to give you an idea how the entropy is varying throughout the file.

  The update for mean/variance isn't awful, but I don't have any code I can demonstrate at the moment.  Suffice it to say, if this interests you, look at the wikipedia article on calculating mean/variance.

  From the definition of mean/variance it may not be clear, but with some algebraic manipulation it isn't hard to get it into a form that, while not numerically stable (eg, your results won't be quite as accurate as if you'd done it the hard way), but the result you need can be calculated much faster by updating based on previously calculated values, rather than re-calculating it from scratch every time the window slides.

-Brian



Add New Comment
Comment:









There are 31,313 total registered users.


Recently Created Topics
[help] Unpacking VMP...
Mar/12
Reverse Engineering ...
Jul/06
hi!
Jul/01
let 'IDAPython' impo...
Sep/24
set 'IDAPython' as t...
Sep/24
GuessType return une...
Sep/20
About retrieving the...
Sep/07
How to find specific...
Aug/15
How to get data depe...
Jul/07
Identify RVA data in...
May/06


Recent Forum Posts
Finding the procedur...
rolEYder
Question about debbu...
rolEYder
Identify RVA data in...
sohlow
let 'IDAPython' impo...
sohlow
How to find specific...
hackgreti
Problem with ollydbg
sh3dow
How can I write olly...
sh3dow
New LoadMAP plugin v...
mefisto...
Intel pin in loaded ...
djnemo
OOP_RE tool available?
Bl4ckm4n


Recent Blog Entries
halsten
Mar/14
Breaking IonCUBE VM

oleavr
Oct/24
Anatomy of a code tracer

hasherezade
Sep/24
IAT Patcher - new tool for ...

oleavr
Aug/27
CryptoShark: code tracer ba...

oleavr
Jun/25
Build a debugger in 5 minutes

More ...


Recent Blog Comments
nieo on:
Mar/22
IAT Patcher - new tool for ...

djnemo on:
Nov/17
Kernel debugger vs user mod...

acel on:
Nov/14
Kernel debugger vs user mod...

pedram on:
Dec/21
frida.github.io: scriptable...

capadleman on:
Jun/19
Using NtCreateThreadEx for ...

More ...


Imagery
SoySauce Blueprint
Jun 6, 2008

[+] expand

View Gallery (11) / Submit