OpenRCE Article

Cody Brocious (daeken) <cody

brocious

gmail

com>

Tuesday, September 6 2005 17:30.58 CDT

Introduction

In this article I will cover the basics of reverse-engineering binary file formats, starting with the simple archive format used in the popular game EverQuest II known as VPK. This is the first of many articles I'll be writing on the subject as time goes on. You can find VPK files in any EverQuest II installation, and any will do for the purposes of this article. For the purposes of this article, I will assume the reader has knowledge of programming in C. The architecture being used is x86, and thus, everything is unsigned and little-endian. The tools you will need to follow along with this article are a decent hex editor (I recommend "HexEditor" for Mac OS X) and a copy of Python.

Diving In

For this article I targeted the file boat_06p_tutorial01.vpk. If you choose to target another file, please note that there may be differences between what you see in the article and what you come across in the file.

The main thing you must understand when working with file formats is the way data is typically stored. The most basic C data-type that you will find in a binary file is a char. A char is stored as a single byte, a short is stored as two bytes, a long or a float is stored as 4 bytes. Since we're dealing with little-endian files, the bytes will be stored in the reverse order to what they would normally be. For instance, given the number 9, if you wanted to store it as a long in little-endian, it would be represented as 4 bytes with the values 0x09, 0x00, 0x00, and 0x00, in that order. Aside from integers, you will often find arrays. Typically these are stored as an integer representing the length of the array and then that many elements. The elements of an array could be anything from a char to a string to a complex data structure, so it can be difficult to figure out where arrays are and how to read their elements.

To start reverse-engineering a file format, you will want to choose at least one sample file. It is often a good idea to pick a number of sample files to be able to look at the similarities and differences between them, but it is not absolutely necessary in most cases. Once you have picked your sample file, you'll need to examine it within a hex editor.

A good first step is to think about how you would store the data if you were designing the file format. In my example, the VPK archive format, my first thought was that the architects would probably store the data in a compressed format. I immediately drew this conclusion due to the fact that I could not see any plain-text data, telling me that the file could be compressed and/or even encrypted. In most cases, developers will not waste time writing their own compression algorithm. What I will usually do is come up with a list of compression methods that are commonly used, royalty-free (nobody likes to spend money when they don't have to) and easy to implement. Since I am using Python, I decided to try zlib first. zlib is a compression library used in many applications as it is easy to work with and fairly efficient.

Locating Compressed Data

Since we don't know where the compressed data is in the file, it was necessary to write code that would search through the file for anything that can be decompressed with zlib. I start by reading the file into a string. From there, I attempt to decompress data from each byte to the end of the file, which will tell us where the start of a decompressed block is. I wrote the following snippet of code to accomplish this task:

import zlib, sys

def main(args):
    f = file(args[0], 'rb').read()
    size = len(f)

    i = 0
    while i < size:
        try:
            de = zlib.decompress(f[i:])
        except zlib.error:
            i += 1
            continue

        print 'Found a compressed block starting at', i

        i += 1

if __name__=='__main__':
    sys.exit(main(sys.argv[1:]))

And the output:

    Found a compressed block starting at 4
    Found a compressed block starting at 1370
    Found a compressed block starting at 51781
    Found a compressed block starting at 51905
    Found a compressed block starting at 52283
    Found a compressed block starting at 52401

This code takes a filename to search as an argument and will return the location of any zlib-compressed blocks. Immediately when I run this I see many compressed blocks, so it is confirmed that they are using zlib. Now, just having the location of the compressed block doesn't tell you a whole lot. You also need to know the size of the data. To accomplish this I wrote code that upon discovery of a compressed block, will loop through all the data from that point on, looking for the first point where the decompressed data matches the length of the decompressed data from the beginning of the block to the end of the file. This is not fool-proof, but should work well enough. The code is as follows:

import zlib, sys

try:

    import psyco
    psyco.full()
except ImportError:
    pass


def main(args):
    f = file(args[0], 'rb').read()
    size = len(f)

    i = 0
    slen = 0
    while i < size:
        try:
            slen = len(zlib.decompress(f[i:]))
        except zlib.error:
            print i
            i += 1
            continue

        start = i
        while i < size:
            try:
                slen2 = len(zlib.decompress(f[start:i]))
            except zlib.error:
                i += 1
                continue

            if slen == slen2:
                break

            i += 1

        print 'Found:', start, i - start

if __name__=='__main__':
    sys.exit(main(sys.argv[1:]))

Output:

    Found: 4 1362
    Found: 1370 50407
    Found: 51781 120
    Found: 51905 374
    Found: 52283 114
    Found: 52401 145

When I run this code, it gives me some interesting results. It seems that the first compressed block starts 4 bytes into the file and that there are 4 bytes between each compressed block. Looking at the 4 bytes before each block, it seems to be a long integer representing the length of the compressed block. We now have enough information to decompress the data and make use of it. In the following code I make use of the struct module of Python to decode the long.

import zlib, struct, sys

try:
    import psyco
    psyco.full()
except ImportError:
    pass

def main(args):
    fp = file(args[0], 'rb')

    i = 1

    while True:
        data = fp.read(4)
        if len(data) != 4:
            break

        length = struct.unpack('<L', data)[0]
        file('block_%i.dat' % i, 'wb').write(zlib.decompress(fp.read(length)))
        i += 1

if __name__=='__main__':
    sys.exit(main(sys.argv[1:]))

This code will write all of the compressed blocks to files in the current directory called block_#.dat.

Looking at the first block, it appears to start off with a small header, followed by the filename for the block, followed by the data of the file. This matches all of the blocks until the last. It seems to be the filename list used to pull out arbitrary blocks. This seems like a good place to work on next.

Just looking at it with my hex editor, I see 20 bytes, then the first filename, then 12 bytes followed by the second filename, followed by 12 bytes, etc until the end of the block. In the case of the filename block I'm working with, the entire thing is 487 bytes and there are 5 filenames. Since the data before the first filename is 20 bytes and the data before the rest is only 12, this says there's probably an 8-byte header on the actual filename block.

Looking at the first 8 bytes, I see what seems to be 2 long integers. Selecting the first, my hex editor tells me it has a value of 487, which is the total size of the block. Selecting the second, my hex editor tells me it has a value of 5, which is the number of filenames in the block. Fantastic.

Now looking at the first filename and 12-byte header, I notice the filename is 85 bytes long and since it has no null-termination, the length should be stored in the header somewhere. Checking the value of the 4 bytes immediately before the filename gives me 85, so we have our length. But what are the other 8 bytes? The first 4 bytes seem to be a long integer with a value of 0, and the second 4 bytes seem to be a long integer with the value 1362 (0x0552 hex). This is not immediately obvious as to what it does, so I look at the next file's header. Yet again the filename length is the last 4 bytes of the header. Looking again at the 2 unknown longs, in this case the first is 1366 and the second is 50407. If you look at the difference between the second long in the first filename header and the first long in the second filename header, they differ by 4 bytes. Earlier we found that the header on compressed blocks is 4 bytes, and seeing that the first filename header's first long is 0, it seems likely that these integers represent the starting position and length of the compressed blocks. Looking back at the output of the earlier script, it's confirmed. We now know the filename block format.

Now we have to look at the previous blocks. Looking at the first one, I see an 8-byte header followed by the filename. In this case, the filename is 85 bytes long and the file is 2276 bytes long. Looking at the header, I see two long integers. One with a value of 86 and one with a value of 2182. The first looks suspiciously like the length of the filename, but it's 1 byte too long. This is because the filename is null-terminated. When you count the ending null, the filename is 86 bytes, which makes the first integer of the header the filename length. The second is relatively close to the length of the file, but what could be causing this difference? The header, including the filename and null terminator, is 94 bytes long. Subtracting this from the total file length gives 2182, the value of the second long integer.

Putting It All Together

The file consists of a number of blocks, which are each a long integer followed by a zlib-compressed block of data of the size the integer indicates.

Each block is its own file, with the exception of the last which is a filename block.

The filename block consists of an 8-byte header, made up of a long representing the length of the whole filename block and a long representing the number of files. After this header are the filenames, each with a 12-byte header consisting of a long that represents the beginning of the file's block, a long that represents the length of this block, and a long that represents the length of the filename, followed by the (non-null-terminated) filename itself.

The other blocks consist of an 8-byte header consisting of a long storing the length of the filename (null-terminated) and a long representing the length of the data in the block, followed by the filename and then the data.

So when I write my code, now that everything is figured out, I end up with this:

import zlib, struct, os, sys

try:
    import psyco
    psyco.full()
except ImportError:
    pass

def main(args):
    fp = file(args[0], 'rb')
    data = fp.read(4)

    while len(data) == 4:
        l = struct.unpack('<L', data)[0]
        data2 = fp.read(l)
        data = fp.read(4)

        if len(data) != 4:
            try:
                filenameblock = zlib.decompress(data2)
            except ImportError:
                filenameblock = data2

    fnblocksize, fncount = struct.unpack('<LL', filenameblock[:8])

    filenameblock = filenameblock[8:]

    try:
        os.mkdir(args[1])
    except OSError:
        pass

    for i in range(fncount):
        block_start, block_len, fnlen = struct.unpack('<LLL', filenameblock[:12])
        filename = filenameblock[12:12 + fnlen]

        paths = filename.split('/')[:-1]
        path = args[1]

        for dir in paths:
            path += '/' + dir

            try:
                os.mkdir(path)
            except OSError:
                pass

        print filename
        f = file(args[1] + '/' + filename, 'w')
        fp.seek(block_start + 4)
        d = fp.read(block_len)

        try:
            d = zlib.decompress(d)
        except zlib.error:
            pass

        f.write(d[12 + len(filename):])
        f.close()
        filenameblock = filenameblock[12 + fnlen:]

if __name__=='__main__':
    sys.exit(main(sys.argv[1:]))

Conclusion

And there we have it. To run this script, use `python <script.py> <filename.vpk> <directory to output to>`.

Hopefully you've found this article useful. I hope to cover more advanced file formats in the future, as well as network protocols.

If you have any questions or comments, please let me know.

Happy Hacking,

Cody Brocious.