Flag: Tornado! Hurricane!

The Molecular Virology of Lexotan32: Metamorphism Illustrated

Thursday, August 16 2007 16:58.00 CDT
Author: orr # Views: 42190 Printer Friendly ...

Introduction

This paper is a direct descendent of my previous one regarding the metamorphic engine of the W32.Evol virus. I advise you to take a look at it before reading this one, or at least be acquainted with the subject of metamorphism. The focus of this paper is the special engine of the Lexotan32 virus.

The virus was released in 29A#6 Virus Magazine in 2002, the Annus Mirabilis of metamorphic viruses. The virus was created by the prolific VX coder, Vecna, and was one of the last complex creations of this kind. I could further elaborate on the genealogy of this virus, but I think it is sufficient to say that this virus is a culmination of many of the techniques developed throughout the author's career.

To be honest, I never thought that this paper would be written, as the virus was released with its original source code, thus no real point in reverse engineering it. I've been glancing at it from time to time, but I never went too deep. Eventually, I decided that the only way to understand what's going on was to compile it – and that's when I first began to realize the beauty:


This is the header of the compiled binary, which may look pretty straightforward for many people, but those who are used to looking at binaries will know the difference. Usually, the section names and purposes (highlighted) are chosen by the compiler / assembler, and tend to be very generic, but this program was something else. At that moment I knew this is going to be interesting.

Concept

Perhaps one of the most difficult aspects in the design of a viral metamorphic engine is the issue of Re-Generation. The viral code processed by such engine is subject to obfuscation which includes insertion of random junk instructions as well as replacement of one instruction with an equivalent set of instructions. Unless monitored, the code will simply grow beyond proportion within a certain amount of generations (as in the W32.Evol virus). Several viruses solved this problem in creative ways; The TMC virus (DOS) used tables that stored the entire virus binary commands and their lengths, while the MetaPHOR virus used a de-permutation module and instruction shrinking module in order to achieve the goal of extracting “clean”, un-obfuscated virus copy.

Lexotan32 solves this problem in a very elegant way. A special table, Genotype (G), is attached to the metamorphed code (M). The engine will use the entries of this table in order to extract the clean virus copy (P).


The pseudo-algorithm is as follows:
            P  = BuildBodyFromGenotype(M, G)
            M' = MixBody(P)
            G' = BuildGenotypeFromBody(M')
Now, imagine the following metamorphed code:


Scary, eh? The above code is so obfuscated, that even a human reader will find it hard to understand; how will a machine be able to clean it?

Re-Generation

It is quite evident that the rebuilding mechanism is inspired from nature. The real life Genotype is the genetic information encoded in our genes. The Genotype structure determines specific visual characteristics of a resulting Phenotype, which is, in our case, the metamorphed body. One Genotype may result in several different looking Phenotypes, just as in the virus.

The Genotype, in the viral sense, is a list consisting of signed WORD values. Each member of the list is a relative pointer to the next instruction in the buffer.



The first item on the list (0x3D) is in fact the pointer to the first instruction; by adding the second item to the first (0x3D+0xFFC6 = 0x3), we get the pointer to the second instruction - and so on. Additionally, the Genotype may contain two possible flag values:
  • -2 (0xFFFE) - FREE4GARBLE directive.
  • -1 (0xFFFF) - End of Genotype list.
The FREE4GARBLE directive, along with the next list item, notifies the engine on which registers are currently used, and will be translated into a XOR ESP, VALUE in the output buffer. The engine will later use this information in order to add intelligent garbage instructions.

Mutation

After getting to the correct location, the engine will get the length of the instruction and store it in the output buffer. During this phase, the engine will perform instruction transformations based on a random factor.

The engine is able to perform the following:

Original
Transformed
ADD   reg, imm
SUB   reg, -(imm)
MOV   reg, reg/imm
PUSH  reg/imm
POP reg
SUB   reg, reg
XOR   reg, reg
TEST  reg, reg
OR    reg, reg
LODSx
MOV   ACUM, [esi]
ADD esi, SIZE
STOSx
MOV   ACUM, [esi]
ADD edi, SIZE
CMD   reg, imm8
CMD   reg, imm32
DEC   reg
SUB   reg, 1
INC   reg
ADD   reg, 1
* CMD = ADD/ADC/SUB/SBB/XOR/CMP
* x = {B,W,D}, ACUM = {AL, AX, EAX}
Although some of the transformations above result in growth of code, all of them are reversible,meaning that the engine is able to randomly obfuscate (expand) or optimize (shrink) the code.

The above example illustrates both the extraction of the plain code and instruction transformations:

Original
Transformed
PUSH EBP
POP ESP
MOV  EBP, ESP
SUB  EAX, 0FFFFEDCCh
ADD  EAX, 1234h


Permutation

After generating a clean copy from the Genotype, the engine will permutate the code in a rather unusual way. On the first phase it will create a pool of NOPs (0x90), and will copy one or more instructions from the plain code into random locations inside that pool. In order to retain the logical code flow, all the instruction blocks will be linked with jumps. During this phase, the engine will add garbage instructions based on the FREE4GARBLE directive.



On the second phase, all the byte sequences will be collected into one buffer by removing the NOPs in between them.



At this point, while running the code in a debugger, I wasn't looking at opcodes - I was staring at small genes floating inside a Primeval Soup, perfectly reordering themselves in a chaotic manner, forming life.

Garbage

Another interesting thing is t he issue of garbage generation. The main idea is blending legitimate instructions with purely random instructions, without altering the code-flow. Many poly-viruses used commands that were not likely to change the code flow (for example CLC/CMC/NOP, or ADD EAX,0 and the likes) but those are now very easily detected by heuristics. Another method was nesting garbage instructions inside PUSH-POP combinations, allowing the code-flow to remain unaltered, but that resulted in detectors that scanned the stack for magic values.

The garbage engine of Lexotan is able to generate instructions that are very common in legitimate code. As witnessed, after the initial regeneration process, the engine will extract XOR ESP, VALUE instructions based on the FREE4GARBLE directive. The value is a flag notifying the engine about possible use of registers and flags.

For example:
     
XOR    ESP,     100001b (21h)
Is the equivalent of:
     
FREE4GARBLE    _EAX+_EBP

The engine will replace these instructions with random instructions that don't use the specified registers, allowing garbage to be run without altering the original code flow. The XOR ESP instructions throughout the metamorphed code will eventually be replaced with garbage LEA instructions.

Among the instructions the engine may generate are:
     - MOV  reg1, reg2
     - MOV  reg, imm
     - LEA  reg, [r+disp]
     - CMD  reg, imm           ;ADD/ADC/SUB/SBB/XOR/CMP
     - INC  reg 
     - DEC  reg
The registers and immediate values can be of any size (8/16/32 bit).

Eventually, the resulting code will look this:


Relocations

After all the transformations were applied, most (if not all) locations of the original instructions are changed, forcing the engine to fix all the branch instructions (JMP/CALL) that link parts of the code together.

The engine copies instructions from an input buffer to an output buffer, and during the transformation process it keeps track of Old vs. New locations in the form of a two- DWORD table (eip_table). Another table (jmp_table), keeping information about branch instructions is constructed. The first member will hold the Branch Destination in the Old Buffer while the second member will hold the Relative Location in New Buffer. Both values are DWORDS. Additionally, there's another BYTE that indicates the size (either 1 which means a short jmp, or 0 that means near jmp).

In order to simplify, let's try fixing a simple case:

As you can see, the code became smaller by 3 bytes. Notice that the JNE in the output buffer no longer leads to the RET instruction. For the above example, the following tables will be constructed:

So, as you can probably deduct, the new relative value to be patched is 01 (instead of 04), which will fix the jmp instruction correctly.

New Genotype

After the initial Genotype is extracted (in the regeneration process), the engine saves relative locations of instructions in a table called userlist in the same manner and structure of the Genotype. The userlist functions as an interim Genotype and its values are fixed during the phases of permutation using the same lookup method seen above, as offsets change very dynamically. After all the transformations are over, the engine will use the userlist in order to generate a new Genotype, for use with the new metamorphed code. All the XOR ESP instructions will be converted to the format of the Genotype's FREE4GARBLE directive.

In the next generation, the new Genotype will be used to clean the body from all the garbage and permutation, and will undergo instruction mutations again, thus resulting in an ever-changing base code.

Appendix A: Code as Art

I wouldn't be able to go about finishing this paper without a little of what my friends and family call ‘digging'. These are some of the thoughts that floated through my mind while studying the virus code. As this is not the product of a deep philosophical investigation, the definitions I provide are probably not complete and should be taken into account according to their context.
  • A Language is made of specific Words which are made of specific alphabet. As such, many compositions can be made in order to form a defined word, and eventually a Sentence.
  • A Computer Language generally consists of Instructions, composed of a defined set of commands, whose purpose is to define an Action.
When speaking about definitions and operations, Natural Languages and Computer Languages share a similarity: both can be formalized. The simplest example can be found in Newton's Second Law:

Language
Implementation
English The Force on a particle is proportional
to the Time rate of change in the product
of its Mass and Velocity.
Assembly
mov eax, mass
mul eax, velocity
div eax, time
mov force, eax
C
Force = mass * velocity / time;

F = m * v / t;
Math


In the above example we go from the lowest to the highest levels of abstraction. Now, we should carefully examine this table: The English sentence provides instructions on how to calculate the force of a moving body using verbs and nouns while the Assembly code uses word-like commands and variables. Both sentences have one thing in common that the others don't: The English sentence uses Conjunctions (and other descriptive words) and the Assembly code uses Registers. In the C and Math representations, however, we can see that although meaningful names can be used, the commands (first given in English form) have now become operators, and the Modus Operandi was now replaced with a formula.

In the world of Software, the purpose of High-Level languages is to formalize blocks of instructions into more abstract and unavoidably generic constructs. Due to its low-level nature, the Assembly language (like English) provides the coder with many more nuances to achieve the desired action, semantically, while the C language provides a more abstract and flexible form. This is, of course, without insulting the fantastic C language. When coding High-Level statements such as IF, FOR and WHILE you can sometimes feel like you are speaking directly to the compiler, but in the realm of bit-manipulation and optimization, the Assembly coder is the artist.

Now, as witnessed, it is possible to turn the English sentence into a set of commands. You can also transform that sentence into a poem, and skilled writers can probably create a hybrid of the two forms. What defines one variation as more artistic than the other?

The answer lies in the manifestation of the concept.

Code is not about telling a story, it's about instructing the computer on how to perform certain actions. It's not what you are doing; it's HOW you do it.

The source code of Lexotan32 itself is a very customized Assembly source. It is filled with macros and custom-directives and was written in a way that forces a custom pre-processor to be run prior to compilation.

At first it might alienate the reader, but as you go deeper you can come across very amusing macros such as ‘coin' whose purpose is to return a random number between 0 and 1. The author could have done that in different ways, but instead he chose to flip a virtual coin in order to determine some of the next actions.

Another example can be seen in the disassembler module, where the complexity and diversity of the Intel instruction set is broken down into a simple yet dominant macro.

The author of Lexotan32 was dealing with problems already encountered before, yet implemented them all in very original ways– Regeneration (Genotype), Permutation (NOP Pool) and Garbage (FREE4GARBLE), all wrapped up in a custom PE file – all of those are examples of colorful coding.

The source code was Assembly poetry, the concept was inspired and carefully crafted, the execution was filled with imagination and vision and the binary output was simply fantastic, every time anew.

Code can be Art.

Q.E.D.

Appendix B: The Genealogy of Lexotan32

Since we are dealing with art, we must also take a look at the ideas that preceded and contributed to the final creation. It is also quite interesting to note the development of certain concepts as well as their sometimes embryonic implementation. Although this subject diverges from the scope of this paper, I decided to include a very brief history of the development of this virus through other viruses created by Vecna.

1. Miss Lexotan 6mg
The first Lexotan virus was actually a DOS metamorphic virus written in 1998 which was followed by several variants. The metamorphism model that was first introduced is similar to the one that can be found in the PLY virus. Every instruction was to be padded with NOPs, so that all instructions could have a fixed length of 5 bytes.

90                nop
90                nop
90                nop
90                nop
1E                push    ds

90                nop
5B                pop     bx
90                nop
90                nop
90                nop

B8 F0 FF          mov     ax, 0FFF0h
90                nop
90                nop
This way, the engine doesn't have to worry for relocations. The main virus body would be encrypted (early Genotype), and the engine would unpack it using a small built-in length-disassembler, and would again randomly pad the instructions with NOPs on the next generation. In the next major variant (0.3), NOP padding was replaced by the unique garbage engine:

BA 06 FB          mov     dx, 0FB06h
B9 75 02          mov     cx, 275h
1E                push    ds
5B                pop     bx
BE 4A A7          mov     si, 0A74Ah
BF 05 95          mov     di, 9505h
B8 F0 FF          mov     ax, 0FFF0h
In version 0.6 and forth, the metamorphic engine was able to support instruction mutations, similar to the ones found on the 32bit version.

2. RegSwap
The same year, Vecna released one of the first metamorphic viruses for the Windows 95 platform, RegSwap. In this virus, all the instructions remained the same, but with different register usage on every generation. In order to achieve that, Vecna introduced the idea of using tables with relative distances and information regarding register indexes. The table would travel with the virus, the same way the modern Genotype was used.

3. Ramones
In 2001, Vecna released the W95.Ramones worm, which utilized an early version of the same permutation and garbling engine seen in Lexotan32 (the MixBody module). This engine (that was also published in XINE#5) was almost identical to its final version found in Lexotan32.

A visual diagram describing the evolution of Lexotan:


Ending

As I said in the beginning, the source-code of this virus is in fact available for everyone to examine, and the fact is that it holds many other interesting features such as the infection mechanism, the mixing of variables, compression, encryption and many other unique implementations. Unfortunately, those subjects had to be left out of this paper, but I advise the curios reader to examine them more closely.

What I did was just compiling and running the code. The rest was just like watching a movie through Hex dumps.

Vecna - Thanks.


Orr
www.antilife.org/

Article Comments Write Comment / View Complete Comments

    Username Comment Excerpt Date
redbone good information .... Tuesday, July 12 2011 02:56.10 CDT
tgnice cool :) Saturday, June 18 2011 03:18.50 CDT
  lazyworm very nice!I need it. Wednesday, June 30 2010 20:25.49 CDT
m4dnut it's so cool~! thnaks for your effots. i alway... Wednesday, July 9 2008 20:32.17 CDT
c0ck3dpist0l it's cool! Thanks for sharing! Tuesday, April 29 2008 07:54.10 CDT
adityaks nicely driven , very well Sunday, September 23 2007 23:40.22 CDT
  vecna Congratulations for the article - its exact Thursday, August 23 2007 20:03.45 CDT
baibhav Gud work ! Thanks for sharing ! Friday, August 17 2007 10:48.22 CDT
MohammadHosein alot of details ...great work . Friday, August 17 2007 10:47.15 CDT
  mballano Nice article ;-) Friday, August 17 2007 01:54.18 CDT

There are 30,780 total registered users.


Recently Created Topics
How to decompile a f...
Sep/16
How to trap mouse cl...
Sep/03
Intel pin in loaded ...
Jun/27
Going to do today wi...
Jun/27
how to create delphi...
Jun/27
enabling menu in a s...
Jun/18
How to get the Image...
Jun/17
OllyDBG Process Term...
Apr/28
Reversing opcode
Apr/24
Question about debbu...
Apr/16


Recent Forum Posts
Intel pin in loaded ...
djnemo
OOP_RE tool available?
Bl4ckm4n
OOP_RE tool available?
van7hu
Should binaries be n...
Kolisar
Problem with ollydbg
nullx42
!findtrampoline Immu...
skycrack
looking for a softwa...
raxen
Documenting reversed...
raxen
.orpc section what's...
mbin
Pydbg load() issue
phreak


Recent Blog Entries
oleavr
Aug/27
CryptoShark: code tracer ba...

oleavr
Jun/25
Build a debugger in 5 minutes

oleavr
Apr/17
frida.re 1.2.0 is out, with...

gareebnavas
Jan/21
Android Malware Analysis

oleavr
Dec/21
frida.github.io: scriptable...

More ...


Recent Blog Comments
pedram on:
Dec/21
frida.github.io: scriptable...

capadleman on:
Jun/19
Using NtCreateThreadEx for ...

newlulu on:
Jun/10
Branch tracing and LBR acce...

newlulu on:
Jun/10
Advanced debugging techniques

newlulu on:
Jun/10
2 anti-trace mechanisms spe...

More ...


Imagery
SoySauce Blueprint
Jun 6, 2008

[+] expand

View Gallery (11) / Submit