|
The Molecular Virology of Lexotan32: Metamorphism Illustrated
Introduction This paper is a direct descendent of my previous one regarding the metamorphic engine of the W32.Evol virus. I advise you to take a look at it before reading this one, or at least be acquainted with the subject of metamorphism. The focus of this paper is the special engine of the Lexotan32 virus. The virus was released in 29A#6 Virus Magazine in 2002, the Annus Mirabilis of metamorphic viruses. The virus was created by the prolific VX coder, Vecna, and was one of the last complex creations of this kind. I could further elaborate on the genealogy of this virus, but I think it is sufficient to say that this virus is a culmination of many of the techniques developed throughout the author's career. To be honest, I never thought that this paper would be written, as the virus was released with its original source code, thus no real point in reverse engineering it. I've been glancing at it from time to time, but I never went too deep. Eventually, I decided that the only way to understand what's going on was to compile it � and that's when I first began to realize the beauty: ![]() This is the header of the compiled binary, which may look pretty straightforward for many people, but those who are used to looking at binaries will know the difference. Usually, the section names and purposes (highlighted) are chosen by the compiler / assembler, and tend to be very generic, but this program was something else. At that moment I knew this is going to be interesting. Concept Perhaps one of the most difficult aspects in the design of a viral metamorphic engine is the issue of Re-Generation. The viral code processed by such engine is subject to obfuscation which includes insertion of random junk instructions as well as replacement of one instruction with an equivalent set of instructions. Unless monitored, the code will simply grow beyond proportion within a certain amount of generations (as in the W32.Evol virus). Several viruses solved this problem in creative ways; The TMC virus (DOS) used tables that stored the entire virus binary commands and their lengths, while the MetaPHOR virus used a de-permutation module and instruction shrinking module in order to achieve the goal of extracting �clean�, un-obfuscated virus copy. Lexotan32 solves this problem in a very elegant way. A special table, Genotype (G), is attached to the metamorphed code (M). The engine will use the entries of this table in order to extract the clean virus copy (P). ![]() The pseudo-algorithm is as follows:
P = BuildBodyFromGenotype(M, G)
M' = MixBody(P)
G' = BuildGenotypeFromBody(M')
Now, imagine the following metamorphed code:
![]() Scary, eh? The above code is so obfuscated, that even a human reader will find it hard to understand; how will a machine be able to clean it? Re-Generation It is quite evident that the rebuilding mechanism is inspired from nature. The real life Genotype is the genetic information encoded in our genes. The Genotype structure determines specific visual characteristics of a resulting Phenotype, which is, in our case, the metamorphed body. One Genotype may result in several different looking Phenotypes, just as in the virus. The Genotype, in the viral sense, is a list consisting of signed WORD values. Each member of the list is a relative pointer to the next instruction in the buffer. ![]() The first item on the list (0x3D) is in fact the pointer to the first instruction; by adding the second item to the first (0x3D+0xFFC6 = 0x3), we get the pointer to the second instruction - and so on. Additionally, the Genotype may contain two possible flag values:
Mutation After getting to the correct location, the engine will get the length of the instruction and store it in the output buffer. During this phase, the engine will perform instruction transformations based on a random factor. The engine is able to perform the following:
* CMD = ADD/ADC/SUB/SBB/XOR/CMP
* x = {B,W,D}, ACUM = {AL, AX, EAX}
Although some of the transformations above result in growth of code, all of them are reversible,meaning that the engine is able to randomly obfuscate (expand) or optimize (shrink) the code.
The above example illustrates both the extraction of the plain code and instruction transformations:
Permutation After generating a clean copy from the Genotype, the engine will permutate the code in a rather unusual way. On the first phase it will create a pool of NOPs (0x90), and will copy one or more instructions from the plain code into random locations inside that pool. In order to retain the logical code flow, all the instruction blocks will be linked with jumps. During this phase, the engine will add garbage instructions based on the FREE4GARBLE directive. ![]() On the second phase, all the byte sequences will be collected into one buffer by removing the NOPs in between them. ![]() At this point, while running the code in a debugger, I wasn't looking at opcodes - I was staring at small genes floating inside a Primeval Soup, perfectly reordering themselves in a chaotic manner, forming life. Garbage Another interesting thing is t he issue of garbage generation. The main idea is blending legitimate instructions with purely random instructions, without altering the code-flow. Many poly-viruses used commands that were not likely to change the code flow (for example CLC/CMC/NOP, or ADD EAX,0 and the likes) but those are now very easily detected by heuristics. Another method was nesting garbage instructions inside PUSH-POP combinations, allowing the code-flow to remain unaltered, but that resulted in detectors that scanned the stack for magic values. The garbage engine of Lexotan is able to generate instructions that are very common in legitimate code. As witnessed, after the initial regeneration process, the engine will extract XOR ESP, VALUE instructions based on the FREE4GARBLE directive. The value is a flag notifying the engine about possible use of registers and flags.
The engine will replace these instructions with random instructions that don't use the specified registers, allowing garbage to be run without altering the original code flow. The XOR ESP instructions throughout the metamorphed code will eventually be replaced with garbage LEA instructions. Among the instructions the engine may generate are:
- MOV reg1, reg2
- MOV reg, imm
- LEA reg, [r+disp]
- CMD reg, imm ;ADD/ADC/SUB/SBB/XOR/CMP
- INC reg
- DEC reg
The registers and immediate values can be of any size (8/16/32 bit).
Eventually, the resulting code will look this: ![]() Relocations After all the transformations were applied, most (if not all) locations of the original instructions are changed, forcing the engine to fix all the branch instructions (JMP/CALL) that link parts of the code together. The engine copies instructions from an input buffer to an output buffer, and during the transformation process it keeps track of Old vs. New locations in the form of a two- DWORD table (eip_table). Another table (jmp_table), keeping information about branch instructions is constructed. The first member will hold the Branch Destination in the Old Buffer while the second member will hold the Relative Location in New Buffer. Both values are DWORDS. Additionally, there's another BYTE that indicates the size (either 1 which means a short jmp, or 0 that means near jmp). In order to simplify, let's try fixing a simple case: ![]() As you can see, the code became smaller by 3 bytes. Notice that the JNE in the output buffer no longer leads to the RET instruction. For the above example, the following tables will be constructed: ![]() So, as you can probably deduct, the new relative value to be patched is 01 (instead of 04), which will fix the jmp instruction correctly. New Genotype After the initial Genotype is extracted (in the regeneration process), the engine saves relative locations of instructions in a table called userlist in the same manner and structure of the Genotype. The userlist functions as an interim Genotype and its values are fixed during the phases of permutation using the same lookup method seen above, as offsets change very dynamically. After all the transformations are over, the engine will use the userlist in order to generate a new Genotype, for use with the new metamorphed code. All the XOR ESP instructions will be converted to the format of the Genotype's FREE4GARBLE directive. In the next generation, the new Genotype will be used to clean the body from all the garbage and permutation, and will undergo instruction mutations again, thus resulting in an ever-changing base code. Appendix A: Code as Art I wouldn't be able to go about finishing this paper without a little of what my friends and family call �digging'. These are some of the thoughts that floated through my mind while studying the virus code. As this is not the product of a deep philosophical investigation, the definitions I provide are probably not complete and should be taken into account according to their context.
In the above example we go from the lowest to the highest levels of abstraction. Now, we should carefully examine this table: The English sentence provides instructions on how to calculate the force of a moving body using verbs and nouns while the Assembly code uses word-like commands and variables. Both sentences have one thing in common that the others don't: The English sentence uses Conjunctions (and other descriptive words) and the Assembly code uses Registers. In the C and Math representations, however, we can see that although meaningful names can be used, the commands (first given in English form) have now become operators, and the Modus Operandi was now replaced with a formula. In the world of Software, the purpose of High-Level languages is to formalize blocks of instructions into more abstract and unavoidably generic constructs. Due to its low-level nature, the Assembly language (like English) provides the coder with many more nuances to achieve the desired action, semantically, while the C language provides a more abstract and flexible form. This is, of course, without insulting the fantastic C language. When coding High-Level statements such as IF, FOR and WHILE you can sometimes feel like you are speaking directly to the compiler, but in the realm of bit-manipulation and optimization, the Assembly coder is the artist. Now, as witnessed, it is possible to turn the English sentence into a set of commands. You can also transform that sentence into a poem, and skilled writers can probably create a hybrid of the two forms. What defines one variation as more artistic than the other? The answer lies in the manifestation of the concept. Code is not about telling a story, it's about instructing the computer on how to perform certain actions. It's not what you are doing; it's HOW you do it. The source code of Lexotan32 itself is a very customized Assembly source. It is filled with macros and custom-directives and was written in a way that forces a custom pre-processor to be run prior to compilation. At first it might alienate the reader, but as you go deeper you can come across very amusing macros such as �coin' whose purpose is to return a random number between 0 and 1. The author could have done that in different ways, but instead he chose to flip a virtual coin in order to determine some of the next actions. Another example can be seen in the disassembler module, where the complexity and diversity of the Intel instruction set is broken down into a simple yet dominant macro. The author of Lexotan32 was dealing with problems already encountered before, yet implemented them all in very original ways� Regeneration (Genotype), Permutation (NOP Pool) and Garbage (FREE4GARBLE), all wrapped up in a custom PE file � all of those are examples of colorful coding. The source code was Assembly poetry, the concept was inspired and carefully crafted, the execution was filled with imagination and vision and the binary output was simply fantastic, every time anew. Code can be Art. Q.E.D. Appendix B: The Genealogy of Lexotan32 Since we are dealing with art, we must also take a look at the ideas that preceded and contributed to the final creation. It is also quite interesting to note the development of certain concepts as well as their sometimes embryonic implementation. Although this subject diverges from the scope of this paper, I decided to include a very brief history of the development of this virus through other viruses created by Vecna. 1. Miss Lexotan 6mg The first Lexotan virus was actually a DOS metamorphic virus written in 1998 which was followed by several variants. The metamorphism model that was first introduced is similar to the one that can be found in the PLY virus. Every instruction was to be padded with NOPs, so that all instructions could have a fixed length of 5 bytes. 90 nop 90 nop 90 nop 90 nop 1E push ds 90 nop 5B pop bx 90 nop 90 nop 90 nop B8 F0 FF mov ax, 0FFF0h 90 nop 90 nopThis way, the engine doesn't have to worry for relocations. The main virus body would be encrypted (early Genotype), and the engine would unpack it using a small built-in length-disassembler, and would again randomly pad the instructions with NOPs on the next generation. In the next major variant (0.3), NOP padding was replaced by the unique garbage engine: BA 06 FB mov dx, 0FB06h B9 75 02 mov cx, 275h 1E push ds 5B pop bx BE 4A A7 mov si, 0A74Ah BF 05 95 mov di, 9505h B8 F0 FF mov ax, 0FFF0hIn version 0.6 and forth, the metamorphic engine was able to support instruction mutations, similar to the ones found on the 32bit version. 2. RegSwap The same year, Vecna released one of the first metamorphic viruses for the Windows 95 platform, RegSwap. In this virus, all the instructions remained the same, but with different register usage on every generation. In order to achieve that, Vecna introduced the idea of using tables with relative distances and information regarding register indexes. The table would travel with the virus, the same way the modern Genotype was used. 3. Ramones In 2001, Vecna released the W95.Ramones worm, which utilized an early version of the same permutation and garbling engine seen in Lexotan32 (the MixBody module). This engine (that was also published in XINE#5) was almost identical to its final version found in Lexotan32. A visual diagram describing the evolution of Lexotan: ![]() Ending As I said in the beginning, the source-code of this virus is in fact available for everyone to examine, and the fact is that it holds many other interesting features such as the infection mechanism, the mixing of variables, compression, encryption and many other unique implementations. Unfortunately, those subjects had to be left out of this paper, but I advise the curios reader to examine them more closely. What I did was just compiling and running the code. The rest was just like watching a movie through Hex dumps. Vecna - Thanks. Orr www.antilife.org/ |