Flag: Tornado! Hurricane!

Blogs >> pedram's Blog

Created: Wednesday, December 13 2006 23:58.16 CST Modified: Wednesday, December 27 2006 21:10.05 CST
Printer Friendly ...
Branch Tracing with Intel MSR Registers
Author: pedram # Views: 76528

The ability to "trace" code is rather useful and can be leveraged to ease the burden of a number of tasks ranging from standard issue debugging to vulnerability hunting to malware analysis and more. Debuggers such as OllyDbg implement code tracing via conventional single stepping. This is easily accomplished in a debugger by setting the appropriate EFlag bit in the thread context you wish to single step (Python Win32 example):

    context = self.get_thread_context(thread_handle)
    context.EFlags |= EFLAGS_TRAP
    self.set_thread_context(context, thread_handle=thread_handle)

One by one, as instructions are executed the debugger is trapped with an EXCEPTION_SINGLE_STEP event. In the case of OllyDbg, various register states are stored and execution is continued. For those of you who haven't used this feature before, believe me when I say that it's painfully slow on medium to large chunks of code. This was one of my main motivations behind creating the PaiMei Process Stalker module. Process Stalker improves code tracing speed by monitoring execution of basic blocks as opposed to individual instructions. What exactly does this mean? Sequences of assembly instructions can be broken down into "basic blocks". Basic blocks can be grouped together to form Control-Flow Graphs (CFGs). These are familiar terms but for those of you who don't know it, consider the following example deadlisting:

This straight sequence can be broken into basic blocks which are defined as sub-sequences of instructions where each instruction within the block is guaranteed to be run, in order, once the first instruction of the basic block is reached. The strict definition for basic blocks differs here and there, for example you may or may not want to consider a CALL instruction the end of a basic block (depending on whether you care to take the effort to determine if that CALL is guaranteed to return). You get the general idea though. Here is that same sequence broken down into a CFG:

Instead of tracing code at the instruction level, Process Stalker traces code at a higher level by setting and monitoring breakpoints at the head of every basic block. Since every instruction within a block is guaranteed to be executed once the block is reached, there is no need to trace any further into the block. Improving code trace speed using the basic block method is not a novel idea, [http://www.sabre-security.com/]SABRE Securities[/url] commercial product Bin Navi utilizies the same technique. Even more creative mechanisms for improving trace speed have been developed as well: See Matt Conover's project list at http://www.cybertech.net/~sh0ksh0k/projects/ as well as the recent blog entry and tool announce from McAfee at UMSS: Efficient Single Stepping on Win32.

There are downfalls to all of these shortcuts howevever. The creative ones can get quite complicated and the basic block method requires that the target binary is pre-analyzed to determine the location of all the basic blocks. To successfully enumerate basic blocks you must first correctly differentiate between code and data within an individual binary. This is harder said then done which is why both Process Stalker and Bin Navi really on IDA Pro's analysis. This is a drag because not only does it introduce significant steps to setup the trace but IDA can make mistakes that can botch the entire trace. An improved version of basic block tracing is desired.

Some time ago I was flipping through the IA-32 Intel Architecture Softwre Developer's Manual Volume 3 when I came across the following information in section 15.5 (page 515, specific to the provided link):

    The MSR_DEBUGCTLA MSR enables and disables the various last branch recording mechanisms
    described in the previous section. This register can be written to using the WRMSR
    instruction, when operating at privilege level 0 or when in real-address mode. A protected-mode
    operating system procedure is required to provide user access to this register. Figure 15-4 shows
    the flags in the MSR_DEBUGCTLA MSR. The functions of these flags are as follows:
    BTF (single-step on branches) flag (bit 1)
    When set, the processor treats the TF flag in the EFLAGS register as a "singlestep
    on branches" flag rather than a "single-step on instructions" flag. This
    mechanism allows single-stepping the processor on taken branches, interrupts,
    and exceptions. See Section 15.5.4., "Single-Stepping on Branches, Exceptions,
    and Interrupts" for more information about the BTF flag.

According to the documentation, the behaviour of single step can be altered through a flag in one of the Model Specific Registers (MSRs). So I threw some PyDbg based Python code together to test this out. First, I implemented a conventional single step tracer: tracer_single_step.py. Next, I modified that tracer with the appropriate MSR setting code: tracer_msr_branch.py. Ran them both and to my pleasant surprise it worked like a charm. Try them for yourself. Attach to and interact with calculator with the single step tracer then try it again with the MSR tracer, the speed difference is quite notable. Implementing the MSR tracer required almost minimal changes. First, some definitions:

    SysDbgReadMsr  = 16
    SysDbgWriteMsr = 17
    ULONG     = c_ulong
    ULONGLONG = c_ulonglong
    class SYSDBG_MSR(Structure):
        _fields_ = [
            ("Address", ULONG),
            ("Data",    ULONGLONG),
    def write_msr():
        msr = SYSDBG_MSR()
        msr.Address = 0x1D9
        msr.Data = 2
        status = windll.ntdll.NtSystemDebugControl(SysDbgWriteMsr,

The write_msr() routine defined above utilizes the NtSystemDebugControl() Windows native API (special thanks to Alex Ionescu for his help with this) to set the appropriate MSR values specific to my Pentium M processor. Your mileage may vary with those values, check the Intel manual for the appropriate numbers. Next, all you have to do is follow every call to single_step() with a call to write_msr():

    # re-raise the single step flag on every block.
    def handler_single_step (dbg):
        return DBG_CONTINUE
    # ensure every new thread starts in single step mode.
    def handler_new_thread (dbg):
        return DBG_CONTINUE

I'll be adding MSR routines to PyDbg in a future release and will also release a new version of Process Stalker that does not require any pre-analysis to accomplish its code tracing... When I find the time to do all that I may expand this blog entry with further details and examples into a full blown article.

Caveat Emptor: It appears that you can kill your CPU if you carelessly fool around with MSRs. So there, I said it, be warned.

Blog Comments
morphique Posted: Thursday, December 14 2006 10:07.50 CST
Great stuff. Thanks for the pointer. I have been working on path finding and buffer mutation stuff and this information really helped.

I can't believe that I missed to spot this information in IA32 manual.

Cheers Morph

PSUJobu Posted: Friday, December 15 2006 18:40.00 CST
First, thanks for the pointer. Second, when you say "kill your CPU", are you speaking of a temporary or permanent death?  In other words, are you killing it until the next power cycle or actually bricking your PC?!

camill8 Posted: Friday, December 15 2006 19:05.26 CST
Here is code which will do it in Linux.  Warning: I ripped this out of some other code I have so no promises on how well it works.  Your msr "device" may be at a different location, edit those lines accordingly.  You need the msr kernel module installed and the program must be run as root.  Enjoy.


pedram Posted: Friday, December 15 2006 19:27.09 CST
PSUJobu: No I really mean you can burn out your CPU. I could be wrong ... someone should try it and find out for sure ;-)

Darawk Posted: Sunday, December 17 2006 21:14.12 CST
You can do this with other events too by using a clever little trick with the performance monitoring buffers.  There is a bit in one of the MSR's that specifies what action to take when a trace buffer gets full.  If you tell it to generate an interrupt, every time your trace buffer gets full, your code will get control.  However, you can set the trace buffer to any size you want, including a size too small to fit a single event.  This means that for every event listed in Appendix A of Volume 3, you can effectively break on, just like a branch trace.

Opcode Posted: Sunday, December 17 2006 23:51.04 CST
Very nice trick, Darawk!
Thanx ;)


pedram Posted: Tuesday, December 19 2006 15:54.30 CST
Just a quick note. A friend of mine wrote and reminded me to mention that these MSR games do not work in VMWare (unfortunately).

otto Posted: Wednesday, December 27 2006 19:58.18 CST
pedram: The code that you give at the beginning clears the trap flag instead of setting it (as you wrote). At least with my logic you cannot set a flag with the AND operator :)

pedram Posted: Wednesday, December 27 2006 21:10.35 CST
otto: My bad ;-) Copy pasted the wrong line, fixed now.

stam321 Posted: Wednesday, January 3 2007 03:33.52 CST
Thanx for the info.

  is the tracer_msr_branch.py link is forbidden on purpose?


pedram Posted: Wednesday, January 3 2007 05:19.49 CST
stam321: Nope, it shouldn't be forbidden. Fixed now, thanks for pointing it out.

stam321 Posted: Wednesday, January 3 2007 07:37.21 CST

I've tried branch tracing using msr with p6 familiy cpu dual core and couldnt get it to work.
It seems that the DEBUGCTLA address is the same on all cpus
after p4(0x1D9) and bit 1 controls the BTF.
I tried many things and it doesnt work. any ideas?


morphique Posted: Friday, January 19 2007 10:06.26 CST
I m trying to write MSR register with following function but NtSystemDebugControl returns 0xc0000022(access denied). Any ideas on what might be the possible cause?

typedef struct _SysDbgMsr{
ULONG Address;
}SysDbgMsr, *PSysDbgMsr;

void SetMsr(){

PSysDbgMsr msr;
SysDbgMsr readmsr;
msr = (PSysDbgMsr)LocalAlloc(LPTR, sizeof(*msr));
msr->Address = 0x01D9;
ULONGLONG value = 2;
msr->Data = &value;
NtSystemDebugControl = (PNtSystemDebugControl)getfunc("ntdll", "NtSystemDebugControl");
if (!NtSystemDebugControl)
error("failed retrieving the systemdebugcontrol pointer");
NTSTATUS status = NtSystemDebugControl(SysDbgWriteMsr, (PVOID)msr, sizeof(*msr), 0, 0, 0);
printf("Return status: 0x%08x\n", status);


stam321 Posted: Sunday, January 21 2007 03:48.46 CST
Did you set debug privs?
Like in pydbg, get_debug_privileges().
I checked it and if you dont set your
privs it doesnt work.

care to chat on the subject I did some experiments
on the subject.

Hope it helps,

stam321 Posted: Sunday, January 21 2007 08:44.13 CST

Did you try this trick?

morphique Posted: Sunday, January 21 2007 12:19.18 CST
OpenProcessToken(GetCurrentProcess(), TOKEN_ADJUST_PRIVILEGES, &token_handle)
instead of this I was doing
OpenProcessToken(process_handle, TOKEN_ADJUST_PRIVILEGES, &token_handle)

process_handle = process handle of debugee.

It work's fine now.

Kasperle Posted: Sunday, January 21 2007 18:57.17 CST
This blog post finally made regain interest in some unfinished work I did to integrate this feature with FreeBSD's ptrace API :)

Very nice work. This solution does enable branch tracing for every process being single stepped at the same time, is that correct?

Sirmabus Posted: Sunday, June 10 2007 18:54.32 CDT
After I read this some time ago, I've been wanting to dust off an old project that might make for a handy tool if I can get it to work right.

I setup things as described here (I'm using C/C++ in WinXP SP2) and got it to "single step on branches".  But I saw strange things, at least not what I was expecting.
The exception/EIP address seemed to happen before or after a branch sort of random.

After a few hours of reading I found we are missing a few things here.
First I recommend looking at the AMD manuals; they are a little less dry and easier to read then the Intel ones.  Grab them here, and or search the site for publication "24593".

I saw this simple explanation some place:
"Single Step Control Transfers:
* Breakpoints can be set to occur on control transfers, such as calls,
jumps, interrupts, and exceptions. This can allow a debugger to narrow a problem search to a specific section of code before enabling single stepping of all instructions."

Most of the MSR/debug stuff I found in volume 2.
The MSR/debug info we are interested in starts on page 321, and 420.

FYI the MSR debug registers, very similar, and even compliment the "legacy" DRx registers. So don't fear them.  I seriously doubt you can "..burn out your CPU". At most I could see this if one messed with the wrong registers like the CPU clock/multiplier registers that some processors have (that you can access via the MSR).  

Here is what I was missing, and probably what others will need to use this single step mode:

Page 322, 420, vol 2, defines the DebugCtlMSR (MSR 0x1D9) control bits.
In addition to the BTF (bit 1) already described, also set the LBR (bit 0).
This will turn on the recording of the last BTF exception.

Page 323, 420, vol 2, defines the recording registers (0x1DB to 0x1DE).
You will see two interesting registers "LastBranchFromIP" (0x1DB) and "LastBranchToIP" (0x1DC).  We already get the "to" from "ExceptionInfo->ExceptionRecord->ExceptionAddress" (or ExceptionInfo->ContextRecord->Eip), 0x1DB tell us the address where the branch came from, and in a virtual address we can just read in the low 32bits (ignore the high 32bits)!

So now it should make sense, when the "EXCEPTION_SINGLE_STEP" happens, read the recording  register 0x1DB to get the source/from address.

For me, I just want to handle the calls.  So for every exception I have to grab the source address and do sort of a mini-disassemble on the first few opcode bytes to determine if it was a CALL instruction or not.
Fortunately, in most normal 32bit applications the majority of the calls will be a single byte 'E8' (5 byte relative 32bit call) test.

Page 329,  vol 2, shows us what "control transfers" trigger this single step exception to fire:
* Jcc, JrCXZ, LOOPcc
* Exceptions, IRET

The good news is this whole feature seems to be very common on all relatively new CPUs.  So far it tested functional on an older P4 and on an AMD64-3700,

So far my tests have only been a single threaded (and single core) test application.  I look at this and wonder if there will be thread problems.  There is just one recorder register on the CPU, not one per thread.  Hopefully the way it and the windows SEH works, it will all be synchronous and not miss a step. If there are such problems, one might have to make some sort of kernel driver that add the missing support in the SEH.  Seems this should be in the context structure, and read just like on can read the DRx registers..

Sirmabus Posted: Tuesday, June 12 2007 16:44.24 CDT
And more, Intel processors have a stack of up to 16 stores for the recorder depending on the processor.

Looks like one will have a make a little system taking the particular processor into account.

And still not uncertain this doesn't require a modification to the kernel SEH flow. Maybe one thread easy enough in user mode but what happens to the recording register(s) when you got multiple threads firing off the branch exception around the same instant?
I see this ideally as ending up in the  Win32 context structure.  
Seems to me a "ExceptionInfo->ContextRecord->BranchFrom" would be ideal..

Sirmabus Posted: Wednesday, June 13 2007 19:29.14 CDT
These guys made a OllyDbg branch trace logger plug-in:


Edit: But, not using the MSR/BTF thing yet.

Sirmabus Posted: Wednesday, July 25 2007 00:12.09 CDT

Got branch tracing working very well using a KMD and doing a hardware hook on int1.

So well that it can trace complex processes (with many threads) in real time. Quite a slow down, but workable.

More R&D and eventually demo to follow..

EDIT: I'm giving Pedram's blog back and I'll write my adventures in my own blog here :-)

feryno Posted: Monday, October 29 2007 09:54.34 CDT
hello, I successfully implemeted branches features in my debugger in x64 version of windows, here the link to start to know the history:

my first successful approach was done using a driver, it worked, but at the end I discovered an easier way by a backdoor in windows (only by a coincidence thanks to disassembling ntoskrnl.exe when debugging my driver)

my project implements both:
1. step on branches (DebugCtlMSR.BTF=1)
2. LastBranches recording (DebugCtlMSR.LBR=1) -LastBranchFromIP, LastBranchToIP, LastExceptionFromIP, LastExceptionToIP

enjoy it,

feryno Posted: Tuesday, October 30 2007 01:44.59 CDT
and here a link to an article about the easy ring3 method:
a bit long reading, so in shortcut: for enabling branches in win x64, just set some bits in  debug 7 register (DR7.LE, DR7.GE). That's all ! Simple, huh? Unbelievable and funny? No, that way really works under current versions of x64 windows.

Sirmabus Posted: Thursday, November 1 2007 09:45.19 CDT
Thanks feryno.

I used a driver my self for a direct in1 handler.
It works well because it eliminates a lot of overhead.

I was able to use the last branch record stack too, but have you been able to get the "DS store" going under windows?
It takes a lot of setup and apparently can only work with several kernel OS hacks (like VTune does).  There is next to no documentation (that I can find) setting the DS buffer to work on windows..

Sirmabus Posted: Tuesday, May 24 2011 02:26.36 CDT
It's been some time now, but I incidentally found the problem with doing this BT ("branch tracing") in R3 some time ago.
I just didn't look deep enough.
Although really to do this in real time you probably want to use a R0 driver anyhow.
Without a driver you will be going through all the hoops of the OS exception handling, down from it's IPC messages into the R3 handling of your process, etc.

Anyhow the problem is the same faced with doing "page guard" or similar memory exception tracing/hooking.
Every time an exception fires it's passed down from kernel to user code in ntdll.dll and on through kernel32.dll.

If you happen to branch trace into some of these places the flow will spin around in an obvious feedback loop and crash once stack space is exhausted.
With at least memory exceptions I found I could bypass the problem using a ntdll.dll hook on "KiUserExceptionDispatcher" but haven't tried with BT'ing yet.
Maybe with such a hook, and, or with some additional logic in the BT handler for these conditions it will work fine in R3.
Another possibility is cleverly using hardware break points on entry and exit points to perhaps pause and continue the BT'ing at strategic points.  But then it would limit things to four threads and use this resource..

dimmu Posted: Wednesday, February 6 2013 10:42.11 CST
Is call opcode a branch instruction ??
My point is with setting the msr branch flag can we stop at step_in of a call ??

Add New Comment

There are 30,978 total registered users.

Recently Created Topics
let 'IDAPython' impo...
set 'IDAPython' as t...
GuessType return une...
About retrieving the...
How to find specific...
How to get data depe...
Identify RVA data in...
Immunity Debugger Re...
Question about memor...
How can i find conne...

Recent Forum Posts
How to find specific...
Problem with ollydbg
How can I write olly...
New LoadMAP plugin v...
Intel pin in loaded ...
OOP_RE tool available?
OOP_RE tool available?
Should binaries be n...
Problem with ollydbg
!findtrampoline Immu...

Recent Blog Entries
Android Application Reversing

Breaking IonCUBE VM

Anatomy of a code tracer

IAT Patcher - new tool for ...

CryptoShark: code tracer ba...

More ...

Recent Blog Comments
nieo on:
IAT Patcher - new tool for ...

djnemo on:
Kernel debugger vs user mod...

acel on:
Kernel debugger vs user mod...

pedram on:
frida.github.io: scriptable...

capadleman on:
Using NtCreateThreadEx for ...

More ...

SoySauce Blueprint
Jun 6, 2008

[+] expand

View Gallery (11) / Submit