|Feryno, 2007-09-29||Revision: 1.0|
A debugger is kind of blackbox for a regular user. The interactions between debugger and OS are kept under the cover. Let's uncover them and see how it all works.
This article was composed from
Debugging in Long Mode
(AMD64) slides, first presented at FASM
Technical Conference II, Brno, Czech Republic, 25 August 2007.
While debugging, we are playing with an executable program. We can stop it, change its memory or registers when it is stopped, step through it, and resume its execution.
CPU executes code very quickly. During debugging, we can execute code at the speed observable by human senses (sight).
For playing this game, we need another program - a debugger.
Why programmers need debugging?
Debugging is possible thanks to CPU feature called exceptions.
First 32 interrupts (00h-1Fh) are reserved for
exceptions. Exceptions behave very similarly to interrupts - every
exception forces interruption of the program execution and control is
transfered from the currently-executing program to the routine
handling the interrupting exception. These routines are part of OS
kernel and they are called "exception handlers". During the control
transfer to the exception handler, the CPU stops execution of the
program and saves its return instruction pointer (
RSP), flags register (
RFLAGS). The handler is responsible
for saving the remaining state of the interrupted program
(GPR, XMM, …). Saving registers allows the CPU to restart the interrupted
program after the handler finishes exception handling.
Most of the time, exception means an occurrence of a "degenerated" instruction or code in the program - in this case, exception boundary is reported before the instruction causing the exception, and the interrupted instruction isn't allowed to complete. These exceptions are called faults.
For the life to be more complicated, the reported instruction pointer lies sometimes on the address of the following instruction, so the boundary is reported after the instruction causing the exception, and the execution of the instruction causing the exception is allowed to complete. These exceptions are called traps. The benefit of these traps for our life is that they are the core of debugging.
Triggers int 0 vector.
Sample 1: divisor is zero
Note. All code samples are written in a syntax of my favorite assembler, fasm.
mov rcx, 0 div rcx
Sample 2: result (quotient) is too large for the designated register
mov rdx, 3 mov rax, 0 mov rcx, 2 div rcx
Triggers int 1 vector.
icebp ; opcode 0F1h
Sample 2: fundamental method of single stepping (in fact, OS sets this bit in program context and reloads registers when task switching)
pushfq or qword [rsp], 1 shl 8 ; set Trap Flag popfq
Sample 3: fundamental method of hardware breakpoints
lea rax, [trap_instruction] mov dr0, rax mov eax, 1 mov dr7, rax trap_instruction:
lea rax, [mem_write_addr] mov dr0, rax mov eax, 10001h mov dr7, rax ... mem_write_addr rb ?
Triggers int 3 vector.
int3 db 0CCh int 03h db 0CDh, 03h
Triggers int 6 vector.
Sample 1: documented invalid (undefined) instruction
Sample 2: the source operand is a register; correct is
lea rax, [rdx]
lea rax, rdx ; opcode 8Dh, 0C2h
Note that a lot of instructions are now illegal in AMD64 64-bit mode…
Triggers int 8 vector.
Triggers int 0C vector.
Triggers int 0D vector.
Triggers int 0E vector.
Triggers int 11 vector.
This exception can only occur if AM bit
CR0 register is set. This is done by similar kernel code:
mov rax, cr0 or rax, 1 shl 18 mov cr0, rax ; AM bit of CR0
Sample: user-mode code (CPL=3); I assume qword or dqword stack alignment
pushfq or qword [rsp], 1 shl 18 popfq ; set AC bit of rflags mov eax, [rsp+1] ; exception raised here
Note that this exception never occurs if CPL<3.
How does a program and a debugger interact with OS?
mov eax, sys_wait4 syscallFor Win64:
call qword [KERNEL32.WaitForDebugEvent]
mov edi, PTRACE_GETREGS mov eax,sys_ptrace syscallUseful values are:
PTRACE_GETREGS, PTRACE_SETREGS, PTRACE_PEEKTEXT, PTRACE_POKETEXT, PTRACE_PEEKDATA, PTRACE_POKEDATAFor Win64:
call qword [KERNEL32.GetThreadContext]Useful API functions are:
GetThreadContext, SetThreadContext, ReadProcessMemory, WriteProcessMemory
mov edi, PTRACE_CONT ; continue mov eax, sys_ptrace syscall
mov edi, PTRACE_SINGLESTEP ; single step mov eax, sys_ptrace syscallFor Win64:
call qword [KERNEL32.ContinueDebugEvent]
If the program doesn't cause any exception then the program runs to its end and terminates. In this case, the debugger doesn't encounter any exception, debugger is only notified about program termination at the end. This is a dream of every assembly coder and desirable terminal stage of developing any program. Well, not exactly, some procedures may still behave in an incorrect way and give unexpected return values...
Hardware breakpoint always triggers int 1
vector. This breakpoint is created by setting some debug
registers. There are only 6 useful debug registers,
DR7. Others are unused (accessing them causes
invalid opcode exception). Isn't it a pity? But on the other side,
it could be even more complicated.
The debug registers can be read and written only when the current-protection level (CPL) is 0 (most privileged) - kernel:
mov rax, dr7
mov dr3, rcx
User mode debugger running at CPL=3 can access debug registers of a program when the program is stopped after causing an exception. For Linux64:
mov edi,PTRACE_GETREGS mov eax,sys_ptrace syscall
mov edi,PTRACE_SETREGS mov eax,sys_ptrace syscall
call qword [KERNEL32.GetThreadContext]
call qword [KERNEL32.SetThreadContext]
DR3 hold 64-bit
virtual (linear) address:
lea rax, [address] mov dr0, rax
If we need to set debug register
DR3, then we
must set its conditions in
DR7 register - enabled bit, type,
|31-30||LEN3||Length of Breakpoint #3|
|29-28||R/W3||Type of Transaction to Trap for Breakpoint #3|
|27-26||LEN2||Length of Breakpoint #2|
|25-24||R/W2||Type of Transaction to Trap for Breakpoint #2|
|23-22||LEN1||Length of Breakpoint #1|
|21-20||R/W1||Type of Transaction to Trap for Breakpoint #1|
|19-18||LEN0||Length of Breakpoint #0|
|17-16||R/W0||Type of Transaction to Trap for Breakpoint #0|
|6||L3||Local Exact Breakpoint #3 Enabled|
|4||L2||Local Exact Breakpoint #2 Enabled|
|2||L1||Local Exact Breakpoint #1 Enabled|
|0||L0||Local Exact Breakpoint #0 Enabled|
|01b||2 byte, address in corresponding |
|10b||8 byte, address in |
|11b||4 byte, address must be dword aligned|
|00b||int 1 breakpoint on instruction execution, LEN must be 1 byte (00b)|
|01b||int 1 occurs only on data write|
|10b||int 1 only on I/O read/write if
|11b||int 1 occurs only on data read or data write|
If we want to set one of
registers, we use this scheme:
lea rax, [address] mov DRx, rax ; x = 0, 1, 2, 3 mov eax, ((lenght*4 + type) shl (x*4 + 16)) + (1 shl (x*2)) mov dr7, rax
We want to watch reading from or writing into 1 qword at address 100005120h (address range 100005120h-100005127h)
lea rax, [100005120h] mov dr0, rax mov rax, dr7 and eax, not ((1111b shl 16) + 11b) ; mask off all or eax, (1011b shl 16) + 1 ; prepare to set what we want mov dr7, rax ; set it finally
Done, now we can wait until code falls into the
trap! After accessing any byte at memory range 100005120h-100005127h, int 1 will
DR6.B0 bit will be set to 1.
We want to watch writing into 8 bytes at address range 40AF31h-40AF38h. Setting the lenght to 8 bytes won't work, because the address isn't aligned at dqword boundary. We must set more breakpoints to cover the whole address range:
mov rax, dr7 and eax, 0000FF00h ; mask off all lea rdx, [40AF31h] mov dr0, rdx or eax, (0001b shl 16) + 1 lea rdx, [40AF32h] mov dr1, rdx or eax, (0101b shl 20) + 100b lea rdx, [40AF34h] mov dr2, rdx or eax, (1101b shl 24) + 10000b lea rdx, [40AF38h] mov dr3, rdx or eax, (0001b shl 28) + 1000000b mov dr7, rax
We want to break on the execution of an instruction at 401235h.
Note that the instruction must start exactly at this address. If the set address lies somewhere inside the instruction (in case the instruction has 2 or more bytes) then int 1 won't occur!
lea rax, [401235h] mov dr0, rax mov rax, dr7 and eax, not ((1111b shl 16) + 11b) ; mask off all or eax, (0000b shl 16) + 1 mov dr7, rax
We want to watch reading from or writing into ports
20-27h. This is possible only if
CR4.DE (bit 3 -
Debugging Extensions) bit is set by similar kernel code:
mov rax, cr4 or rax, 1 shl 3 mov cr4, rax ; CR4.DE (bit 3)
This breakpoint is very useful in kernel mode
mov eax, 20h ; port number mov dr3, rax mov rax, dr7 and eax, not ((1111b shl 28) + 11000000b) ; mask off all or eax, 1010b shl 28 + 01000000b ; LEN3=10b (8 bytes), R/W3=10b (I/O) mov dr7, rax
The condition which caused int 1 exception is
recorded in the
DR6 debug-status register:
|14||BS||Single Step (rFLAGS.TF has been set)|
|13||BD||Breakpoint Debug Access Detected (DR7.GD has been set)|
|3||B3||Breakpoint #3 Condition Detected|
|2||B2||Breakpoint #2 Condition Detected|
|1||B1||Breakpoint #1 Condition Detected|
|0||B0||Breakpoint #0 Condition Detected|
|13||GD||General Detect Enabled|
When this bit is set, the debug exception (int 1)
occurs when an attempt is made to execute a
MOV DRn instruction to
any debug register (
DR7). This bit is cleared to 0 by
the processor when the int 1 handler is entered, allowing the int 1
handler to read and write the
DR registers. The int 1 exception
occurs before executing the instruction, and
DR6.BD is set by the
processor. Software debuggers can use this bit to prevent the
currently-executing program from interfering with the debug
At the entry of the handler,
DR6.BD (bit 13) so
mov rax, dr6 doesn't cause int 1 again.
int01_handler: push rax mov rax, dr6 bt eax, 14 jc single_step_detected bt eax, 13 jc debug_access_detected test eax, 1 shl 3 jnz bp3_detected test eax, 1 shl 2 jnz bp2_detected test eax, 1 shl 1 jnz bp1_detected test eax, 1 jnz bp0_detected
If none of these bits are set, the exception caused
icebp instruction (opcode 0F1h).
icebp_detected: ... pop rax iretq
Note that there are no other sources of int 1 exception.
Instruction execution breakpoint and general-detect condition cause the int 1 exception to occur BEFORE the instruction is executed.
All other breakpoints (Data Write Only, Data Read or Data Write, I/O Read or I/O Write) and single-stepping conditions cause the int 1 exception to occur AFTER the instruction is executed. More int 1 conditions may occur on the same instruction.
For repeated operations (with
rep prefix, like
rep movsb), these can be suspended by an exception or interrupt
so int 1 can occur between iterations.
Databreakpoint conditions on the previous instruction occur before an instruction-breakpoint condition on the next instruction. However, if instruction and data breakpoints can occur as a result of executing a single instruction, the instruction breakpoint occurs first (before the instruction is executed), followed by the data breakpoint (after the instruction is executed).
Single-step breakpoints (trigger int 1 vector) are
enabled by setting the
rFLAGS.TF bit to 1. When single stepping is
enabled, an int 1 exception occurs after every instruction is
executed until it is disabled by setting
rFLAGS.TF to 0.
The instruction that sets the TF bit is not single stepped, the
instruction that follows hits int 1 after completing execution
(because exception single step is trap type of exception). The
instruction that clears TF bit hits int 1 (because TF was set
before the instruction and single step exception is trap type of
exception - is triggered after execution of instruction
pushf or dword [rsp], 1 shl 8 popf ; RFLAGS.TF=1 now mov edx, eax ; now int 1 occurs for the first time (as the mov instruction execution completes), ; because single step is TRAP type of exception, not FAULT type pushf ; now int 1 occurs again and dword [rsp], not (1 shl 8) ; int 1 occurs for the third time popf ; int 1 occurs for the forth time (as the execution of popf instruction), ; it is the last time because of execution of popf instruction ; clears TF bit ; rFLAGS.TF=0 now mov ebx, ecx ; this doesn't trigger int 1 anymore
When an int 1 exception occurs due to single
stepping, the processor sets
rFLAGS.TF to 0 before entering the
int 1 handler, so that the handler itself is not single
stepped. The processor also sets
DR6.BS (bit 14) to 1,
which indicates that the int 1 exception occurred as
a result of single stepping.
rFLAGS image pushed into the
debug-handler stack has the TF bit
set, and single stepping resumes when a subsequent
iretq pops the
stack image into the
int01_handler: push rax mov rax, dr6 bt eax, 14 ; DR6. BS jnc other_than_single_step single_step_detected: ... iretq
Single stepping can be a bit more complicated, we discuss it below.
Software breakpoint always triggers int 3
vector. It is based on
int3 instruction with opcode
0CCh. This instruction is very useful because this 1-byte fits to
overwrite the first byte of any other instruction.
In fact, we have another possibility to encode this instruction
using opcode 0CDh, 03h. This encoding is not much useful
because it can't fit into 1-byte instructions (
xchg gpr32, eax;
A debugger puts 1-byte form of this instruction at the desired address in a code. If a program hits this instruction, the debugger stops its execution until resumed.
A programmer puts this instruction to his source code in
development stage (
int3 is incompiled). This is a trick how to go easy and quickly into
desired part of program using debugger.
int3incompiled into the program (source of program has
int3instruction, developer must remove it finally):
restore the original byte at address X decrease RIP of the program (RIP-1 = X) end_of_int3_handler: iretq
We can watch addresses of instructions causing
control transfers and exceptions. The instructions are:
RSM. We can
watch also NMIs and SMIs.
We just need to enable 1 bit in 1 register. However, I suppose that neither Windows nor Linux have this bit enabled. The register's name is Debug-Control MSR:
|1||BTF||Branch Single Step|
Similar kernel-mode code sets this bit to 1:
DebugCtlMSR = 01D9h mov ecx, DebugCtlMSR rdmsr or eax, 1 wrmsr
Setting LBR bit orders the processor to record the source and target addresses of the last control transfer (branch instruction, interrupt, and exception).
The processor automatically disables
control-transfer recording when int 1 occurs by clearing
DebugCtlMSR.LBR to 0. The contents of the control-transfer recording
MSRs are not altered by the processor when int 1 occurs. Before
exiting the debug-exception handler, software can set
DebugCtlMSR.LBR to 1 to re-enable the recording mechanism.
After enabling LBR bit of
DebugCtlMSR, the source and destination addresses of
control-transfer events are saved
by the processor - branches (call, jmp), interrupts, exceptions. We
have four registers:
LastExceptionFromIP (01DDh), and
LastExceptionToIP (01DEh). These 64-bit registers are read-only
so there is no way how to prevent
them to destroy during context switching. Well, we can hack this
weakness in a limited way (topic for a presentation at the next FASM
This code is a sample how to read
LastBranchFromIP = 01DBh foo dq ? ... mov ecx, LastBranchFromIP rdmsr mov dword [foo+4], edx mov dword [foo], eax ; qword [foo] now holds the 64-bit address
DebugCtlMSR.BTF changes the behavior of
rFLAGS.TF bit. When this bit is cleared to 0 (normal,
most common setting),
rFLAGS.TF bit controls instruction
single stepping (normal behavior). When this bit is set to 1, the
rFLAGS.TF bit controls single stepping on control
transfers (branch instruction, interrupt, exception) - single step
doesn't occur on every instruction, but only on control transfers
("bigger single steps"). By this way the single-step mechanism is
allowed to do single step only on control transfers, rather than
single step every instruction.
Debuggers can use this capability to perform
a "coarse" single step across blocks of code (bound by control
rFLAGS.TF=1), and then, as the problem search is
narrowed, switch into
a "fine" single-step mode on every instruction
We have two types of breakpoints:
This instruction breakpoint is done using
instruction (opcode 0CCh). A debugger uses this byte to
overwrite the original instruction. Instruction breakpoint must lie on the begin of the
instruction (not inside it!). The disadvantage is that this
breakpoint modifies program's memory so the CRC of code
with such a breakpoint will not match the original one.
This kind of breakpoint uses debug registers so it doesn't modify program's memory. The advantage is that we can watch also memory and I/O port access. On the other side, we can use only four breakpoints for every thread.
man ptrace (Linux help)
Self-mistakes and a lot of years spending by debugging because of them :-)
Continue to discussion board.
The author doesn't wish to publish his e-mail here.
Visit author's home page.
|2007-09-29||1.0||First public version||Feryno|
(dates format correspond to ISO 8601)