Debugging in AMD64 64-bit Mode in Theory

Feryno, 2007-09-29

Revision: 1.0

A debugger is kind of blackbox for a regular user. The interactions between debugger and OS are kept under the cover. Let's uncover them and see how it all works.

This article was composed from Debugging in Long Mode (AMD64) slides, first presented at FASM Technical Conference II, Brno, Czech Republic, 25 August 2007.

While debugging, we are playing with an executable program. We can stop it, change its memory or registers when it is stopped, step through it, and resume its execution.

CPU executes code very quickly. During debugging, we can execute code at the speed observable by human senses (sight).

For playing this game, we need another program - a debugger.

Why programmers need debugging?

To find bugs (critical errors causing program crash)
To find mistakes in procedures giving wrong unexpected result
To improve procedures by exploring them using stepping through instructions and watching registers or memory changes
To analyze unknown executable discovered in system
To learn what instructions do - suitable especially for beginners. I do it very often too - instead of reading manuals

Debugging is possible thanks to CPU feature called exceptions.

What are exceptions?

First 32 interrupts (00h-1Fh) are reserved for exceptions. Exceptions behave very similarly to interrupts - every exception forces interruption of the program execution and control is transfered from the currently-executing program to the routine handling the interrupting exception. These routines are part of OS kernel and they are called "exception handlers". During the control transfer to the exception handler, the CPU stops execution of the program and saves its return instruction pointer (RIP), stack pointer (RSP), flags register (RFLAGS). The handler is responsible for saving the remaining state of the interrupted program (GPR, XMM, …). Saving registers allows the CPU to restart the interrupted program after the handler finishes exception handling.

Most of the time, exception means an occurrence of a "degenerated" instruction or code in the program - in this case, exception boundary is reported before the instruction causing the exception, and the interrupted instruction isn't allowed to complete. These exceptions are called faults.

For the life to be more complicated, the reported instruction pointer lies sometimes on the address of the following instruction, so the boundary is reported after the instruction causing the exception, and the execution of the instruction causing the exception is allowed to complete. These exceptions are called traps. The benefit of these traps for our life is that they are the core of debugging.

List of Exceptions

Divide Error

Triggers int 0 vector.

Sample 1: divisor is zero

Note. All code samples are written in a syntax of my favorite assembler, fasm.

 mov rcx, 0
 div rcx

Sample 2: result (quotient) is too large for the designated register

 mov rdx, 3
 mov rax, 0
 mov rcx, 2
 div rcx

Single Step

Triggers int 1 vector.

Sample 1:

 icebp ; opcode 0F1h

Sample 2: fundamental method of single stepping (in fact, OS sets this bit in program context and reloads registers when task switching)

 pushfq
 or qword [rsp], 1 shl 8 ; set Trap Flag
 popfq

Sample 3: fundamental method of hardware breakpoints

 lea rax, [trap_instruction]
 mov dr0, rax
 mov eax, 1
 mov dr7, rax
trap_instruction:

Sample 4:

 lea rax, [mem_write_addr]
 mov dr0, rax
 mov eax, 10001h
 mov dr7, rax
 ...
 mem_write_addr rb ?

Breakpoint

Triggers int 3 vector.

Sample:

 int3
 db 0CCh
 int 03h
 db 0CDh, 03h

Invalid Opcode

Triggers int 6 vector.

Sample 1: documented invalid (undefined) instruction

ud2

Sample 2: the source operand is a register; correct is lea rax, [rdx]

 lea rax, rdx   ; opcode 8Dh, 0C2h

Note that a lot of instructions are now illegal in AMD64 64-bit mode…

Double Fault

Triggers int 8 vector.

Stack-Segment Fault

Triggers int 0C vector.

General Protection

Triggers int 0D vector.

Page Fault

Triggers int 0E vector.

Alignment Check

Triggers int 11 vector.

This exception can only occur if AM bit of CR0 register is set. This is done by similar kernel code:

 mov rax, cr0
 or rax, 1 shl 18
 mov cr0, rax ; AM bit of CR0

Sample: user-mode code (CPL=3); I assume qword or dqword stack alignment

 pushfq
 or qword [rsp], 1 shl 18
 popfq   ; set AC bit of rflags

 mov eax, [rsp+1]   ; exception raised here

Note that this exception never occurs if CPL<3.

Interactions, Exception Delivery

How does a program and a debugger interact with OS?

A program causes an exception.
CPU stops the execution of the program, saves instruction pointer, stack pointer, flags of the program and control is given to the corresponding exception handler (i. e., interrupt vector).
OS handles the interrupt vector and notifies the debugger about the exception: For Linux64:
```
 mov eax, sys_wait4
 syscall
```
For Win64:
```
 call qword [KERNEL32.WaitForDebugEvent]
```

User is allowed to change registers or memory of the program via debugger. For Linux64:

mov edi, PTRACE_GETREGS
mov eax,sys_ptrace
syscall

Useful values are:

 PTRACE_GETREGS, PTRACE_SETREGS,
 PTRACE_PEEKTEXT, PTRACE_POKETEXT,
 PTRACE_PEEKDATA, PTRACE_POKEDATA

For Win64:

 call qword [KERNEL32.GetThreadContext]

Useful API functions are:

 GetThreadContext, SetThreadContext, ReadProcessMemory, WriteProcessMemory

User can resume execution of the program via debugger. For Linux64:

 mov edi, PTRACE_CONT ; continue
 mov eax, sys_ptrace
 syscall

 mov edi, PTRACE_SINGLESTEP ; single step
 mov eax, sys_ptrace
 syscall

For Win64:

 call qword [KERNEL32.ContinueDebugEvent]

If the program doesn't cause any exception then the program runs to its end and terminates. In this case, the debugger doesn't encounter any exception, debugger is only notified about program termination at the end. This is a dream of every assembly coder and desirable terminal stage of developing any program. Well, not exactly, some procedures may still behave in an incorrect way and give unexpected return values...

Hardware Breakpoints

Hardware breakpoint always triggers int 1 vector. This breakpoint is created by setting some debug registers. There are only 6 useful debug registers, DR0, DR1, DR2, DR3, DR6, and DR7. Others are unused (accessing them causes invalid opcode exception). Isn't it a pity? But on the other side, it could be even more complicated.

The debug registers can be read and written only when the current-protection level (CPL) is 0 (most privileged) - kernel:

 mov rax, dr7

 mov dr3, rcx

User mode debugger running at CPL=3 can access debug registers of a program when the program is stopped after causing an exception. For Linux64:

 mov edi,PTRACE_GETREGS
 mov eax,sys_ptrace
 syscall

 mov edi,PTRACE_SETREGS
 mov eax,sys_ptrace
 syscall

For Win64:

 call qword [KERNEL32.GetThreadContext]

 call qword [KERNEL32.SetThreadContext]

Debug registers DR0, DR1, DR2, and DR3 hold 64-bit virtual (linear) address:

 lea rax, [address]
 mov dr0, rax

If we need to set debug register DR0-DR3, then we must set its conditions in DR7 register - enabled bit, type, lenght.

DR7
bit(s)	mnemonic	description
31-30	LEN3	Length of Breakpoint #3
29-28	R/W3	Type of Transaction to Trap for Breakpoint #3
27-26	LEN2	Length of Breakpoint #2
25-24	R/W2	Type of Transaction to Trap for Breakpoint #2
23-22	LEN1	Length of Breakpoint #1
21-20	R/W1	Type of Transaction to Trap for Breakpoint #1
19-18	LEN0	Length of Breakpoint #0
17-16	R/W0	Type of Transaction to Trap for Breakpoint #0
6	L3	Local Exact Breakpoint #3 Enabled
4	L2	Local Exact Breakpoint #2 Enabled
2	L1	Local Exact Breakpoint #1 Enabled
0	L0	Local Exact Breakpoint #0 Enabled

LEN0-LEN3
00b	1 byte
01b	2 byte, address in corresponding `DR` must be word aligned
10b	8 byte, address in `DR` must be qword aligned
11b	4 byte, address must be dword aligned

R/W0-R/W3
00b	int 1 breakpoint on instruction execution, LEN must be 1 byte (00b)
01b	int 1 occurs only on data write
10b	int 1 only on I/O read/write if `CR4.DE`=1 (bit 3 of `CR4`) - `in`, `out`, `insb`, `outsb`
10b	if `CR4.DE`=0 this setting is undefined
11b	int 1 occurs only on data read or data write

Setting Hardware Breakpoints

If we want to set one of DR0-DR3 registers, we use this scheme:

 lea rax, [address]
 mov DRx, rax ; x = 0, 1, 2, 3
 mov eax, ((lenght*4 + type) shl (x*4 + 16)) +  (1 shl (x*2))
 mov dr7, rax

Example 1: Memory Reading or Writing Breakpoint

We want to watch reading from or writing into 1 qword at address 100005120h (address range 100005120h-100005127h)

 lea rax, [100005120h]
 mov dr0, rax
 mov rax, dr7
 and eax, not ((1111b shl 16) + 11b)	; mask off all
 or eax, (1011b shl 16) + 1		; prepare to set what we want
 mov dr7, rax				; set it finally

Done, now we can wait until code falls into the trap! After accessing any byte at memory range 100005120h-100005127h, int 1 will occur and DR6.B0 bit will be set to 1.

Example 2: Memory Reading Breakpoint at Unaligned Address

We want to watch writing into 8 bytes at address range 40AF31h-40AF38h. Setting the lenght to 8 bytes won't work, because the address isn't aligned at dqword boundary. We must set more breakpoints to cover the whole address range:

breakpoint 0. to watch 1 byte at 40AF31h
breakpoint 1. to watch 1 word at 40AF32h-40AF33h
breakpoint 2. to watch 1 dword at 40AF34h-40AF37h
breakpoint 3. to watch 1 byte at 40AF38h

 mov rax, dr7
 and eax, 0000FF00h   ; mask off all
 lea rdx, [40AF31h]
 mov dr0, rdx
 or eax, (0001b shl 16) + 1
 lea rdx, [40AF32h]
 mov dr1, rdx
 or eax, (0101b shl 20) + 100b
 lea rdx, [40AF34h]
 mov dr2, rdx
 or eax, (1101b shl 24) + 10000b
 lea rdx, [40AF38h]
 mov dr3, rdx
 or eax, (0001b shl 28) + 1000000b
 mov dr7, rax

Example 3: Instruction Execution Breakpoint

We want to break on the execution of an instruction at 401235h.

Note that the instruction must start exactly at this address. If the set address lies somewhere inside the instruction (in case the instruction has 2 or more bytes) then int 1 won't occur!

 lea rax, [401235h]
 mov dr0, rax
 mov rax, dr7
 and eax, not ((1111b shl 16) + 11b)   ; mask off all
 or eax, (0000b shl 16) + 1
 mov dr7, rax

Example 4: Port Reading or Writing Breakpoint

We want to watch reading from or writing into ports 20-27h. This is possible only if CR4.DE (bit 3 - Debugging Extensions) bit is set by similar kernel code:

 mov rax, cr4
 or rax, 1 shl 3
 mov cr4, rax ; CR4.DE (bit 3)

This breakpoint is very useful in kernel mode (in, out, insb, and outsb instructions).

 mov eax, 20h   ; port number
 mov dr3, rax
 mov rax, dr7
 and eax, not ((1111b shl 28) + 11000000b)   ; mask off all
 or eax, 1010b shl 28 + 01000000b            ; LEN3=10b (8 bytes), R/W3=10b (I/O)
 mov dr7, rax

The condition which caused int 1 exception is recorded in the DR6 debug-status register:

DR6
bit	name	event
14	BS	Single Step (rFLAGS.TF has been set)
13	BD	Breakpoint Debug Access Detected (DR7.GD has been set)
3	B3	Breakpoint #3 Condition Detected
2	B2	Breakpoint #2 Condition Detected
1	B1	Breakpoint #1 Condition Detected
0	B0	Breakpoint #0 Condition Detected

DR7
bit(s)	mnemonic	description
13	GD	General Detect Enabled

When this bit is set, the debug exception (int 1) occurs when an attempt is made to execute a MOV DRn instruction to any debug register (DR0-DR3, DR6, DR7). This bit is cleared to 0 by the processor when the int 1 handler is entered, allowing the int 1 handler to read and write the DR registers. The int 1 exception occurs before executing the instruction, and DR6.BD is set by the processor. Software debuggers can use this bit to prevent the currently-executing program from interfering with the debug operation.

A Sample of int 1 Handler

At the entry of the handler, CPU clears DR6.BD (bit 13) so mov rax, dr6 doesn't cause int 1 again.

int01_handler:
 push rax
 mov rax, dr6
 bt eax, 14
 jc single_step_detected
 bt eax, 13
 jc debug_access_detected
 test eax, 1 shl 3
 jnz bp3_detected
 test eax, 1 shl 2
 jnz bp2_detected
 test eax, 1 shl 1
 jnz bp1_detected
 test eax, 1
 jnz bp0_detected

If none of these bits are set, the exception caused icebp instruction (opcode 0F1h).

icebp_detected:
 ...
 pop rax
 iretq

Note that there are no other sources of int 1 exception.

Instruction execution breakpoint and general-detect condition cause the int 1 exception to occur BEFORE the instruction is executed.

All other breakpoints (Data Write Only, Data Read or Data Write, I/O Read or I/O Write) and single-stepping conditions cause the int 1 exception to occur AFTER the instruction is executed. More int 1 conditions may occur on the same instruction.

For repeated operations (with rep prefix, like rep movsb), these can be suspended by an exception or interrupt so int 1 can occur between iterations.

Databreakpoint conditions on the previous instruction occur before an instruction-breakpoint condition on the next instruction. However, if instruction and data breakpoints can occur as a result of executing a single instruction, the instruction breakpoint occurs first (before the instruction is executed), followed by the data breakpoint (after the instruction is executed).

How Single Stepping Behaves

Single-step breakpoints (trigger int 1 vector) are enabled by setting the rFLAGS.TF bit to 1. When single stepping is enabled, an int 1 exception occurs after every instruction is executed until it is disabled by setting rFLAGS.TF to 0. The instruction that sets the TF bit is not single stepped, the instruction that follows hits int 1 after completing execution (because exception single step is trap type of exception). The instruction that clears TF bit hits int 1 (because TF was set before the instruction and single step exception is trap type of exception - is triggered after execution of instruction completes).

 pushf
 or dword [rsp], 1 shl 8
 popf

 ; RFLAGS.TF=1 now

 mov edx, eax

 ; now int 1 occurs for the first time (as the mov instruction execution completes),
 ; because single step is TRAP type of exception, not FAULT type

 pushf

 ; now int 1 occurs again

 and dword [rsp], not (1 shl 8)

 ; int 1 occurs for the third time

 popf

 ; int 1 occurs for the forth time (as the execution of popf instruction),
 ; it is the last time because of execution of popf instruction
 ; clears TF bit
 ; rFLAGS.TF=0 now

 

 mov ebx, ecx   ; this doesn't trigger int 1 anymore

A Skeleton of int 1 Handler for Single Stepping

When an int 1 exception occurs due to single stepping, the processor sets rFLAGS.TF to 0 before entering the int 1 handler, so that the handler itself is not single stepped. The processor also sets DR6.BS (bit 14) to 1, which indicates that the int 1 exception occurred as a result of single stepping.

The rFLAGS image pushed into the debug-handler stack has the TF bit set, and single stepping resumes when a subsequent iretq pops the stack image into the rFLAGS register.

int01_handler:
 push rax
 mov rax, dr6
 bt eax, 14   ; DR6. BS
 jnc other_than_single_step

single_step_detected:
 ...
 iretq

Single stepping can be a bit more complicated, we discuss it below.

Software Breakpoints

Software breakpoint always triggers int 3 vector. It is based on int3 instruction with opcode 0CCh. This instruction is very useful because this 1-byte fits to overwrite the first byte of any other instruction.

In fact, we have another possibility to encode this instruction using opcode 0CDh, 03h. This encoding is not much useful because it can't fit into 1-byte instructions (cld; push/pop gpr64; xchg gpr32, eax; stosb; …).

A debugger puts 1-byte form of this instruction at the desired address in a code. If a program hits this instruction, the debugger stops its execution until resumed.

A programmer puts this instruction to his source code in development stage (int3 is incompiled). This is a trick how to go easy and quickly into desired part of program using debugger.

Handling Software Breakpoints

The debugger reads the original byte and saves the original byte and the original address by storing then into an internal buffer.
The debugger replaces the original byte with the byte 0CCh.
The debugger waits until int 3 occurs.
The int 3 handler gets the address just after the executed byte 0CCh (int 3 is trap type of exception)
The debugger calculates internal value X by subtracting 1 from address returned in step 4 (X = RIP-1)
The debugger checks its internal buffer if any of stored address matches X
If no such address found, it is an instruction int3 incompiled into the program (source of program has int3 instruction, developer must remove it finally):
```
 jmp end_of_int3_handler
```
If such address found, it was a breakpoint caused by byte 0CCh inserted into the program by the debugger:
```
 restore the original byte at address X
 decrease RIP of the program (RIP-1 = X)
end_of_int3_handler:
 iretq
```

Other Features

We can watch addresses of instructions causing control transfers and exceptions. The instructions are: JMP, CALL, RET, Jcc, JrCXZ, LOOPcc, JMPF, CALLF, RETF, INT n, INT 3, ICEBP, IRETQ, SYSCALL, SYSRET, RSM. We can watch also NMIs and SMIs.

We just need to enable 1 bit in 1 register. However, I suppose that neither Windows nor Linux have this bit enabled. The register's name is Debug-Control MSR:

DebugCtlMSR
bit	mnemonic	description
1	BTF	Branch Single Step
0	LBR	Last-Branch Record

Similar kernel-mode code sets this bit to 1:

DebugCtlMSR = 01D9h
 mov ecx, DebugCtlMSR
 rdmsr
 or eax, 1
 wrmsr

Setting LBR bit orders the processor to record the source and target addresses of the last control transfer (branch instruction, interrupt, and exception).

The processor automatically disables control-transfer recording when int 1 occurs by clearing DebugCtlMSR.LBR to 0. The contents of the control-transfer recording MSRs are not altered by the processor when int 1 occurs. Before exiting the debug-exception handler, software can set DebugCtlMSR.LBR to 1 to re-enable the recording mechanism.

After enabling LBR bit of DebugCtlMSR, the source and destination addresses of control-transfer events are saved by the processor - branches (call, jmp), interrupts, exceptions. We have four registers: LastBranchFromIP (01DBh), LastBranchToIP (01DCh), LastExceptionFromIP (01DDh), and LastExceptionToIP (01DEh). These 64-bit registers are read-only so there is no way how to prevent them to destroy during context switching. Well, we can hack this weakness in a limited way (topic for a presentation at the next FASM Technical Conference).

This code is a sample how to read LastBranchFormIP register:

LastBranchFromIP = 01DBh
 foo dq ?
 ...
 mov ecx, LastBranchFromIP
 rdmsr
 mov dword [foo+4], edx
 mov dword [foo], eax   ; qword [foo] now holds the 64-bit address

DebugCtlMSR.BTF changes the behavior of the rFLAGS.TF bit. When this bit is cleared to 0 (normal, most common setting), rFLAGS.TF bit controls instruction single stepping (normal behavior). When this bit is set to 1, the rFLAGS.TF bit controls single stepping on control transfers (branch instruction, interrupt, exception) - single step doesn't occur on every instruction, but only on control transfers ("bigger single steps"). By this way the single-step mechanism is allowed to do single step only on control transfers, rather than single step every instruction.

Debuggers can use this capability to perform a "coarse" single step across blocks of code (bound by control transfers) (DebugCtlMSR.BTF=1, rFLAGS.TF=1), and then, as the problem search is narrowed, switch into a "fine" single-step mode on every instruction (DebugCtlMSR.BTF=0, rFLAGS.TF=1).

Summarization

We have two types of breakpoints:

Software Breakpoint

This instruction breakpoint is done using int3 instruction (opcode 0CCh). A debugger uses this byte to overwrite the original instruction. Instruction breakpoint must lie on the begin of the instruction (not inside it!). The disadvantage is that this breakpoint modifies program's memory so the CRC of code with such a breakpoint will not match the original one.

Hardware Breakpoint

This kind of breakpoint uses debug registers so it doesn't modify program's memory. The advantage is that we can watch also memory and I/O port access. On the other side, we can use only four breakpoints for every thread.

Resources

AMD64 Architecture Programmer's Manual Volume 2: System Programming

The Linux Kernel Archives

man ptrace (Linux help)

Microsoft Developer Network

Self-mistakes and a lot of years spending by debugging because of them :-)

Comments

Continue to discussion board.

The author doesn't wish to publish his e-mail here.

Visit author's home page.

Revisions

2007-09-29

1.0

First public version

Feryno

(dates format correspond to ISO 8601)