Hafiz Shamnad

Posted on Feb 27

Day 13 — I Stopped Trusting File Names and Started Inspecting Files (SafeOpen v2)

#python #cybersecurity #security #linux

Yesterday my tool only looked at the filename.

Today I realised the filename is the lie attackers want you to believe.

The Moment Everything Changed

I took a harmless malicious.bat and renamed it to invoice.pdf.

My old checker (Day 12) said: “Looks safe ✅”

Windows Explorer showed: invoice.pdf (icon = PDF)

A normal user would double-click without a second thought.

But the file was still a batch script.

That’s when it hit me:

The operating system doesn’t execute the name.

It executes the content.

Files Have Two Identities

What the user sees → filename + icon (easy to fake)
What the OS executes → magic bytes (first 2–8 bytes of the file)

Real examples:

PDF → always starts with %PDF
Windows EXE → always starts with MZ
ELF binary (Linux) → starts with 7f ELF
ZIP (DOCX, XLSX, JAR…) → starts with PK\x03\x04

If the header says “executable” but the name says “document”, that’s a disguise. Game over for filename-only checkers.

So I rebuilt everything.

SafeOpen v2 — “Inspect Before You Execute”

Here’s the complete, ready-to-run tool with full explanations of every new capability.

#!/usr/bin/env python3
"""
SafeOpen v2 — File Security Analyzer
Inspect before you execute.
"""

import sys, hashlib, mimetypes, os, math, struct, time, argparse, json, re
from datetime import datetime

# === 1. Magic Byte Detection (Upgrade #1) ===
MAGIC_SIGNATURES = {
    b"MZ":                    "Windows PE Executable",
    b"\x7fELF":               "Linux ELF Executable",
    b"\xca\xfe\xba\xbe":      "Java Class File",
    b"PK\x03\x04":            "ZIP Archive (DOCX/XLSX/JAR/etc)",
    b"%PDF":                  "PDF Document",
    b"\x89PNG":               "PNG Image",
    b"\xff\xd8\xff":          "JPEG Image",
    # ... (full dict in the complete code below)
}

def detect_magic(data):
    for magic, desc in MAGIC_SIGNATURES.items():
        if data.startswith(magic) or magic.lower() in data[:512].lower():
            return desc
    return None

What this does: Reads the first 2048 bytes and matches against known headers.

Rename malware.exe → report.pdf → tool now screams “CRITICAL — Executable disguised as document”.

# === 2. Entropy — The “Malware Smell” (Upgrade #2) ===
def entropy(data):
    if not data: return 0.0
    occur = [0] * 256
    for byte in data:
        occur[byte] += 1
    ent = 0.0
    length = len(data)
    for x in occur:
        if x == 0: continue
        p = x / length
        ent -= p * math.log2(p)
    return ent

Why it matters: Normal documents have structure → entropy ~4.0–6.5

Packed/encrypted malware → entropy >7.5 (looks like random noise).

No signatures needed. Pure mathematics.

# === 3. Suspicious Behaviour Indicators (Upgrade #3) ===
SUSPICIOUS_STRINGS = [
    (b"powershell", "PowerShell downloadcradle"),
    (b"Invoke-WebRequest", "Downloader"),
    (b"rm -rf", "Destructive delete"),
    (b"net user", "User creation"),
    # ... 20+ more patterns
]

def scan_strings(data):
    hits = []
    lower = data.lower()
    for pattern, desc in SUSPICIOUS_STRINGS:
        if pattern.lower() in lower:
            hits.append(desc)
    return hits

Scans first 512 KB for known malicious patterns. A document containing powershell -c Invoke-WebRequest is not a document.

# === 4. Embedded Network Indicators (Upgrade #4) ===
# Simple regex on decoded text
urls = re.findall(r'https?://[^\s\'"<>]{4,80}', text)
ips  = re.findall(r'\b(?:\d{1,3}\.){3}\d{1,3}\b', text)

Shows you every C2 server or IP the file wants to talk to before you open it.

# === 5. Cryptographic Hashes + PE Header Parsing (Upgrade #5) ===
def sha256sum(path): ...
def md5sum(path): ...

def get_pe_info(data):
    # Parses MZ → PE header, extracts architecture, compile time, DLL/EXE flag
    ...

Even if the file is renamed 10 times, the SHA-256 is the same.

PE parser tells you “64-bit executable compiled on 2025-11-03”.

Risk Meter — One Number to Rule Them All

Every check adds to a 0–100 risk score:

Extension mismatch → +25
High entropy → +30
Suspicious strings → +5 each
Embedded URLs → +2 each
PE executable in .pdf → instant jump

Then a beautiful terminal risk meter with colour-coded threat level (CLEAN → CRITICAL).

What SafeOpen Is (and Is Not)

Is: 5-second pre-execution triage for suspicious attachments.

Is not: Antivirus, sandbox, or signature-based detector.

It solves the exact moment every SOC analyst, helpdesk tech, and power user faces:

“Hey, is this invoice.pdf safe?”

How to use it right now:

python3 safeopen.py suspicious.pdf --strings
python3 safeopen.py *.exe --json-out report.json

Results

Final Thought

Most breaches aren’t zero-days.

They’re ordinary files opened by ordinary people who trusted the filename.
Sorry for missing yesterday’s post. I got pulled into some serious debugging and real testing, and the write-up itself took longer than I expected. I didn’t want to rush it and post something half-baked, so I waited until it was stable and properly explained. Day 13 is finally here 🙂

SafeOpen gives you the habit:

Don’t execute first. Inspect first.

Drop a 🔥 if you want the Day 14 tomorrow.

DEV Community