Simon Morley

Posted on Oct 14

Teaching Security Scanners to Remember - Using Vector Embeddings to Stop Chasing Ghost Ports

#cybersecurity #vectordatabase #security #openai

I've scanned the same 118 blockchain validator nodes probably 200 times over the past year. And for most of that time, my scanner was an idiot with amnesia - treating scan #200 exactly like scan #1, learning nothing.

Every single time, ports 2375 and 2376 showed up as "open." Every single time, my tools dutifully tested them for Docker APIs. Every single time, they found nothing. Ten seconds wasted per scan, multiplied by hundreds of scans, just... gone.

Then I had a thought: What if my scanner could remember?

The Ghost Port Problem

Here's what kept happening across all 118+ nodes, spanning multiple cloud providers and geographies:

Ports 2375/2376 (standard Docker API ports) responded to TCP handshakes
But curl hung. Netcat got EOF immediately. No banner, no service, nothing
Identical TCP fingerprints every time: TTL≈63, window=65408
These were otherwise hardened validator nodes with strict firewalls

Traditional security scanners reported these as "open/tcpwrapped" or "unknown service." Which meant:

Repeated Docker API testing (10+ seconds per port)
Manual investigation on every scan
False positives in my reports
Wasted scanning budget when cloud providers flagged excessive probes

After the 50th identical scan, I was done. There had to be a better way.

Vector Embeddings: Not Just for Chatbots

Vector embeddings are typically associated with NLP and RAG systems — turning text into high-dimensional vectors where semantically similar things cluster together. But the core concept is universal: represent complex data as points in space, then query "what's similar to this?"

What if each network scan became a vector representing:

Port combinations and states
TCP-level behaviors (TTL, window size, response timing)
Application-layer responses
Infrastructure context (hosting provider, network profile)

Then instead of treating every scan independently, I could query: "What have I learned from similar infrastructure before?"

The Architecture

I built a three-part system:

Alan (AI Planner): LLM-based decision engine that receives scan context and historical patterns, then generates optimized probe sequences

Stan (Executor): Runs the actual scanning commands (nmap, masscan, protocol probes) and captures behavioral metadata

Vince (Vector Memory): PostgreSQL with pgvector extension storing 1536-dimensional embeddings with cosine similarity search

The flow looks like:

Stan discovers open ports → [22, 80, 2375, 2376, 9000, 9184]
Vector memory finds similar historical scans
Alan gets enriched context with patterns
Alan generates optimized probe plan based on what worked before
Results stored with behavioral fingerprint
Embedding generated and indexed for future queries

Setting Up pgvector

I chose pgvector because it's PostgreSQL-native, mature, and way more cost-effective than managed vector databases:

CREATE EXTENSION vector;

ALTER TABLE validator_scans 
  ADD COLUMN embedding vector(1536);

CREATE INDEX validator_scans_embedding_idx 
  ON validator_scans 
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

Similarity queries are simple:

SELECT id, host, ports, 
       1 - (embedding <=> query_embedding) as similarity
FROM validator_scans
WHERE created_at > NOW() - INTERVAL '90 days'
ORDER BY embedding <=> query_embedding
LIMIT 5;

For embeddings, I use OpenAI's text-embedding-ada-002 (1536 dimensions) because it's dirt cheap ($0.0001 per 1K tokens) and handles structured text well.

Beyond Simple Signatures

Traditional fingerprinting is rule-based:

IF port == 2375 AND banner contains "Docker" 
  THEN service = Docker API

Vector-based learning captures behavior:

"Port 2375: SYN-ACK succeeds, TTL=63, window=65408, 
 no banner, immediate FIN on data send, 
 appears alongside ports 9000+9184 (Sui consensus/metrics),
 ASN indicates Vultr hosting"

Similarity search returns:

"47 similar scans: 46 showed identical 'ghost port' behavior,
 1 had actual Docker (flagged as anomaly)"

Conclusion: 98% confidence this is NOT Docker, likely cloud infrastructure artifact

Watching It Learn

Scans 1-20 (Initial learning): System tests Docker APIs as expected, stores behavioral metadata showing timeouts and connection refusals.

Scans 21-50 (Pattern recognition): Vector similarity search starts clustering:

Query: Scan with ports [22, 80, 2375, 2376, 9000, 9184]

Top matches:
- Scan #14: 96% similarity → 2375/2376 ghost ports
- Scan #8:  94% similarity → 2375/2376 ghost ports  
- Scan #19: 93% similarity → 2375/2376 ghost ports

Pattern confidence: 0.85 (17/20 matching scans)
Recommendation: Skip Docker testing on 2375/2376

Scans 51+ (Optimized): High confidence behavioral signatures:

{
  "similar_scan_count": 47,
  "confidence": 0.96,
  "2375_behavior": "ghost_port - skip Docker probes",
  "estimated_time_saved": "45s per scan"
}

The Results

After 200 scans of the same infrastructure:

Time efficiency: 58 seconds per scan → 20 seconds per scan (66% reduction)

Probe efficiency: 7.2 probes per host → 3.8 probes per host (47% less network traffic)

False positives: 2.4 per scan → 0.3 per scan (87% reduction)

Pattern recognition speed: Confident patterns (>0.85 similarity) after just 18-25 similar scans

But here's the coolest part: anomaly detection. On scan #73, port 2375 actually responded with a Docker API. The system immediately flagged it: "Unusual behavior — historical data shows 0.02% Docker response rate." Turned out to be a misconfigured node that needed immediate attention.

Practical Considerations

Similarity thresholds matter:

Homogeneous infrastructure (like validators): 0.75-0.85
Mixed environments: 0.65-0.75
Pentesting diverse targets: 0.60-0.70

Cold start problem: First 10-20 scans of new infrastructure provide no optimization. Mitigation: seed database with known patterns.

Temporal drift: Infrastructure changes over time. I time-weight similarity to prefer recent scans.

Embedding overhead: Adds 50-100ms per scan. I generate embeddings asynchronously in production.

Why This Matters

Traditional security scanners treat every scan as a fresh start. They're like someone with no short-term memory, asking the same questions over and over. This made sense 20 years ago when each network was unique.

But modern security teams scan thousands of similar nodes repeatedly:

Development environments that clone production
Auto-scaling cloud infrastructure
Container clusters with identical configurations
Blockchain validator networks (my use case)

Vector-based behavioral fingerprinting lets scanners accumulate institutional knowledge that compounds over time. They get smarter with every scan, building confidence about what's normal and what's anomalous.

As cloud infrastructure grows more complex — with synthetic network responses, polymorphic services, and dynamic topologies — we need security tools that learn. Not just from signature databases, but from their own experience.

What's Next

I'm exploring:

Multi-modal embeddings combining text with numeric TCP fingerprints
Transfer learning: do patterns from Sui validators apply to Ethereum nodes?
Hierarchical clustering to automatically build infrastructure taxonomies
Tracking temporal pattern evolution to detect infrastructure migrations

The core insight stands: every scan is a training example. Stop forgetting. Start remembering.

I’m publishing the open source code here: github.com/pgdn-oss. Built with PostgreSQL, pgvector, and OpenAI embeddings. Part of a new venture, coming soon.

DEV Community