I've scanned the same 118 blockchain validator nodes probably 200 times over the past year. And for most of that time, my scanner was an idiot with amnesia - treating scan #200 exactly like scan #1, learning nothing.
Every single time, ports 2375 and 2376 showed up as "open." Every single time, my tools dutifully tested them for Docker APIs. Every single time, they found nothing. Ten seconds wasted per scan, multiplied by hundreds of scans, just... gone.
Then I had a thought: What if my scanner could remember?
The Ghost Port Problem
Here's what kept happening across all 118+ nodes, spanning multiple cloud providers and geographies:
- Ports 2375/2376 (standard Docker API ports) responded to TCP handshakes
- But curl hung. Netcat got EOF immediately. No banner, no service, nothing
- Identical TCP fingerprints every time: TTL≈63, window=65408
- These were otherwise hardened validator nodes with strict firewalls
Traditional security scanners reported these as "open/tcpwrapped" or "unknown service." Which meant:
- Repeated Docker API testing (10+ seconds per port)
- Manual investigation on every scan
- False positives in my reports
- Wasted scanning budget when cloud providers flagged excessive probes
After the 50th identical scan, I was done. There had to be a better way.
Vector Embeddings: Not Just for Chatbots
Vector embeddings are typically associated with NLP and RAG systems — turning text into high-dimensional vectors where semantically similar things cluster together. But the core concept is universal: represent complex data as points in space, then query "what's similar to this?"
What if each network scan became a vector representing:
- Port combinations and states
- TCP-level behaviors (TTL, window size, response timing)
- Application-layer responses
- Infrastructure context (hosting provider, network profile)
Then instead of treating every scan independently, I could query: "What have I learned from similar infrastructure before?"
The Architecture
I built a three-part system:
Alan (AI Planner): LLM-based decision engine that receives scan context and historical patterns, then generates optimized probe sequences
Stan (Executor): Runs the actual scanning commands (nmap, masscan, protocol probes) and captures behavioral metadata
Vince (Vector Memory): PostgreSQL with pgvector extension storing 1536-dimensional embeddings with cosine similarity search
The flow looks like:
- Stan discovers open ports → [22, 80, 2375, 2376, 9000, 9184]
- Vector memory finds similar historical scans
- Alan gets enriched context with patterns
- Alan generates optimized probe plan based on what worked before
- Results stored with behavioral fingerprint
- Embedding generated and indexed for future queries
Setting Up pgvector
I chose pgvector because it's PostgreSQL-native, mature, and way more cost-effective than managed vector databases:
CREATE EXTENSION vector;
ALTER TABLE validator_scans
ADD COLUMN embedding vector(1536);
CREATE INDEX validator_scans_embedding_idx
ON validator_scans
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
Similarity queries are simple:
SELECT id, host, ports,
1 - (embedding <=> query_embedding) as similarity
FROM validator_scans
WHERE created_at > NOW() - INTERVAL '90 days'
ORDER BY embedding <=> query_embedding
LIMIT 5;
For embeddings, I use OpenAI's text-embedding-ada-002 (1536 dimensions) because it's dirt cheap ($0.0001 per 1K tokens) and handles structured text well.
Beyond Simple Signatures
Traditional fingerprinting is rule-based:
IF port == 2375 AND banner contains "Docker"
THEN service = Docker API
Vector-based learning captures behavior:
"Port 2375: SYN-ACK succeeds, TTL=63, window=65408,
no banner, immediate FIN on data send,
appears alongside ports 9000+9184 (Sui consensus/metrics),
ASN indicates Vultr hosting"
Similarity search returns:
"47 similar scans: 46 showed identical 'ghost port' behavior,
1 had actual Docker (flagged as anomaly)"
Conclusion: 98% confidence this is NOT Docker, likely cloud infrastructure artifact
Watching It Learn
Scans 1-20 (Initial learning): System tests Docker APIs as expected, stores behavioral metadata showing timeouts and connection refusals.
Scans 21-50 (Pattern recognition): Vector similarity search starts clustering:
Query: Scan with ports [22, 80, 2375, 2376, 9000, 9184]
Top matches:
- Scan #14: 96% similarity → 2375/2376 ghost ports
- Scan #8: 94% similarity → 2375/2376 ghost ports
- Scan #19: 93% similarity → 2375/2376 ghost ports
Pattern confidence: 0.85 (17/20 matching scans)
Recommendation: Skip Docker testing on 2375/2376
Scans 51+ (Optimized): High confidence behavioral signatures:
{
"similar_scan_count": 47,
"confidence": 0.96,
"2375_behavior": "ghost_port - skip Docker probes",
"estimated_time_saved": "45s per scan"
}
The Results
After 200 scans of the same infrastructure:
Time efficiency: 58 seconds per scan → 20 seconds per scan (66% reduction)
Probe efficiency: 7.2 probes per host → 3.8 probes per host (47% less network traffic)
False positives: 2.4 per scan → 0.3 per scan (87% reduction)
Pattern recognition speed: Confident patterns (>0.85 similarity) after just 18-25 similar scans
But here's the coolest part: anomaly detection. On scan #73, port 2375 actually responded with a Docker API. The system immediately flagged it: "Unusual behavior — historical data shows 0.02% Docker response rate." Turned out to be a misconfigured node that needed immediate attention.
Practical Considerations
Similarity thresholds matter:
- Homogeneous infrastructure (like validators): 0.75-0.85
- Mixed environments: 0.65-0.75
- Pentesting diverse targets: 0.60-0.70
Cold start problem: First 10-20 scans of new infrastructure provide no optimization. Mitigation: seed database with known patterns.
Temporal drift: Infrastructure changes over time. I time-weight similarity to prefer recent scans.
Embedding overhead: Adds 50-100ms per scan. I generate embeddings asynchronously in production.
Why This Matters
Traditional security scanners treat every scan as a fresh start. They're like someone with no short-term memory, asking the same questions over and over. This made sense 20 years ago when each network was unique.
But modern security teams scan thousands of similar nodes repeatedly:
- Development environments that clone production
- Auto-scaling cloud infrastructure
- Container clusters with identical configurations
- Blockchain validator networks (my use case)
Vector-based behavioral fingerprinting lets scanners accumulate institutional knowledge that compounds over time. They get smarter with every scan, building confidence about what's normal and what's anomalous.
As cloud infrastructure grows more complex — with synthetic network responses, polymorphic services, and dynamic topologies — we need security tools that learn. Not just from signature databases, but from their own experience.
What's Next
I'm exploring:
- Multi-modal embeddings combining text with numeric TCP fingerprints
- Transfer learning: do patterns from Sui validators apply to Ethereum nodes?
- Hierarchical clustering to automatically build infrastructure taxonomies
- Tracking temporal pattern evolution to detect infrastructure migrations
The core insight stands: every scan is a training example. Stop forgetting. Start remembering.
I’m publishing the open source code here: github.com/pgdn-oss. Built with PostgreSQL, pgvector, and OpenAI embeddings. Part of a new venture, coming soon.
Top comments (0)