Fawole Joshua

Posted on Mar 20

Why Signature-Based Security Is No Longer Enough To Detect Cyber Attacks And How UEBA Hunts vicious threats.

#ai #cybersecurity #machinelearning #deeplearning

Introduction

Imagine a national museum that holds thousands of historic antiquities.

The monuments are very precious, worth billions and are very essential to the preservation of the cultural heritage of that nation.

The department of security has the job of preventing unwanted persons from entering the museum. In order to do the job effectively, security personnel have lists and mugshots of criminals. At every entrance, guards check visitors against a book of mugshots actively searching for known criminals, troublemakers and persons of interest.

The identity of anyone entering into the museum is verified. For centuries this worked because a criminal looked like a criminal; they wear suspicious clothes, walk around aimlessly, carry odd bags and lose their temper at the slightest questioning by security guards.

However, criminals and people with malicious intentions evolved. Today's thieves don't look like thieves. They dress sharply, walk with confidence, and never once glance nervously at a security camera. They've done their homework. They know which employees have access to the restricted wings. They've studied their mannerisms, their routines, their faces. And on the day of the heist, they walk straight through the front gate wearing the face of a trusted curator.

Mugshots became useless, posters became increasingly less effective and thieves walked straight through the gate into the museum with little or no stress.

This is the problem with modern cybersecurity, your network is the museum attackers are trying to access, your databases and servers are precious and invaluable artifacts . While your firewalls, antivirus softwares and intrusion detection systems are the security guards with mugshots, lists of criminals and types of attacks expected.

They are checking everything against a list of known-bad signatures, malicious IPs and known attack patterns. This is not effective in any form. Attackers no longer look like attackers, they use stolen credentials, they avoid irrational and suspicious movements, they move stealthily past your rules using different and dynamic techniques and head straight for your data.

This article makes a sole argument: signature-based security alone is insufficient. In a world where attackers constantly change their tools, malware architecture, IP and overall techniques, manually setting rules to detect different types of attacks is not only stressful but also impracticable and very inefficient.

New malware variants appear in their thousands almost every month, there is absolutely no way antivirus software can keep up, attacks like Living-off-the-land use legitimate tools and require no bad signature. Zero-day-exploits are even worse. They are ghosts and leave almost no signature.

This means, attackers have an easy entrance to networks. Furthermore, they can move mountains if they lay their hands on an authorized account and the signature-based security system will not flag it.

The cat and mouse game of find-the-bad-activities is a losing battle. The only sustainable defence mechanism is to know what good activities look like. The defence system needs to know everything about it such that bad activities cannot hide no matter what form it takes.

This is the philosophical shift at the heart of modern threat hunting. It is the undiluted application of machine learning and artificial intelligence in the realm of security. This is the security agency that does not need a list of attacks before it can successfully flag one.

If Jude from the HR department at the museum situated in California suddenly logs in on a Sunday evening from Dubai and starts downloading 678 gigabytes of customer data, we don't need to debate whether the IP is malicious or that the download tool has a signature. It is obvious that this is an unusual activity and has never happened in the space of 5 years that Jude had been working with the Museum, then it will certainly be flagged as an anomaly.

User and Entity Behaviour Analytics (UEBA) Detection: How it Learns.

The government of the country realizes the critical issue at the museum and decided to introduce a special task force to help the security department. This is where UBEA comes in.

User and Entity Behaviour Analytics focuses on studying everyone and everything (humans, their instruments and other factors) this is done in order to establish a ground truth, what we can otherwise term as normal activities. Anything apart from these normal activities are potential threats.

Imagine a new security guard named Owen
(UEBA) assigned to the Museum. For his first three months Owen does nothing but to carefully observe every employee, every visitor, every delivery person. Their times of resumption, exit, levels of access and general mode of conducting their activities.

Owen is not just memorizing facts, he is building a robust infrastructure that will serve as the baseline for evaluating all future activities. This is what UEBA does with your data, it consumes logs from countless sources:

Authentication logs (VPN, Active Directory)
Network flows (NetFlow, DNS queries)
Endpoint logs (process creation, file access)
Application logs (database queries, web server access)
Cloud service logs (Office 365, AWS, Salesforce)


Assuming you've loaded the data, imported the libraries and performed data cleaning and preprocessing 

print("\n[3] Feature Engineering for Behavioral Profiles")

df_behavior = df.copy()
df_behavior['is_attack'] = (df_behavior['Label'] == 'DDoS').astype(int)

# replace inf values first
df_behavior = df_behavior.replace([np.inf, -np.inf], np.nan)

# 1. Packet rate features - add small epsilon to avoid division by zero

df_behavior['packets_per_second'] = df_behavior['Total Fwd Packets'] / (df_behavior['Flow Duration'] + 1e-10)

df_behavior['bytes_per_packet'] = df_behavior['Total Length of Fwd Packets'] / (df_behavior['Total Fwd Packets'] + 1e-10)

# 2. Flag ratios

df_behavior['syn_ack_ratio'] = df_behavior['SYN Flag Count'] / (df_behavior['ACK Flag Count'] + 1e-10) \ (df_behavior['ACK Flag Count'] > 0).astype(int)...

Output

[3] Feature Engineering for Behavioral Profiles
Created new behavioral features:
['packets_per_second', 'bytes_per_packet', 'syn_ack_ratio', 'flag_diversity', 'fwd_bwd_ratio', 'packet_size_variation', 'iat_cv']

From this raw data, UEBA extracts behavioural features and truth for every employee and user. It learns that Jonathan from the accounting department usually logs-in in the morning at 7:35AM, opens the spreadsheet and had never attempted to open the organization's source code repository.

That series of observations is the ground truth, any deviation is tantamount to a breach. The concept of ground truth is perhaps the strongest asset a defender can possess.

How UEBA Works and Prevents Spamming of False Positives.

If UEBA relies on the establishment of the ground truth, does that mean it fluctuates and flags everything that deviates from the ground truth?

Not exactly, however, it only flags but does not report everything. It classifies signals according to a predefined and domain-specific level of seriousness (risk scoring system). This ensures that SOC Analysts are not drowned in threat reports which later turns out to be insignificant or totally non-malicious.

Say, on the 23rd of June, 2025, a thief manages to compromise the account of a young employee at the maintenance department of the museum. Let's call this account “IamCareless600”

Day 1: the account logs in on a Saturday at 10:45PM. This is unusual and the owner has never done this. Owen sees it but doesn't react (maybe it's an emergency, or he just needs to get something).

Day 2: The account logs in on Monday, but instead of heading to the maintenance department, the account went to the restoration lab (a place he had never gone to) the account was denied entrance, it headed again to the administrative block and the entrance was further denied and finally starts making its way to the server room. Access once again denied.

Owen now has weak signals:

An anomalous late night entry
Multiple failed access attempts to restricted areas
A pattern of wandering that does not match any employees normal behaviour

Day 3: IamCareless600 Logs in and immediately tries to gain access to the database, he succeeds this time and starts transferring 500 gig of file to an external IP in a foreign country. The combination of these activities give Owen a strong probable cause.

Owen's machine-learning-powered brain correlates the signals: [Anomalous entry time] + [Multiple failed access attempts to restricted areas] + [First-time server access] + [Massive data exfiltration] = COMPROMISED ACCOUNT. Owen doesn't raise a generic alarm. He runs to the security team with a precise report.

The thief is caught in the act, halfway through stealing the museum's most precious records. This is threat hunting. This is the difference between waiting for an alarm and actively watching out for strange activities.


print("\n[5] Detecting Anomalous Ports with Isolation Forest")

port_features = ['total_flows', 'avg_flow_duration', 'avg_packet_size', 'avg_packet_rate', 'syn_ack_ratio', 'packet_size_std']

X_port = port_df[port_features].copy() 
# Handle any remaining infinite or NaN values
X_port = X_port.replace([np.inf, -np.inf], np.nan) X_port = X_port.fillna(X_port.mean())

from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X_port)

expected_contamination = port_df['is_malicious_port'].mean()
print(f"Expected contamination: {expected_contamination:.2%}")

from sklearn.ensemble import IsolationForest
iso_forest = IsolationForest(n_estimators=100, contamination=expected_contamination, random_state=42, bootstrap=False)  # Disable bootstrap to avoid issues
iso_forest.fit(X_scaled)

port_df['anomaly_score'] = iso_forest.decision_function(X_scaled)
port_df['predicted_anomaly'] = (iso_forest.predict(X_scaled) == -1).astype(int)

print("\nTop 20 most anomalous ports:")
anomalous = port_df.sort_values('anomaly_score').head(20) 

print(anomalous[['port', 'total_flows', 'attack_rate', 'anomaly_score', 'attack_rate', 'anomaly_score', 'is_malicious_port']].to_string())

Practical Implementation of UEBA Using CIC-IDS-2017 Dataset as a Case Study

We conducted a practical implementation of UEBA on a collection of network traffic containing both normal activity and DDoS attacks.

Owen was able to find the needle in the haystack, using an isolation forest to detect anomaly. We discovered that port 80, the web server, was drowning in a traffic attack: 136,951 flows, 93% of them malicious. The volume was 4.1 standard deviations above normal. The packet sizes were also 3.4 standard deviations above normal. The probability of this happening by chance is less than 1 in 35 million.

Owen does not need to be aware of how the attack was made or what name was it called, all he cares about is that “This is an unusual activity, a dangerous one at that and must be instantly stopped”. Here is the link to the comprehensive code https://github.com/Akanji102/DDoS-anomaly-detection-using-Isolation-forest

Classification Report:

precision recall f1-score

Normal Port 1.00 1.00 1.00 19

Malicious Port 1.00 1.00 1.00 1

Accuracy: 100%

ROC-AUC: 1.00

Matthews Correlation Coefficient: 1.00

Although the dataset was generated under a constrained environment and was essentially made for educational purposes, it actively demonstrates how Isolation forest, other unsupervised algorithms and deep learning networks can detect strange activities and any form of attack, old or new.

Tools and Techniques UEBA Utilizes to Hunt Threats.

Owen isn't a single model, he is an ensemble of different algorithms and tools, all working together to classify behaviours and detect anomalies. Some of them include

(1). Temporal analysis:

Temporal analysis (time series analysis) models every user and entity's activity as a pattern over time. It doesn't just track what they do, but when they do it and in what sequence. It detects unusual login times, modified work patterns and sequence violations.

The algorithms here range from statistical methods like Seasonal ARIMA (which captures weekly patterns) to deep learning approaches like LSTMs (which excel at learning sequences).

(2). Graph Analysis:

UEBA as a defence system does not see every account or network as a single node that only needs to be studied independently, it sees them as a giant dynamic web of connections.

It detects data exfiltration, lateral movements and insider collusion. When a Clerk suddenly starts interacting and sending huge data files to businessmen in the Middle East or a group of connections keep moving with malicious intents, they are instantly studied as a whole. UEBA traces the underlying relationships between every single network in order to detect fraud.

The magic here lies in algorithms like Community Detection (which automatically finds groups that normally work together) and Graph Neural Networks (which learn to spot structural anomalies).

(3). Statistical analysis:

The Volume Detective tracks quantities, data volumes, file counts, action frequencies. It builds what “normal” should look like. A marketing intern might possibly download an average of 200mb of data per day while a video editor might go as far as 10gig.

It detects massive downloads (a video editor normally should not download salary dataset or download a dataset of 500 gig). It uses models like Gaussian distribution, moving averages, exponential smoothing etc.

(4). Unsupervised Learning:

This is the instrument for setting the final ground truth. It is the sorting of all of the data into clusters in order to find the odd ones. It creates the truth and what should be avoided.

It includes models like KMeans, DBScan, Hierarchical Clustering and a very good one is also Isolation forest.

Things to note before implementingUEBA

(1). Data Quality:

UEBA as a defence system is only as good as the one on which it was built. The data should be clean, realistic and gathered with a good pipeline. Bad data automatically connotes bad UEBA.

(2). Cold Start:

As previously explained, UEBA needs a significant amount of time to gather knowledge and establish the base line. Therefore, in the first few weeks of setting it up, the signals will be noisy and will not generate favourable results but it will increase in accuracy over time.

(3). Concept Drift:

Companies are not static, roles change and so do policies, it is therefore recommended that UEBA models be retrained after major drifts in order to ensure accuracy

(4). Signature-based and human-in-the-loop:

UEBA doesn't replace signature-based threat detection, it enriches it. The system does not replace human analysts as well, it empowers them. Human hunters investigate, confirm or refute, and provide feedback that closes the loop and improves future detection. This symbiosis is essential.

Conclusion

There are two options available for the hypothetical scenario presented in this article: the museum can add more guards and maintain its rule-based system or it could fundamentally rethink its approach, shifting from reactive detection to proactive behavior monitoring.

The same applies to the security of your company's data, UEBA gives flexibility, dynamic reaction and cautious proactiveness. It is currently one of the strongest defence mechanisms against cyber attacks.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.