Building a RAG-Based AWS VPC Flow Log Analyzer

#ai #rag #cloudcomputing #openai

Understanding network traffic inside a Virtual Private Cloud (VPC) directly impacts your security posture, performance visibility, and compliance readiness. Yet most teams still sift through raw flow logs manually, reacting to incidents instead of proactively investigating them.

Rather than grepping through thousands of log lines or exporting data to spreadsheets, we can turn VPC Flow Logs into an interactive layer.

What if you could simply ask your logs questions like this?

Was that SSH connection rejected?
Which IP keeps hitting port 443?
Is this traffic normal or a problem?

In this article, we’ll build a Retrieval-Augmented Generation (RAG) powered VPC Flow Log Analyzer that turns static network telemetry into an interactive security assistant

The Challenge of Manual Log Analysis

AWS VPC Flow Logs capture essential information about network traffic. Yet, analysing these raw logs to detect threats like SQL injection attempts or unauthorised access presents significant challenges:

Information Overload: The sheer volume of logs is overwhelming. Finding specific patterns or anomalies is like searching for a needle in a haystack.
Context Fragmentation: Raw logs lack context. Identifying related packets across different components and time frames is labour-intensive and error-prone.

The RAG-based VPC Flow Log Analyser uses:

Streamlit (interactive UI)
LangChain (RAG orchestration)
Chroma (vector database)
OpenAI GPT-4o (reasoning engine)

At the end, you'll have a conversational security assistant capable of answering questions like:

“Which IPs were rejected?”
“Was there unusual traffic to port 22?”
“Which destinations received the most packets?”

Functional Components

Data Ingestion & Transformation ("Translator")
Raw VPC Flow Logs are just strings of numbers and IPs (e.g., 2 123... 443 6 ACCEPT).
The Component: A custom Python parser.
It "hydrates" the logs, turning them into human-readable sentences like "Source 10.0.1.5 sent 1000 bytes to Port 443 and was ACCEPTED." This makes it much easier for the AI to "understand" the relationship between data points.
Embedding Model ("Encoder")
We can't search text mathematically, so we have to turn it into numbers (vectors).
Component: OpenAI text-embedding-3-small.
It creates a numerical "fingerprint" for every log line. Similar events (like multiple SSH brute-force attempts) will have similar numerical fingerprints, allowing for "fuzzy" or semantic searching.
Vector Database ("Memory")
Standard databases search for exact words; a vector DB searches for meaning.
Component: ChromaDB.
It stores thousands of these "fingerprints" locally. When you ask a question, it instantly finds the top 10 or 15 log entries that are most relevant to your specific query.
RAG Orchestration & LLM ("Brain")
This is where the actual "chatting" happens.
Component: LangChain + GPT-4o.

LangChain takes the question, grabs the relevant logs from ChromaDB, and hands them both to GPT-4o with a set of instructions: "You are a security engineer; tell me what happened here."

Streamlit Frontend ("Cockpit") Component: Streamlit Web Framework. It provides the UI for uploading .txt files, managing your API keys via .env, and providing the chat interface so you don't have to touch a terminal to investigate your network.

Steps involved in the implementation:

Check out the codebase on GitHub

Step 1: Creating Virtual Environment and Installing Dependencies

git clone https://github.com/Damdev-95/rag_aws_flow_logs

python -m venv venv

source venv/bin/activate

cd rag_aws_flow_logs

pip install -r requirements.txt

Step 2: Configuration handling

The environment variables include handling sensitive data, such as openai keys.

ENV_API_KEY = os.getenv("OPENAI_API_KEY")

Step 3: Running the streamlit App

streamlit run app.py

Once you click on 'Browse files', you will be able to upload log files on the application; ensure the log file format is in txt.

Select "Build Knowledge Base" to store the raw log data in the vector database after it has been converted into vectors.

Successfully created index events after the embedding process
Yes, we are live, I asked the below
What is the summary of the flow logs based on traffic accept and reject