Mr Arindam Bhattacharjee

Posted on Mar 14

I Built an AI Security Scanner in an Afternoon. Here's What I Actually Learned

I'll be straight with you - I wasn't planning to build this. I stumbled across a project prompt about using Gemini to scan Python code for security vulnerabilities, and my first thought was "that sounds gimmicky." An LLM doing security analysis? Seems like exactly the kind of thing that would hallucinate a SQL injection where there isn't one, or miss an obvious hardcoded password.

Turns out I was partially wrong. The tool works better than I expected, and the places where it fell short taught me more about prompt engineering than the places it worked perfectly. So here's the full write-up.

What We're Building

A CLI tool that takes Python code , either a snippet or a real .py file - sends it to Gemini, and gets back a structured security report with severity ratings. Color-coded. In your terminal. Think of it as a lightweight version of what tools like Bandit or Snyk do, except you build it yourself in an afternoon and actually understand what's happening under the hood.

The stack is simple: Python, the google-generativeai SDK, python-dotenv for key management, and colorama for the colored output. No databases, no web server, no frontend. Just a script you can point at any Python file.

Why Gemini for This?

Fair question. Pattern matching tools like Bandit are faster and don't require an API call. But they miss context. A regex rule that flags md5 will catch hashlib.md5(password) but it'll also flag hashlib.md5(file_content) , which isn't necessarily a security issue at all. Gemini understands the context around the code, which matters more than people give it credit for.

That said, AI-powered scanning is a supplement, not a replacement. I'd run Gemini analysis alongside a static tool, not instead of one.

Setting Up (Don't Skip the Virtual Environment)

I know everyone says this and half the tutorials skip it anyway. Don't skip it. It takes 30 seconds and saves you from the "I broke my system Python" conversation.

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

pip install google-generativeai python-dotenv colorama

Get your Gemini API key from aistudio.google.com. The free tier gives you 1,500 requests per day , more than enough for development. Store it in a .env file:

GOOGLE_API_KEY=your-key-here

Add .env to your .gitignore before you do anything else. API key leaks are embarrassingly common. Google will revoke your key the moment their systems detect it in a public repo.

Quick sanity check to confirm you're connected:

import os
from dotenv import load_dotenv
import google.generativeai as genai

load_dotenv()
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

model = genai.GenerativeModel("gemini-2.5-flash")
response = model.generate_content("Say hello if you can hear me.")
print(response.text)

If Gemini responds, you're in. That's the hardest part done.

The Three Vulnerabilities We're Targeting

Before writing the scanner logic, let's talk about what we're actually looking for.

SQL Injection is when user input gets concatenated directly into a SQL query. Instead of treating the input as data, the database treats it as code. An attacker types ' OR '1'='1 into a login form and suddenly they're authenticated as everyone. It's been in the OWASP Top 10 for as long as the list has existed, and it still shows up in production code all the time.

Hardcoded secrets are passwords and API keys stored directly in source code. The problem isn't just that someone reads your code : it's that secrets end up in git history, in build artifacts, in CI logs. Once they're committed, they're effectively public even if you delete them later.

Weak cryptography in this case means using MD5 to hash passwords. MD5 is fast, which sounds like a good thing until you realise that fast hashing is exactly what attackers want. Modern GPUs can compute billions of MD5 hashes per second. Password hashing should be slow , use bcrypt or Argon2.

Here's the intentionally bad test code to scan:

# SQL Injection
def get_user(username):
    query = "SELECT * FROM users WHERE username = '" + username + "'"
    cursor.execute(query)
    return cursor.fetchone()

# Hardcoded credentials
DATABASE_PASSWORD = "supersecret123"
API_KEY = "sk-1234567890abcdef"

# Weak hashing
import hashlib
def hash_password(password):
    return hashlib.md5(password.encode()).hexdigest()

The Prompt - This Is Where Most Tutorials Fall Short

Here's what I've learned about using LLMs for structured output: vague prompts get vague answers. If you just say "analyse this code for security issues," Gemini will write you an essay. Useful for learning, useless for a tool.

What you want is a rigid format that you can parse and colour-code:

security_prompt = """
You are a security expert. Analyse this code for vulnerabilities.

For each issue, use this exact format:

---
SEVERITY: [CRITICAL/HIGH/MEDIUM/LOW]
TYPE: [Vulnerability Name]
DESCRIPTION: [One sentence explaining the issue]
IMPACT: [One sentence on potential damage]
FIX: [Code snippet only]
---

Be concise. No preamble.

Code:
{code}
"""

The {code} placeholder gets replaced when you call security_prompt.format(code=your_code). The explicit format instruction is what makes the output parseable - without it, you get prose that's hard to highlight programmatically.

A quick word on severity levels:

CRITICAL : exploitable right now, fix before your next commit
HIGH : serious, fix within days
MEDIUM : real issue but needs specific conditions to exploit
LOW : minor, schedule it

Adding Color-Coded Output

Install colorama if you haven't:

pip install colorama

Then add this function to map severity labels to terminal colors:

from colorama import init, Fore, Style
init(autoreset=True)

def add_colors_to_output(text):
    text = text.replace("SEVERITY: CRITICAL", f"SEVERITY: {Fore.RED}{Style.BRIGHT}CRITICAL{Style.RESET_ALL}")
    text = text.replace("SEVERITY: HIGH", f"SEVERITY: {Fore.YELLOW}{Style.BRIGHT}HIGH{Style.RESET_ALL}")
    text = text.replace("SEVERITY: MEDIUM", f"SEVERITY: {Fore.BLUE}MEDIUM{Style.RESET_ALL}")
    text = text.replace("SEVERITY: LOW", f"SEVERITY: {Fore.GREEN}LOW{Style.RESET_ALL}")
    return text

And your main scan loop looks like this:

for label, code in [
    ("SQL Injection", vulnerable_code_1),
    ("Hardcoded Credentials", vulnerable_code_2),
    ("Weak Cryptography", vulnerable_code_3),
]:
    print(f"\n{'=' * 50}")
    print(f"Analysing: {label}")
    print('=' * 50)
    response = model.generate_content(security_prompt.format(code=code))
    print(add_colors_to_output(response.text))

Gemini correctly flagged all three. SQL injection got CRITICAL. The hardcoded credentials got HIGH. Weak MD5 hashing got HIGH. The fixes it suggested were accurate , parameterized queries, environment variables, and bcrypt respectively.

Taking It Further: Scanning Real Files

The hardcoded snippets are fine for testing, but the tool only becomes useful when you can point it at actual files. Adding that is simpler than you'd think:

import sys

def scan_file(filepath):
    try:
        with open(filepath, 'r') as f:
            code = f.read()

        print(f"\n{'=' * 60}")
        print(f"Scanning: {filepath}")
        print('=' * 60)

        response = model.generate_content(security_prompt.format(code=code))
        print(add_colors_to_output(response.text))

    except FileNotFoundError:
        print(f"{Fore.RED}Error: File '{filepath}' not found.{Style.RESET_ALL}")

if len(sys.argv) > 1:
    scan_file(sys.argv[1])
else:
    print("Usage: python scanner.py <filepath>")

Now you can run python scanner.py any_file.py and get a full report. I created a vuln.py test file with SQL injection, hardcoded secrets, weak hashing, AND a command injection vulnerability (using os.system() with unsanitized user input), and Gemini caught all four.

Honest Assessment: Where It Works and Where It Doesn't

Where it works well: Context-dependent issues. Things like "this variable is named password and contains a string literal" Gemini gets that immediately. It's also good at generating the fix code, not just identifying the problem.

Where it struggles: False confidence. If you pass it clean code, it sometimes still finds something to complain about. It doesn't know your codebase - it doesn't know that cursor is already using a properly configured parameterized query driver, for example. Treat every finding as a suggestion, not a verdict.

What it won't replace: Static analysis tools with knowledge of your specific framework, dependency scanning, runtime analysis. This is a first-pass reviewer, not a full security audit.

What I'd Build Next

A few directions worth exploring if you want to take this further:

Directory scanning - loop through all .py files in a folder and produce a combined report
JSON output - pipe results into other tools or dashboards
CI/CD integration - run the scanner automatically on every pull request and fail the build on CRITICAL findings
Side-by-side diffs - show original code next to the suggested fix

The CI/CD angle is the most practically useful. If this scanner runs on every commit, it catches things before they ever make it to code review.

Wrapping Up

This took me about 90 minutes start to finish, including the time I spent fighting with my virtual environment activation on Windows and reading through Gemini's API docs. The end result is a tool I'll actually use , not as my only security check, but as a quick sanity pass when I'm reviewing my own code before pushing.

The bigger takeaway for me was the prompt engineering piece. Getting Gemini to output structured, parseable text consistently is a skill, and it transfers to any other LLM-powered tool you build. The model is capable. The prompt is the bottleneck.

If you build this, try pointing it at some old code you wrote a few years ago. You might be surprised what it finds.

The full source is on my GitHub. Stack: Python 3.11, google-generativeai, python-dotenv, colorama.``