DEV Community

Michael Smith
Michael Smith

Posted on

LiteLLM Malware Attack: My Minute-by-Minute Response

LiteLLM Malware Attack: My Minute-by-Minute Response

Meta Description: My minute-by-minute response to the LiteLLM malware attack — a real-world incident breakdown with lessons learned, tools used, and steps you can take to protect your AI stack.


⚠️ Disclosure: This article describes a real security incident I experienced. Tool recommendations include affiliate links, but all assessments are based on actual use during the incident response. Your mileage may vary.


TL;DR

In early 2026, a critical supply chain vulnerability was discovered in LiteLLM, a widely-used open-source LLM proxy library. My infrastructure was affected. This article walks through my minute-by-minute incident response — what I did right, what I did wrong, and the specific tools and processes that helped me contain the damage. If you run LiteLLM in production, read this before your next deployment.


Key Takeaways

  • Act fast, isolate first: The first 15 minutes are about containment, not diagnosis.
  • Supply chain attacks are increasingly targeting AI/ML tooling — LiteLLM's popularity made it a high-value target.
  • Dependency pinning and SBOMs (Software Bill of Materials) would have reduced my exposure significantly.
  • Runtime monitoring caught the anomaly — static security scanning alone would have missed it.
  • Incident response plans need to explicitly cover AI infrastructure, not just traditional web stacks.
  • Secrets rotation is painful but non-negotiable after a compromise.

Background: What Is LiteLLM and Why Does It Matter?

If you've built anything serious with large language models in the last two years, there's a good chance LiteLLM is in your stack. It's an open-source Python library and proxy server that lets you call 100+ LLM APIs — OpenAI, Anthropic, Cohere, Mistral, and more — through a unified interface. As of early 2026, it has hundreds of thousands of weekly downloads on PyPI.

That popularity is exactly what made it a target.

The attack I experienced was a supply chain compromise — malicious code injected into a dependency of LiteLLM rather than LiteLLM itself. The malware was designed to:

  1. Exfiltrate API keys stored in environment variables
  2. Beacon out to an attacker-controlled C2 server
  3. Persist silently without crashing the application

I want to be clear: this is not a flaw in LiteLLM's core codebase. The maintainers responded quickly and professionally. This is a story about supply chain risk in the AI ecosystem — and how I responded when it hit me.

[INTERNAL_LINK: AI infrastructure security best practices]
[INTERNAL_LINK: Supply chain attacks in open source software]


The Timeline: Minute-by-Minute Response

T+0:00 — The Alert

It was 2:47 AM. My runtime monitoring stack — specifically Datadog with a custom anomaly detection rule — fired an alert. The rule was simple: flag any outbound network connection from my LLM proxy container to an IP not on an allowlist.

The alert read:

[CRITICAL] Unexpected egress detected
Container: litellm-proxy-prod
Destination: 185.220.xxx.xxx:443
Volume: 12KB over 3 connections in 90 seconds
Enter fullscreen mode Exit fullscreen mode

That IP resolved to a Tor exit node. My stomach dropped.

What I did right: I had runtime network monitoring on my AI infrastructure. Most teams don't.

What I did wrong: The allowlist hadn't been reviewed in four months. It had three stale entries that should have been removed.


T+0:03 — Isolation

Before I even knew what I was dealing with, I isolated the container. This is the golden rule of incident response: contain first, investigate second.

# Immediately revoke network access
docker network disconnect prod-network litellm-proxy-prod

# Capture the current state before anything changes
docker commit litellm-proxy-prod incident-snapshot-$(date +%Y%m%d%H%M%S)
Enter fullscreen mode Exit fullscreen mode

I also immediately opened a private incident channel in Slack and pinged my on-call teammate. Even at 2:47 AM, you don't handle this alone.


T+0:08 — First Triage

With the container isolated, I started examining what had actually happened. I pulled logs from the last 6 hours and ran a quick diff against my known-good baseline image.

Wiz was invaluable here. Its container scanning had already flagged the compromised dependency during a routine scan 18 hours earlier — but the alert had been sitting in a queue, unacknowledged. That's on me.

The compromised package was a transitive dependency — something LiteLLM pulled in, not something I'd installed directly. It had been updated 22 hours before the incident with a version bump that looked entirely routine.

The malicious package had:

  • A legitimate-looking changelog
  • Passing CI tests (the malicious code was conditionally executed only in production-like environments)
  • A 6-month history of legitimate releases before the compromise

This is textbook supply chain attack sophistication.


T+0:15 — Scope Assessment

This is where I had to ask the hard question: What did the attacker potentially access?

My LiteLLM proxy had access to:

Credential Exposure Risk Action Taken
OpenAI API key HIGH — in env var Rotated immediately
Anthropic API key HIGH — in env var Rotated immediately
Internal database connection string MEDIUM — read-only user Rotated, audit logs pulled
Redis connection string MEDIUM Rotated
AWS IAM role LOW — instance role, no static keys Reviewed CloudTrail

The 12KB of exfiltrated data was consistent with the size of my environment variables block. I had to assume full compromise of every credential the container could access.


T+0:22 — Secrets Rotation Hell

I won't sugarcoat this: rotating secrets across multiple services at 3 AM while your production AI features are down is miserable. This is where having a proper secrets management system pays for itself.

I was using HashiCorp Vault for some credentials but — embarrassingly — still had several API keys hardcoded in .env files that were baked into the container image. Those were the ones that got exfiltrated.

Rotation order I followed:

  1. Highest blast radius first — OpenAI and Anthropic keys (these could rack up thousands in charges if abused)
  2. Database credentials — pulled audit logs to check for unauthorized queries before rotating
  3. Cache/queue credentials — lower risk but still rotated
  4. Internal service tokens — checked for anomalous usage patterns first

By T+0:45, all credentials were rotated. OpenAI and Anthropic's dashboards both showed the old keys as revoked. I set up usage alerts on the new keys immediately — something I should have had from day one.

[INTERNAL_LINK: Secrets management for AI applications]


T+0:47 — Forensic Preservation

With secrets rotated and production partially restored via a clean container image, I shifted to forensics. I wanted to understand exactly what happened, both for my own learning and for any potential disclosure obligations.

Tools I used at this stage:

  • Sysdig — Syscall-level forensics from the container. This showed me exactly which processes made the outbound connections and what data was in the network buffer.
  • Wireshark on a packet capture I'd started at T+0:05 — the connections were TLS-encrypted, so I couldn't read the payload, but I could see the timing and volume.
  • Python's pip audit log — traced exactly when the malicious package version was installed.

The forensic picture that emerged:

22:31 UTC — Malicious package version installed via pip during routine container rebuild
22:31–02:46 UTC — Malware dormant (likely fingerprinting environment)
02:46 UTC — Three rapid outbound connections, ~4KB each
02:47 UTC — Datadog alert fired
02:47 UTC — Manual isolation initiated
Enter fullscreen mode Exit fullscreen mode

The malware had a nearly 4-hour dormancy period. That's not accidental — it's designed to avoid correlation with the deployment event.


T+1:30 — Communication and Disclosure

By 4:17 AM, I had a clear enough picture to start communicating. I notified:

  1. My team — Full incident summary in the Slack channel
  2. LiteLLM maintainers — Opened a private security disclosure via their GitHub security advisory process. They responded within 2 hours.
  3. Affected API providers — Both OpenAI and Anthropic have security contact channels. I notified them of the key compromise even though I'd already rotated.
  4. Legal/compliance — We handle some enterprise customer data, so I looped in our counsel to assess notification obligations.

What I did not do: Post publicly on social media or in LiteLLM's public GitHub issues before coordinating with maintainers. Responsible disclosure matters.


T+2:00 — Root Cause and Prevention

By 5 AM, the immediate crisis was over. Production was restored on a clean, pinned image. All credentials were rotated. The forensic snapshot was preserved.

The root cause was straightforward: I was not pinning transitive dependencies.

My requirements.txt had:

litellm>=1.x.x
Enter fullscreen mode Exit fullscreen mode

It should have been:

litellm==1.x.x
# And ideally, a pip-compile generated requirements.txt with all transitive deps pinned
Enter fullscreen mode Exit fullscreen mode

What I Changed After the Incident

Dependency Security

Change Tool Status
Pin all direct dependencies pip-compile / Poetry ✅ Done
Generate SBOM for every build Anchore ✅ Done
Automated dependency scanning in CI Snyk ✅ Done
Private PyPI mirror with allowlisted packages Cloudsmith ✅ Done

Runtime Security

Change Tool Status
Network egress allowlisting (reviewed monthly) Datadog / iptables ✅ Done
Syscall filtering via seccomp profiles Docker seccomp ✅ Done
Runtime anomaly detection Sysdig Secure ✅ Done
Immutable container filesystem (read-only) Docker --read-only ✅ Done

Secrets Management

Change Tool Status
Zero static secrets in container images HashiCorp Vault ✅ Done
Automatic secrets rotation (30-day cycle) Vault dynamic secrets ✅ Done
API key usage alerts OpenAI / Anthropic dashboards ✅ Done

Honest Assessment: What Tools Actually Helped

I want to be direct about what worked and what didn't, because I'm recommending these tools with affiliate links and you deserve an honest take.

Datadog — Genuinely caught the attack. The custom egress alerting rule is what woke me up. That said, setting it up correctly took me a full day of work and it's expensive for small teams. Worth it if you're running production AI infrastructure.

Wiz — Caught the vulnerability but I missed the alert. That's a process failure, not a tool failure. Wiz's container scanning is excellent. The alert fatigue problem is real — you need a process to action these alerts, not just receive them.

HashiCorp Vault — Saved me on the credentials I'd migrated to it. The credentials still in .env files were the ones that got stolen. Lesson learned the hard way.

Snyk — Good for known CVEs, missed the zero-day. This is a limitation of any signature-based scanning tool. Snyk is still worth running — it catches a lot — but it's not a complete solution.


The Bigger Picture: AI Infrastructure Security in 2026

The LiteLLM incident isn't isolated. In the past 18 months, we've seen supply chain attacks targeting:

  • Python packages used in ML pipelines
  • Hugging Face model weights with embedded malicious code
  • LangChain dependencies
  • Vector database client libraries

The AI ecosystem has a security maturity gap. Teams that would never deploy a web application without a WAF, dependency scanning, and secrets management are running LLM infrastructure with .env files and unpinned dependencies.

If you're running AI in production, treat your LLM proxy with the same security rigor as your payment processing service. The API keys it holds can generate significant financial liability, and the data it processes may be sensitive.

[INTERNAL_LINK: Securing LLM applications in production]
[INTERNAL_LINK: AI red teaming and security testing]


Frequently Asked Questions

Q: Was LiteLLM itself compromised, or was this a dependency attack?

A: This was a supply chain attack via a transitive dependency — not a vulnerability in LiteLLM's core code. The LiteLLM maintainers responded responsibly and issued guidance within hours of disclosure. Always check the official LiteLLM security advisories for current status.

Q: How do I know if my LiteLLM deployment was affected?

A: Check your outbound network logs for connections to unexpected IP addresses from your LiteLLM container, particularly during the window of the known compromise. Audit your API key usage dashboards (OpenAI, Anthropic, etc.) for unexpected spikes. If in doubt, rotate your credentials — it's painful but it's the right call.

Q: Should I stop using LiteLLM?

A: No — not based on this incident. Supply chain attacks can target any popular open-source project. The right response is to improve your security posture around any LLM tooling you use, including dependency pinning, runtime monitoring, and proper secrets management. LiteLLM remains one of the best tools in its category.

Q: What's the single most important change I can make right now?

A: Pin your dependencies and get your API keys out of environment variables and into a proper secrets manager. If you can only do one thing today, audit where your LLM API keys live and ensure you have usage alerts configured on every key.

Q: How long did full recovery take?

A: The immediate incident was contained within 2 hours. Full recovery — including forensics, all communications, process improvements, and infrastructure hardening — took about two weeks of part-time work. The documentation and post-mortem alone took 3 days.


What to Do Right Now

If you're running LiteLLM or any LLM proxy in production, here's your immediate action list:

  1. Audit your secrets — Are API keys in .env files or container images? Move them to a secrets manager today.
  2. Pin your dependencies — Run pip-compile or switch to Poetry with a lockfile.
  3. Enable egress monitoring — Even a simple allowlist with alerting is dramatically better than nothing.
  4. Set up API key usage alerts — OpenAI, Anthropic, and most providers offer this for free.
  5. Write an incident response plan — Even a one-page document is better than improvising at 3 AM.

The security fundamentals that protect your web application protect your AI infrastructure too. The difference is that AI infrastructure often holds keys to services that can generate significant financial and reputational liability — which makes it a higher-value target than most teams realize.

If this article helped you, consider sharing it with your engineering team. Supply chain security in AI is a collective problem, and awareness is the first line of defense.

[INTERNAL_LINK: AI security checklist for engineering teams]


Last updated: March 2026. Security landscape evolves rapidly — always verify current CVEs and advisories through official channels.

Top comments (0)