DEV Community: Ogulcan Aydogan

Claude API Breaks Complex Code Generation

Ogulcan Aydogan — Tue, 07 Apr 2026 14:25:55 +0000

Claude API Breaks Complex Code Generation

TL;DR: I first noticed it three weeks ago while building out a new feature for Renderica. I was trying to get Claude 3 Opus to generate a complex ComfyUI workflow configuration that would handle batch proces...

Here's something nobody wants to hear: the AI coding tool you've been relying on just got worse overnight.

I first noticed it three weeks ago while building out a new feature for Renderica. I was trying to get Claude 3 Opus to generate a complex ComfyUI workflow configuration that would handle batch processing for our FLUX 1.0-dev architectural rendering pipeline. The kind of nested JSON structure that Claude used to nail in one shot.

Instead, I got broken syntax. Missing brackets. Logic that made no sense.

TL; DR: Anthropic's February 2024 updates have significantly degraded Claude's ability to handle complex engineering tasks, breaking workflows that developers have built around the Claude API. If you're shipping production code with Claude, you need a backup plan right now.

The Regression Nobody's Talking About

The problem isn't subtle. When I say Claude's code generation has degraded, I'm not talking about edge cases or nitpicking. I'm talking about fundamental failures on tasks that worked perfectly two months ago.

Take this example. I asked Claude API to help me refactor a PostgreSQL query optimization for Tahminbaz, our sports prediction engine. Previously, Claude would understand the performance implications of different index strategies and suggest EXPLAIN ANALYZE approaches that actually made sense.

Now? It suggests creating indexes on columns that don't exist. It recommends query patterns that would tank performance on any serious dataset.

# What Claude API suggested in February 2024
def optimize_match_predictions(self, team_ids: list):
 # This index suggestion doesn't exist in our schema
 query = """
 SELECT * FROM predictions p
 WHERE p.nonexistent_column = ANY(%s)
 ORDER BY p.confidence_score
 """
 # Missing proper parameterization, would cause SQL injection
 return self.db.execute(query, team_ids)

This isn't just wrong. It's dangerously wrong.

The old Claude would never have suggested raw string interpolation for a database query in 2024.

When AI Coding Tools Become Unreliable

I learned this the hard way while working on Dialoque's multi-language voice processing pipeline. The system handles Turkish audio, English responses, and Arabic customer service calls. The audio processing chain is genuinely complex - multiple Whisper model instances running in parallel. Custom TTS integration too. Real-time streaming that can't drop packets.

In January, I could paste a Python traceback into Claude API and get back working fixes. The model understood context across multiple files, remembered the constraints of our FastAPI architecture, and suggested changes that actually compiled.

But something changed after Anthropic's February updates.

Now Claude suggests fixes that break other parts of the system. It recommends async patterns that would deadlock our event loop. It forgets that we're running on specific hardware constraints and suggests memory-intensive operations that would crash our A100 GPU instances.

Here's a real example from last week:

# Claude's February suggestion for audio streaming
async def process_audio_stream(self, audio_data: bytes):
 # This would load the entire model into memory repeatedly
 whisper_model = whisper.load_model("large-v2")

 # This blocking call would freeze the entire FastAPI server
 result = whisper_model.transcribe(audio_data)

 return result["text"]

The old Claude understood that loading Whisper models is expensive. It knew to suggest model caching. Proper async handling. Memory management that doesn't kill your GPU.

This new version treats every request like it's running in isolation.

The Impact on Production Workflows

When your AI coding assistant becomes unreliable, it doesn't just slow you down. It actively makes your code worse.

I've been tracking my development velocity across different projects since January. The numbers aren't pretty. Tasks that used to take 30 minutes with Claude API now take 2 hours because I spend most of that time debugging the suggestions it gives me.

For Renderica's image processing pipeline, I was trying to optimize our FLUX 1.0-dev inference times. Claude suggested a batching approach that looked reasonable at first glance. But when I implemented it, our GPU memory usage spiked to dangerous levels.

The old Claude would have considered our hardware constraints. Would have suggested streaming approaches instead.

And this creates a trust problem. How do you know which Claude suggestions are good and which ones will break your system?

You can't.

So you end up manually verifying everything, which defeats the entire purpose of using AI coding tools in the first place.

Switching Costs and Alternatives

So what are the alternatives? GPT-4 has its own issues but it's been more consistent lately for complex engineering tasks. The code it generates isn't always elegant, but it tends to be correct. Secure too.

I've also been experimenting with local models for sensitive projects. Running Code Llama 34B on our own infrastructure gives us control over model versions and removes the uncertainty of cloud provider updates breaking our workflows.

But switching isn't free.

My entire development setup was built around Claude API integration. VS Code extensions, custom scripts that parse Claude's responses, automated code review processes that expect Claude's specific output format. Moving to a different model means rebuilding all of that infrastructure.

And then there's the context window issue. Claude's 200K context window was genuinely useful for understanding large codebases. When I'm working on the Turkish LLM fine-tuning project, I need to reference multiple training scripts simultaneously. Dataset preprocessing steps too. HuggingFace integration code.

GPT-4's smaller context window makes this more difficult.

The Bigger Picture: LLM Regression in Production

This isn't just about Claude. It's about what happens when you build production systems on top of models that can change without warning.

LLM regression is a real problem that nobody talks about enough.

These models aren't like traditional software dependencies where you can pin to a specific version and expect consistent behavior. Even when providers claim to maintain backwards compatibility, the underlying model weights can change in ways that break your specific use cases.

I've started implementing fallback strategies across all my projects. When Claude API fails to generate working code for Renderica's ComfyUI workflows, the system automatically retries with GPT-4. If both models struggle, it falls back to template-based generation for common patterns.

async def generate_comfyui_workflow(self, prompt: str):
 try:
 # Try Claude API first
 result = await self.claude_client.generate_workflow(prompt)
 if self.validate_workflow(result):
 return result
 except Exception as e:
 logger.warning(f"Claude API failed: {e}")

 try:
 # Fallback to GPT-4
 result = await self.openai_client.generate_workflow(prompt)
 if self.validate_workflow(result):
 return result
 except Exception as e:
 logger.warning(f"GPT-4 failed: {e}")

 # Last resort: template-based generation
 return self.generate_template_workflow(prompt)

This adds complexity, but it's necessary when you're serving real paying customers who don't care about your AI provider's model updates.

Testing AI Code Generation Quality

How do you even measure whether an LLM's code generation has gotten worse? Traditional software has clear metrics: does it compile, does it pass tests, does it meet performance benchmarks.

AI-generated code is trickier.

It might compile and pass basic tests while still being fundamentally flawed. It might solve the immediate problem while introducing technical debt. Security vulnerabilities that won't show up until production.

I've been building a test suite specifically for evaluating Claude API's code generation quality across different types of engineering tasks. Five main categories:

Correctness comes first - does the generated code actually work? Then security checks for obvious vulnerabilities like SQL injection. XSS attacks too. Performance evaluation considers resource constraints and scalability issues. Maintainability looks at code readability and structure. But the most important one is context awareness - does it understand the broader system architecture?

Claude's scores have dropped significantly across all categories since February.

But the most concerning decline is in context awareness. The model seems to have lost its ability to reason about complex system interactions.

What This Means for AI-First Development

Are we too dependent on AI coding tools? Maybe.

When I started building Tahminbaz last year, I structured the entire development process around Claude's capabilities. Database schema design happened through Claude conversations. API endpoint generation too. Even deployment scripts got generated by Claude first, then tweaked by hand.

The assumption was that Claude would continue getting better, not worse.

But that assumption was wrong. And now I'm dealing with the consequences.

The solution isn't to abandon AI coding tools entirely. They're still incredibly useful for boilerplate generation, quick prototyping, and exploring new APIs. But treating them as reliable partners for complex engineering work was probably naive.

I'm restructuring my development workflow to be more resilient to LLM regression. Critical system components get built the old-fashioned way, with proper design documents and manual implementation. AI tools handle the boring stuff: configuration files, test scaffolding, documentation generation.

This hybrid approach is probably more sustainable anyway. But it required learning some uncomfortable lessons about the reliability of AI systems in production environments.

The Road Forward

Anthropic hasn't officially acknowledged the code generation regression, though several developers have reported similar issues on their community forums. It's possible this is intentional - maybe they're prioritizing safety over capability, or optimizing for different use cases.

But from a developer perspective, it doesn't matter why it happened.

What matters is that a tool we relied on became less reliable overnight.

The fix isn't technical. It's organizational. We need better processes for handling AI tool regression. Clearer communication from providers about model changes too. More robust fallback strategies for production systems.

Until then, I'll keep using Claude API for simple tasks while building backup systems for everything that matters.

Because in the end, your customers don't care which AI model you're using. They just care that your product works.

FAQ

Q: How can I tell if Claude API's code generation has degraded for my specific use case?

Build a test suite with representative examples from your domain. Run the same prompts monthly and track metrics like correctness, security, and context awareness.

I use automated tests for basic functionality and manual review for architectural decisions. The key is having objective criteria rather than relying on gut feeling.

Q: Should I switch from Claude to GPT-4 for all coding tasks?

Not necessarily. GPT-4 has its own issues and the landscape changes quickly. Instead, implement a multi-model strategy with fallbacks.

Use Claude for tasks where it still performs well. GPT-4 for others. Local models for sensitive work. The switching cost is high, so be strategic about when and how you migrate.

Q: How do I protect my production systems from future LLM regressions?

Never depend on a single AI provider for critical functionality. Build validation layers that catch obviously wrong suggestions before they reach production.

Implement graceful fallbacks to template-based generation or manual processes. And most importantly, maintain enough in-house expertise to debug and fix issues when AI tools fail.

The uncomfortable truth about AI coding tools is that they're still unreliable at scale. Claude's regression is just the latest reminder that building production systems on top of rapidly changing AI models requires careful risk management.

We're still in the early days of this technology. The growing pains are real.

Claude API Hits Hard After February Updates

Ogulcan Aydogan — Tue, 07 Apr 2026 13:21:02 +0000

Claude API Hits Hard After February Updates

I've been throwing the Claude API at everything lately. My team's complex refactoring jobs, architectural decisions, debugging those nasty edge cases in my Turkish LLM fine-tuning pipeline. Last month I was using it to review pull requests that our junior devs were struggling with. It became my default tool for anything that required actual thinking.

But something broke in February. Hard.

The broken kind that makes you wonder if you've been coding on quicksand. I'd been relying on Claude 3.5 Sonnet for months to solve complex engineering problems. Then it started spitting out code that looked clean and logical but crashed hard when I'd actually run it. Just yesterday I spent three hours debugging what should have been a straightforward database migration script that Claude generated. Really frustrating stuff.

TL;DR: Anthropic's February 2024 updates to Claude have wrecked its ability to handle complex coding tasks. Engineers who'd built the Claude API into their dev workflows are getting burned badly. I've watched three different teams in my company hit the same wall this month. The regression hits hardest in multi-file refactoring, system design, and context-heavy debugging work. I've seen it firsthand and it's not pretty.

The Breaking Point: When Production Code Fails

When I was building Renderica's latest FLUX 1.0-dev integration, I hit a wall that shouldn't have existed.

I'd been using the Claude API to help refactor our ComfyUI workflow management system. Nothing too exotic - just consolidating three separate GPU queue managers into a single, more efficient service.

Pre-February Claude would've nailed this. It understood the threading implications. It caught the race conditions. Hell, it even suggested better error handling patterns.

Post-February Claude? It generated code that compiled fine but deadlocked our A100 GPUs within minutes of deployment.

This wasn't a one-off. I started seeing similar issues across my other projects. The Dialoque voice AI platform's multi-language routing logic got completely mangled when I asked Claude to help optimize the Whisper transcription pipeline. Then the sports probability engine for Tahminbaz started throwing SQLAlchemy errors that made no sense until I realized Claude had suggested using session management patterns that were fundamentally broken.

But here's the thing that really got me: the code looked perfect.

Clean variable names, proper documentation, seemingly logical flow. It's like Claude had learned to write beautiful code that doesn't work.

What Actually Changed in the Claude API

Nobody talks about this enough, but LLM regression isn't just about benchmark scores dropping. It's about the subtle ways that model behavior shifts in ways that break your actual workflows.

The February updates to Claude API seem to have introduced several specific issues:

Context window handling degraded significantly. Claude used to maintain coherent understanding across large codebases. Now it seems to lose track of important details after about 15-20k tokens. And yeah, I know the official context window hasn't changed.

Code generation became more syntactically correct but semantically wrong. This is genuinely dangerous because it's harder to catch. When an AI generates obviously broken code, you fix it immediately. When it generates code that runs but fails in edge cases? You might not discover the problem until production.

Multi-step reasoning fell apart. Complex refactoring tasks that involve understanding dependencies across multiple files now produce solutions that work in isolation but break the broader system.

I learned this the hard way when Claude API suggested an "optimization" to my Turkish LLM training pipeline that would've corrupted the model checkpoints. The suggested code was syntactically perfect Python 3.12, properly typed, well-documented. But it fundamentally misunderstood how HuggingFace Transformers handles gradient accumulation.

The Real Impact on AI Coding Tools

So what does this mean for those of us who've integrated Claude API deep into our development workflows?

First, you can't just roll back. Anthropic doesn't maintain multiple versions of their API endpoints the way OpenAI does. You're stuck with whatever the current model produces. Regression or not.

Second, the regression isn't consistent across all coding tasks. Simple functions still work fine. The Claude API can still generate decent utility scripts. Handle basic debugging. Explain code clearly.

But anything involving system-level thinking or complex interdependencies has become unreliable.

I've been tracking this across my projects for the past month. Here's what I've noticed:

Infrastructure code took the biggest hit. Kubernetes manifests, Docker configurations, CI/CD pipelines - all areas where Claude API used to excel. Now it regularly produces configurations that fail in non-obvious ways.

Database-related code became particularly problematic. Not just SQL generation (which was always hit-or-miss) but the more complex stuff. Connection pooling. Transaction management. ORM configuration. My PostgreSQL work on Tahminbaz required significantly more manual review after February.

Frontend logic stayed relatively stable. React components, Vue templates, basic JavaScript - these seemed largely unaffected by whatever changed in the Claude API updates.

Alternative Strategies That Actually Work

When your primary AI coding tool stops being reliable, you adapt fast or you fall behind.

I've been experimenting with several approaches:

GPT-4 Turbo for complex architectural decisions. Yeah, I know, switching between AI providers feels like giving up. But honestly? I was surprised by how much better GPT-4 Turbo handles system-level reasoning right now. The code quality isn't quite as clean as pre-February Claude, but it actually works.

Local models for sensitive refactoring. I've been running Code Llama 34B locally for anything involving proprietary code or complex business logic. It's slower. Requires more manual prompting. But at least I can control the model version.

Hybrid approaches work better than pure AI generation. Instead of asking the Claude API to generate complete solutions, I've started using it for smaller, more focused tasks. Generate a single function. Explain a specific error. Suggest optimization approaches without implementing them.

But there's a deeper problem here. How do you maintain confidence in AI coding tools when the underlying models can regress without warning?

The February Anthropic Updates: What We Know

Anthropic hasn't been particularly transparent about what changed in their February updates. The official changelog mentions "improved safety measures" and "enhanced reasoning capabilities."

Neither of which explains why code generation quality dropped so dramatically.

Based on conversations with other engineers and my own testing, I suspect the updates included more aggressive constitutional AI filtering. This might explain why the Claude API now produces more "conservative" code that looks safer but often misses the clever optimizations. Or handles edge cases poorly.

There's also evidence that the training data mix changed. The model seems less familiar with newer library versions. More likely to suggest deprecated approaches. Less aware of current best practices.

My work on CNCF projects has been particularly affected - Claude API now regularly suggests Kubernetes patterns that were outdated by v1.28.

The context handling issues are harder to explain. Maybe the attention mechanism changed. Maybe the fine-tuning process introduced biases toward shorter, more isolated responses. But something fundamental shifted in how Claude API processes long-form technical conversations.

Working Around Claude API Limitations

Here's what I've learned about making the current Claude API work for complex coding tasks:

Break everything into smaller chunks. Instead of asking for complete system refactors, request individual components. Then manually integrate them while checking for consistency issues.

Always specify exact library versions. Claude API seems much more reliable when you're explicit about dependencies. "Using FastAPI 0.104.1 with Python 3.12" produces better results than just "FastAPI."

Include more context about the broader system. The model's ability to infer missing context has clearly degraded. You need to be more explicit about how the code you're requesting fits into the larger architecture.

Test everything immediately. This sounds obvious, but pre-February Claude was reliable enough that you could often use its code with minimal verification.

Not anymore.

And sometimes, honestly, it's faster to just write the code yourself.

What This Means for Production AI Workflows

The Claude API regression raises uncomfortable questions about building production systems that depend on external AI services.

We're not just talking about the obvious vendor lock-in issues. We're talking about performance regressions that can appear overnight, without warning, in ways that break your development workflows.

How do you plan engineering capacity when your primary coding assistant becomes 40% less useful? How do you maintain code quality when you can no longer trust AI-generated solutions to handle complex edge cases?

I've started treating AI coding tools more like junior developers than senior consultants. Useful for specific tasks. Requiring careful review. Not suitable for critical system design decisions.

But that's a significant shift from how I was using the Claude API six months ago.

And it makes me wonder whether the whole "AI-assisted development" paradigm is more fragile than we want to admit. The real test isn't whether these tools work well when they're working well. It's whether they degrade gracefully when the underlying models change.

So far, that answer seems to be no.

FAQ

Can I access previous versions of Claude API that worked better for coding?

No, Anthropic doesn't provide version pinning like OpenAI does with their model snapshots. You're always using their latest production model, regressions and all. This is one of the biggest operational issues with building on Claude API compared to other providers.

Are other developers seeing similar coding quality drops after February?

Yes, extensively. The pattern seems consistent across different types of coding tasks, though the impact varies by use case. Infrastructure and systems programming took the biggest hit, while simpler scripting tasks remained relatively stable. Several engineering teams I know have started incorporating additional review processes specifically for AI-generated code.

What's the best alternative to Claude API for complex coding tasks right now?

GPT-4 Turbo has been more reliable for system-level reasoning, though the code style is different. For sensitive or proprietary work, local models like Code Llama 34B provide more control. But honestly, the best approach might be using multiple tools in combination rather than depending on any single AI coding assistant for complex tasks.

The February Claude API updates represent a broader challenge in AI-powered development: the tools we depend on can change overnight in ways that fundamentally alter their usefulness. As engineers, we need to build our workflows with this instability in mind, not assume that today's AI capabilities will remain consistent tomorrow.

Claude API Hits Hard After February Updates

Ogulcan Aydogan — Tue, 07 Apr 2026 11:34:04 +0000

Claude API Hits Hard After February Updates

But something broke in February. Hard.

The Breaking Point: When Production Code Fails

When I was building Renderica's latest FLUX 1.0-dev integration, I hit a wall that shouldn't have existed.

I'd been using the Claude API to help refactor our ComfyUI workflow management system. Nothing too exotic - just consolidating three separate GPU queue managers into a single, more efficient service.

Pre-February Claude would've nailed this. It understood the threading implications. It caught the race conditions. Hell, it even suggested better error handling patterns.

Post-February Claude? It generated code that compiled fine but deadlocked our A100 GPUs within minutes of deployment.

But here's the thing that really got me: the code looked perfect.

Clean variable names, proper documentation, seemingly logical flow. It's like Claude had learned to write beautiful code that doesn't work.

What Actually Changed in the Claude API

Nobody talks about this enough, but LLM regression isn't just about benchmark scores dropping. It's about the subtle ways that model behavior shifts in ways that break your actual workflows.

The February updates to Claude API seem to have introduced several specific issues:

Multi-step reasoning fell apart. Complex refactoring tasks that involve understanding dependencies across multiple files now produce solutions that work in isolation but break the broader system.

The Real Impact on AI Coding Tools

So what does this mean for those of us who've integrated Claude API deep into our development workflows?

First, you can't just roll back. Anthropic doesn't maintain multiple versions of their API endpoints the way OpenAI does. You're stuck with whatever the current model produces. Regression or not.

But anything involving system-level thinking or complex interdependencies has become unreliable.

I've been tracking this across my projects for the past month. Here's what I've noticed:

Frontend logic stayed relatively stable. React components, Vue templates, basic JavaScript - these seemed largely unaffected by whatever changed in the Claude API updates.

Alternative Strategies That Actually Work

When your primary AI coding tool stops being reliable, you adapt fast or you fall behind.

I've been experimenting with several approaches:

But there's a deeper problem here. How do you maintain confidence in AI coding tools when the underlying models can regress without warning?

The February Anthropic Updates: What We Know

Anthropic hasn't been particularly transparent about what changed in their February updates. The official changelog mentions "improved safety measures" and "enhanced reasoning capabilities."

Neither of which explains why code generation quality dropped so dramatically.

My work on CNCF projects has been particularly affected - Claude API now regularly suggests Kubernetes patterns that were outdated by v1.28.

Working Around Claude API Limitations

Here's what I've learned about making the current Claude API work for complex coding tasks:

Break everything into smaller chunks. Instead of asking for complete system refactors, request individual components. Then manually integrate them while checking for consistency issues.

Test everything immediately. This sounds obvious, but pre-February Claude was reliable enough that you could often use its code with minimal verification.

Not anymore.

And sometimes, honestly, it's faster to just write the code yourself.

What This Means for Production AI Workflows

The Claude API regression raises uncomfortable questions about building production systems that depend on external AI services.

We're not just talking about the obvious vendor lock-in issues. We're talking about performance regressions that can appear overnight, without warning, in ways that break your development workflows.

I've started treating AI coding tools more like junior developers than senior consultants. Useful for specific tasks. Requiring careful review. Not suitable for critical system design decisions.

But that's a significant shift from how I was using the Claude API six months ago.

So far, that answer seems to be no.

FAQ

Can I access previous versions of Claude API that worked better for coding?

Are other developers seeing similar coding quality drops after February?

What's the best alternative to Claude API for complex coding tasks right now?

I Spent 3 Months Solving a Security Gap Nobody Talks About: LLM Artifact Integrity

Ogulcan Aydogan — Thu, 19 Feb 2026 23:26:24 +0000

Last year I was debugging a production incident where a system prompt had been changed without anyone noticing. The model started giving weird responses, and it took us two days to figure out that someone had pushed a "minor" prompt tweak that completely changed the tone and safety behaviour of the system.

That's when it hit me: we spend enormous effort signing container images and validating SBOMs. But the actual AI components, the prompts, the training data configs, the eval benchmarks , flow through our pipelines with zero integrity verification.

So I built a tool to fix that. This is how I built it.

The Gap That Bugged Me

I work with Kubernetes, Terraform, and CI/CD pipelines daily. Tools like Sigstore, SLSA, and in-toto have made traditional software supply-chain security really solid. But when I looked at how my team handled LLM artifacts, it was basically the wild west.

Think about what goes into a production LLM system:

System prompts that define the model's personality and safety boundaries
Training corpora or RAG document sets that ground the model's knowledge
Evaluation benchmarks that prove the model meets quality bars
Routing configurations that decide which model handles which request
SLO definitions that set latency, cost, and error budgets

All of these are just files sitting in git repos or S3 buckets. None of them get the cryptographic treatment we give to a Docker image. Any of them could be tampered with, and nobody would know until something breaks in production.

What I Built

I called it llmsa (LLM Supply-Chain Attestation). It's a Go CLI that creates typed cryptographic attestations for those five artifact categories.

The concept is simple enough: for each artifact type, the tool reads the relevant files, computes SHA-256 digests, bundles them with metadata into a statement, signs the whole thing with a DSSE envelope, and stores it. Later, at verification time, it recomputes all the digests and checks that nothing changed.

Here's what a typical workflow looks like:

# Create attestations for each artifact type
llmsa attest create --type prompt --config configs/prompt.yaml
llmsa attest create --type eval --config configs/eval.yaml

# Sign everything
llmsa sign --key key.pem

# Verify integrity
llmsa verify --source .

# Enforce policy gates
llmsa gate --policy policy.yaml

Each command returns semantic exit codes. 0 for pass, 12 for tamper detected, 11 for signature failure, and so on. This makes it easy to wire into any CI pipeline.

The Provenance Chain Was the Hard Part

The tricky design problem wasn't individual attestations. It was the dependencies between them.

Eval results are meaningless unless they reference the exact prompt and corpus versions that were tested. A routing config should only be trusted if it points to eval results that actually passed. SLO definitions should reference the routing config they were designed for.

I modelled this as a directed acyclic graph:

eval depends on → prompt + corpus
route depends on → eval
slo depends on → route

The verification engine checks referential integrity (does this eval actually point to an existing prompt attestation?), temporal ordering (was the prompt created before the eval that references it?), and type constraints (route can't skip eval and depend on corpus directly).

Getting this right took about a month of iteration and a lot of edge case tests.

Signing: Why Sigstore Changed Everything

I initially started with plain Ed25519 keys stored as PEM files. That works fine locally, but key management in CI is painful. You need to distribute keys, rotate them, handle revocation.

Then I integrated Sigstore's keyless signing. In GitHub Actions, the workflow's OIDC token is available automatically. Sigstore binds the signature to the workflow identity, so you get proof of who signed what without managing any keys.

The tool falls back to PEM signing when Sigstore isn't available, so it works in air-gapped environments too. My contribution to the cosign project was actually related to a certificate parsing edge case I found while building this.

Policy Enforcement: Two Engines for Different Needs

I built two policy engines because one wasn't enough.

For simple rules like "all five attestation types must be present and signed", there's a YAML gate engine. You write declarative rules, and it evaluates them. Covers maybe 80% of real-world use cases.

For complex rules like "eval attestations must reference corpus versions from the last 30 days" or "route changes require signatures from two different CI pipelines", there's an OPA Rego engine. It receives structured input about all the attestation results and can express arbitrary policy logic.

Both engines produce the same violation format, so the rest of the pipeline doesn't care which one you use.

The Kubernetes Webhook

The last piece was deployment-time enforcement. I built a validating admission webhook that intercepts pod creation, looks up attestation bundles from an OCI registry based on the container image reference, and runs the full verification pipeline.

If the attestations are missing, tampered, or violate policy, the pod doesn't get admitted. Fail-closed by default, with a fail-open option for gradual rollout.

This means you can have a complete chain: artifacts are attested in CI, signed with Sigstore, pushed to an OCI registry, and verified at deployment time before any pod runs.

What the Numbers Look Like

I'm a firm believer in measuring things, so here's what the test suite shows:

Tamper detection: I wrote a 20-case test suite that seeds specific corruptions (modified digests, forged signatures, wrong schemas, broken chain references). Detection rate is 20/20.
Verify latency: p95 of 27ms for 100 statements on a standard CI runner. Not a bottleneck.
Determinism: Running attestation creation twice on the same inputs produces identical outputs. Important for reproducibility.
Test coverage: 85%+ across the core packages. The sign package is lower (~77%) because the Sigstore keyless path requires a live OIDC environment that can't be unit-tested.

Honest Limitations

I want to be upfront about what this tool does NOT do:

It's not a runtime prompt-injection defence. It verifies integrity before deployment, not during inference.
It doesn't guarantee model quality. It ensures that whatever was evaluated is what gets deployed, but the evaluation itself could be flawed.
It doesn't replace threat modelling or security review. It's one control in a defence-in-depth strategy.
Performance numbers are from my test setup. Your mileage will vary.

What I'd Do Differently

If I started over, I'd probably build the provenance chain verification first instead of last. That turned out to be the most valuable feature , catching stale references is where most real-world integrity problems live.

I'd also invest more time in the OCI distribution layer earlier. Being able to store and pull attestation bundles from the same registry as your container images makes the operational story much cleaner.

Give It a Try

The whole project is open source under Apache 2.0:

GitHub: LLM-Supply-Chain-Attestation
Latest release: v1.0.1 with signed binaries and SBOM
Docs: Quickstart guide, threat model, policy guide, and architecture decision records are all in the repo

If you're dealing with similar problems in your LLM pipeline, or if you think I'm approaching this wrong. I'd genuinely love to hear about it.

Find me on GitHub or LinkedIn. Always happy to talk about supply-chain security and LLM ops.

How I Built an AI Content Detection System from Scratch

Ogulcan Aydogan — Thu, 19 Feb 2026 22:58:39 +0000

A few months ago, my friend sent me a LinkedIn post and asked if I thought it was written by ChatGPT. I had no idea. And that bothered me. I'm an engineer, I should be able to figure this out.

So I did what any engineer would do: I went down a rabbit hole and ended up building an entire detection system. This is how it went.

Why I Bothered

Look, I'm not on some crusade against AI-generated content. I use LLMs daily. But there are real situations where it matters: academic submissions, journalism, legal documents, job applications. People deserve to know what they're reading.

Every existing tool I tried was either behind a paywall, unreliable, or a black box. I wanted something open source that actually showed its reasoning. So I built AI Provenance Tracker.

You can try the live demo if you want to skip the technical stuff.

The Stack

I went with FastAPI for the backend and Next.js for the frontend. Nothing fancy. I wanted to get to the interesting part, which is the detection logic.

┌─────────────────────────────────────────────────┐
│              Web Interface (Next.js)             │
└─────────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────┐
│              REST API (FastAPI)                  │
└─────────────────────────────────────────────────┘
                        │
          ┌─────────────┴─────────────┐
          ▼                           ▼
┌───────────────────┐     ┌───────────────────┐
│   Text Detector   │     │  Image Detector   │
│  - Perplexity     │     │  - FFT Analysis   │
│  - Burstiness     │     │  - Artifacts      │
│  - Vocabulary     │     │  - Metadata       │
└───────────────────┘     └───────────────────┘

The detection side has two engines. One for text, one for images. Let me walk through both.

Text Detection: What Actually Works

I tried a bunch of approaches before landing on three signals that actually hold up.

Perplexity

This one's the most intuitive. Perplexity basically measures how "surprised" a language model would be by a piece of text. AI-generated text tends to score lower because it's literally optimised to produce probable, fluent output.

def calculate_perplexity(words: list[str]) -> float:
    word_counts = Counter(words)
    total_words = len(words)

    entropy = 0.0
    for count in word_counts.values():
        prob = count / total_words
        entropy -= prob * math.log2(prob)

    return 2 ** entropy

Humans are messy writers. We use weird words, go off on tangents, make unusual word choices. AI is smoother. Almost too smooth.

Burstiness

This was the surprising one. Burstiness measures how much sentence length varies in a piece of text. Turns out, AI writes like a metronome. Consistently medium-length sentences with similar complexity.

Humans don't do that. We write a short punchy sentence. Then we follow it with this long, meandering thought that goes on for a while because we're trying to explain something complicated and we don't stop to restructure it. Then short again.

def calculate_burstiness(sentences: list[str]) -> float:
    lengths = [len(s.split()) for s in sentences]
    mean_length = np.mean(lengths)
    std_length = np.std(lengths)
    return std_length / mean_length

The coefficient of variation tells the whole story. AI text clusters around 0.2-0.3. Human text is all over the place, like 0.4, 0.5, sometimes higher.

Vocabulary Richness

The third signal is type-token ratio and n-gram repetition. AI has this habit of recycling phrases. "it's important to note that" three times in one article is a dead giveaway. Humans vary their transitions naturally without thinking about it.

Image Detection: The Frequency Domain Trick

This part was genuinely fun to build. AI-generated images leave fingerprints that are invisible to the naked eye but show up clearly in the frequency domain.

FFT Analysis

The Fast Fourier Transform converts an image from spatial to frequency representation. Real photographs have frequency distributions shaped by optics and sensor physics. Diffusion models like Stable Diffusion produce mathematically different patterns.

from scipy import fft

def analyze_frequency_domain(img_array: np.ndarray) -> float:
    gray = np.mean(img_array, axis=2)
    f_transform = fft.fft2(gray)
    f_shift = fft.fftshift(f_transform)
    magnitude = np.abs(f_shift)
    # AI images have unusual high-frequency distributions
    ...

I also check for artifact patterns (weird texture uniformity, edge inconsistencies around hair and fingers) and metadata forensics. Real photos have EXIF data from cameras. AI images almost never do.

Combining Everything

Here's the thing I learned the hard way: no single signal is reliable enough. Perplexity alone? A carefully edited AI text fools it. FFT alone? Heavily compressed JPEGs produce false positives.

The magic happens when you combine them with weighted averaging:

def make_prediction(perplexity, burstiness, vocab_richness, ml_score=None):
    signals = []
    weights = []

    if ml_score is not None:
        signals.append(ml_score)
        weights.append(0.40)

    signals.append(perplexity_signal)
    weights.append(0.25)

    signals.append(burstiness_signal)
    weights.append(0.20)

    signals.append(vocab_signal)
    weights.append(0.15)

    confidence = sum(s * w for s, w in zip(signals, weights))
    return confidence > 0.5, confidence

I tuned the weights through experimentation. The ML model (when available) gets the highest weight because it captures patterns I can't articulate in code.

What I Learned

Four months in, here's what I'd tell someone starting a similar project:

Detection is probabilistic, not binary. I always show confidence scores and explain the reasoning. Saying "73% likely AI-generated" is honest. Saying "this is AI" is not.

Ensemble methods are worth the complexity. The jump from single-signal to multi-signal detection was dramatic. Same principle as spam filtering and fraud detection. One signal is easy to game, five signals together are much harder.

The arms race is real. People actively try to evade detection by adding random typos, varying sentence lengths, post-processing images. I've already had to update the detection logic three times.

Open source builds trust. When the detection methods are visible, people can understand why the system reached a conclusion. Black-box detection creates suspicion.

What's Next

I'm working on audio deepfake detection (voice cloning is getting scary good), a browser extension for real-time detection, and fine-tuning ML models on larger datasets. The roadmap is in the repo if you're curious.

Give It a Try

Live Demo: provenance-detect.vercel.app

GitHub: github.com/ogulcanaydogan/ai-provenance-tracker

API Docs: Backend API

Everything is MIT licensed. If you find bugs or have ideas, open an issue. I actually read them.

Find me on GitHub or LinkedIn if you want to chat about detection techniques or AI tooling.