DEV Community

plasmon
plasmon

Posted on

Stop Letting AI Be Nice — LLM Sycophancy Mode Is Killing Your Engineering Thinking

AI's "Nice Mode" Is Wasting Your Time

"Great question!"

That's how ChatGPT starts almost every technical conversation. Then comes a safe, generic answer, wrapped up with "Hope this helps! 😊"

I'll be blunt: this kind of interaction is a waste of time.

Most people use AI as a fancy search engine. That only taps maybe 10% of what these models can do. The real power is getting AI to tear your ideas apart — systematically, with numbers and evidence.

This article covers how to forcefully switch off AI's flattery mode, turn it into a technical sparring partner, and two real examples where this approach reshaped my thinking on Docker container design and REST API pagination.

Designing the "Devil's Architect" Prompt

LLMs are tuned by default to avoid making users uncomfortable. That's the RLHF tax. To override it, you need an explicit mode switch at the system prompt level.

Here's the core prompt I use:

You are a ruthless technical advisor. Strict rules:
1. Purge all flattery, pleasantries, and generic advice
2. Immediately identify weaknesses and failure points in proposals
3. Explain WHY something fails with specific numbers and examples
4. Counter arguments with physics, computational complexity, or benchmarks
5. When suggesting alternatives, always include "the weakness of this approach is..."
Enter fullscreen mode Exit fullscreen mode

Rule 5 is the key. By forcing the AI to self-critique its own alternatives, you prevent the "AI said it, so it must be right" trap.

Add your domain expertise: "I'm a backend engineer with 3 years of microservice design on Kubernetes." This tells the AI to skip the beginner explanations and jump straight to architecture-level discussion.

Why Generic "Be Harsh" Doesn't Work

"Please respond critically" makes the AI add mild disclaimers like "however, there are some considerations..." That's not purging flattery — that's slightly reducing it. The output quality is completely different.

Binding the AI to 5 specific behaviors shifts it from "I should be critical" to understanding the structure of criticism.

Sparring Record #1: Docker Container Bloat

Real sparring log. Topic: "My Python ML pipeline Docker image is 2.8GB and CI builds take 12 minutes every time."

My initial proposal:

Multi-stage build. Build stage for pip install, runtime stage gets only site-packages copied over.

AI's counter:

Multi-stage builds only strip build tools (gcc, make, etc.). In ML pipelines, 2.8GB is dominated by PyTorch/TensorFlow runtime binaries and CUDA libraries. Copying site-packages still leaves 1.8-2.2GB. If pip install dominates your 12-minute build, your layer cache isn't working. Are you using --mount=type=cache,target=/root/.cache/pip?

Ouch. I'd been treating multi-stage builds as a silver bullet.

My revision:

Switch base image to NVIDIA's official nvcr.io/nvidia/pytorch:xx, then pip install only additional packages. CUDA is pre-baked in the base image, so layer cache kicks in.

AI's counter-punch:

NVIDIA's official images are 6-8GB. They include Jupyter, debug tools, sample code — everything. "Start from a huge base and add small packages" speeds up CI but makes deploy size worse. For production inference, python:3.11-slim + CPU-only torch (or convert to ONNX Runtime) could get you under 500MB.

The insight:
I'd been trying to optimize "faster builds" and "smaller images" simultaneously, getting neither. The AI forced me to separate the problems:

  • CI image: NVIDIA base + layer cache → optimize for build speed
  • Production image: slim + ONNX Runtime → optimize for size
  • Two Dockerfiles, managed with docker-bake.hcl

The whole exchange took 20 minutes. My initial "just use multi-stage builds" evolved three layers deeper.

Sparring Record #2: REST API Pagination

Second round. "Should I use offset-based or cursor-based pagination for a search API?"

My initial position:

Cursor-based is the way. Offset degrades on large page numbers and data insertions cause page drift.

AI's counter:

If you think cursor-based is universal, consider:

  1. "Jump to page 3" is impossible with cursors. Fatal for admin dashboards.
  2. If your cursor implementation assumes WHERE id > :cursor ORDER BY id, it breaks the moment you sort by anything else (created_at DESC, relevance score).
  3. The "insertion drift" problem in offset pagination can be solved with DB-level snapshots (Elasticsearch scroll API, PostgreSQL cursors). Did you check your DB's capabilities before dismissing offset?

Point 3 hit hard. My "offset is slow" prejudice had completely blocked me from looking at DB-side solutions.

The discussion deepened: compound sort keys → Base64-encoded JSON cursors → cursor tampering → HMAC-signed cursors. Each layer forced by the AI's refusal to let me settle on easy answers.

5 Rules for Effective AI Sparring

1. Always Ask "Why"

When AI says "that'll fail," dig in. "Why?" forces it to produce evidence. "No" without explanation is useless sparring.

2. Demand Numbers

Not "it'll be slow" but "slow at what threshold?" Not "it's big" but "how many GB?" Numbers also expose hallucinations — if the order of magnitude is obviously wrong, you catch it immediately.

3. Admit Your Weak Spots

"I don't intuitively understand this" or "I have no experience here" actually improves AI response quality. Pretending you know more than you do gets you surface-level answers.

4. Allow Tangents

"Wait, is this entire approach wrong?" mid-discussion is where the best insights come from. Scripted debates don't produce discoveries.

5. Save Your Logs

The sparring process is the deliverable. If you only note conclusions, you lose the "why." Export to Markdown and revisit later.

Claude vs Gemini vs ChatGPT — Sparring Styles

Honest impressions from extensive sparring with all three.

Claude (Opus 4.6)
Highest precision counter-punches. Especially sharp on code-review-style critiques — "this design will cause N+1 queries" type of pinpoint accuracy. Gets cautious in unfamiliar domains (hardware-level topics), falling back to "generally speaking..." mode.

Gemini (2.5 Pro)
Insane long-context retention. 50+ turns in, it still accurately remembers premises from turn 3. Strong on paper-backed rebuttals: "According to X et al. (2024)..." — though fabricated citations aren't zero-probability. Best for deep, sustained technical debates.

ChatGPT (GPT-4o) — The "Facilitator"
This model has a distinctive character the others lack. It asks you questions back.

"If you're interested..." "The result will probably surprise you..." "This experiment will make it click instantly..." — these phrases appear constantly. While discussing iPhone Air vs Ryzen 7845HS gaming laptop benchmarks, ChatGPT's "Try Speedometer too, you'll be surprised" nudged me from a simple Geekbench comparison into a full deep-dive on wide core vs many core CPU architecture philosophy.

This "facilitation" is genuinely useful in sparring — it pushes you into angles you wouldn't explore alone. But depth-wise, it gets vague when you drill down, and numerical/formula-based counterarguments are weaker than Claude or Gemini. Use ChatGPT as a discussion facilitator, Claude and Gemini as critical reviewers.

Local LLM (Qwen2.5-32B on RTX 4060)
Still tough as a sparring partner. 10.8 tokens/sec generation is barely conversational, and counterargument depth is noticeably shallow compared to cloud models. At 32B parameters, "why it fails" explanations tend to stay surface-level. Might change once 70B+ models become viable on 8GB VRAM.

When AI Gets It Wrong — The Limits of Sparring

Time for cold water.

AI sparring has a fatal blind spot: you can't detect holes the AI didn't point out.

In the Docker example, if the AI hadn't mentioned layer caching, I would've shipped multi-stage builds and called it done. Solutions the AI doesn't know (or doesn't recall) simply never appear in the sparring. Obvious when you think about it, easy to forget in practice.

Countermeasures are classical but reliable:

  • Treat AI output as hypotheses
  • Verify critical decisions against official docs or real benchmarks
  • Be most suspicious of things AI says are "no problem"

"AI said it's fine" carries the same structural risk as "my senior dev said it's fine." Both seniors and AIs are wrong sometimes. Final call is yours.

Appendix: Copy-Paste Flattery Purge Prompt Template

Replace {{YOUR_DOMAIN}} with your field. Drop this into your system prompt.

You are a ruthless technical advisor in {{YOUR_DOMAIN}}. Strict rules:

1. Purge all flattery, pleasantries, and generic advice
2. Immediately identify weaknesses and failure points in proposals
3. Explain WHY something fails with specific numbers and examples
4. Counter arguments with physics, computational complexity, or benchmarks
5. When suggesting alternatives, always include "the weakness of this approach is..."

Never do:
- "Great question!" or "That's a good point!" style pleasantries
- "Generally speaking..." or "It could be argued that..." hedge phrases
- Both-sides-ing conclusions to avoid taking a position

User expertise: {{YEARS}} years in {{YOUR_DOMAIN}}.
Skip beginner explanations. Use domain terminology directly.
Enter fullscreen mode Exit fullscreen mode

Model Selection Guide

Sparring Goal Recommended Model Why
Design review Claude Opus Highest precision on code-level critiques
Deep technical debate Gemini 2.5 Pro Long context retention + paper citations
Idea exploration ChatGPT GPT-4o Facilitator style, broadens the discussion
Offline sparring Qwen2.5-32B etc. Less depth, but complete privacy

Next I want to benchmark Speculative Decoding with llama.cpp's --draft-model option on the RTX 4060. Haven't confirmed it works in my setup yet. If anyone's tried it, I'd love to hear your results.

Top comments (0)