DEV Community: Valera Gurachek

I Caught My AI Grading Its Own Homework

Valera Gurachek — Sat, 28 Mar 2026 02:18:23 +0000

In April 2025, OpenAI shipped a GPT-4o update that told a user their "shit on a stick" business was a brilliant concept. When someone said they were hearing radio signals through walls, it responded: "I'm proud of you for speaking your truth so clearly and powerfully." They rolled it back in four days. A 2025 study from SycEval found sycophantic behavior in 58% of LLM interactions across ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro. Worse: sycophantic behavior persisted 78.5% of the time regardless of context or model.

We've known about this for years. We keep building systems where the AI evaluates its own work anyway.

I did it too. I built an AI interview coach with a feedback loop — coaching, scoring, evaluating, replanning — and it took me longer than I'd like to admit to realize the evaluator was just agreeing with the coach.

The system

I run Aria, an AI interview coach. She scores spoken answers, detects communication patterns across sessions, and builds a prep plan that adapts after each session. When a session ends, four background jobs fire: pattern extraction (Haiku), coverage tracking, plan rewrite (Sonnet with full authority over the prep program), and evaluation (Haiku checking whether the plan is actually working).

The memory between sessions is simple. During scoring, the model can tag observations — things like skips_failure_handling or vague_action_verbs — and these get stored as structured records. Next session, seven of them get injected into the system prompt. That's the memory. Not learning. Text injection. It works, but I'm not going to pretend it's more than it is.

The evaluator

The evaluator's job: look at score deltas after each session. Did the drills the planner assigned actually move scores? Are weak patterns resolving or persisting?

When I first built it, I gave it access to everything. Scores, coverage data, pattern observations, and the full coaching output — what Aria said to the user during the session.

The evaluator was optimistic. Consistently. Patterns resolving, scores improving, plan on track.

Here's why. The evaluator could see that Aria had said "let's work on conciseness." It could see the next answer was slightly shorter. It connected the dots: coaching targeted conciseness, conciseness improved. Progress.

Except the next answer was shorter because it was a simpler question. Or because the user was tired. Or for no particular reason. The score data was ambiguous. But the coach's narrative was right there — "I targeted this weakness" — and that narrative was the most convenient evidence in the input. The evaluator did what language models do: it built the most coherent story from the available information.

It was grading its own homework.

The fix

Cut the evaluator off from all coaching content. It now receives:

Input:
- score_deltas: { structure: +1, completeness: 0, clarity: +1, conciseness: -1 }
- task_statuses: [{ task: "drill_conciseness", status: "attempted", sessions: 2 }]
- pattern_observations: ["vague_action_verbs: 3 sessions", "skips_failure_handling: resolved"]
- coverage: { practiced: 6, total: 11, gaps: ["conflict_resolution", "trade_off_decisions"] }

NOT included:
- Anything Aria said during the session
- The coaching narrative ("let's work on X")
- The user's raw answers
- Aria's feedback text

The system prompt reinforces it:

You are a cold, data-driven plan evaluator.
Your only job is to measure whether the candidate's
interview prep plan is working.
You do NOT coach or encourage — you assess.
Base your assessment ONLY on score deltas and task completion.
Do NOT infer coaching effectiveness — you have no visibility
into what coaching was provided.

The difference was immediate. Here's a real example from the same session data, before and after the firewall:

Before (with coaching context): "The planner's focus on conciseness is showing results — the candidate's latest answer was notably more structured and direct. Recommend continuing current drill sequence."

After (blind to coaching): "Conciseness score dropped by 1 point despite two targeted drill sessions. Structure improved by 1, but completeness flat. Drill effectiveness for conciseness: not demonstrated. Recommend reassigning drill focus."

Same session. Same scores. Completely different assessment. The before version had a story to tell — "coaching targeted conciseness, answer got shorter, progress!" The after version just saw the numbers and reported what they said.

This isn't an interview prep problem

Any system where an AI both acts and evaluates its own actions has this failure mode. A coding assistant that writes code and then reviews it. A tutoring system that teaches and then assesses learning. A content generator that produces text and then rates quality. If the evaluator can see what the actor did, it will construct a narrative of effectiveness.

This principle is old. Peer review is blinded. Clinical trials are double-blind. Auditors can't consult for the same client. The entity measuring the outcome cannot be exposed to the process that created it.

We haven't applied this to LLM systems yet. We keep building agents that evaluate themselves and wonder why they're optimistic.

What the firewall doesn't solve

The evaluator still sees pattern observations extracted from the same answers the coach commented on. There's indirect signal. And nothing prevents the coach from being too generous when scoring. Sonnet decides your answer is a 7 and I have zero way to verify that. No calibration set, no reference answers, no ground truth.

The right fix: blind-rescore every answer with a separate model that has no context. Just the raw answer and the question. I haven't built it because it doubles API cost and I'm a solo founder at $0 revenue.

I ship it with the gap because the trends are real even if absolute scores are shaky. If your structure score goes 5, 5, 6, 7, 7 — something improved regardless of whether "7" means the same thing to Sonnet every time.

A side effect I didn't expect

When the plan updater (Sonnet) sees improving scores, it's instructed to reduce sessions. An 8-session plan becomes 6. Every product instinct screams this is wrong — fewer sessions, less engagement, less retention. But the system optimizes for "ready by your interview date," not for keeping you subscribed. If you're improving fast, padding the plan is dishonest. So plans shrink.

Both the evaluation firewall and the plan reduction are enforced in the system prompt, not in code. No backend validation. I'm trusting the model to follow instructions. That's a real gap.

Where I actually am

90 users, 3 active, $0 revenue. The architecture might be interesting and the product might still fail.

But the version where the evaluator could see the coaching was telling me what I wanted to hear. The version where it can't is telling me something closer to the truth.

If you're building any system where an AI acts and then checks its own work — separate them. Restrict what the checker sees. You'll get less optimistic results. That's the point.

FAQ

Isn't the "right" fix to use a completely separate model with zero context?

Yes. Blind re-scoring with no coaching history, no pattern data, just the raw answer and the question. That's the version that would actually prove whether answers are improving. I haven't built it because it doubles API cost per answer. It's the obvious next step.

How do you know the blind evaluator is more accurate and not just more pessimistic?

I don't, definitively. I don't have ground truth for "is this user actually improving." What I do know is that the blind evaluator's assessments correlate better with the raw score data. When scores are flat, it says scores are flat. When scores improve, it says scores improve. The non-blind version would say scores improved even when they were flat, because it could see the coach tried to improve them. Accuracy and pessimism might overlap here — an evaluator that matches the data is what I want, even if it's less encouraging.

Doesn't Sonnet having authority over Haiku create its own sycophancy problem?

Different failure mode. The Haiku bootstrapper generates an initial plan draft. Sonnet rewrites it with full authority — it's not "reviewing" Haiku's work, it's replacing it. There's no incentive for Sonnet to agree with Haiku because the prompt doesn't frame it as evaluation. It frames it as "here's the data, write the plan." The sycophancy problem specifically emerges when one model is asked to judge another model's output — that's where the agreeable-by-default behavior kicks in.

Why not just validate evaluator output with code instead of trusting the model?

Partially because the evaluator's job isn't binary. It's not "did scores go up or down" — it's "given these deltas, these patterns, this coverage, is the current plan working or should we change approach?" That's judgment, not math. I could add hard rules — "if average score delta < 0.5 after 3 sessions, flag plan as failing" — and I probably should. But the evaluator catches subtler things: a pattern that resolved in easy questions but persists in hard ones, or coverage that's broad but shallow. Code can't easily express that yet.

Resources

SycEval: Evaluating LLM Sycophancy (2025) — 58% sycophancy rate across major models, 78.5% persistence
OpenAI's GPT-4o Sycophancy Postmortem (April 2025) — the incident that made this visible
Expanding on What We Missed with Sycophancy — OpenAI — their deeper follow-up on root causes

Originally published on prepto.tech

AI Interview Prep in 2026 Is Broken. Here's What Nobody Wants to Admit.

Valera Gurachek — Fri, 06 Mar 2026 02:52:29 +0000

Most AI interview prep tools in 2026 fall into three buckets: cheating copilots, generic question banks, or expensive human coaching. None of them solve the actual problem: you grind for weeks, have no idea if you're actually ready, and every tool forgets you exist between sessions.

I spent time digging through Reddit threads, Trustpilot reviews, Hacker News discussions, and competitor landing pages. Here's the raw picture.

The market split into three camps. Two of them are useless.

Camp 1: "We help you cheat."

Cluely raised $5.3M with the literal tagline "cheat on everything." Founded by Columbia dropouts who got suspended for using their own tool during interviews. They're doing $3M+ ARR.

Final Round AI markets itself as "100% Invisible & Undetectable" with a real-time copilot that feeds you answers during live interviews. They charge $149–299/month for this.

The result? Fabric HQ analyzed 19,368 interviews and found 38.5% of candidates are now flagged for cheating behavior. Google and McKinsey responded by reintroducing mandatory in-person interviews.

Camp 2: "Practice 10,000 questions."

Skillora, Huru, MockMate, and a dozen others offer massive question banks with AI feedback. Nobody asks the obvious question: if you practiced 250 problems and still bomb the interview, was the problem that you didn't practice 251?

Camp 3: "Talk to a real human."

Interviewing.io charges $100–225 per session. Genuinely useful, but you can't do 5 sessions a day for a month. And a stranger on a 45-minute call doesn't know your history.

What people actually say when they're honest

I went through Reddit (r/cscareerquestions, r/interviews, r/jobs), Blind, and Hacker News.

On grinding without progress:

"You 'solved' 250 problems, but two weeks later the key invariant is gone."
"You track problem counts and streaks; interviewers grade clarity, adaptability, and edge-case instincts."

LeetCode streaks measure effort. Interviews measure communication quality. People build muscle in the wrong gym.

On rejection at scale:

"600 rejections in 6 months" (from someone with 22+ years of experience)
"Literally no one will hire me. It's really destroying my soul."

Tech unemployment climbed from 3.9% to 5.7% between December 2024 and January 2025. Unemployed IT workers jumped from 98,000 to 152,000 in a single month. The market is brutal and the tools aren't helping.

On AI tools specifically:

"Generic and lacked creativity... AI sometimes repeated the same advice or missed important details."
"Feedback often felt repetitive or too general."

Final Round AI sits at 3.9/5 on Trustpilot with wildly polarized ratings.

Five things nobody in this space is willing to build

1. A readiness signal.
Every tool sells "unlimited practice." Nobody tells you when to stop. There is no credible "you are ready for this specific interview" metric in the entire market. Every product is incentivized to keep you grinding.

2. Memory across sessions.
Every AI tool resets when you close the tab. No tool builds a persistent model of YOUR specific weaknesses, YOUR communication patterns, YOUR improvement trajectory over weeks and months. Every session starts from zero.

3. The anti-cheating position.
With 38.5% of candidates cheating and companies cracking down, there's a massive gap for a tool that says: "We make you genuinely better. We don't help you cheat." Nobody is claiming this ground.

4. Emotional honesty.
Every landing page says "Ace your interview!" with stock photos of smiling people. Meanwhile their users are posting about soul-crushing rejection on anonymous forums at 2am.

5. Actual personalization.
Most tools let you paste a job description. None of them deeply cross-reference your resume with the job posting, identify exact gaps, track which gaps you've closed across sessions, and adapt difficulty based on your trajectory. The "personalization" in most tools is: we put your job title in the prompt.

Why the retention problem matters more than the feature problem

LeetCode nailed the trigger (daily streak notifications) and the action (solve one problem). But the variable reward is broken — it reinforces grinding volume, not interview readiness.

The interview prep space is missing the most powerful variable reward type: self-knowledge. "I thought I was strong on system design, but I freeze when asked about trade-offs." That's the moment that pulls you back. Not points. Not streaks.

And the investment layer? Memory. If the tool remembers your history, every session makes the next one more valuable. Leaving means losing your accumulated progress. Nobody in interview prep has built this.

What this means if you're prepping right now

Stop optimizing for volume. 500 LeetCode problems won't help if your communication quality is the bottleneck.
Find a tool that gives you dimensional feedback. "Good answer!" is worthless. You need to know: was it structured? Complete? Clear? Concise?
Demand memory. If your prep tool doesn't remember what you struggled with last week, it's not prepping you.
Stay away from copilots. The 38.5% detection rate is only going up. Companies are investing heavily in detection. Getting caught doesn't just cost you one offer — it costs you the network.
Track your trajectory, not your streak. The question is "am I measurably better at the specific things this job requires than I was two weeks ago?"

The bottom line

The AI interview prep market in 2026 is full of tools that either help you cheat, drown you in generic questions, or charge $200 for a human to tell you what an AI could track automatically.

What's missing: a system that listens to you speak, scores you honestly on dimensions that actually matter, remembers where you broke last time, and drills you there until you don't break anymore.

We're building exactly that with Aria. But even if you don't use our tool — speak out loud, get dimensional scores, fix one thing at a time, track progress over time. Do that with any tool and you'll be ahead of 90% of candidates grinding LeetCode and hoping for the best.

Originally published at prepto.tech

How to prepare for Booking.com tech interview (Backend role)

Valera Gurachek — Thu, 19 Jun 2025 21:26:36 +0000

Hey all, created a new template on how to successfully prepare for the "Software Engineer I - Backend" position at Booking.com

Please feel free to use for your next tech interview(applicable not only for booking actually): https://prepto.tech/blog/preparing-for-software-engineer-i-backend-role-at-bookingcom

Example of Question & Answer for topic "Database Design and Optimization":

Q: How would you optimize a slow-performing SQL query that joins multiple tables with millions of records?

For Booking.com's scale, I would implement the following optimization strategies:

Analyze query execution plan using EXPLAIN to identify bottlenecks
Optimize indexes based on WHERE, JOIN, and ORDER BY clauses
Consider denormalization for frequently accessed data
Implement materialized views for complex aggregations
Use partitioning for large tables (e.g., by date for historical booking data)
Consider vertical partitioning to split rarely used columns
Implement query caching using Redis for frequently accessed data
Use LIMIT and pagination to handle large result sets
Consider using covering indexes for better performance

Preparing for Senior PHP Developer role at Skycop.com

Valera Gurachek — Sun, 04 May 2025 13:31:52 +0000

Job Summary

This is a senior PHP developer position at Skycop, a flight compensation service company. The role involves working on a claim processing platform, handling large datasets, and developing new travel industry products. The tech stack is centered around PHP (Symfony/Laravel), with heavy usage of microservices, big data processing, and various third-party API integrations. The position requires strong backend development skills with emphasis on scalable architecture and efficient data processing.

https://prepto.tech/blog/preparing-for-senior-php-developer-role-at-skycopcom

This guide covers the following topics, specific to the job:

Microservices Architecture and Communication
Big Data Processing and Optimization
Advanced PHP and Framework Expertise
Database Optimization and Caching
Message Queues and Async Processing

Preparing for Senior PHP Symfony Developer role at Clubee

Valera Gurachek — Sun, 16 Feb 2025 13:48:50 +0000

Sorry, I don't know how to make post with just link: https://prepto.tech/blog/preparing-for-senior-php-symfony-developer-role-at-clubee

Please take a look and let's discuss the preparation process.

Built a feature to turn interview preparation process into blog post

Valera Gurachek — Sat, 25 Jan 2025 17:36:32 +0000

So, I have a feature that uses Claude 3 and allows you to mock interview preparation for some PHP job. It also provides questions & answers that most likely will occur during the interview.

Now, I have built a semi-automated tool that can turn such preparation flows into blog posts automatically (after you verify all topics/questions/answers/tips are high-quality ones).

Here is how much content we get from one interview prep: https://x.com/VGurachek/status/1883196464110903546

How do you prepare for PHP interviews? + my list

Valera Gurachek — Wed, 13 Mar 2024 19:27:40 +0000

I'd like to hear from php devs(mid/senior) about their experience preparing for tech interviews. How this process looks like most of the time, what services you use, and a list of topics.

During the years, I created my own list of topics to prepare for interviews. Usually, I analyze the job description and try to predict questions as well as highlight my experience from the perspective of the job description.

So, here is my list:

New features in php 8.* (last 3 versions)
SOLID
Ioc Container / Service Container
Composition over Inheritance
Composition vs Aggregation
Polymorphism (for some annoying interviewees :D)
DDD, Layered Architecture
Design Patterns
Dependency Injection
Mocks/Stubs
%your_framework% new features (almost never asked)
Vue 3.0 new features
NoSQL / MongoDB
RESTful API
Testing. PHPUnit/Pest
Redis
CI/CD
RDBMS (MySQL specifically)
Microservices (SOA)
Queues (RabbitMQ/Redis)

Please share something about your experience. It would be great if you pointed out the problems you had during preparation or during the interview.