Jaskaran Singh

Posted on Apr 21

I Grade AI Code for a Living. Here's What Nobody Talks About.

#ai #career #android #kotlin

Jaskaran Singh — Senior Software Engineer, AI Trainer

I've spent the last year doing something most engineers haven't: reading AI-generated code all day and deciding whether it's actually good.

Not "does it compile." Not "did the tests pass." Good as in, would I be comfortable shipping this to production at 2am on a Friday if something went wrong.

The answer, more often than people want to admit, is no.

I use LLMs myself. But after evaluating enough AI-generated code across Python, Java, Kotlin, and C/C++, I know the failure modes aren't random. They follow patterns. And once you see them, you can't unsee them in AI code or your own.

The Job Nobody Has a Good Title For

My official role is AI Trainer. What that actually means: I'm a human in the RLHF loop.

Reinforcement Learning from Human Feedback works by having engineers like me evaluate model outputs against structured rubrics, then rank and rewrite them so the model learns what "better" looks like. I write adversarial prompts to expose failure modes. I do multi-turn code reviews, meaning I follow an entire back-and-forth between a user and a model across five or ten turns, and assess whether the reasoning held up or quietly drifted off the rails somewhere in the middle.

Less "AI whisperer." More "very opinionated senior reviewer who never runs out of things to flag."

The Pattern That Bothers Me Most

There's a category of bug I call "confident and wrong." The code compiles. It's readable. The variable names are sensible. It even has a comment explaining what it does. And it's still wrong. Not obviously wrong, but wrong in the way that only shows up under load, or with a specific input type, or after three other things happen first.

Here's a real example. Prompt was something like: "Write a function to fetch user details and cache the result."

The model produced:

object UserCache {
    private val cache = HashMap<String, User>()

    fun getUser(userId: String, fetchFn: () -> User): User {
        return cache.getOrPut(userId) { fetchFn() }
    }
}

Clean. Concise. Totally broken in a concurrent environment.

HashMap isn't thread-safe. Two coroutines calling getOrPut simultaneously on the same key can corrupt the map. The model didn't add a mutex, didn't suggest ConcurrentHashMap, didn't even mention the assumption that this runs single-threaded. It just wrote code that works in the demo and fails in production.

The correct version uses ConcurrentHashMap or wraps access with a Mutex if you need atomic get-or-fetch semantics:

object UserCache {
    private val cache = ConcurrentHashMap<String, User>()
    private val mutex = Mutex()

    suspend fun getUser(userId: String, fetchFn: suspend () -> User): User {
        cache[userId]?.let { return it }
        return mutex.withLock {
            // double-checked after acquiring lock
            cache.getOrPut(userId) { fetchFn() }
        }
    }
}

The model's version would pass code review at most places. That's what worries me.

The Edge Case Problem is Structural, Not Random

After a few hundred evaluations, I stopped thinking of missed edge cases as oversights. They're structural. LLMs optimize for the problem as stated. If the prompt doesn't mention null inputs, concurrent access, or network timeouts, the model won't think about them either.

Good engineers treat those as implied. You don't wait to be asked "what if this list is empty." You just handle it.

Here are the categories where models fail most consistently:

Concurrency. Single-threaded assumptions that explode under real-world load. The HashMap example above is the most common flavor.

Failure state propagation. Functions that catch exceptions and return null or false, then callers that don't check the return value, and the whole chain silently fails. The model gets each function right in isolation. It gets the composition wrong.

Resource cleanup. Network connections, file handles, database cursors left open because the happy path worked and nobody wrote the finally block or used the right scoping construct.

Behavioral drift across turns. In turn 1, the model sets up a class a certain way. By turn 4, after a few "can you refactor this" prompts, it has made changes that contradict the original design without acknowledging it. The code still runs. The architecture is now inconsistent in ways that will cause problems in six months.

What I Actually Look For in a Code Review

My rubric has eight criteria. The ones that surface the most issues:

Correctness under adversarial input. Not "does it work with the example." Does it work when the input is empty, null, malformed, enormous, or concurrent? I'll trace through a model's code in my head with the worst inputs I can think of before scoring it.

Explicitness of assumptions. Code that works is not the same as code that communicates its constraints. If a function assumes its input is sorted, that needs to be in a comment, a precondition check, or the function name. The model almost never does this unprompted.

Error handling that means something. There's a specific anti-pattern I call "error theater":

// This is not error handling. This is error cosplay.
try {
    val result = riskyOperation()
    return result
} catch (e: Exception) {
    Log.e("TAG", "Something went wrong")
    return null
}

It looks like error handling. It isn't. The caller has no information. The system has no way to recover. The log message gets ignored. Good error handling changes what the caller can do. It doesn't just muffle the crash.

Security surface. SQL construction via string interpolation, credentials in code comments, user input passed to shell commands without sanitization. These come up. Not constantly, but often enough that I check every time.

The Skill That Transferred Back

I didn't expect this job to change how I write code. It did.

Spending eight hours a day articulating why something is wrong, not just flagging it but writing a clear explanation that a model can actually learn from, builds a habit of internal interrogation that's hard to turn off.

Now, before I submit a PR, I run my own rubric. Is this thread-safe? What happens on retry? Who owns cleanup? Does this function do what its name says, or has it quietly acquired a second responsibility?

That last one is underrated. Functions that do two things are where bugs live. The AI writes them constantly because function names get generated from the prompt context, and prompts often have two goals. "Fetch and validate" is two functions pretending to be one.

Where AI Code Actually Shines

I've been critical, so to be fair.

AI-generated code is genuinely good at boilerplate. Serialization logic, configuration parsing, test scaffolding, adapters between interfaces that differ only in naming. Tedious work that models handle well. If I ask for a Room database entity with a DAO and a repository, the output is usually solid and saves thirty minutes.

// This kind of scaffolding? Models nail it.
@Entity(tableName = "users")
data class UserEntity(
    @PrimaryKey val id: String,
    val name: String,
    val email: String,
    val createdAt: Long = System.currentTimeMillis()
)

@Dao
interface UserDao {
    @Query("SELECT * FROM users WHERE id = :userId")
    suspend fun getUserById(userId: String): UserEntity?

    @Insert(onConflict = OnConflictStrategy.REPLACE)
    suspend fun insertUser(user: UserEntity)
}

Models are also good at surfacing options I'd forgotten about. Not because they know my codebase, but because they've seen enough code to suggest a StateFlow where I was reaching for LiveData, or use runCatching in a context where it genuinely fits.

The mistake is treating it as something that reasons about your system. It doesn't know your system. It knows patterns. Those overlap most of the time and fail in ways that aren't obvious the other times.

Why I Wrote This

A few months ago I started noticing that engineers I respect were shipping AI-generated code without reviewing it seriously. Not because they're lazy. Because the code looked fine. That's the problem. It's calibrated to look fine.

The engineers who work well with AI tooling treat it the way experienced engineers treat a junior developer: capable, useful, not fully trusted without review, and prone to specific failure patterns you learn over time.

That framing changed how I work with it. I think it'll change how you do too.

Jaskaran Singh is a Senior Software Engineer working in AI training and evaluation. Previously built Android fintech apps at Comviva Technologies and Talentica Software. Currently building a Python-based OINP immigration monitoring bot on the side, because immigration status shouldn't require manually refreshing government websites.

Find me on LinkedIn or at my portfolio.