Jangwook Kim

Posted on May 22 • Originally published at effloow.com

LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025)

#benchmark #researchreproducibility #llmagents #paperpoc

A research team from the University of Texas at Dallas published LMR-BENCH at EMNLP 2025, asking a specific question: can LLM agents reproduce the core implementation from an NLP research paper when given the paper, a partially masked codebase, and explicit instructions?

This is harder than it sounds. And the benchmark design is smart enough to be worth understanding in detail.

Sources: arXiv 2506.17335 | ACL Anthology | GitHub

What the benchmark actually tests

LMR-BENCH contains 28 reproduction tasks drawn from 23 NLP papers published in ACL, EMNLP, NAACL, and AAAI over the past five years. Each task follows the same structure:

Paper: the full PDF
Masked repository: a real codebase from the paper, but with one or more critical functions replaced by # TODO: implement stubs
Implementation instruction: a natural language description of what the masked function should do, including cross-file dependencies and design intent

The agent's job is to generate patch code that fills the stubs correctly.

This tests something distinct from "can an LLM write a function from a docstring." The function body has to match what the paper describes, use the surrounding codebase's conventions, and pass unit tests against the paper's reference implementation.

The nine task categories

Category	What gets masked
Tokenization	Custom tokenizer logic
Attention mechanism	Scaled dot-product or custom attention
Positional encoding	RoPE, ALiBi, learned variants
Loss function	Custom training objectives
Data preprocessing	Dataset-specific transforms
Model architecture	Layer definitions, custom blocks
Training procedure	Optimizer steps, gradient modifications
Decoding strategy	Beam search variants, constrained decoding
Evaluation metric	BLEU variants, task-specific metrics

The hardest category is model architecture: reproducing a custom layer requires reading across multiple files to understand tensor shapes, class inheritance, and forward pass conventions — exactly the kind of multi-file reasoning that current LLMs struggle with.

The easiest is evaluation metric: formulas are usually self-contained, well-documented in the paper, and don't require deep codebase knowledge.

How masking works in practice

Here's what a masked task looks like (synthetic example based on paper methodology):

# Original in paper's codebase: rotary_embedding.py
def apply_rotary_emb(xq, xk, freqs_cis):
    """Apply rotary embeddings to query and key tensors."""
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)

# Masked version (what the agent receives):
def apply_rotary_emb(xq, xk, freqs_cis):
    """Apply rotary embeddings to query and key tensors."""
    # TODO: implement
    # Instruction: Apply rotary position embeddings to xq and xk.
    # Use torch.view_as_complex for complex number representation.
    # freqs_cis shape must be broadcast-compatible with xq_.
    # Return float tensors matching input dtype.
    raise NotImplementedError

The info.json for this task would also specify which files the agent should read (reshape_for_broadcast definition lives in utils.py, for example).

Dual evaluation: unit tests + LLM-as-judge

LMR-BENCH scores agents on two axes:

Axis 1 — Functional correctness (unit tests)
Numerical equivalence against the reference implementation. The agent's patch must produce the same tensor outputs as the original function.

Axis 2 — Implementation fidelity (LLM-as-judge)
GPT-4o reads the paper's algorithm description and the agent's code, then scores whether the implementation actually follows the described method — even if it passes unit tests through an equivalent but differently structured approach.

This dual axis matters because:

A function can pass unit tests but use a different algorithm (memorized shortcut)
A function can fail unit tests due to floating-point differences but be conceptually correct

Both axes tell you different things about the agent's reasoning.

What the results show

The paper doesn't release a full leaderboard in the public arXiv version, but the findings indicate:

o3-mini (high compute) was the best-performing model tested
Pass@1 rates ranged roughly from 20% to 60% across task categories
Multi-file reasoning was the single biggest differentiator: models that could trace function calls across 3+ files significantly outperformed those that stayed in the target file
Simply giving the model the paper PDF without the masked code resulted in worse performance than giving both — the code context matters more than the paper text for reproduction tasks

The last point is counterintuitive. You'd expect the paper's equations to be the key signal. But the surrounding codebase (tensor shapes, variable naming, utility functions) constrains the solution space more tightly than the abstract algorithm description.

Why this benchmark matters for developers

If you're building an AI-assisted research coding tool (or evaluating whether an agent can help you implement a paper), LMR-BENCH is the most realistic evaluation framework available. The alternatives:

HumanEval / MBPP: function-level, no paper context, no cross-file reasoning
SWE-bench: bug fixing in large codebases, different skill set from paper reproduction
APPS: competitive programming, not research implementation

LMR-BENCH specifically targets the "I read a paper, now implement it" workflow — which is what most ML engineers actually do.

Running the benchmark yourself

The benchmark repo requires Python ≥ 3.12 and supports any LLM backend through its evaluation harness:

git clone https://github.com/du-nlp-lab/LMR-Bench
cd LMR-Bench
pip install -r requirements.txt

# Run a single task with Claude
python evaluate.py \
  --task benchmark/rotary_emb_task/ \
  --model claude-opus-4-7-20251001 \
  --api-key $ANTHROPIC_API_KEY

The evaluation harness handles: sending the paper + masked code to the model, collecting the patch, running unit tests, and recording fidelity scores.

What to expect if you run it

Based on the paper's findings, expect:

Evaluation metric tasks: 50–60% pass@1 with a capable model
Model architecture tasks: 20–30% pass@1, sometimes lower
Most failures: not wrong algorithm, but wrong tensor handling — shape mismatches from not reading the surrounding code carefully enough

If you're using this to evaluate your own agent, the architecture tasks are the most informative discriminator between models.

The broader picture

LMR-BENCH reveals a gap that matters: LLMs can explain papers well and can write code well, but the intersection — implement exactly what this paper describes, in this codebase, with these constraints — is still hard. The benchmark gives that gap a number.

For the AI research community, this is also a forcing function: if you want your paper to be reproducible by an LLM agent, write clearer implementation instructions and reduce cross-file dependencies in your codebase.

Paper: Shuo Yan et al., "LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research," EMNLP 2025. arXiv:2506.17335. All results cited from the published paper. PoC evidence in data/lab-runs/lmr-bench-llm-reproduce-nlp-research-code-paper-poc-2026.md.

DEV Community