DEV Community

jg-noncelogic
jg-noncelogic

Posted on • Originally published at arxiv.org

PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

Hook: PlotChain — a deterministic, generator-based benchmark for MLLMs to read engineering plots (Bode, FFT, step response, pump curves). Ground-truth comes straight from the generator so answers are exact. Read it: https://arxiv.org/abs/2602.13232 (PlotChain, arXiv:2602.13232)

Insight: The neat trick is checkpointed fields (cp_*). Instead of one free-form guess, each plot exposes sub-skills (cutoff freq, peak mag, etc.), letting you localize failures. They release generator, dataset, scoring code and manifests for fully reproducible runs.

Results (practical): Top models hit ≈80% field pass (Gemini 2.5 Pro 80.4%, GPT-4.1 79.8%, Claude Sonnet 78.2%); GPT-4o trails ~61.6%. But frequency-domain tasks are brittle: bandpass ≤23%, FFT spectrum still hard.

Takeaway for engineers: If you care about automation that reads plots, benchmark with deterministic, checkpointed tests like PlotChain before you ship — otherwise subtle but critical gaps (FFT/bandpass) will bite you in production.

Top comments (0)