DEV Community: 黎辰悦

A Reproducible Prompt Workflow for Multi-File Bug Fixing (Free Generator Included)

黎辰悦 — Tue, 23 Dec 2025 13:28:55 +0000

Multi-file bug fixes go wrong in a very predictable way: the agent starts editing too early, touches unrelated files, and you end up with a messy change you can’t review or reproduce.

This post shares a repeatable prompt workflow you can reuse for multi-file fixes:

Recon → Plan → Patch → Verify

And yes — I built a tiny free tool to generate this kind of prompt pack automatically.

Project home: https://devstral2.org

Free generator: https://devstral2.org/tools/devstral2-prompt-pack.html

Why multi-file bug fixing is harder than it looks

Multi-file bugs usually span boundaries:

page templates / layout wrappers
typography + global CSS rules
content rendering (Markdown/prose styles)
build/deploy differences and caching

Agents fail here when they "just start editing" before understanding how the UI is composed.

The workflow (Recon → Plan → Patch → Verify)

1) Recon (Repo reconnaissance)

Goal: understand the minimum set of files involved.

Ask the agent to:

locate the entry point(s)
list candidate files with 1-line reasons
form a root-cause hypothesis before editing anything

2) Plan (Minimal change plan)

Goal: define the smallest patch that meets acceptance criteria.

Require:

file list + why
patch steps (by file)
acceptance criteria + verification steps

3) Patch (Step-by-step edits)

Goal: keep diffs auditable.

Require:

edits grouped by file
“what changed” + “why”
avoid unrelated refactors

4) Verify (Proof it works)

Goal: provide evidence.

manual checks
quick regressions
(optional) tests/build commands

Real bug scenario (from devstral2.org)

While adding blog post detail pages to devstral2.org, I hit a real UI bug:

Bug: the blog detail content looked too gray / low contrast on the dark theme, so parts of the article were hard to read.
Expected behavior: the blog detail page should match the homepage’s typography/layout style (consistent font, contrast, spacing).
Verification: after the fix, I refreshed the page and confirmed the entire blog detail content is clear.

In practice, the minimal fix was:

make the blog detail wrapper reuse the homepage container/typography classes (instead of rewriting global CSS).

Copy-paste Prompt Pack (example)

Use this prompt pack for the same category of bug:

Prompt Pack: Multi-file bug fixing (low-contrast blog detail page)

You are an AI engineer working on a dark-themed website (devstral2.org).
Bug: blog post detail content is too gray/low-contrast; parts are hard to read.
Goal: make the blog detail typography/layout consistent with the homepage; ensure readability.
Constraints: minimal diffs; avoid broad refactors; don’t change unrelated pages; keep design consistent.

Recon (before edits):

Locate where the blog detail page is rendered (template/component/html).
Identify the homepage wrapper/container + typography classes.
Compare blog detail wrapper vs homepage wrapper.
Find which styling rule causes low contrast (color/opacity/prose/text class/CSS vars).
List candidate files to change with 1-line reasons.

Plan (minimal fix):

Propose the smallest patch: blog detail wrapper reuses homepage container/typography structure.
Specify exact files to edit and why.
Define acceptance criteria + regression checks.

Patch:

Update the blog detail wrapper/container classes to match homepage.
Keep changes scoped; no unrelated formatting refactors.
Explain changes PR-style (what/why).

Verify:

Refresh blog detail page and verify readability.
Check headings/paragraphs/links (and code blocks if present).
Quick regression: open homepage + at least one tools page.

Output format:

Root cause hypothesis
Files to change
Patch steps (by file)
Verification checklist

Quick acceptance checklist

[ ] Recon done before edits (candidate files listed)
[ ] Plan includes acceptance criteria
[ ] Changes are minimal and scoped
[ ] Verify includes refresh + quick regressions
[ ] No unrelated pages were impacted

If you try this workflow, which step helps you the most — Recon, Plan, Patch, or Verify?

Project home: https://devstral2.org

Free generator: https://devstral2.org/tools/devstral2-prompt-pack.html

Devstral 2 vs Devstral Small 2: A 30-Minute Playground Test for Multi-File Coding Tasks

黎辰悦 — Mon, 22 Dec 2025 12:44:36 +0000

A practical decision tree: when to pick Devstral 2 vs Devstral Small 2

A reproducible 30-minute Playground test plan (same prompt, same params, two runs)

A comparison table you can screenshot and reuse

A truthfulness statement: clearly separating [facts] / [test results] / [opinions]

Table of Contents

What Are Devstral 2 and Devstral Small 2?

Performance Comparison (What to Compare Without Making Up Benchmarks)

Practical Applications (Multi-File vs Small Tasks)

Cost and Accessibility (Verify First)

Implementation Guide: 30-Minute Playground Test (My Template)

Making the Right Choice (Decision Tree)

Conclusion

Appendix: Full Prompt (Copy-Paste)

Disclaimer: Facts vs Tests vs Opinions

What Are Devstral 2 and Devstral Small 2?

Both Devstral 2 and Devstral Small 2 are positioned for software engineering / code intelligence. The official pages emphasize similar strengths: tool usage, codebase exploration, and multi-file editing—which are core requirements for “code agents” that operate across a repository.

1.1 What is Devstral 2?

Positioning (official): a code-agent-oriented model for software engineering tasks

Emphasis: tool usage + repo exploration + multi-file editing

Who it’s for: higher-complexity engineering tasks where plan quality and regression control matter

1.2 What is Devstral Small 2?

Positioning (official): similar focus on tools + exploration + multi-file editing

Key difference (practical expectation): typically framed as a lower-cost / lighter option for more frequent iteration

1.3 Verifiable Facts Checklist (Please Verify on Official Pages)

Before making any choice, I recommend putting these fields “in front of your desk” and verifying them on the official/model card pages:

Context length: Does each model list the same context length?

Pricing: Input/output price per 1M tokens (and whether there’s a free period)

Model names / versions: e.g., “Devstral 2512” vs “Labs Devstral Small 2512”

Positioning statement: wording about “code agents / tools / multi-file editing”

Playground availability: which models appear in Studio/Playground and under what labels

Note: The rest of this post intentionally avoids “invented numeric benchmarks.” I only use a reproducible comparison workflow you can run yourself.

Performance Comparison What to Compare Without Fabricating Benchmarks

Instead of arguing “which is better” by feel, I compare the same multi-file project prompt twice in Playground and score the outputs using four practical engineering metrics:

Plan Quality – does it propose a step-by-step, engineering-grade plan?

Scope Control – does it limit changes to necessary files and explain impact?

Test Awareness – does it propose verification steps or tests, not just code?

Reviewability – is the output PR-friendly (clear diffs, rationale, checklist)?

These metrics are useful because multi-file tasks are where models fail most painfully: one wrong assumption can cascade into broken imports, mismatched interfaces, hidden regressions, or unreviewable “big rewrites.”

Practical Applications Multi-File Tasks vs “Small” Tasks

A simple way to frame the choice is:

Devstral 2: “I need fewer failures on complex repo tasks.”

Devstral Small 2: “I’m iterating quickly on smaller pieces and cost matters.”

3.1 Typical Scenarios for Devstral 2 (e.g., Devstral 2512)

Choose this direction when:

Your task spans multiple files with interface linkage or dependency chains

Regression risk is high (one change can break other modules)

You want an engineering plan (scoping, test points, reviewability)

You prioritize stability over token cost (verify pricing on the official page)

3.2 Typical Scenarios for Devstral Small 2 (e.g., Labs Devstral Small 2512)

Choose this direction when:

Requirements are simpler: single file, low risk, or easy to decompose

You’re budget-sensitive and want frequent, low-cost iterations (verify pricing)

You’re willing to add stronger constraints for stability, for example:

“Scout before modifying”

“Output only the smallest diff”

“List test points explicitly”

“Don’t refactor unrelated code”

Cost and Accessibility (Verify First)

This section is intentionally verification-first.

What to verify

Is the API currently free? If yes, until when?

After the free period, what are the input/output prices per 1M tokens for:

Devstral 2

Devstral Small 2

Are there regional / account limitations or model availability differences?

How cost changes your decision

If two models produce similarly acceptable outputs for your tasks, cost becomes a meaningful tie-breaker:

Frequent iteration + small tasks → lower cost option may win

High-risk multi-file tasks → paying more to reduce failures may be worth it

One-Page Scorecard (Screenshot-Friendly)

Test Setup (keep identical for fairness)

Temperature: 0.3

max_tokens: 2048

top_p: 1

Response format: Text

Same prompt, two runs (only switch model)

Models

Run A: Devstral 2512

Run B: Labs Devstral Small 2512

4-Metric Engineering Scorecard (1–5)
Metric What “5/5” Looks Like Devstral 2 (A) Small 2 (B)
Plan Quality Step-by-step plan, sequencing, dependencies, clear deliverables

Scope Control Minimal necessary changes, avoids unrelated refactors, names affected files

Test Awareness Explicit verification steps/tests, rollback/risk notes

Reviewability PR-style output: readable sections, rationale, checklist, clear diffs

Quick Verdict (Circle One)

If A wins on Plan + Scope + Tests: choose Devstral 2 for high-risk multi-file work

If outputs are similar and cost matters: choose Devstral Small 2 for frequent iteration

Notes (for your screenshots)

Figure A: Playground output screenshot (Devstral 2512, same params)

Figure B: Playground output screenshot (Labs Devstral Small 2512, same params)

What prompt did you use? → (paste prompt name / link / appendix section)

View the original article：https://www.devstral2.org/blog/posts/devstral-compare-choose-playground-30min/

Making the Right Choice A Decision Tree: Task Complexity × Cost Sensitivity

Ask two questions:

Q1: Is this a complex multi-file task with high regression risk?

If yes, lean toward Devstral 2

If no, go to Q2

Q2: Are you cost-sensitive and iterating frequently?

If yes, lean toward Devstral Small 2

If no, pick based on your tolerance for failure vs your need for speed

Conclusion: Choose at a Glance

Complex projects / multi-file linkage / high-risk modifications → prioritize Devstral 2 (Devstral 2512)

Budget-sensitive / rapid iteration / tasks easily decomposed → prioritize Devstral Small 2 (Labs Devstral Small 2512)

Appendix: Full Prompt (Copy-Paste)

Role: You are an “Engineering Lead + Architect”.
Goal: Help me choose between Devstral 2 and Devstral Small 2 with practical guidance, without fabricating benchmarks.

My Background

I am a beginner, but I can use the console/Playground for testing

I can use Postman (optional)

I want: a comparison table + a selection conclusion + a risk warning + reproduction steps

Tasks

Explain in 8–12 lines why “code agent / multi-file project tasks” have higher requirements for the model (in layman terms).

Provide a selection decision tree: when should I choose Devstral 2 vs Devstral Small 2?

Output a comparison table including at least:

suitable task type

inference/quality tendency

cost sensitivity

suitability for local use

dependence on context length

risks/precautions

Provide a 30-minute field test plan (Playground only): how to run the same prompt twice and what metrics to compare (plan quality, scope control, test awareness, reviewability).

Finally, output a disclaimer / statement of truthfulness distinguishing: [facts] [test results] [opinions]

Strong Constraints

Do not fabricate any numerical benchmarks or “I’ve seen a review” conclusions.

If you cite facts such as positioning / context length / pricing, prompt me to verify them on the official page and list which fields to verify (do not hard-code numbers).

Output should be screenshot-friendly: clear structure, bullet points, and tables.

Disclaimer: Facts vs Tests vs Opinions (Paste Into Your Blog) [Facts]

Model positioning / feature emphasis / context length / pricing should be verified on official model card pages.

I intentionally avoid claiming any numeric benchmark results that I did not personally reproduce.

[Test Results]

My Playground run compared two models using the same prompt and same parameters.

For this particular prompt, the outputs were highly similar in structure and recommendations.

[Opinions]

I believe the safest selection method is reproducible testing rather than “choosing by feel.”

I expect discriminative gaps (if any) to show up more clearly on high-risk multi-file modification tasks with concrete repo constraints.