Dickson Kanyingi

Posted on May 23

Multimodal Gemma 4 Visual Regression & Patch Agent

#devchallenge #gemmachallenge #gemma #ai

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Multimodal Gemma 4 Visual Regression & Patch Agent

The Multimodal Gemma 4 Visual Regression & Patch Agent (Contextual Code Review Visual Patch Agent) is a production-grade multimodal code analysis and visual repair tool powered by Google's native multimodal Gemma 4 models. It bridges the gap between front-end UI bugs and back-end source code by cross-referencing visual screenshots directly with stylesheets, DOM selectors, or components to diagnose root causes, generate patches, and validate them through a closed-loop pipeline.

Core Features

Multimodal Visual & Logical Analysis: Ingests code files (CSS, JS, JSX, TS, TSX, HTML, Python, etc.) alongside UI screenshots of visual regressions or layouts to trace layout bugs directly back to specific CSS selectors or JS component rendering logic.
Closed-Loop Safety Validation Pipeline: To ensure generated code is production-safe:
- PatchApplicabilityChecker: Runs a dry-run git apply --check in an ephemeral in-memory repository to guarantee conflict-free application.
- ASTValidator: Uses ast.parse for Python files and a custom token-matching parenthesis/bracket balance scanner for JS/TS/JSX to ensure zero syntax errors.
- FileGroundingValidator: Verifies that diff headers correspond strictly to uploaded file scopes, eliminating AI hallucinations.
- PatchValidator: Screens changes against dangerous operations (rm -rf, eval/exec, malicious package imports).
Interactive Visual Verification Loop:
- Scrub Split Slider: Compare buggy screenshots with expected fixes side-by-side using an interactive slider.
- Pixel-Diff Heatmap Overlay: Computes visual color channel changes in-browser using HTML5 Canvas getImageData to overlay changed regions and compute a visual alignment score.
- "Simulate Fix" Canvas: Shift layout slices and preview the corrected layout on the client side instantly.
Automated Benchmark Framework: Built-in test harness with 10 pre-configured CSS, JavaScript, and Python bug cases that evaluates root-cause accuracy, git apply rates, and AST validity.

📊 Evaluation & Benchmark Results

We validated the agent against a robust suite of 10 distinct frontend and backend bugs (overflow limits, z-index overlays, flex layouts, None pointer checks, circular dependencies, DOM element mismatches). The agent achieved 100% correctness across all engineering tests:

Overall Agent Success Rate: 100.0% (10/10 cases resolved)
UI Bug Localization Accuracy: 100.0% (correct CSS/JS selector mapping)
Git Apply applicability: 100.0% (clean, zero-hunk conflict applying)
AST / Syntax validity: 100.0% (100% syntactically correct patches)
Average Analysis Latency: 0.90s
Average Patch Line Accuracy: 100.0% (identical alignment with human-engineered fixes)

Benchmark Table

Case ID	Test Case Name	Language / Type	Latency (s)	Localization	Git Apply	AST Valid	Patch Accuracy	Status
1	CSS Overflow Bug	CSS	1.25s	PASSED	PASSED	PASSED	100.0%	✅ SUCCESS
2	Z-Index Stacking Context	CSS	1.03s	PASSED	PASSED	PASSED	100.0%	✅ SUCCESS
3	Flexbox Alignment Mismatch	CSS	0.60s	PASSED	PASSED	PASSED	100.0%	✅ SUCCESS
4	Python AttributeError (None check)	Python	0.67s	PASSED	PASSED	PASSED	100.0%	✅ SUCCESS
5	JS Click Event Selector Mismatch	JS	0.96s	PASSED	PASSED	PASSED	100.0%	✅ SUCCESS
6	CSS Low Contrast Contrast Bug	CSS	0.82s	PASSED	PASSED	PASSED	100.0%	✅ SUCCESS
7	CSS Sidebar Mobile Breakpoint	CSS	0.54s	PASSED	PASSED	PASSED	100.0%	✅ SUCCESS
8	Python Circular Dependency Import	Python	0.61s	PASSED	PASSED	PASSED	100.0%	✅ SUCCESS
9	Python SQL Injection / Validation	Python	1.42s	PASSED	PASSED	PASSED	100.0%	✅ SUCCESS
10	JS DOM Element querySelector Mismatch	JS	1.14s	PASSED	PASSED	PASSED	100.0%	✅ SUCCESS

Demo

Live URL: https://multimodal-visual-regression-patch-agent.vercel.app

Video Demo: https://youtu.be/gvarF7T1C5E

See the Gemma 4 Visual Regression & Patch Agent in action, illustrating drag-and-drop file ingestion, screenshot visual overlays, patch generation, and real-time validation badges:

Screenshots

Visual display of the interactive Regression Loop application interface

Interactive Split slider

Visual verification loop Side-by-Side view

Pixel-diff heatmap visualization

Interactive visual match simulation with related code snippets

Try It Yourself (Local Reproduction / Setup)

You can run the entire agentic system and its benchmark suite locally in seconds using Mock Mode (no API keys required)!

# Clone the repository
git clone https://github.com/kanyingidickson-dev/Multimodal-Visual-Regression-Patch-Agent.git
cd Multimodal-Visual-Regression-Patch-Agent

# Set up virtual environment
python3 -m venv venv
source venv/bin/activate
pip install -r backend/requirements.txt

# Compile Frontend Assets
cd frontend
npm install
npm run build
cd ..

# Run Benchmark Suite
python3 backend/benchmark.py

# Launch FastAPI web server
python3 backend/app.py

Open http://127.0.0.1:5000 to interact with the premium dark glassmorphic review dashboard!

You can click Load Example on Model settings for a quick demo launch and review.

For Testing Without API Key:

# Set MOCK_MODE=true in .env to use mock responses
echo "MOCK_MODE=true" >> .env
python backend/app.py

Code

Repository:
https://github.com/kanyingidickson-dev/Multimodal-Visual-Regression-Patch-Agent

Directory Layout:

.
├── backend/
│   ├── app.py                 # FastAPI server & route handlers
│   ├── benchmark.py           # Automated benchmark suite runner
│   ├── code_reviewer.py       # Multi-stage review orchestration
│   ├── file_parser.py         # File ingestion & truncation utilities
│   ├── gemma_client.py        # API client for OpenRouter & Hugging Face
│   ├── patch_utils.py         # Security scanners, AST, & git validators
│   ├── requirements.txt       # Backend dependencies
│   └── demo.py                # Command-line testing entry
├── frontend/                  # React dashboard codebase
│   ├── src/                   # Source directory
│   │   ├── App.jsx            # Core dashboard and Visual Verification UI
│   │   ├── App.css            # Stylesheets
│   │   ├── index.css          # Color design tokens and layout classes
│   │   └── api.js             # API client connection methods
│   ├── dist/                  # Built production frontend bundles
│   ├── package.json           # npm configuration
│   └── vite.config.js         # Vite settings
├── examples/                  # Demo assets
│   ├── benchmark-cases/       # Built-in 10 benchmark test directories
│   ├── broken-app/            # Example buggy application
│   ├── sample-output.json     # Standard review structure file
│   └── sample-screenshot.png  # Base testing image
├── prompts/                   # Custom agent instructions
│   ├── system_prompt.md       # Architectural guidance rules
│   └── user_prompt.md         # Multimodal instruction format
├── Dockerfile                 # Production Docker image blueprint
├── docker-compose.yml         # Container coordinator
├── README.md                  # Project documentation
└── LICENSE                    # MIT License

Key Directory Structure

backend/app.py — FastAPI web server supporting dynamic parameters and multipart file/screenshot ingestion.
backend/benchmark.py — Automated test case generator and benchmark runner.
backend/code_reviewer.py — Core orchestrator wrapping OpenRouter/HuggingFace API calls in multimodal content blocks.
backend/gemma_client.py — Client supporting dense model choices and contextual, high-fidelity mock review generations.
backend/patch_utils.py — Closed-loop safety validators (Git apply check, AST parsers, and file grounding).
frontend/src/App.jsx — React interface with interactive before/after split scrub sliders, pixel difference canvases, and patch validation panels.

How I Used Gemma 4

1. Model Choice: Gemma 4 31B Dense (Instruct)

I chose Gemma 4 31B Dense for this project because:

Native Multimodality: Native pixel integration enables excellent spatial mapping from image regions to matching stylesheets.
256K Context Window: Essential for ingesting multiple visual assets alongside dense code modules.
Accurate Code Generation: Ensures precise unified git diff syntaxes that compile and apply flawlessly.

2. Technical Implementation

Multimodal Prompt Construction:

For OpenRouter and Hugging Face, images are mapped to base64 data payloads. We structure the prompt to pass visual tokens first, as prepending pixels optimizes the native layout spatial grounding before digesting text source code:

if images:
    user_content = []
    # Prepend vision tokens
    for img_data in images:
        user_content.append({
            "type": "image_url",
            "image_url": {"url": img_data}
        })
    # Append instructions and files
    user_content.append({
        "type": "text",
        "text": user_prompt
    })

JSON Output Constraints:
To enable programmatic extraction of findings and patches, the system instructs Gemma 4 to respond in structured JSON. The output is parsed automatically, feeding the diff highlights and safety validators:

{
    "summary": "...",
    "root_cause": "...",
    "fix_plan": ["...", "..."],
    "patch": "diff --git a/filename b/filename...",
    "assumptions": ["...", "..."],
    "confidence": "high | medium | low"
}

Safety Layer

To protect developers, all generated patches are validated before rendering:

Block matches on destructive shell scripts (e.g. rm -rf, /dev/null).
Warns if insecure libraries are imported (e.g. pickle, subprocess in unsafe parameters).
Checks code validation errors using compilation.

🚀 Future Vision & Roadmap

Headless visual regression (CI/CD): Incorporate Playwright automation tasks to apply patches in temporary containers, launch the application, capture screenshots, and complete the visual loop automatically in the cloud.
Bi-directional IDE Sync: Allow developers to highlight visual elements in a browser extension and instantly jump to the corresponding code line inside VS Code or Cursor.
Support for Figma Files: Integrate Figma design files directly to compare pixel-perfect implementations automatically.

Built for the Gemma 4 Challenge:- demonstrating how open, multimodal models can empower developers with intelligent, visual-aware coding tools.

#ai #gemma4 #multimodal #visual-regression #patch-generation #code-review #frontend #backend #react #fastapi #gemma-4 #openrouter #huggingface #git #diff #patch #safety #validation #benchmark #test-suite #mock-mode #docker #docker-compose #vite #npm #python #asyncio #json #base64 #vision #multimodal-prompt #structured-output #code-generation #visual-aware-coding #developer-tools #ai-agents #coding-assistant #visual-regression-patch-agent

DEV Community

Multimodal Gemma 4 Visual Regression & Patch Agent

What I Built

Multimodal Gemma 4 Visual Regression & Patch Agent

Core Features

📊 Evaluation & Benchmark Results

Benchmark Table

Demo

Screenshots

Try It Yourself (Local Reproduction / Setup)

Code

Directory Layout:

Key Directory Structure

How I Used Gemma 4

1. Model Choice: Gemma 4 31B Dense (Instruct)

2. Technical Implementation

Safety Layer

🚀 Future Vision & Roadmap

Top comments (0)