TL;DR
I improved Gemini 2.5 Flash Image (Nano Banana)'s image generation quality from 49/100 to 95/100. Built an MCP with intelligent prompt optimization that actually works.
Auto-enhances prompts with 7 best practices • Preserves multimodal context • No manual prompt engineering needed
Jump to: Results | How It Works | GitHub
Why Prompt Optimization Matters
Even powerful models like Gemini 2.5 Flash Image (Nano Banana) require extensive prompt engineering for quality output. Most folks write simple prompts like "make the person smile and run on the road" and wonder why the results look off.
How I Built an Intelligent Orchestration Layer
This implementation was inspired by an insightful reader comment on my previous article. Special thanks to @guypowell for the "orchestration layer" concept.
I built an intelligent orchestration layer as an MCP (Model Context Protocol) server that automatically transforms simple prompts into rich, detailed instructions.
Why Schema-Based Wasn't Enough
I'd been working on an MCP to enable Nano Banana functionality in my previous post. A dev.to reader suggested using schemas for the orchestration layer, but I had reasons for choosing dynamic optimization instead:
Approach | Pros | Cons | Measured Performance |
---|---|---|---|
Schema-based | Predictable, fast, testable | Rigid, high maintenance, poor edge cases | ~50ms (estimated) |
LLM-based | Flexible, context-aware, easy to improve | Extra latency, added cost | ~2.4 seconds (measured) |
For creative tasks with unpredictable inputs, I decided flexibility > speed. And honestly, that 2.4 second wait is worth it when you see the results.
What My MCP Actually Does
The transformation is dramatic. Here's a real example:
Input:
A cheerful anime-style girl with short blonde hair running on a road, wearing a white dress with heart design, big smile on her face, dynamic running pose, outdoor street scene, sunny day
After MCP optimization:
A cheerful anime-style girl with a distinctive single strand of hair sticking up, short blonde bob, and always wears small blue socks, running energetically on a sun-drenched road. She's wearing a white, knee-length dress with a pink heart design featuring Japanese characters inside. A big, bright smile illuminates her face as she sprints forward in a dynamic running pose, arms bent and legs lifted high. The outdoor street scene unfolds with vibrant colors: asphalt reflecting the clear blue sky, lush green trees lining the sidewalk, and pastel-colored buildings creating a charming urban backdrop. The sunny day casts soft shadows and highlights the joyous energy of the scene. The perspective is slightly low, enhancing the feeling of movement and her forward momentum, like a frame from a lively anime sequence.
The difference is clear: spatial consistency, visual coherence, and logical scene composition are achieved.
How I Implemented Phil Schmid's 7 Best Practices
I embedded Phil Schmid's 7 principles directly into the system prompt. Instead of manually crafting prompts, the MCP now automatically applies transformations like "be hyper-specific" (turning "blonde hair girl" into "distinctive single strand of hair sticking up, short blonde bob with subtle highlights") and "use semantic negative prompts" (converting "no cars, no rain" into "empty sun-drenched road with clear blue sky").
Here's the core system prompt structure:
const SYSTEM_PROMPT = `You are an expert at crafting prompts for image generation models...
// Core principles embedded here - full version on GitHub
- Focus on what should be present rather than what should be absent
- Physical characteristics: textures, materials, colors, scale
- Spatial relationships: foreground, midground, background
- Style: artistic direction, photographic techniques`
The magic happens when these principles work together — character consistency fixes prevent drift between generations, context and intent transform vague requests into clear artistic direction, and camera control adds professional photographic terminology that the model understands.
What I Learned About Multimodal Processing
During development, I hit a wall. Prompt optimization worked great for new images but completely ignored the original image's style when editing. The model would take my carefully crafted anime-style image and turn it into realistic art.
The solution? Pass the original image to Gemini 2.0 Flash during prompt generation. This way, the prompt optimizer actually sees what it's working with:
async generateStructuredPrompt(
userPrompt: string,
features: FeatureFlags = {},
inputImageData?: string // Base64 encoded image data
): Promise<Result<StructuredPromptResult, Error>> {
const config = {
temperature: 0.7,
maxTokens: 500,
systemInstruction,
...(inputImageData && { inputImage: inputImageData }), // Include image data
}
// Now Gemini understands the original style
}
I also learned the hard way about token limits. Gemini 2.5 Flash Image starts struggling above 1000 tokens, so I keep prompts under 500 while maximizing descriptive detail. Temperature at 0.7 hits that sweet spot between creativity and consistency.
The Results I Got
The improvements were dramatic, but getting there wasn't smooth. My first attempts had the character running across the road instead of along it (traffic accident waiting to happen!), ignored the anime style completely, and failed when prompts got too long.
Quality Metrics
Metric | Before | After | Impact |
---|---|---|---|
Prompt Adherence | 18/40 | 38/40 | ✅ |
Spatial Logic | 2/20 | 20/20 | 🎯 |
Character Consistency | 19/20 | 19/20 | ✅ |
Technical Quality | 9/10 | 9/10 | ✅ |
Scene Consistency | 1/10 | 10/10 | 🚀 |
Total Score | 49/100 | 95/100 | +94% |
Scoring by Claude Code (Anthropic's coding assistant)
Visual Comparison
Original image created with Canva AI
Previous MCP: Character crossing the road, inconsistent style (49 points)
New MCP: Proper spatial logic, consistent anime style (95 points)
The key lesson? Surface-level verification isn't enough. You need to validate spatial relationships and scene logic, not just whether the image "looks good".
Performance and Implementation Details
Processing Step | Measured Time | Notes |
---|---|---|
Prompt Optimization | ~2.4 seconds | Using Gemini 2.0 Flash |
Prompt Length | 187→821 chars | ~4.4x more detail |
Image Generation | 5-10 seconds | gemini-2.5-flash-image-preview |
I use Gemini 2.0 Flash for prompt optimization (fast & stable) and Gemini 2.5 Flash Image for generation (high quality). This two-model approach keeps things snappy while maintaining quality.
What's Next
The core orchestration layer is shipped and working. I'm considering adding perceptual hash validation and deterministic testing, but proceeding carefully to maintain the balance between flexibility and reliability.
Try It Yourself
# For Claude Code users
claude mcp add mcp-image \
--env GEMINI_API_KEY=your-api-key \
--env IMAGE_OUTPUT_DIR=/absolute/path/to/images \
-- npx -y mcp-image
# Then just ask: "Generate a sunset mountain landscape"
Note: IMAGE_OUTPUT_DIR
must be an absolute path (e.g., /Users/username/images
, not ./images
).
Get your API key at Google AI Studio and start creating!
Final Thoughts
While Nano Banana already produces decent results out of the box, I found that tuning prompts can dramatically improve generation accuracy. If you're thinking "I want great images but can't be bothered with constant prompt tweaking," definitely try the MCP.
The implementation is completely open source:
shinpr
/
mcp-image
MCP server for AI image generation using Google's Gemini API. Enables Claude Code, Cursor, and other MCP-compatible AI tools to generate and edit images seamlessly.
MCP Image Generator
A powerful MCP (Model Context Protocol) server that enables AI assistants to generate and edit images using Google's Gemini 2.5 Flash Image API. Seamlessly integrate advanced image generation capabilities into Claude Code, Cursor, and other MCP-compatible AI tools.
✨ Features
- AI-Powered Image Generation: Create images from text prompts using Gemini 2.5 Flash Image Preview
-
Intelligent Prompt Enhancement: Automatically optimizes your prompts using Gemini 2.0 Flash for superior image quality
- Adds photographic and artistic details
- Enriches lighting, composition, and atmosphere descriptions
- Preserves your intent while maximizing generation quality
-
Image Editing: Transform existing images with natural language instructions
- Context-aware editing that preserves original style
- Maintains visual consistency with source image
-
Advanced Options
- Multi-image blending for composite scenes
- Character consistency across generations
- World knowledge integration for accurate context
- Multiple Output Formats: PNG, JPEG, WebP support
- File Output: Images are saved as files for easy…
By the way, I'm curious — how do you approach prompt optimization in your projects? Would you lean schema-based for predictability, or try something dynamic like this? Let me know in the comments!
Top comments (0)