DEV Community

Shinsuke KAGAWA
Shinsuke KAGAWA

Posted on

From 49 to 95: How Prompt Engineering Boosted Gemini MCP Image Generation

TL;DR

I improved Gemini 2.5 Flash Image (Nano Banana)'s image generation quality from 49/100 to 95/100. Built an MCP with intelligent prompt optimization that actually works.

comparison

Auto-enhances prompts with 7 best practices • Preserves multimodal context • No manual prompt engineering needed

Jump to: Results | How It Works | GitHub


Why Prompt Optimization Matters

Even powerful models like Gemini 2.5 Flash Image (Nano Banana) require extensive prompt engineering for quality output. Most folks write simple prompts like "make the person smile and run on the road" and wonder why the results look off.

How I Built an Intelligent Orchestration Layer

This implementation was inspired by an insightful reader comment on my previous article. Special thanks to @guypowell for the "orchestration layer" concept.

I built an intelligent orchestration layer as an MCP (Model Context Protocol) server that automatically transforms simple prompts into rich, detailed instructions.

Why Schema-Based Wasn't Enough

I'd been working on an MCP to enable Nano Banana functionality in my previous post. A dev.to reader suggested using schemas for the orchestration layer, but I had reasons for choosing dynamic optimization instead:

Approach Pros Cons Measured Performance
Schema-based Predictable, fast, testable Rigid, high maintenance, poor edge cases ~50ms (estimated)
LLM-based Flexible, context-aware, easy to improve Extra latency, added cost ~2.4 seconds (measured)

For creative tasks with unpredictable inputs, I decided flexibility > speed. And honestly, that 2.4 second wait is worth it when you see the results.

What My MCP Actually Does

The transformation is dramatic. Here's a real example:

Input:

A cheerful anime-style girl with short blonde hair running on a road, wearing a white dress with heart design, big smile on her face, dynamic running pose, outdoor street scene, sunny day
Enter fullscreen mode Exit fullscreen mode

After MCP optimization:

A cheerful anime-style girl with a distinctive single strand of hair sticking up, short blonde bob, and always wears small blue socks, running energetically on a sun-drenched road. She's wearing a white, knee-length dress with a pink heart design featuring Japanese characters inside. A big, bright smile illuminates her face as she sprints forward in a dynamic running pose, arms bent and legs lifted high. The outdoor street scene unfolds with vibrant colors: asphalt reflecting the clear blue sky, lush green trees lining the sidewalk, and pastel-colored buildings creating a charming urban backdrop. The sunny day casts soft shadows and highlights the joyous energy of the scene. The perspective is slightly low, enhancing the feeling of movement and her forward momentum, like a frame from a lively anime sequence.
Enter fullscreen mode Exit fullscreen mode

The difference is clear: spatial consistency, visual coherence, and logical scene composition are achieved.

How I Implemented Phil Schmid's 7 Best Practices

I embedded Phil Schmid's 7 principles directly into the system prompt. Instead of manually crafting prompts, the MCP now automatically applies transformations like "be hyper-specific" (turning "blonde hair girl" into "distinctive single strand of hair sticking up, short blonde bob with subtle highlights") and "use semantic negative prompts" (converting "no cars, no rain" into "empty sun-drenched road with clear blue sky").

Here's the core system prompt structure:

const SYSTEM_PROMPT = `You are an expert at crafting prompts for image generation models...
// Core principles embedded here - full version on GitHub
- Focus on what should be present rather than what should be absent
- Physical characteristics: textures, materials, colors, scale
- Spatial relationships: foreground, midground, background
- Style: artistic direction, photographic techniques`
Enter fullscreen mode Exit fullscreen mode

The magic happens when these principles work together — character consistency fixes prevent drift between generations, context and intent transform vague requests into clear artistic direction, and camera control adds professional photographic terminology that the model understands.

What I Learned About Multimodal Processing

During development, I hit a wall. Prompt optimization worked great for new images but completely ignored the original image's style when editing. The model would take my carefully crafted anime-style image and turn it into realistic art.

The solution? Pass the original image to Gemini 2.0 Flash during prompt generation. This way, the prompt optimizer actually sees what it's working with:

async generateStructuredPrompt(
  userPrompt: string,
  features: FeatureFlags = {},
  inputImageData?: string // Base64 encoded image data
): Promise<Result<StructuredPromptResult, Error>> {
  const config = {
    temperature: 0.7,
    maxTokens: 500,
    systemInstruction,
    ...(inputImageData && { inputImage: inputImageData }), // Include image data
  }
  // Now Gemini understands the original style
}
Enter fullscreen mode Exit fullscreen mode

I also learned the hard way about token limits. Gemini 2.5 Flash Image starts struggling above 1000 tokens, so I keep prompts under 500 while maximizing descriptive detail. Temperature at 0.7 hits that sweet spot between creativity and consistency.

The Results I Got

The improvements were dramatic, but getting there wasn't smooth. My first attempts had the character running across the road instead of along it (traffic accident waiting to happen!), ignored the anime style completely, and failed when prompts got too long.

Quality Metrics

Metric Before After Impact
Prompt Adherence 18/40 38/40
Spatial Logic 2/20 20/20 🎯
Character Consistency 19/20 19/20
Technical Quality 9/10 9/10
Scene Consistency 1/10 10/10 🚀
Total Score 49/100 95/100 +94%

Scoring by Claude Code (Anthropic's coding assistant)

Visual Comparison

Original image created with Canva AI
Original

Previous MCP: Character crossing the road, inconsistent style (49 points)
Old Version

New MCP: Proper spatial logic, consistent anime style (95 points)
Prompt Enhanced Version

The key lesson? Surface-level verification isn't enough. You need to validate spatial relationships and scene logic, not just whether the image "looks good".

Performance and Implementation Details

Processing Step Measured Time Notes
Prompt Optimization ~2.4 seconds Using Gemini 2.0 Flash
Prompt Length 187→821 chars ~4.4x more detail
Image Generation 5-10 seconds gemini-2.5-flash-image-preview

I use Gemini 2.0 Flash for prompt optimization (fast & stable) and Gemini 2.5 Flash Image for generation (high quality). This two-model approach keeps things snappy while maintaining quality.

What's Next

The core orchestration layer is shipped and working. I'm considering adding perceptual hash validation and deterministic testing, but proceeding carefully to maintain the balance between flexibility and reliability.

Try It Yourself

# For Claude Code users
claude mcp add mcp-image \
  --env GEMINI_API_KEY=your-api-key \
  --env IMAGE_OUTPUT_DIR=/absolute/path/to/images \
  -- npx -y mcp-image

# Then just ask: "Generate a sunset mountain landscape"
Enter fullscreen mode Exit fullscreen mode

Note: IMAGE_OUTPUT_DIR must be an absolute path (e.g., /Users/username/images, not ./images).

Get your API key at Google AI Studio and start creating!

Final Thoughts

While Nano Banana already produces decent results out of the box, I found that tuning prompts can dramatically improve generation accuracy. If you're thinking "I want great images but can't be bothered with constant prompt tweaking," definitely try the MCP.

The implementation is completely open source:

GitHub logo shinpr / mcp-image

MCP server for AI image generation using Google's Gemini API. Enables Claude Code, Cursor, and other MCP-compatible AI tools to generate and edit images seamlessly.

MCP Image Generator

A powerful MCP (Model Context Protocol) server that enables AI assistants to generate and edit images using Google's Gemini 2.5 Flash Image API. Seamlessly integrate advanced image generation capabilities into Claude Code, Cursor, and other MCP-compatible AI tools.

✨ Features

  • AI-Powered Image Generation: Create images from text prompts using Gemini 2.5 Flash Image Preview
  • Intelligent Prompt Enhancement: Automatically optimizes your prompts using Gemini 2.0 Flash for superior image quality
    • Adds photographic and artistic details
    • Enriches lighting, composition, and atmosphere descriptions
    • Preserves your intent while maximizing generation quality
  • Image Editing: Transform existing images with natural language instructions
    • Context-aware editing that preserves original style
    • Maintains visual consistency with source image
  • Advanced Options
    • Multi-image blending for composite scenes
    • Character consistency across generations
    • World knowledge integration for accurate context
  • Multiple Output Formats: PNG, JPEG, WebP support
  • File Output: Images are saved as files for easy…




By the way, I'm curious — how do you approach prompt optimization in your projects? Would you lean schema-based for predictability, or try something dynamic like this? Let me know in the comments!

Top comments (0)