Shinsuke KAGAWA

Posted on Aug 29 • Edited on Aug 31

How I Built an Image Generation MCP Server with Gemini 2.5 Flash Image (aka nano-banana)

#webdev #ai #typescript #tutorial

I've been working on a personal project that handles multiple images, and I wanted to streamline the image generation workflow. The repetitive cycle of generating images in a separate tool, downloading them, and integrating them into code was becoming tedious.

That's when Google's newly announced Gemini 2.5 Flash Image, released on August 26, 2025, caught my eye. This is the model internally known as "nano-banana". After experimenting with it, I found that while prompt engineering is still necessary, it generates images that fit my use cases quite well with reasonable accuracy.

https://developers.googleblog.com/en/introducing-gemini-2-5-flash-image/

I realized integrating this into my development environment (Claude Code) would significantly improve my workflow, so I decided to build an MCP server. In this post, I'll share my experience implementing an MCP server using Gemini 2.5 Flash Image.

Why nano-banana Caught My Interest

In my personal project, I regularly need to create sample images showing the same character in different situations or combining multiple image elements. Traditional image generation AI would subtly change character appearances between prompts or fail to edit images as intended. While these are just samples and imperfection is acceptable, having consistent quality from the sample stage definitely improves the overall product quality.

What particularly intrigued me about nano-banana was its promise of maintaining character consistency across different scenes, along with its ability to leverage Gemini's real-world knowledge for more contextually appropriate images. These features seemed perfect for my needs.

Why I Chose to Implement It as an MCP Server

For those unfamiliar, MCP stands for Model Context Protocol – a standardized protocol for host applications (Claude Code in this case) to communicate with servers (data sources or tools). Since I had already built an MCP before and knew the process, creating one for Claude Code was the obvious choice.

Let's Try It Out!

I'll use Claude Code's MCP configuration as an example. This MCP can be used with any MCP-compatible tool, not just Claude Code – including Cursor and others.

Note (Aug 31, 2025): Updated to use the npm package directly for easier installation.

For Claude Code, add the MCP like this:

% cd /path/to/project
% claude mcp add mcp-image --env GEMINI_API_KEY=new-key --env IMAGE_OUTPUT_DIR=/path/to/project -- npx -y mcp-image

For GUI tools like Claude Desktop, add the following to your MCP configuration JSON file:

{
  "mcpServers": {
    "mcp-image": {
      "command": "npx",
      "args": ["-y", "mcp-image"],
      "env": {
        "GEMINI_API_KEY": "your-api-key",
        "IMAGE_OUTPUT_DIR": "/path/to/project/images"
      }
    }
  }
}

Set IMAGE_OUTPUT_DIR to the absolute path where you want the generated images to be saved.

Configuration Notes (Getting Your API Key)

Generate your API key from Google AI Studio:
https://aistudio.google.com/

You can create your API key here and optionally link it to Google Cloud.

Set the obtained API key as GEMINI_API_KEY.

Actual Results

Let's regenerate this original image created by Canva AI:

Generated image:

It keeps the same tone but shifts the character's gaze to the left! (Well, technically it's to the right from our point of view, but close enough!)

The Appeal of nano-banana

At $0.039 per image, it's quite affordable considering the quality. Being specialized for image generation, when you provide appropriate prompts and leverage the API's features, it tends to generate images that match expectations well. Since it's still in preview, I see great potential as accuracy improves and features expand.

Practical Examples of New API Features

This MCP leverages three key features:

maintainCharacterConsistency → Ensures the same character appearance across different scenes
blendImages → Naturally combines input images with generated content
useWorldKnowledge → Applies real-world knowledge for contextually accurate images

These features are exposed as tool parameters, so the LLM can decide when to enable them and adjust prompts accordingly.

MCP tool definition with three parameters:

{
  blendImages: {
    type: 'boolean',
    description: 'Enable multi-image blending for combining multiple visual elements naturally. Use when prompt mentions multiple subjects or composite scenes'
  },
  maintainCharacterConsistency: {
    type: 'boolean',
    description: 'Maintain character appearance consistency. Enable when generating same character in different poses/scenes'
  },
  useWorldKnowledge: {
    type: 'boolean',
    description: 'Use real-world knowledge for accurate context. Enable for historical figures, landmarks, or factual scenarios'
  }
}

Enhanced prompt generation based on flags:

if (params.maintainCharacterConsistency) {
  enhancedPrompt += ' [INSTRUCTION: Maintain exact character appearance, including facial features, hairstyle, clothing, and all physical characteristics consistent throughout the image]'
}

if (params.blendImages) {
  enhancedPrompt += ' [INSTRUCTION: Seamlessly blend multiple visual elements into a natural, cohesive composition with smooth transitions]'
}

if (params.useWorldKnowledge) {
  enhancedPrompt += ' [INSTRUCTION: Apply accurate real-world knowledge including historical facts, geographical accuracy, cultural contexts, and realistic depictions]'
}

Implementation Challenges and Lessons Learned

I've been developing an Agentic Coding environment over the past few months.

Recently, I implemented improvements to prevent unexpected behavior during End-to-End integration testing, which made this implementation go smoothly without major expectation gaps during testing. While feedback from practical use and fine-tuning are still necessary, experiencing the daily reduction in implementation burden is truly rewarding.

That said, there were challenges. This development faced two main issues:

First: Integration Testing
To ensure expected behavior during integration, I introduced a mechanism (sub-agent) that generates E2E tests from Acceptance Criteria (AC) written in the Design Doc. While this significantly stabilized integration behavior, analyzing Japanese AC from multiple perspectives (requirements, dependencies, constraints/prerequisites, success criteria) to generate test cases resulted in a certain percentage of flaky tests.

I addressed this by tolerating flakiness until the implementation was complete, then removing flaky tests once all tests passed after implementation. Balancing ideals with reality remains an ongoing challenge.

Second: "LLMs Don't Know the Latest Technology" Problem
This MCP uses the @google/genai SDK. While generating design documents involved web searches for the latest information, during implementation I didn't explicitly require web searches, assuming "if it's in the design doc, it should be fine." This resulted in the LLM incorrectly interpreting that "the SDK in the design doc is wrong – it should be @google/generative-ai!" and implementing incorrectly.

Since web searching for every task execution would be too context-heavy, I had to manually intervene and correct this. I'm still considering how to systematically address this feedback mechanism.

Conclusion

Implementing the image generation MCP server using nano-banana (Gemini 2.5 Flash Image) went smoothly. Having prior experience with both MCP and Gemini API applications helped avoid major pitfalls.

With practical features like character consistency and natural language editing, I highly recommend trying this if you're working on projects involving image generation. Since the model is currently in preview, I plan to keep up with regular updates.

The actual code is publicly available, so feel free to check it out if you're interested!

shinpr / mcp-image

MCP server for AI image generation using Google's Gemini API. Enables Claude Code, Cursor, and other MCP-compatible AI tools to generate and edit images seamlessly.

MCP Image Generator

A powerful MCP (Model Context Protocol) server that enables AI assistants to generate and edit images using Google's Gemini 2.5 Flash Image API. Seamlessly integrate advanced image generation capabilities into Claude Code, Cursor, and other MCP-compatible AI tools.

✨ Features

AI-Powered Image Generation: Create images from text prompts using Gemini 2.5 Flash Image Preview
Image Editing: Transform existing images with natural language instructions
Advanced Options
- Multi-image blending for composite scenes
- Character consistency across generations
- World knowledge integration for accurate context
Multiple Output Formats: PNG, JPEG, WebP support
File Output: Images are saved as files for easy access and integration

🔧 Prerequisites

Node.js 20 or higher
Gemini API Key - Get yours at Google AI Studio
Claude Code or Cursor (or any MCP-compatible AI tool)
Basic terminal/command line knowledge

🚀 Quick Start

1. Get Your Gemini API Key

Get your API key from…

View on GitHub

Top comments (6)

Guy • Sep 3

Well done on the MCP framing. Now move beyond “it basically works” to “it will survive scaling.”

Here's what I'd do to make it shippable:

Add validation: after image generation, run lightweight consistency checks (e.g., perceptual hash comparisons for character features). If off, fallback to alternative prompt strategy (e.g., longer seed or explicit example injection).

Lock down test behaviour: isolate random seeds in integration tests, record golden outputs, and flag drift as CI failures rather than “flaky.”

Raise MCP abstraction: don’t just append text instructions, consider embedding an instruction template or schema so that phrasing drift is minimized.

Next steps / TL;DR

Ship the orchestration layer, that’s your real product. It maps intent → flags → prompts → results.

Harden the edge cases: validation + deterministic testing.

Then scale: test STT, streaming, and full-stack agent loops next.

Shinsuke KAGAWA • Sep 3

Thanks a lot for the detailed feedback — particularly the suggestion to use perceptual hashing for validation; that's a great idea I hadn't considered. I'm planning to work on improvements along these lines:

For testing: calling the external API directly from CI isn't practical, so I'll explore manual or E2E-style tests to cover that gap (and use mocked/recorded responses for CI where useful).
For validation: I'll dig into perceptual hashing and similarity checks and see how far they can help.
For the orchestration layer: I want to experiment with MCP's sampling (server-side LLM sampling to normalize prompts) before sending them to Gemini.

I'll be tackling these step by step. Really appreciate the push to think beyond the prototype phase!

Guy • Sep 3

Glad I could help spark some new thinking. Good luck!

Jerry Hargrive • Aug 29

Super cool! Quick clarification: are maintainCharacterConsistency/blendImages/useWorldKnowledge actual Gemini 2.5 Flash Image parameters, or custom MCP flags you translate into prompt instructions (or specific API fields)? If the latter, which SDK options/endpoints are you using under the hood?

Shinsuke KAGAWA • Aug 29

These aren’t official Gemini 2.5 Flash Image parameters. They’re custom MCP flags handled in my server code.

The implementation just calls the standard generateContent() method from the @google/genai SDK. When a flag like maintainCharacterConsistency is set, the server appends extra text instructions to the prompt before sending it to Gemini. You can see this logic in lines 122–135 of the code.

So from Gemini’s perspective, it’s just receiving a regular prompt — the flags exist only in the MCP layer.

This MCP layer also allows the calling LLM (e.g., Claude Code) to map user intent to specific flags. For example, if the user asks "generate the same character in different poses," Claude may set maintainCharacterConsistency automatically. In practice, this two-step flow (Claude Code → MCP → Gemini) can be more effective than sending only the raw prompt.

Ava Nichols • Aug 29

Interesting!