How I built Nano Banana AI Image Editing Agent

#machinelearning #ai #datascience

Recently, I’ve been working on a personal development project to create a service that handles multiple images, and I wanted to make the image generation workflow smoother. It’s a pain to have to generate images in a separate tool, download them, and then incorporate them into the code.

Suddenly, a new wind is blowing in the world of image generation AI called “nano-banana”. An unidentified image-generating AI suddenly appeared on a comparison site for AI models called LMArena.

With no official announcements, it remains shrouded in mystery, but its high level of accuracy has caused quite a stir in the AI community.

In the world of image generation AI, well-known models such as DALL-E, Midjourney, and Stable Diffusion have long dominated the market. However, the emergence of nano-banana is about to change this landscape dramatically.

What’s particularly noteworthy about nano-banana is its consistency and editing capabilities. It effectively maintains character across multiple images and handles complex image editing, a task that previous models struggled to achieve.

When I actually used it, I found that although the prompts needed some ingenuity, it generated images that suited my usage scenarios with a fairly high degree of accuracy. So, I thought it would be useful to incorporate this into my development environment

So, let me give you a quick demo of a live chatbot to show you what I mean.

Check video

If you take a look at how the chatbot generates the output, you’ll see that the AI agent captures the user input and adds it to the session state message history, then constructs an API request by combining the system prompt (which instructs the agent to generate images and describe changes), the user’s text, and any uploaded reference image.

This request is sent to Google’s Gemini 2.5 Flash Image Preview model, configured to return both text descriptions and generated images. The model processes the prompt to extract visual concepts, objects, and styles, then generates visual content by transforming the instructions into a coherent image based on both the text prompt and any visual inputs.

It returns a structured response where the unpack_response function extracts text into full_response and converts binary image data into a PIL Image object, which is then displayed in the Streamlit chat interface alongside the generated description and stored in the session state message history, creating a persistent conversational record where users can reference previous generations and build upon them iteratively

Why this is a game changer

While conventional AI image generation has tended to rely on a “try a few times to find the right one,” Gemini 2.5 (nano-banana) excels at “ maintaining consistency even when the composition changes “ and “ finishing the image by correcting only the targeted areas .”

Localised editing, compositing multiple images, copying styles, and maintaining consistency of subjects can all be done on the same model, dramatically reducing the number of trials.

By reducing the number of trial and error trials and errors on the original drawings and rough sketches in the pre-production stage and producing a more refined product, the need for rework in the post-production stage can be reduced.

In the workplace, the chances of a plan remaining just a plan will decrease, and the option of actually creating it will become more realistic. Also, depending on the product, a dramatic increase in production speed is expected.

Features

Gemini 2.5 Flash Image (formerly nano-banana) is the latest image generation and editing model developed by Google. This model can perform advanced image generation and editing using only text instructions, and also supports editing and compositing of existing images.

High-speed image generation:

Images can be generated in just a few seconds per image, significantly faster than competing models. It is also cost-effective.
Designed specifically for image
editing: This powerful app lets you change backgrounds and people’s expressions simply by sending text commands. It also supports editing tasks like blurring backgrounds, erasing people, changing poses, and colorizing black-and-white photos. It also faithfully responds to multi-step commands (multiple times) within the same chat session.

Maintaining character consistency:

Maintains facial features, body shapes, clothing, etc., of people and characters with high accuracy. Effective for generating and editing a series of images.
Fusion and composition of multiple images:
It is possible to combine an input image with another image scene or to create a new fused image by combining elements of multiple images.

Gemini Knowledge Integration:

Leveraging the world knowledge and logical inference capabilities of Google’s large-scale language model “Gemini,” the system generates images with semantic consistency. It also demonstrates excellent performance in accurately reproducing text and logos, expressing factual details, and reading diagrams.
Digital watermark embedding: A digital watermark using
SynthID is automatically embedded in the output image, making it possible to later identify it as an AI-generated image.

Let’s Start Coding

Let us now explore step by step and unravel the answer to how to build a Nano Banana AI Image Editing Agent. We will install the libraries that support the model. For this, we will do a pip install requirements

pip install requirements
Once installed, we import the important dependencies like streamlit, google, io and PIL


import streamlit as st
from io import StringIO
from dotenv import load_dotenv

import os
from io import BytesIO
from PIL import Image
from google import genai
from google.genai import types

Let’s set the page configuration with a custom title and sidebar, then I made a dictionary to hold avatars for the assistant and the user so their messages look unique, and I created styled headings in the main page using HTML with custom colors .

I also added a sidebar with a banana emoji title for fun, and I initialized the session state so the chatbot remembers past messages, starting with a default greeting from the assistant.

Then, I created a loop that displays each stored message with the right avatar and content, and if the assistant sends an image, I made sure it is displayed directly under the message, giving the app an interactive and conversational feel.

st.set_page_config(page_title='Gemini Nano Banana Chatbot', 
                    initial_sidebar_state='auto')

background_color = "#252740"

avatars = {
    "assistant": "🤖",
    "user": "👤"
}

st.markdown("<h2 style='text-align: center; color: #3184a0;'>Gemini Nano Banana</h2>", unsafe_allow_html=True)
st.markdown("<h3 style='text-align: center; color: #3184a0;'>Image generator chatbot</h3>", unsafe_allow_html=True)

with st.sidebar:
    st.markdown("### 🍌 Gemini Nano Banana")

if "messages" not in st.session_state.keys():
    st.session_state.messages = [
        {"role": "assistant", "content": "How may I assist you today?", "image": None}
    ]

for message in st.session_state.messages:
    with st.chat_message(message["role"], 
                         avatar=avatars[message["role"]]):
        st.write(message["content"])
        if message["role"] == "assistant" and message["image"]:
            st.image(message["image"])

I developed a function called clear_chat_history that resets the conversation by replacing the session state messages with a single default assistant greeting, and then I connected this function to a "Clear Chat History" button in the sidebar so users can restart the chat whenever they want.

I also added a file uploader inside the sidebar that lets users upload images in JPG, JPEG, or PNG format, and once an image is uploaded, I made sure it gets opened with Image.open, saved into the session state for later use, and immediately displayed in the sidebar with a caption so users can see the image they just uploaded.

def clear_chat_history():
    st.session_state.messages = [
        {"role": "assistant", "content": "How may I assist you today?", "image": None}
    ]

st.sidebar.button("Clear Chat History", on_click=clear_chat_history)

with st.sidebar:
    uploaded_file = st.file_uploader("Upload an image", type=["jpg", "jpeg", "png"])

    if uploaded_file:
        image_bytes = Image.open(uploaded_file)
        st.session_state.image = image_bytes
        st.image(image_bytes, caption="Uploaded Image", use_container_width=True)

After that, I created a function run_query that lets the Agent send a request to Google’s Gemini API to generate text and images from what the user inputs. I started by loading the environment variables to safely get the key GEMINI_API_KEY and then set up the API client with that key.

I wrote a system prompt that clearly tells the model to generate an image and a short text describing any changes. Then I put together a contents list that includes the user’s input and the uploaded image, if there is one, or just the input if not. I called client.models.generate_content using the "gemini-2.5-flash-image-preview" model, set it to return both text and images, and finally, made the function return the model’s response or "Error" If something goes wrong.

def run_query(input_text):
    try:
        load_dotenv()
        GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
        os.environ["GEMINI_API_KEY"] = GEMINI_API_KEY
        client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
        system_prompt = """
        #INSTRUCTIONS
        Generate an image according to the instructions. 
        Specify in the output text the changes made to the image
        #OUTPUT
        A generated image and a short text
        """

        if "image" in st.session_state and st.session_state.image:
            contents = [system_prompt, input_text, st.session_state.image]
        else:
            contents = [system_prompt, input_text]

        response = client.models.generate_content(
            model="gemini-2.5-flash-image-preview",
            contents=contents,
            config=types.GenerateContentConfig(
                response_modalities=['Text', 'Image']
            )
        )

        if response:
            return response
        else:
            return "Error"

Next, I built a function called unpack_response that takes what the user types, sends it to the Gemini model, and then separates the text and image that the model creates. I set up a placeholder so we could update the output dynamically, started with an empty string for the text, and created a variable to hold the image.

If something goes wrong, the function returns an error message, but normally it loops through the response: any text is added to the response string, and any image data is opened so it can be shown. To make the chat feel real,

I used st.chat_input so users can type messages, display their message with a 👤 avatar, then show the assistant’s reply with a loading spinner, including both the text and image if the model generated one. Finally, I saved the assistant’s reply in the session state so the whole conversation stays visible and interactive.

def unpack_response(prompt):
    response = run_query(prompt)

    placeholder = st.empty()
    full_response = ""
    generated_image = None

    # Handle error responses
    if isinstance(response, str) and "Error" in response:
        return response, placeholder, None

    try:
        for part in response.candidates[0].content.parts:
            if part.text is not None:
                for item in part.text:
                    full_response += item
            elif part.inline_data is not None:
                generated_image = Image.open(BytesIO(part.inline_data.data))
    except Exception as ex:
        full_response = f"ERROR in unpack response: {str(ex)}"
        generated_image = st.session_state.image if "image" in st.session_state else None

    return full_response, placeholder, generated_image

output = st.empty()
if prompt := st.chat_input():
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user", avatar=avatars["user"]):
        st.write(prompt)

if st.session_state.messages[-1]["role"] != "assistant":
    with st.chat_message("assistant", avatar=avatars["assistant"]):
        with st.spinner("Thinking..."):

            full_response, placeholder, generated_image = unpack_response(prompt)
            if full_response:
                st.write(full_response)
            if generated_image:
                st.image(generated_image)

    message = {"role": "assistant", 
               "content": full_response,
               "avatar": avatars["assistant"],
               "image": generated_image}
    st.session_state.messages.append(message)

Conclusion :

The arrival of nano-banana marks a major turning point in the image generation AI industry, with its overwhelming performance and innovative features opening up new possibilities for the creative industry.

Gemini 2.5 (nano-banana) is not just an AI that helps you create things better, but has the potential to change the production process itself. It would be a good idea to think about business on the assumption that such functions will continue to develop and allow you to achieve your goals more perfectly.

🧙‍♂️ I am an AI Generative expert! If you want to collaborate on a project, drop an inquiry here or Book a 1-on-1 Consulting Call With Me.

I would highly appreciate it if you

❣ Join my Patreon: https://www.patreon.com/GaoDalie_AI

Book an Appointment with me: https://topmate.io/gaodalie_ai
Support the Content (every Dollar goes back into the -video):https://buymeacoffee.com/gaodalie98d
Subscribe to the Newsletter for free:https://substack.com/@gaodalie