Wesley Chun (@wescpy)

Posted on Jul 16, 2024 • Edited on Feb 2

Gemini API 102a: Putting together basic GenAI web apps

#webdev #python #node #machinelearning

TL;DR:

The first pair of posts in this ongoing Gemini API series provide a thorough introduction to using the API, primarily from Google AI. (Existing GCP users can easily migrate their code to its Vertex AI platform without much difficulty.) However, while command-line scripts are a great way to get started, they aren't how you're going to reach users. Aspiring data scientists and AI professionals make great use of powerful tools like Jupyter Notebooks, but being able to create/prototype web apps may also be useful. This post aims to address both of these "issues" by demonstrating use of the Gemini API in a basic genAI web app using Flask (Python) or Express.js (Node.js), all in about 100 lines of code!

Introduction

Welcome to the blog covering Google developer technologies, whether you're learning how to code Google Maps, export Google Docs as PDF, or learn about serverless computing, computing with Google](https://dev.to/wescpy/a-broader-perspective-of-serverless-1md1), this is the right place to be. You'll also find posts on common knowledge like credentials, including API keys and OAuth client IDs... all of this) from Python and sometimes Node.js.

If you've been following along in this series covering the Gemini API, you now know how to perform text-only and multimodal queries, use streaming, and multi-turn (or "chat") conversations, all from the command-line issued against one of the Gemini LLMs (large language models). It's time to take it to the next level by building a basic web app that uses the Gemini API.

Application

Regardless of whether you build the app with Python or Node.js, the app works identically. End-users upload an image file (JPG, PNG, GIF formats supported) along with a text prompt. The app then performs a multimodal query to the Gemini API using the latest 1.5 Flash model then displays a reduced-size version of the image along with the prompt as well as the generated result from the model.

Both apps, with comments, plus the web templates can be found in the repo folder. Be sure you've created an API key and stored it as as API_KEY = '<YOUR_API_KEY>' in settings.py for Python or .env for Node.js before jumping into the code. We'll start with Python first.

Python

The example assumes you've acquired an API key and performed the other prerequisites from the previous post:

Ensure your Python (including pip) installation is up-to-date (recommend 3.9+)
Install packages: pip install -U pip flask pillow google-generativeai (or pip3)

Initial code walk-through

The only new package installed this time is the Flask micro web framework which comes with the Jinja2 templating system. The app consists of only two files, the main application, main.py, and its accompanying web template, templates/index.html. With that, let's dive into main application file, one chunk at a time, starting with the imports:

from base64 import b64encode
import io

from flask import Flask, render_template, request, url_for
from werkzeug.utils import secure_filename
from PIL import Image

import google.generativeai as genai
from settings import API_KEY

The io standard library package has an object (io.BytesIO) used in this app as an in-memory "disk file" for the thumbnail. You create your own settings.py local file to hold the API key (not in the repo). There are some the 3rd-party packages as well:

Module/package	Use
`flask`	Flask: popular micro web framework
`werkzeug`	Werkzeug: collection of WSGI web app utilities
`pillow`	Pillow: flexible fork of well-known Python Imaging Library (PIL)
`google.generativeai`	Google AI Python SDK: provides access to Gemini API & models

Werkzeug and the Jinja2 templating system are Flask dependencies... for the most part these are resources requested directly from Flask save for the single explicit import of Werkzeug's secure_filename() function which is used in the app.

ALLOW_EXTS = {'png', 'jpg', 'jpeg', 'gif'}
MODEL_NAME = 'gemini-1.5-flash-latest'
THUMB_DIMS = 480, 360
JINUN_TMPL = 'index.html'

app = Flask(__name__)
genai.configure(api_key=API_KEY)
model = genai.GenerativeModel(MODEL_NAME)

Constant	Use
`ALLOW_EXTS`	Image file types supported by the app
`MODEL_NAME`	Gemini LLM model to use in this app
`THUMB_DIMS`	Thumbnail dimensions
`JINUN_TMPL`	Jinja2/Nunjucks template

After Flask is initialized comes the API key authorization required to access the Gemini API, and finally, the chosen model is set.

def is_allowed_file(fname: str) -> bool:
    return '.' in fname and fname.rsplit('.', 1)[1].lower() in ALLOW_EXTS

The is_allowed_file() function takes an uploaded file's name and checks it against the image file types supported by the app per ALLOW_EXTS. Everything else is the main application.

App operation

Before we dive into that, let's visualize what the app does, so you can connect the dots easier reviewing the rest of the code. When you hit the app the first time, you get an initial empty form view:

webgem-empty — [IMG] Gemini API web app: empty form

You can clearly see two primary form elements, a file-picker to choose the image with, and a text field for the LLM prompt (which has a default of "Describe this image," and a submit button. From here, a user is expected to choose a locally-stored image file:

webgem-imgpick — [IMG] Gemini API web app: image file picker

I picked the waterfall picture from the previous post. After the image has been selected, modify the prompt to something of your choosing. Below, I changed the default to "Where is this and what is it?"

webgem-imgNpromptSet — [IMG] Gemini API web app: image and prompt set

With the image selected and prompt set, clicking on the submit button "causes things to happen," and the resulting screen shows a smaller thumbnail of the selected image, the prompt entered, the model used, and the LLM results:

webgem-results — [IMG] Gemini API web app: results

Note that in all these steps, there's always a blank form at the bottom of every step so users can move to another image once they're done with the one they're on. With that, let's look at the main handler:

Main handler: error-checking

The first half of the main handler consists of checking for bad input... take a look:

@app.route('/', methods=['GET', 'POST'])
def main():
    context = {'upload_url': url_for(request.endpoint)}

    if request.method == 'POST':
        upload = request.files.get('file')
        if not upload:
            context['error'] = 'No uploaded file'
            return render_template(JINUN_TMPL, **context)

        fname = secure_filename(upload.filename.strip())
        if not fname:
            context['error'] = 'Upload must have file name'
            return render_template(JINUN_TMPL, **context)

        if not is_allowed_file(fname):
            context['error'] = 'Only JPG/PNG/GIF files allowed'
            return render_template(JINUN_TMPL, **context)

        prompt = request.form.get('prompt').strip()
        if not prompt:
            context['error'] = 'LLM prompt missing'
            return render_template(JINUN_TMPL, **context)

This is the only handler in the app, supporting both POST and GET requests, meaning it handles the initial empty form page as well as processing actual work requests (via POST). Since the page always shows a blank form at the bottom, the first thing you see is the template context being set with the upload URL redirecting the app to the same endpoint (/).

The next four sections handle various types of bad input:

No uploaded file
Upload with no file name
Upload with unsupported file type
No LLM prompt

In each of these situations, an error message is set for the template, and the user is sent straight back to the web template. Any errors are highlighted for reference followed by the same empty form giving the user a chance to correct their mistake. Here's what it looks like if you forgot to upload a file:

webgem-nofile — [IMG] Gemini API web app: no file error

Main handler: core functionality

The last chunk of code is where all the magic happens.

        try:
            image = Image.open(upload)
            thumb = image.copy()
            thumb.thumbnail(THUMB_DIMS)
            img_io = io.BytesIO()
            thumb.save(img_io, format=image.format)
            img_io.seek(0)
        except IOError:
            context['error'] = 'Invalid image file/format'
            return render_template(JINUN_TMPL, **context)

        context['model']  = MODEL_NAME
        context['prompt'] = prompt
        thumb_b64 = b64encode(img_io.getvalue()).decode('ascii')
        context['image']  = f'data:{upload.mimetype};base64,{thumb_b64}'
        context['result'] = model.generate_content((prompt, image)).text

    return render_template(JINUN_TMPL, **context)

if __name__ == '__main__':
    import os
    app.run(debug=True, threaded=True, host='0.0.0.0',
            port=int(os.environ.get('PORT', 8080)))

Assuming all the inputs pass muster, the real work begins. A copy of the original image is made and which is then converted into a smaller thumbnail for display. It's saved to an in-memory file object (io.BytesIO) and later base64-encoded for the template. Any errors occurring during this image processing results in an error sent to the template.

If all has succeeded thus far, then it goes through the final round of being sent to the LLM for analysis. Before that happens, all of the necessary fields for a successful computation are sent to the template context, including the prompt, model used, the base64-encoded thumbnail, and finally, the result returned from the Gemini API which was sent the image and prompt.

Whether a plain GET, or a POST resulting in all of this processing, the template is then rendered, wrapping up the last part of the handler. The rest of the code just kicks off the Flask development server on port 8080 to run the app. (The "devserver" is great for development and testing, but you would choose a more robust server for production.)

Web template

Now, let's look at the web template to tie the whole thing together:

<!doctype html>
<html>
<head>
<title>GenAI image analyzer example</title>
</head>
<body>

<style>
body {
  font-family: Verdana, Helvetica, sans-serif;
  background-color: #DDDDDD;
}
</style>

<h1>GenAI basic image analyzer (v0.1)</h1>

{% if error %}
    <h3>Error on previous request</h3>
    <p style="color: red;">{{ error }}</p>
    <hr>
{% endif %}

{% if result and image %}
    <h3>Image uploaded</h3>
    <img src="{{ image }}" />

    <h3>LLM analysis</h3>
    <b>Prompted received:</b> {{ prompt }}<p></p>
    <b>Model used:</b> {{ model }}<p></p>
    <b>Model response:</b> {{ result }}<p></p>
    <hr>
{% endif %}

<h3>Analyze an image</h3>

<form action="{{ upload_url }}" method="POST" enctype="multipart/form-data">
    <label for="file">Upload image to analyze:</label><br>
    <input type="file" name="file"><p></p>
    <label for="prompt">Image prompt for LLM:</label><br>
    <input type="text" name="prompt" value="Describe this image"><p></p>
    <input type="submit">
</form>

</body>
</html>

The initial headers and limited CSS (cascading style sheets) styling shows up at the top followed by the app title. The error section comes next, displayed only if an error occurs. If an image is processed successfully, the results are displayed along with a thumbnail version of the image, model, and prompt. Finally, the empty form shows up at the end, and that's it! To run the app, just execute python main.py (or python3).

Both the app (main.py) and template (templates/index.html) can be found in the python folder of the repo.

Node.js

The Node version of the app is a near-mirror image of the Python version, and the web template is the exact same... I chose Nunjucks (instead of another templating system like EJS) specifically because it uses the same templating format as Jinja2 (Python). Now ensure you have an API key and NPM-install the necessary packages:

Ensure your Node (including NPM) installation is up-to-date (recommend 18+)
Install packages: npm i dotenv express multer nunjucks sharp @google/generative-ai

Time to look at code... below is the modern JavaScript ECMAscript module, main.mjs:

import 'dotenv/config';
import express from 'express';
import multer from 'multer';
import nunjucks from 'nunjucks';
import sharp from 'sharp';
import { GoogleGenerativeAI } from '@google/generative-ai';

const PORT = process.env.PORT || 8080;
const ALLOW_EXTS = ['png', 'jpg', 'jpeg', 'gif'];
const MODEL_NAME = 'gemini-1.5-flash-latest';
const THUMB_DIMS = [480, 360];
const JINUN_TMPL = 'index.html';

const app = express();
app.use(express.urlencoded({ extended: false }));
nunjucks.configure('templates', { autoescape: true, express: app });
const upload = multer({ storage: multer.memoryStorage() });
const genAI = new GoogleGenerativeAI(process.env.API_KEY);
const model = genAI.getGenerativeModel({ model: MODEL_NAME });

async function is_allowed_file(fname) {
    return (fname.includes('.') && ALLOW_EXTS.includes(
        fname.toLowerCase().slice(((fname.lastIndexOf('.') - 1) >>> 0) + 2)));
}

The major difference in this Node version vs. Python is that there is more initialization required for Express.js middleware, such as setting up the Nunjucks templating system and configuring the Multer system to handle file uploads. These are the 3rd-party packages you see imported at the top.

Package	Use
`dotenv`	Dotenv: adds environment variables from `.env`
`express`	Express.js: popular micro web framework
`multer`	Multer: middleware to handle file uploads
`nunjucks`	Nunjucks: JavaScript templating system
`sharp`	Sharp: high-performance image processing library
`@google/generative-ai`	Google AI SDK for JavaScript: provides access to Gemini API & models

Python uses fewer 3rd-party packages explicitly because the (Jinja2) templating system is a Flask dependency, and Flask itself handles file uploads. The Python app also uses settings.py, a nod from Django, instead of .env like Node.js, requiring dotenv.

app.all('/', upload.single('file'), async (req, rsp) => {
    let context = {
        upload_url: `${req.protocol}://${req.get('host')}${req.originalUrl}`
    };

    if (req.method === 'POST') {
        const upload = req.file;
        if (!upload) {
            context.error = 'No uploaded file';
            return rsp.render(JINUN_TMPL, context);
        }
        const fname = upload.originalname.trim();
        if (!fname) {
            context.error = 'Upload must have file name';
            return rsp.render(JINUN_TMPL, context);
        }
        const allowed = await is_allowed_file(fname);
        if (!allowed) {
            context.error = 'Only JPG/PNG/GIF files allowed';
            return rsp.render(JINUN_TMPL, context);
        }
        const prompt = req.body.prompt.trim();
        if (!prompt) {
            context.error = 'LLM prompt missing';
            return rsp.render(JINUN_TMPL, context);
        }

        const image = upload.buffer;
        const mimeType = upload.mimetype;
        var thumb_b64;
        try {
            const thumb = await sharp(image);
            const thumb_buf = await thumb.resize({ width: THUMB_DIMS[0] }).toBuffer();
            thumb_b64 = thumb_buf.toString('base64');
        }
        catch (ex) {
            context.error = 'Invalid image file/format';
            return rsp.render(JINUN_TMPL, context);
        }

        context.model = MODEL_NAME;
        context.prompt = prompt;
        context.image = `data:${mimeType};base64,${thumb_b64}`;
        const payload = { inlineData: { data: image.toString('base64'), mimeType } };
        const result = await model.generateContent([prompt, payload]);
        context.result = await result.response.text();
    }
    return rsp.render(JINUN_TMPL, context);
});

app.listen(PORT, () => console.log(`* Running on port ${PORT}`));

The main handler is a twin of the Python version, comprised of the same major sections:

Set upload_url in context and error-check
Create thumbnail and base64-encode it
Send image thumb, model, prompt, and results from Gemini to template

As mentioned above, the templates/index.html template is identical to the Python version. There is also a CommonJS version (main.js) of the app if you prefer. To run either, execute node main.mjs or node main.js. The app(s) along with the template can be found in the Node.js repo folder.

Summary

Many are excited to delve into the world of GenAI & LLMs, and while user-friendly "Hello World!" scripts are a great way to get started, seeing how to integrate usage of the Gemini API in web apps brings developers one next step closer to productizing something. This post highlights a basic web app that takes a prompt and image input, sends them to the Gemini API, displaying the results along with an empty form to analyze the next image. From here, use your imagination as to what you can build on top of this baseline "MVP" web app.

If you found an error in this post, bug in the code, or have a topic you want me to cover in the future, drop a note in the comments below or file an issue at the repo. Upcoming posts in this series will highlight the differences between the 1.0 and 1.5 Gemini models' outputs across a variety of queries as well as more advanced topics like finetuning, function-calling, and embeddings, so stay tuned for those. Thanks for reading, and I hope to meet you at an upcoming event soon... see the travel calendar at the bottom of my consulting site.

PREV POST: Part 2: Gemini API 102: Next steps beyond "Hello World!"
NEXT POST: Part 4: Generate audio clips with Gemini 2.0 Flash

References

Below are links to resources for this post or additional references you may find useful.

DEV Community