DEV Community

Cover image for I built a tool to turn screenshots into docs with Gemma 4 and watched it think
Ola Adesoye
Ola Adesoye

Posted on

I built a tool to turn screenshots into docs with Gemma 4 and watched it think

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

For someone who isn't easily moved, I was intrigued the first time i ran it.

I'd uploaded a screenshot of an Anchor docs page which contained a Rust struct, a wall of prose, and a sidebar of nav links. Gemma 4 came back with a clean three-section writeup which included overview, key elements, notes for readers.

It had read the page like a person would and that's the moment this post is really about.


Lens, before the first image goes in. Streamlit and settings on the left.

What I built (and why)

The challenge was to write something useful about Gemma 4. The easiest path I could think of at the time was a getting-started tutorial. But depending on much the submission will be, this could also the most crowded approach.

Afterall, every model release gets a hundred of those, and most are interchangeable.

So I built a tool instead, and wrote about using it.

Lens is a small Streamlit app that takes an image (a UI screenshot, a diagram, a photo, a sticky note) and an intent (documentation, blog-outline, or alt-text), and asks Gemma 4 to do the writing. It does two things on top of a normal API call that I think are actually interesting:

  1. It surfaces Gemma 4's thinking trace alongside the answer — so you can see what the model considered before deciding what to write.
  2. It uses Gemma 4's native bounding-box detection to identify visual elements in the image and draws them over the upload.

The code lives on GitHub if you want to skip ahead. The rest of this post is what I learned by using it, end to end with over 35 real API calls.


How Lens uses Gemma 4

The whole client is small enough to read. Here's the multimodal call, which is the load-bearing five lines of the entire project:

from google import genai
from google.genai import types

client = genai.Client()  # picks up GEMINI_API_KEY from env

with open("screenshot.png", "rb") as f:
    image_bytes = f.read()

response = client.models.generate_content(
    model="gemma-4-26b-a4b-it",
    contents=[
        types.Part.from_bytes(data=image_bytes, mime_type="image/png"),
        "You are a senior technical writer. Look at the image and produce..."
    ],
)

print(response.text)
Enter fullscreen mode Exit fullscreen mode

It's just that. Image as bytes, MIME type, a prompt, and you're talking to a Mixture-of-Experts model that activates roughly 4B parameters per token while staying in the 256K-context family.

(For the curious: I default to gemma-4-26b-a4b-it because the MoE is gentler on free-tier rate limits. gemma-4-31b-it, the flagship dense model, is one sidebar toggle away.)

Asking for thoughts

Gemma 4 ships with a thinking mode — the model emits a chain-of-thought trace before its final answer. The Python SDK exposes this through a config object:

from google.genai import types

config = types.GenerateContentConfig(
    thinking_config=types.ThinkingConfig(include_thoughts=True),
)

response = client.models.generate_content(
    model="gemma-4-26b-a4b-it",
    contents=[...],
    config=config,
)

# Thought parts are flagged on the candidate's content parts.
for part in response.candidates[0].content.parts:
    if getattr(part, "thought", False):
        print("THOUGHT:", part.text)
    else:
        print("ANSWER:", part.text)
Enter fullscreen mode Exit fullscreen mode

In Lens, I render the answer up top and tuck the thoughts behind an expandable panel. I expected the SDK path to be flaky and wrote a fallback that asks the model to wrap its reasoning in <thinking>...</thinking> tags so I could parse it from plain text.

The fallback never triggered. Over 14 runs with thinking on, every single one came back via the SDK path with proper thought parts.

![Thoughts panel expanded]

Gemma 4's thinking trace alongside the final writeup. The thoughts are notes-to-self. They're not polished prose so don't be too hard on the model here.

Asking for bounding boxes

This one I genuinely didn't believe would work without prompt-engineering gymnastics. The Hugging Face notes for Gemma 4 claim it returns object detection results as raw JSON.

I took this to mean I wasn't going to encounter grammar constraints or a structured-output dance whatsoever. So I tried the bluntest possible prompt:

You are a UI element detector. Look at the image and return ONLY a JSON array.
Each item must be: {"box_2d": [y_min, x_min, y_max, x_max], "label": "<short noun phrase>"}.
Coordinates are normalised 0-1000 against image dimensions (0,0 is top-left).
Detect up to 10 of the most prominent elements.
No prose. No markdown code fences. Just the JSON array.
Enter fullscreen mode Exit fullscreen mode

And it just... worked. Every single image I tested produced parseable JSON. The numbers landed in the right places. The labels were sensible. Some examples from the batch I ran:

  • A UI screenshot of a GitHub PR page produced 5 cleanly-bounded boxes (PR title, file diff, comment thread, etc.)
  • A sticky note produced 10 boxes for individual items
  • A flowchart diagram produced 10 boxes. The model identified each one as a different shape in the flow.

That last one is where things got really interesting.


The flowchart moment

I uploaded a screenshot of a flowchart. The flowchart was the kind with diamonds and rectangles and an "advisory only" branch. After uploading, I asked Lens to detect elements. Here's what came back:

[
  {"box_2d": [5, 238, 54, 431], "label": "User Request box"},
  {"box_2d": [101, 206, 285, 463], "label": "Gate 1 diamond"},
  {"box_2d": [305, 32, 332, 241], "label": "Frozen label"},
  {"box_2d": [449, 15, 521, 256], "label": "All actions frozen box"},
  {"box_2d": [365, 331, 616, 724], "label": "Gate 2 diamond"},
  {"box_2d": [642, 268, 667, 415], "label": "Rule Violated label"},
  {"box_2d": [734, 167, 803, 521], "label": "Blocked box"},
  {"box_2d": [713, 634, 848, 831], "label": "Gate 3 diamond"},
  {"box_2d": [925, 431, 985, 691], "label": "Advisory only box"},
  {"box_2d": [925, 766, 982, 985], "label": "Execute Action box"}
]
Enter fullscreen mode Exit fullscreen mode

![Flowchart with detection overlay]


Ten boxes with ten labels. The geometry is precise.

The precision floored me. The model was able to find the shapes, follow the flow, name the branches, and distinguish a label ("Rule Violated") from the box it pointed at. It clearly understood it was looking at a flowchart.

But then I noticed something.

It called the decision symbol a diamond. It called the process symbols boxes. It called the terminator a box. Every flowchart diagram has a vocabulary such as decision, process, terminator, connector, etc, but Gemma 4 was naming the shapes by their visual primitives instead of their domain function.

This is the most interesting thing I learned during the build, so I want to sit with it for a second.

The model knew enough to follow the flow. It knew enough to label "Rule Violated" as belonging next to "Gate 2 diamond." It detected ten distinct elements in the right places. It just didn't bridge the last gap from "I see a diamond" to "I see a decision node."

I don't think this is a Gemma 4 weakness. I think it's a prompt design finding. If I'd written "You are a flowchart annotator. Use standard flowchart terminology such as decision, process, terminator, connector when labeling elements," I'd bet money I'd get domain-accurate labels back. I think the capability is there but the framing just wasn't.

And I believe this is the kind of thing benchmark scores can't easily catch. You only notice when you actually use a model on your real inputs.


What the numbers looked like

I built a small automated test battery alongside the app. The automated test runs every image in test-assets/ through every intent (with thinking on and off) plus a detection pass, and writes a markdown report. Over 35 calls on 5 images:

Metric Value
Calls succeeded 34 / 35
Calls with thoughts returned 14 / 14 requested
Detection passes parsed cleanly 5 / 5
Average latency ~32.6 s
Total wall time ~19 min

A few things stand out.

The one failure was a 500 INTERNAL from the API mid-batch. It wasn't a rate limit or a malformed input like i initially thought. It was just a transient backend hiccup.

The runner caught it, logged it, kept going. The next call succeeded though.

I would like to think one point of failure for starters is permissible. Afterall, free-tier APIs have moments; the resilience pattern matters more than pretending they don't.

Latency is real. ~32 seconds per call on the MoE model is not snappy. For an interactive UI you'd absolutely want to stream responses, which the SDK supports but Lens doesn't, because the point was clarity over polish. Worth knowing if you're planning anything user-facing.

Vision quality is the headline. The longest output (a documentation pass on a GitHub PR screenshot, thinking ON) read like something a human would write. It correctly extracted the PR number, the repo, the contributor, the file changes Copilot AI summarized, and the in-line bug fixes (I didn't specifically prompt for any of these things btw).

![Result + run-info panel]

Latency and token usage on the right; the actual writeup on the left. Both matter.


Build choices I'd repeat (and how Lens is laid out)

A few decisions that worked out:

Streamlit, not a frontend framework. Single file. Hot reload. Browser preview. The whole UI is ~180 lines. If your project's job is to demonstrate a model, frontend ceremony is a tax as far as I'm concerned.

A separate test battery script. Manually probing the app one image at a time would have given me anecdotes. The battery gave me numbers. It was within the same code path as the app so there was no extra surface area to maintain.

An honest fallback for thinking mode. I shipped the SDK-config path AND a prompt-tag fallback because I couldn't verify SDK parity ahead of time. The fallback never triggered, but writing it kept the build moving without a HALT for verification.

Centralized error mapping. Every API exception gets classified. This includes rate-limit, auth, network, other. All of them get converted to a friendly Streamlit message instead of a Python traceback. The user never sees a stack trace; the test battery never crashes on a 429.

The layout, if you want to clone and modify:

gemma4-lens/
├── app.py              # Streamlit UI
├── gemma_client.py     # google-genai wrapper, thinking + error handling
├── prompts.py          # Three intent templates + the detection prompt
├── render.py           # Resize, JSON parse, box drawing, thought split
├── tests/
│   └── test_battery.py # Matrix runner + markdown report writer
├── test-assets/        # Drop your own images here
├── reports/            # Auto-generated reports
├── requirements.txt
└── README.md
Enter fullscreen mode Exit fullscreen mode

Quickstart, for anyone who wants to try Gemma 4 today

Grab an API key from Google AI Studio. It's free and takes about 30 seconds.

git clone https://github.com/Zolldyk/Lens.git
cd Lens

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
# paste your key into .env

streamlit run app.py
Enter fullscreen mode Exit fullscreen mode

Upload anything. A screenshot, a photo, a diagram you sketched on a napkin. Pick an intent. Toggle thinking on if you want to see the model reason. Toggle detection on if you want it to identify elements.

If you're on macOS, use python3, not python — stock macOS doesn't ship a python symlink, which was the one snag I hit during setup. I added it to the README so you won't.


What I'd say about Gemma 4 after using it

I came in expecting to be either dazzled or underwhelmed. What I got was something more useful: a model that behaves like a senior contributor. It reads visual context carefully. It surfaces its reasoning when asked. It returns structured output when you ask for structure. It does the thing you wanted, and then a little more, and it labels the parts.

It also still benefits enormously from precise prompting — the flowchart shape vs. flowchart function thing is the cleanest example I have. The raw capability seems to be there; getting domain-accurate output is a craft.

If I were starting a new project that needed multimodal understanding today, I wouldn't mind reaching for Gemma 4. Two model variants on the API. Apache 2.0 license. Native structured output. Native thinking mode. 256K context. Free tier that's generous enough to build and test something real before you commit.

That's a lot of value living in one model release, and it's all reachable through five lines of Python.


Top comments (0)