How to Automate the ChatGPT & Gemini Web UIs Without an API Key

Usama — Tue, 30 Jun 2026 12:35:30 +0000

You've got a folder of a few hundred screenshots and you want the text out of each one. Or you want to generate a batch of images for a side project. Or you just want to drop a single "summarize this" call into a script you're writing on a Sunday afternoon. So you open the pricing page for the official API, do the math on per-token billing plus setting up keys and a payment method, and it's hard to justify, because the exact same model will do the exact same thing for free in a browser tab.

There are really two ways to get a model like ChatGPT or Gemini to do work for you. The web UI is free, or already covered by a subscription you're paying for anyway, but you drive it by hand. The API is scriptable, but you pay by the token. Most of the time that trade-off is fine. But for a whole category of work like hobby projects, throwaway scripts, research, or anything that doesn't need production-grade reliability, you're stuck picking between "free but manual" and "automated but paid."

Which raises the obvious question: why not automate the free web UI? It's just a webpage. You open it, type in the box, click send. It turns out that hides a few fiddly problems, which I ran into enough times that I eventually built a small library for them. In this article we'll work through what it takes to automate these UIs, and at the end I'll show how little code it comes down to.

1. What it takes to drive a chat UI

A single round trip with ChatGPT or Gemini breaks down into four jobs:

Get your text into the input box
Optionally attach a file
Wait for the model to finish answering
And read the answer back out.

Every one of these is harder than it sounds, because the page is a modern single-page app that was never built to be driven by a script. We'll use Selenium with undetected-chromedriver, and for now assume the browser is already open (we'll get to launching it in the next section). To keep the code readable I'll show whichever of the two platforms makes each problem clearest, and mention the other where it differs.

1.1 Typing the message

The first surprise is that the input isn't a normal text field you can drop a string into. On ChatGPT it's a contenteditable div, and on Gemini it's a custom rich-textarea element. You can still send keystrokes to it, but two things will trip you up. A plain Enter submits the message, so any newline inside your prompt has to go in as Shift+Enter. And emoji and other characters outside the basic range quietly break send_keys, so those need to be inserted through JavaScript instead.

That pushes you toward sending the message one character at a time:

from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

box = driver.find_element(By.CSS_SELECTOR, 'div[contenteditable="true"]')
box.click()

for char in message:
    if char == "\n":
        # A plain Enter would send the message early
        box.send_keys(Keys.SHIFT, Keys.ENTER)
    else:
        box.send_keys(char)

Gemini works the same way, just against the rich-textarea element instead of the contenteditable div.

1.2 Uploading a file

This is where it gets interesting. The file <input> on the page is hidden, and the useful trick is that you don't need to open a file dialog at all: if you can get a reference to a hidden input[type=file], you can hand it a path with send_keys and ChromeDriver does the upload internally, no dialog involved.

ChatGPT is the easy case. The input already exists in the page, so you unhide it and send the path. Gemini is the awkward one. Clicking its upload button makes the page call the input's own .click(), which pops the operating system's file picker, a window Selenium has no way to drive. The fix is to stop the page from opening that dialog in the first place, by monkey-patching the browser's click method so it ignores the call on file inputs:

driver.execute_script("""
    const orig = HTMLInputElement.prototype.click;
    HTMLInputElement.prototype.click = function () {
        if (this.type === 'file') return;   // swallow the call that opens the OS dialog
        return orig.apply(this, arguments);
    };
""")

With that in place you can walk through Gemini's upload menu without a dialog ever appearing, then find the hidden input it creates, unhide it, and feed it the path:

file_input = driver.find_element(By.CSS_SELECTOR, 'input[name="Filedata"]')
driver.execute_script("arguments[0].style.display = 'block';", file_input)
file_input.send_keys("/path/to/receipt.jpg")

In real code you'd restore the original click afterward so the patch doesn't leak into the rest of the session, but the four lines above are the whole idea. The recurring lesson with this kind of automation is that the hardest problems are the ones where the page actively fights you.

1.3 Waiting for the response

You've sent the message. Now you have to know when the model is done, and there's no event you can listen for and no callback that fires. You poll the page and read its visual cues. The cleanest signal on ChatGPT is the stop button: while a response is being generated there's a stop button on screen, and when generation finishes it disappears.

import time

def is_generating():
    return bool(driver.find_elements(By.CSS_SELECTOR, '[data-testid="stop-button"]'))

while is_generating():
    time.sleep(1)

The principle here is that you're inferring application state from interface elements that were never meant to be read as an API.

1.4 Getting the response out

The reply lives in the page as rendered HTML. Pulling the text out is a matter of finding the right container in the last response and reading it:

turn = driver.find_elements(By.CSS_SELECTOR, ".agent-turn")[-1]   # the most recent response
text = turn.find_element(By.CSS_SELECTOR, ".markdown").text

If you want the raw markdown source instead of the rendered text, there's a copy button you can click and then read off the clipboard. And if the response contains a generated image, getting it out is its own small pipeline: you click the image's download button and then wait for the file to arrive in your download folder, skipping the partial .crdownload file the browser writes while the download is still in progress.

That's a full round trip: text in, file attached, wait for the answer, text or image back out. Run it twice, though, and you hit the next problem. The second time your script opens the browser, you're logged out and starting from a blank session, which is where the next piece comes in.

2. Making it survive across runs

The reason your second run starts logged out is that an automated browser, by default, begins every session from nothing: no cookies, no history, no saved login. So before any of the previous section's code is useful in practice, you need the browser to remember who you are between runs, and you need it to behave enough like a real session that the platform doesn't start throttling you. That comes down to one Chrome setting, a one-time setup step, and typing at a human pace.

2.1 A browser profile that persists

Chrome keeps everything about your identity on a site, including cookies and login sessions, inside a profile directory. If you let Chrome spin up a throwaway profile each run, you lose all of that the moment the script ends. Point it at a directory you control instead, and the login survives:

import undetected_chromedriver as uc

options = uc.ChromeOptions()
options.add_argument("--user-data-dir=/path/to/your/profile")

driver = uc.Chrome(options=options)

Two things are happening here. undetected-chromedriver is a drop-in replacement for Selenium's Chrome that smooths over the most obvious tells of an automated browser. And the --user-data-dir flag is the part that gives you persistence: it tells Chrome to store its profile in a folder of your choosing, so the session you logged into yesterday is still there today. A profile with real history also looks like a returning user rather than a brand-new automated one, which keeps the session healthier over time.

2.2 Logging in, once

A profile directory is only useful once there's a logged-in session inside it, so there's a one-time setup step. You open the browser pointed at your profile, log in by hand, then close it. Every automated run after that reuses the saved session.

driver = uc.Chrome(options=options)
driver.get("https://gemini.google.com")

input("Log into the browser window, then press Enter here to finish setup.")
driver.quit()

Logging in is also where a paid plan pays off. If you already subscribe to ChatGPT Plus or a paid Gemini tier, signing in during setup means every automated run uses that subscription, with its higher message limits and access to the better models, instead of being capped at the free tier. You do this once per machine and forget about it.

2.3 Typing at a human pace

A script that drops an entire prompt into the box in a single instant doesn't behave like a person at a keyboard, and sessions that look automated are the ones that get rate-limited or challenged. The fix is cheap. We're already sending the message one character at a time, so all it takes is a small, slightly random delay between keystrokes:

import time, random

for char in message:
    box.send_keys(char)
    time.sleep(random.uniform(0.02, 0.05))   # a human pace, not an instant dump

The randomness matters more than the exact timing, since a perfectly even rhythm is itself a tell.

With that, the machine is complete. The browser stays logged in across runs, and the input behaves enough like a real person to keep the session stable. You've now seen everything that goes into automating these interfaces, which means it's a good moment to step back and see how much of it you have to write yourself.

3. All of that, in a few lines

Every problem in the last two sections is the kind you want to solve once and then never think about again. That's what pushed me to wrap the whole thing up into a library. It's called Hermex, and you install it with pip install hermex.

The one-time login from the previous section becomes a single call:

from hermex import ChatGPT

ChatGPT.setup()   # opens a browser once: log in, then close the window

After that, the entire round trip from earlier, launching the browser, typing, uploading, waiting for the response, and reading it back, is one line:

response = ChatGPT.simple_query("What does this receipt say?", attachments=["receipt.jpg"])
print(response.text)

For a back-and-forth conversation, keep the browser open and call query as many times as you want:

from hermex import Gemini

gemini = Gemini()
gemini.open_url()

print(gemini.query("Summarize the history of the internet.").text)
print(gemini.query("Now just the key dates.").text)

gemini.close()

And a generated image comes back as a path to the downloaded file:

response = gemini.query("Generate an image of a mountain at sunset.")
print(response.image)

Under the hood, that's everything from the previous sections: the character-by-character typing with its newline and emoji handling, the hidden-input upload with Gemini's dialog suppression, the polling that waits for generation to finish, the text and image extraction, and the persistent profile that keeps you logged in. None of it is conceptually hard, but it's a lot of fiddly surface area to get right and, harder still, to keep working as the interfaces change. That last part is the real argument for not hand-rolling it every time. Hermex is open source under the MIT license, and the code is on GitHub at github.com/pseudo-usama/hermex.

4. Wrapping up

Automating a chat web UI comes down to a handful of problems that each look trivial and aren't: getting text into an input that isn't a text field, attaching files through an element the page hides from you, knowing when the model has finished without any event to tell you, and pulling the answer back out. Wrap those up with a profile that stays logged in, and it collapses to a single line you can call from a script.

The catch is that it's brittle by nature. You're driving an interface built for people, not programs, and a redesign that moves a button or renames a class will quietly break it. That makes it a great fit for hobby projects, scripts, and research, and a poor fit for production, where the official API earns its cost. And since ChatGPT and Gemini each have their own terms of service, where you take this is your call and your responsibility.

The code is on GitHub if it's useful. The documentation is available at hermex.usama.ai.

DEV Community: Usama