<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Usama</title>
    <description>The latest articles on DEV Community by Usama (@pseudo_usama).</description>
    <link>https://dev.to/pseudo_usama</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4009514%2F4e8eb8d4-5b07-4033-9fb4-16b3e92e562c.png</url>
      <title>DEV Community: Usama</title>
      <link>https://dev.to/pseudo_usama</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pseudo_usama"/>
    <language>en</language>
    <item>
      <title>How to Automate the ChatGPT &amp; Gemini Web UIs Without an API Key</title>
      <dc:creator>Usama</dc:creator>
      <pubDate>Tue, 30 Jun 2026 12:35:30 +0000</pubDate>
      <link>https://dev.to/pseudo_usama/how-to-automate-the-chatgpt-gemini-web-uis-without-an-api-key-2ek8</link>
      <guid>https://dev.to/pseudo_usama/how-to-automate-the-chatgpt-gemini-web-uis-without-an-api-key-2ek8</guid>
      <description>&lt;p&gt;You've got a folder of a few hundred screenshots and you want the text out of each one. Or you want to generate a batch of images for a side project. Or you just want to drop a single "summarize this" call into a script you're writing on a Sunday afternoon. So you open the pricing page for the official API, do the math on per-token billing plus setting up keys and a payment method, and it's hard to justify, because the exact same model will do the exact same thing for free in a browser tab.&lt;/p&gt;

&lt;p&gt;There are really two ways to get a model like ChatGPT or Gemini to do work for you. The web UI is free, or already covered by a subscription you're paying for anyway, but you drive it by hand. The API is scriptable, but you pay by the token. Most of the time that trade-off is fine. But for a whole category of work like hobby projects, throwaway scripts, research, or anything that doesn't need production-grade reliability, you're stuck picking between "free but manual" and "automated but paid."&lt;/p&gt;

&lt;p&gt;Which raises the obvious question: why not automate the free web UI? It's just a webpage. You open it, type in the box, click send. It turns out that hides a few fiddly problems, which I ran into enough times that I eventually built a small &lt;a href="https://github.com/pseudo-usama/hermex" rel="noopener noreferrer"&gt;library&lt;/a&gt; for them. In this article we'll work through what it takes to automate these UIs, and at the end I'll show how little code it comes down to.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. What it takes to drive a chat UI
&lt;/h2&gt;

&lt;p&gt;A single round trip with ChatGPT or Gemini breaks down into four jobs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Get your text into the input box&lt;/li&gt;
&lt;li&gt;Optionally attach a file&lt;/li&gt;
&lt;li&gt;Wait for the model to finish answering&lt;/li&gt;
&lt;li&gt;And read the answer back out.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every one of these is harder than it sounds, because the page is a modern single-page app that was never built to be driven by a script. We'll use Selenium with undetected-chromedriver, and for now assume the browser is already open (we'll get to launching it in the next section). To keep the code readable I'll show whichever of the two platforms makes each problem clearest, and mention the other where it differs.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 Typing the message
&lt;/h3&gt;

&lt;p&gt;The first surprise is that the input isn't a normal text field you can drop a string into. On ChatGPT it's a &lt;code&gt;contenteditable&lt;/code&gt; div, and on Gemini it's a custom &lt;code&gt;rich-textarea&lt;/code&gt; element. You can still send keystrokes to it, but two things will trip you up. A plain Enter submits the message, so any newline inside your prompt has to go in as Shift+Enter. And emoji and other characters outside the basic range quietly break send_keys, so those need to be inserted through JavaScript instead.&lt;/p&gt;

&lt;p&gt;That pushes you toward sending the message one character at a time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;selenium.webdriver.common.by&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;By&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;selenium.webdriver.common.keys&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Keys&lt;/span&gt;

&lt;span class="n"&gt;box&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CSS_SELECTOR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;div[contenteditable=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;box&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;char&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;char&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# A plain Enter would send the message early
&lt;/span&gt;        &lt;span class="n"&gt;box&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Keys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SHIFT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Keys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ENTER&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;box&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;char&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gemini works the same way, just against the &lt;code&gt;rich-textarea&lt;/code&gt; element instead of the &lt;code&gt;contenteditable&lt;/code&gt; div.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.2 Uploading a file
&lt;/h3&gt;

&lt;p&gt;This is where it gets interesting. The file &lt;code&gt;&amp;lt;input&amp;gt;&lt;/code&gt; on the page is hidden, and the useful trick is that you don't need to open a file dialog at all: if you can get a reference to a hidden &lt;code&gt;input[type=file]&lt;/code&gt;, you can hand it a path with &lt;code&gt;send_keys&lt;/code&gt; and ChromeDriver does the upload internally, no dialog involved.&lt;/p&gt;

&lt;p&gt;ChatGPT is the easy case. The input already exists in the page, so you unhide it and send the path. Gemini is the awkward one. Clicking its upload button makes the page call the input's own &lt;code&gt;.click()&lt;/code&gt;, which pops the operating system's file picker, a window Selenium has no way to drive. The fix is to stop the page from opening that dialog in the first place, by monkey-patching the browser's &lt;code&gt;click&lt;/code&gt; method so it ignores the call on file inputs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute_script&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    const orig = HTMLInputElement.prototype.click;
    HTMLInputElement.prototype.click = function () {
        if (this.type === &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) return;   // swallow the call that opens the OS dialog
        return orig.apply(this, arguments);
    };
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With that in place you can walk through Gemini's upload menu without a dialog ever appearing, then find the hidden input it creates, unhide it, and feed it the path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;file_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CSS_SELECTOR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;input[name=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Filedata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute_script&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arguments[0].style.display = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;block&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;file_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/receipt.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In real code you'd restore the original &lt;code&gt;click&lt;/code&gt; afterward so the patch doesn't leak into the rest of the session, but the four lines above are the whole idea. The recurring lesson with this kind of automation is that the hardest problems are the ones where the page actively fights you.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.3 Waiting for the response
&lt;/h3&gt;

&lt;p&gt;You've sent the message. Now you have to know when the model is done, and there's no event you can listen for and no callback that fires. You poll the page and read its visual cues. The cleanest signal on ChatGPT is the stop button: while a response is being generated there's a stop button on screen, and when generation finishes it disappears.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_generating&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_elements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CSS_SELECTOR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[data-testid=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stop-button&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="nf"&gt;is_generating&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The principle here is that you're inferring application state from interface elements that were never meant to be read as an API.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.4 Getting the response out
&lt;/h3&gt;

&lt;p&gt;The reply lives in the page as rendered HTML. Pulling the text out is a matter of finding the right container in the last response and reading it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;turn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_elements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CSS_SELECTOR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.agent-turn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# the most recent response
&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;turn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CSS_SELECTOR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want the raw markdown source instead of the rendered text, there's a copy button you can click and then read off the clipboard. And if the response contains a generated image, getting it out is its own small pipeline: you click the image's download button and then wait for the file to arrive in your download folder, skipping the partial &lt;code&gt;.crdownload&lt;/code&gt; file the browser writes while the download is still in progress.&lt;/p&gt;

&lt;p&gt;That's a full round trip: text in, file attached, wait for the answer, text or image back out. Run it twice, though, and you hit the next problem. The second time your script opens the browser, you're logged out and starting from a blank session, which is where the next piece comes in.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Making it survive across runs
&lt;/h2&gt;

&lt;p&gt;The reason your second run starts logged out is that an automated browser, by default, begins every session from nothing: no cookies, no history, no saved login. So before any of the previous section's code is useful in practice, you need the browser to remember who you are between runs, and you need it to behave enough like a real session that the platform doesn't start throttling you. That comes down to one Chrome setting, a one-time setup step, and typing at a human pace.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 A browser profile that persists
&lt;/h3&gt;

&lt;p&gt;Chrome keeps everything about your identity on a site, including cookies and login sessions, inside a profile directory. If you let Chrome spin up a throwaway profile each run, you lose all of that the moment the script ends. Point it at a directory you control instead, and the login survives:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;undetected_chromedriver&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;uc&lt;/span&gt;

&lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;uc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ChromeOptions&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--user-data-dir=/path/to/your/profile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;driver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;uc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things are happening here. &lt;code&gt;undetected-chromedriver&lt;/code&gt; is a drop-in replacement for Selenium's Chrome that smooths over the most obvious tells of an automated browser. And the &lt;code&gt;--user-data-dir&lt;/code&gt; flag is the part that gives you persistence: it tells Chrome to store its profile in a folder of your choosing, so the session you logged into yesterday is still there today. A profile with real history also looks like a returning user rather than a brand-new automated one, which keeps the session healthier over time.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 Logging in, once
&lt;/h3&gt;

&lt;p&gt;A profile directory is only useful once there's a logged-in session inside it, so there's a one-time setup step. You open the browser pointed at your profile, log in by hand, then close it. Every automated run after that reuses the saved session.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;driver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;uc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://gemini.google.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Log into the browser window, then press Enter here to finish setup.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;quit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Logging in is also where a paid plan pays off. If you already subscribe to ChatGPT Plus or a paid Gemini tier, signing in during setup means every automated run uses that subscription, with its higher message limits and access to the better models, instead of being capped at the free tier. You do this once per machine and forget about it.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.3 Typing at a human pace
&lt;/h3&gt;

&lt;p&gt;A script that drops an entire prompt into the box in a single instant doesn't behave like a person at a keyboard, and sessions that look automated are the ones that get rate-limited or challenged. The fix is cheap. We're already sending the message one character at a time, so all it takes is a small, slightly random delay between keystrokes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;char&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;box&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;char&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.02&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;   &lt;span class="c1"&gt;# a human pace, not an instant dump
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The randomness matters more than the exact timing, since a perfectly even rhythm is itself a tell.&lt;/p&gt;

&lt;p&gt;With that, the machine is complete. The browser stays logged in across runs, and the input behaves enough like a real person to keep the session stable. You've now seen everything that goes into automating these interfaces, which means it's a good moment to step back and see how much of it you have to write yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. All of that, in a few lines
&lt;/h2&gt;

&lt;p&gt;Every problem in the last two sections is the kind you want to solve once and then never think about again. That's what pushed me to wrap the whole thing up into a library. It's called Hermex, and you install it with &lt;code&gt;pip install hermex&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The one-time login from the previous section becomes a single call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;hermex&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatGPT&lt;/span&gt;

&lt;span class="n"&gt;ChatGPT&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setup&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;   &lt;span class="c1"&gt;# opens a browser once: log in, then close the window
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that, the entire round trip from earlier, launching the browser, typing, uploading, waiting for the response, and reading it back, is one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ChatGPT&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;simple_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What does this receipt say?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attachments&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;receipt.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a back-and-forth conversation, keep the browser open and call &lt;code&gt;query&lt;/code&gt; as many times as you want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;hermex&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Gemini&lt;/span&gt;

&lt;span class="n"&gt;gemini&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Gemini&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;gemini&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open_url&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gemini&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the history of the internet.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gemini&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Now just the key dates.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;gemini&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a generated image comes back as a path to the downloaded file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gemini&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate an image of a mountain at sunset.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood, that's everything from the previous sections: the character-by-character typing with its newline and emoji handling, the hidden-input upload with Gemini's dialog suppression, the polling that waits for generation to finish, the text and image extraction, and the persistent profile that keeps you logged in. None of it is conceptually hard, but it's a lot of fiddly surface area to get right and, harder still, to keep working as the interfaces change. That last part is the real argument for not hand-rolling it every time. Hermex is open source under the MIT license, and the code is on GitHub at &lt;a href="https://github.com/pseudo-usama/hermex" rel="noopener noreferrer"&gt;github.com/pseudo-usama/hermex&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Wrapping up
&lt;/h2&gt;

&lt;p&gt;Automating a chat web UI comes down to a handful of problems that each look trivial and aren't: getting text into an input that isn't a text field, attaching files through an element the page hides from you, knowing when the model has finished without any event to tell you, and pulling the answer back out. Wrap those up with a profile that stays logged in, and it collapses to a single line you can call from a script.&lt;/p&gt;

&lt;p&gt;The catch is that it's brittle by nature. You're driving an interface built for people, not programs, and a redesign that moves a button or renames a class will quietly break it. That makes it a great fit for hobby projects, scripts, and research, and a poor fit for production, where the official API earns its cost. And since ChatGPT and Gemini each have their own terms of service, where you take this is your call and your responsibility.&lt;/p&gt;

&lt;p&gt;The code is on &lt;a href="https://github.com/pseudo-usama/hermex" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; if it's useful. The documentation is available at &lt;a href="https://hermex.usama.ai/" rel="noopener noreferrer"&gt;hermex.usama.ai&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>chatgpt</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
