Teaching an Agent to Generate Its Own Avatar with Gemini

#agents #automation #gemini #showdev

Ever since I started using OpenClaw, I've been tinkering with it in all sorts of ways — except when work gets busy or I'm just too tired. Recently I decided to start sharing some of these experiences from time to time.

This time, what I was tinkering with was: having an image generation specialist agent open a browser, connect to Gemini to generate images, and then having another agent (the HR manager) call the Feishu API to set those images as group chat avatars. Two agents, each doing their own thing — one draws, one swaps. The whole process runs on its own. I just need to see the results. Sounds simple enough, right? It actually took me several days to get it working.

Why Bother

I have a bunch of Feishu group chats, each one tied to a different agent — there's an image generation specialist, a 3D printing expert, an HR manager, and Xiao Bo who writes blogs. None of these groups had avatars, so they all looked identical. Hard to tell apart, and honestly pretty ugly. I wanted to change their avatars, but there were too many groups to do it one by one. So I figured, let the agents change their own avatars. I have a Gemini subscription, so I'd just use its image generation feature.

The Browser Was the First Hurdle

To let the image generation specialist agent use Gemini for image creation, I first needed it to be able to operate a browser. I'd been using Chrome, but the agent was opening the same Chrome instance I use daily, and we kept getting in each other's way. Sometimes the agent hadn't finished its task yet and I'd accidentally close the window; sometimes I'd be looking something up and the agent would close my tab. We were constantly sabotaging each other.

Later I searched the community to see how others handled this, and some people mentioned Brave. Same Chromium engine as Chrome, open source, not much difference in functionality. So I set it up so the agent would only use Brave while I stick with Chrome — no more accidental window and tab closures. But just switching browsers wasn't enough. I also had to configure some port settings so the agent could connect and take control. That configuration process took several attempts. The agent would close the browser on its own, use the wrong profile — it took multiple rounds of back and forth to get it fully sorted out.

It's like teaching a new intern how to use the company computer. You can't just say "here's a computer" and call it done. You have to teach them not to shut it down randomly, not to unplug the ethernet cable, not to close the work windows.

The Agent Operating the Browser Was the Real Nightmare

With the browser sorted, I started having the image generation specialist agent use it to generate images through Gemini. The very first attempt was a complete disaster — it couldn't even find the "generate image" button.

Once we got past the button issue, it started downloading the wrong images. Gemini's page keeps the previous generation results, and the agent couldn't tell which one was new and which was old. It would confidently hand in the old image like it nailed it.

After two or three rounds of tinkering, it could finally grab the correct image. The whole process was: every time it got it wrong, I'd tell it where it went wrong, and when it got it right, I'd update the correct approach into its skill file so it wouldn't make the same mistake again.

It's like teaching a kid — you have to repeat yourself over and over until they remember.

Running It for Real

After the first success, I set up a scheduled task for the HR agent: starting at 11 PM every night, change one group's avatar per hour (because the GLM plan has a 5-hour daily quota, so I usually have agents run tasks late at night to avoid interfering with daytime work). But reality wasn't so rosy. The HR agent would periodically go haywire — instead of changing the avatar, it would just post a message into the group chat. I wouldn't discover this until the next day, then I'd have it fix it while updating its skill file to record this error pattern.

The actual workflow turned out to be more complex than I imagined: the HR agent first scans to see who still hasn't changed their avatar, then sends the task to the image generation specialist. But the HR agent doesn't wait for the specialist to finish drawing — instead, it picks up the previous round's avatar during the next polling cycle.

After repeated corrections, the success rate of this workflow visibly improved, but still fell short of expectations. Basically nothing works perfectly on the first try — it all requires ongoing training.

In the End

Agents aren't written in one shot. They're taught, little by little.

This whole thing doesn't seem like much — just swapping a few avatars. At least now those groups don't look as ugly. But watching a lobster that knew nothing slowly get smarter — frustrating enough to make you want to curse at first, then slowly feeling a sense of accomplishment as it learns. If you have patience, it's actually pretty fun. If you don't, maybe skip this kind of tinkering.