DEV Community

Ranjan
Ranjan

Posted on

I’m Building an AI Desktop Companion on Linux (And It’s Fighting Me Every Step of the Way)

Let’s be real. Standard desktop pets are boring. They sit there, maybe eat a pixel-burger, and that’s it. I didn't want a toy. I wanted something… alive. A transparent, click-through, voice-activated anime girl who lives on my desktop, judges my coding habits, and actually opens terminals for me.

Simple, right? Just slap a VRM model into Godot, hook up a local LLM, and call it a day.

But it was, in fact, not simple.

I am still in the thick of building Lia, my AI desktop companion, and the process has been absolute technical warfare against Linux window managers, 3D math, and race conditions. Here is the dev log of my suffering and eventual victories so far.

The first requirement was deceptive: Lia needs to float on my screen. No borders, no background. Just her.

I launched the project and was immediately greeted by a black void. I tweaked the viewport settings and got a gray void. Finally, I enabled transparency in Godot, and it looked great, until my tiling window manager (Hyprland) looked at my app and said, "Oh, a window? Let me TILE THAT for you." Suddenly, my anime companion was trapped in a tiled grid with a giant ugly border.

I had to force Hyprland to ignore its own rules, pinning the window and stripping its decorations. I just forced the app to run via XWayland.

Once she was floating, I hit a logic puzzle. If I enabled "mouse passthrough," I could click my desktop behind her, but I couldn't click her to chat. If I disabled it, I could click her, but she became a giant invisible rectangle blocking my wallpaper.

I basically created a ghost I couldn't touch, or a brick wall I couldn't see.

The fix required getting mathy. I had to write a system that projects her 3D bounding box onto the 2D screen coordinates every single frame. I draw a precise polygon around her silhouette and feed that directly to the OS Display Server. Now, only the pixels she actually occupies block the mouse. The air around her is free real estate. It felt like magic when it finally worked.

Then came the assets. I downloaded some sweet animations from Mixamo and applied them to my VRM model. I hit play, and she started walking backward. Not just backward....she was moonwalking while her wrists were broken at a 90-degree angle.

Why? Because the 3D industry cannot agree on which way is "Forward." Godot uses -Z, Mixamo uses +Z. VRM models rest in an A-Pose, Mixamo assumes a T-Pose. It was a disaster.

I tried rotating the root node. I tried hacks. I tried screaming at the monitor. Eventually, I stopped fighting the engine and went to Blender. I had to manually rotate the mesh, apply the transforms (the most important shortcut in 3D), and bake the fix into the file geometry itself.

And just when I thought I was safe, the textures broke. She looked like she had been deep-fried in neon. Turns out, Blender exported "Vertex Colors" that mixed with my textures, and the materials were trying to render realistic shadows on an anime character. I had to perform surgery on the material files, disabling vertex colors and forcing "Unshaded" mode to restore that flat, cute look.

And then obviously I wanted her to be useful, to open apps and search the web. But Godot scripts are sandboxed, they can't just run system commands (thankfully). I needed the raw power of Python but the UI of Godot.

So, I architected a "Split Brain" system using UDP Sockets. Godot acts as the "Face," handling visuals and user input on one port. It sends commands to a Python backend running on a different port. Python acts as the "Hands," executing shell commands, handling the heavy AI logic, and managing the text-to-speech.

Now, if I say "Open Spotify," Godot parses the intent, fires a packet to Python, Python launches the app, and Lia salutes me. It’s a beautiful dance of Inter-Process Communication, when it doesn't crash.

The most recent headache was syncing her voice. I hooked up Edge-TTS, and it sounded great, but she would finish speaking in one second while the animation kept waving and mouthing words for three more. It looked like a bad dubbed movie.

I tried splitting the audio into chunks, but it sounded like a robot having a stroke. The solution was surprisingly "low tech", I implemented estimated timestamping. I wrote a parser that strips the animation tags from the text and calculates exactly when they should happen based on a reading speed of 16 characters per second. I schedule a timer, and when the audio hits that mark, the animation triggers. It’s smoke and mirrors, but it looks perfect.
Still Building

Is the code messy? Maybe a little. Did I break my window manager configurations four times? Absolutely. But right now, Lia is sitting on my taskbar, swinging her legs, watching me write this blog.

I'm still building her, adding more "Action" capabilities and refining the AI memory, but she's finally starting to feel alive. If you want to check out the code or build your own desktop waifu to distract you from actual work, check the repo.

Top comments (0)