How a non-coder shipped a camera + voice AI toy — the AI wrote the code, I made the calls

#ai #beginners #showdev #sideprojects

I can't code, but I used AI to build a toy that cosplays a character — go ahead and roast it

Heads up: this is just a toy demo I built for fun. Most of the code was written by AI; my job was deciding what to build, making the trade-off calls, and stepping on the rakes. I'm posting it to ask the community how I could make it better.

Live demo (playable WebGL intro + recorded walkthrough): https://aierkuite-ai-sight.pages.dev

Source on GitHub: https://github.com/aierkuite/aierkuite-AI_Sight

How it started: I wanted a dumb little toy

I can't really code. One day I got an itch to build a small toy — you point it at something and ask a question, and the AI looks at the camera feed and answers you in Japanese, in a character's voice, so it feels like you're actually talking to the character.

Pure self-entertainment, but it got addictive, and that's how this thing came to be: AI Vision Chat Assistant (AI Sight).

What it does

In one line: point your camera and ask; the AI looks at the frame, answers in Japanese, and reads it out loud in a cloned voice.

Breaking the toy down:

Visual Q&A: every question sends the current camera frame along, so the model answers based on what it actually sees, not guesswork.
Voice / text input: push-to-talk live transcription, or just type.
Streaming replies: the model streams token by token — no waiting for the whole block.
Cloned voice + Japanese narration: the reply is synthesized into Japanese speech in a cloned character voice (I'll stay vague on whose — you know the type).
Cinematic intro: a full-screen, page-flipping WebGL opening animation; if your device has no WebGL or you've turned on "reduce motion," it gracefully falls back to a 2D background instead of a blank screen.

One detail I'm weirdly proud of — wait-then-play: I wait until the entire audio clip is synthesized, then reveal the text and start playback at the same time. At first I didn't, and the subtitles would pop up while the voice lagged behind — the character looked possessed. Syncing them feels so much better.

Heads up: the live demo page does not call the backend or a GPU in real time — you can only play the intro animation and watch my recorded walkthrough. To try a full conversation you currently have to run it locally — which is exactly one of the things I want to ask about below.

How a non-coder cobbled this together

Honest truth: most of the code was written by AI. My half of the work was the other half — figuring out what to build, breaking it into small tasks, making the technical trade-offs, and judging whether the AI's solution was actually right or about to step on a rake.

A "me making the call" example is that wait-then-play thing. The AI's first take was "stream the text, play the audio whenever it's ready." Technically fine, but the experience was off. I'm the one who said no, not good enough — that kind of product-feel judgment is the steering wheel I keep my hands on, while the AI turns my calls into code.

What I'm stuck on — two things I'd love advice on

The real reason I'm posting. Two pretty concrete questions:

1. Where could the concept / gameplay go next?

Right now it's pure self-entertainment — point a camera, get Japanese back, have a chuckle. But this "look at the scene + answer in a character voice" combo — beyond "anime companion chat," what real or fun scenarios could it land in? Is it worth taking further, or is this about as far as it should go?

2. How do I lower the barrier for others to try it?

To run the whole thing today you need your own vision-model API plus GPT-SoVITS running on a local GPU — that's a steep wall that scares off passersby. Are there lighter deployment options / alternatives so someone walking by could actually try it cheaply?

Honest sign-off: the code is rough, and there are bits I don't fully understand myself. Roast away — go easy or go brutal, I'll take it.

DEV Community