innermost47

Posted on Nov 26, 2025

Drawing Sound: A New Interface Between Musicians and AI

#drawtoaudio #aimusic #generativeai #musicproduction

How visual gestures can become a creative language for generative music

When I started working on OBSIDIAN Neural seven months ago, I had a simple goal: bring AI music generation into live performance without replacing the musician's creative control. The plugin lets you generate audio loops on-the-fly with text prompts, control them with MIDI, and layer them with your instruments.

But something was missing.

The Problem With Prompts

Text prompts are powerful, but they have limitations:

They interrupt creative flow (you have to stop and type)
They require linguistic precision (finding the right words)
They're inherently descriptive rather than expressive

When you're in the middle of a jam session or live performance, typing "ambient pad with slow attack and dark timbre" breaks your momentum. You're thinking in sound, but you're forced to translate that into words.

There had to be a more intuitive way.

Enter: Draw-to-Audio

The concept is deceptively simple: What if you could just draw what you want to hear?

I added a canvas to OBSIDIAN Neural where musicians can sketch visual gestures. A Vision Language Model (VLM) interprets the drawing and translates it into audio generation prompts.

Here's how it works:

You draw on the canvas (lines, shapes, textures, anything)
The VLM analyzes your drawing
It generates a detailed prompt describing the sonic equivalent
The audio generation model creates a sample (~10-20 seconds)
You can loop it, layer it, and perform with it

Credits Where Credits Are Due

The Draw-to-Audio concept came from A.D., who suggested: "What if people could just draw what they want to hear?"

Sometimes the best ideas come from people outside the technical bubble—people who think in creative possibilities rather than implementation constraints. This feature exists because A.D. saw a more intuitive way to interact with AI.

Why This Matters

Visual thinking is natural for musicians

When you think about sound, you often think visually:

"I want something that rises"
"This needs texture"
"Make it more chaotic"

These are spatial, visual concepts. Drawing lets you express them directly.

It's faster in the flow

Draw a quick gesture → get sound. No context switching, no vocabulary hunting. Your hand moves, music happens.

It's expressive, not descriptive

Words describe. Drawings express. There's a difference between typing "aggressive rhythm" and scribbling violent, jagged lines across a canvas.

The VLM picks up on that energy and translates it.

Real Examples

Smooth, flowing curves:
→ VLM interprets as "gentle ambient pad with slow evolution, soft attack, ethereal texture"

Chaotic scribbles:
→ "Aggressive distorted rhythm with sharp transients, dissonant harmonics, high energy"

Geometric patterns:
→ "Structured sequence with clear rhythm, precise timing, minimal texture"

Dense clusters:
→ "Complex layered texture with multiple voices, rich harmonic content"

Technical Implementation

For those curious about the stack:

Frontend: Canvas in the VST interface (JUCE framework)
Vision Model: Gemini 2.5 Flash
Audio Generation: Stable Audio Open
Workflow: Drawing → Base64 → VLM → Enhanced prompt → Audio model

The VLM is instructed to interpret drawings as sonic gestures, considering:

Line density and chaos
Shapes and geometry
Distribution across space
Visual energy and intensity

This Isn't Magic

Let me be clear: this doesn't "read your mind" or perfectly translate every drawing into exactly what you imagined. It's not telepathic.

What it does is give you another creative input method—one that's often more intuitive than typing, especially in performance contexts.

Sometimes the VLM surprises you with its interpretation. Sometimes that surprise is exactly what you needed.

Beyond the Gimmick

I know what some of you are thinking: "This sounds like a gimmick."

Fair. But here's why I think it's more than that:

1. It challenges how we interface with AI

Most generative AI tools assume text is the primary interface. But musicians think in sound, gesture, and space. Why limit ourselves to words?

2. It opens new workflows

Imagine:

Sketching a performance score as visual shapes
Drawing evolving textures during live improvisation
Creating "sonic sketches" before refining with traditional tools

3. It democratizes AI music tools

Not everyone is comfortable writing detailed prompts. But everyone can draw a line, a circle, a gesture. This lowers the barrier to experimentation.

What's Next

The Draw-to-Audio feature is still evolving. I'm exploring:

Temporal drawing: Draw left-to-right to control how sound evolves over time
Color mapping: Different colors → different timbres or frequency ranges
Multi-layer drawing: Separate layers for different sonic elements

I'm also curious about integrating this with the Game of Life oscillator concept A.D. and I have been developing—where cellular automata patterns both generate and visualize music simultaneously.

Try It Yourself

OBSIDIAN Neural is open source. You can:

Download it: obsidian-neural.com
Check the code: GitHub
Experiment with Draw-to-Audio yourself

I'd love to hear what you create, what works, what doesn't, and how you think visual interfaces could evolve in generative music tools.

Final Thoughts

AI in music isn't about replacing musicians. It's about giving them new instruments.

Text prompts are one instrument. Draw-to-Audio is another. MIDI control is another. They all serve different creative needs.

The goal isn't to find the "best" interface—it's to offer multiple ways to interact with generative systems, letting musicians choose what fits their workflow and creative moment.

Sometimes you want precision. Sometimes you want spontaneity. Sometimes you want to just... draw.

Anthony is a French developer and musician who created OBSIDIAN Neural, an open-source VST3 plugin for AI music generation. The project was presented at AES AIMLA 2025 in London. Special thanks to A.D. for the creative vision behind Draw-to-Audio.

DEV Community