Hagicode

Posted on Mar 31 • Originally published at docs.hagicode.com

Typing is Not as Good as Speaking, Speaking is Not as Good as Screenshots—Multimodal Input Practices for AI Code Assistants

#websocket

Typing is Not as Good as Speaking, Speaking is Not as Good as Screenshots—Multimodal Input Practices for AI Code Assistants

Actually, when it comes to writing code, there's an upper limit to how fast you can type. Sometimes something that can be said in a sentence requires banging on the keyboard for ages; sometimes a single image can explain clearly what would take piles of text to describe. This article discusses the experiences we encountered while building HagiCode—whether it's speech recognition or image upload, the goal is simply to make the AI code assistant a bit easier to use, that's all.

Background

While working on HagiCode, we discovered a problem—or rather, a problem that naturally emerged as users spent more time with it: relying solely on typing can be quite exhausting.

Think about it: user interaction with the Agent is a core scenario. But having to sit at the keyboard clacking away every time... well, the efficiency isn't exactly high:

Typing is too slow: Some complex problems—errors, interface issues—can take half a minute to type out, but might be spoken in ten seconds. That time difference is pretty frustrating.
Images are more direct: Sometimes when the interface throws an error, or you want to compare with a design mock, or show code structure... the saying "a picture is worth a thousand words" may be old, but the truth remains. Letting AI "see" the problem directly is much clearer than describing it for ages.
Interaction should be natural: Modern AI assistants should support text, voice, images and other methods, right? Users should be able to use whatever they want—that's what natural means, isn't it?

So we thought, why not add speech recognition and image upload features to HagiCode, making Agent operations more convenient. After all, letting users type a few less characters is a good thing.

About HagiCode

The solutions shared in this article come from our practice in the HagiCode project—or rather, experiences discovered through constantly stepping into pits.

HagiCode is an open-source AI code assistant project with a simple idea: use AI technology to improve development efficiency. As we built it, we discovered that users actually have quite strong demand for multimodal input—sometimes saying a sentence is faster than typing a pile of text, sometimes a single screenshot is clearer than describing for ages.

These demands pushed us forward, and eventually we ended up with features like speech recognition and image upload. Users can interact with AI in the most natural way. It feels pretty good.

Analysis

Technical Challenges of Speech Recognition

When implementing the speech recognition feature, we encountered a tricky problem: the browser's WebSocket API doesn't support custom HTTP headers.

And the speech recognition service we chose is ByteDance's Doubao speech recognition API. This API偏偏 insists on passing authentication information through HTTP headers—things like accessToken, secretKey, and such. Great, now we have a technical contradiction:

// Browser WebSocket API doesn't support the following approach
const ws = new WebSocket('wss://api.com/ws', {
  headers: {
    'Authorization': 'Bearer token'  // Not supported
  }
});

The solutions before us basically came down to two:

URL query parameter solution: Put authentication info in the URL
- Advantage: simple to implement
- Disadvantage: credentials exposed on frontend, poor security; and some APIs strictly require header validation
Backend proxy solution: Implement WebSocket proxy on the backend
- Advantage: credentials stored securely on backend; fully compatible with API requirements
- Disadvantage: slightly more complex to implement

Ultimately we chose the backend proxy solution. After all, security is a bottom line that cannot be compromised—on this point, no one should try to fool anyone.

Image Upload Functional Requirements

For the image upload feature, our requirements were actually quite simple:

Multiple upload methods: click to select file, drag-and-drop upload, clipboard paste—gotta have them all, right?
File validation: type restrictions (PNG, JPG, WebP, GIF), size limits (5-10MB)—these are basic operations
User experience: upload progress, preview, error prompts—people need to know what's happening
Security: server-side validation, prevent malicious file uploads—this is a big deal

Solutions

Speech Recognition: WebSocket Proxy Architecture

We designed a three-layer architecture for speech recognition. How should I put it—we basically found a path:

Browser WebSocket
       |
       | ws://backend/api/voice/ws
       | (binary audio)
       v
Backend Proxy
       |
       | wss://openspeech.bytedance.com/ (with auth header)
       v
Doubao API

Core Component Implementation:

Frontend AudioWorklet Processor:

class AudioProcessorWorklet extends AudioWorkletProcessor {
  process(inputs, outputs, parameters) {
    const input = inputs[0]?.[0];
    if (!input) return true;

    // Resample to 16kHz (Doubao API requirement)
    const samples = this.resampleAudio(input, 48000, 16000);

    // Accumulate samples to 500ms chunks
    this.accumulatedSamples.push(...samples);

    if (this.accumulatedSamples.length >= 8000) {
      // Convert to 16-bit PCM and send
      const pcm = this.floatToPcm16(this.accumulatedSamples);
      this.port.postMessage({ type: 'audioData', data: pcm.buffer }, [pcm.buffer]);
      this.accumulatedSamples = [];
    }
    return true;
  }
}

Backend WebSocket Handler (C#):

[HttpGet("ws")]
public async Task GetWebSocket()
{
    if (HttpContext.WebSockets.IsWebSocketRequest)
    {
        await _webSocketHandler.HandleAsync(HttpContext);
    }
}

Frontend VoiceTextArea Component:

export const VoiceTextArea = forwardRef<HTMLTextAreaElement, VoiceTextAreaProps>(
  ({ value, onChange, onTextRecognized, maxDuration }, ref) => {
    const { isRecording, interimText, volume, duration, startRecording, stopRecording } =
      useVoiceRecording({ onTextRecognized, maxDuration });

    return (
      <div className="flex gap-2">
        {/* Voice button */}
        <button onClick={handleButtonClick}>
          {isRecording ? <VolumeWaveform volume={volume} /> : <Mic />}
        </button>
        {/* Text input */}
        <textarea value={displayValue} onChange={handleChange} />
      </div>
    );
  }
);

Image Upload: Multi-Method Upload Component

We built a fully-featured image upload component that supports all three upload methods. How should I put it—we basically covered all the common user scenarios.

Core Features:

Three Upload Methods:

// Click upload
const handleClick = () => fileInputRef.current?.click();

// Drag-and-drop upload
const handleDrop = (e: React.DragEvent) => {
  const file = e.dataTransfer.files?.[0];
  if (file) uploadFile(file);
};

// Clipboard paste
const handlePaste = (e: ClipboardEvent) => {
  for (const item of Array.from(e.clipboardData?.items || [])) {
    if (item.type.startsWith('image/')) {
      const file = item.getAsFile();
      if (file) uploadFile(file);
    }
  }
};

Frontend Validation:

const validateFile = (file: File): { valid: boolean; error?: string } => {
  if (!acceptedTypes.includes(file.type)) {
    return { valid: false, error: 'Only PNG, JPG, JPEG, WebP, and GIF images are allowed' };
  }
  if (file.size > maxSize) {
    return { valid: false, error: `Maximum file size is ${(maxSize / 1024 / 1024).toFixed(1)}MB` };
  }
  return { valid: true };
};

Backend Upload Handler (TypeScript):

export const Route = createFileRoute('/api/upload')({
  server: {
    handlers: {
      POST: async ({ request }) => {
        const formData = await request.formData();
        const file = formData.get('file') as File;

        // Validate
        const validation = validateFile(file);
        if (!validation.isValid) {
          return Response.json({ error: validation.error }, { status: 400 });
        }

        // Save file
        const uuid = uuidv4();
        const filePath = join(uploadDir, `${uuid}${extension}`);
        await writeFile(filePath, buffer);

        return Response.json({ url: `/uploaded/${today}/${uuid}${extension}` });
      }
    }
  }
});

Practice Guide

How to Use Speech Recognition

Configure speech recognition service:
- Go to speech recognition settings page
- Configure Doubao Speech's AppId and AccessToken
- (Optional) Configure hot words to improve recognition accuracy for technical terms
Use in input field:
- Click the microphone icon on the left side of the input field
- Start speaking when you see the waveform animation
- Click the icon again to stop recording
- Recognition results will be automatically inserted at the cursor position
Hot Word Configuration Example:

TypeScript
React
useState
useEffect

How to Use Image Upload

Upload methods:
- Click upload button to select file
- Drag image directly to upload area
- Use Ctrl+V to paste screenshot from clipboard
Supported formats: PNG, JPG, JPEG, WebP, GIF
Size limit: Default 5MB (configurable)

Notes

Speech recognition:
- Requires microphone permission
- Recommended for use in quiet environments
- Maximum recording duration supported is 300 seconds (configurable)
Image upload:
- Only supports common image formats
- Note file size limits
- Uploaded images automatically generate preview URLs
Security considerations:
- Speech recognition credentials stored on backend
- Image upload has strict server-side validation
- Production environment recommends using HTTPS/WSS

Summary

After adding speech recognition and image upload, HagiCode's user experience has indeed improved significantly. Users can now interact with AI in more natural ways—speaking instead of typing, screenshots instead of describing. How should I put it... it's like finally finding a more comfortable way to communicate.

When building this feature, we encountered the browser WebSocket issue of not supporting custom headers, and ultimately solved it through the backend proxy solution. This solution not only ensures security but also lays the foundation for integrating other WebSocket services that require authentication in the future—I guess that's an unexpected bonus.

The image upload component is similar too—using multiple upload methods lets users choose the most convenient one for their scenario. Whether clicking, dragging, or pasting directly, uploads can be completed quickly. All roads lead to Rome, it's just that some roads are easier to travel, others are a bit more winding.

"Typing is not as good as speaking, speaking is not as good as screenshots"—this saying fits quite well here. If you're also building similar AI assistant products, I hope these experiences can help you, even if just a little.

References

If this article helps you:

Give it a like to help more people see it
Come to GitHub and give us a Star: github.com/HagiCode-org/site
Visit the official website to learn more: hagicode.com
Watch the 30-minute practical demo: www.bilibili.com/video/BV1pirZBuEzq/
One-click install and experience: docs.hagicode.com/installation/docker-compose
Desktop quick install: hagicode.com/desktop/
Public beta has started, welcome to install and experience

Original Article & License

Thanks for reading. If this article helped, consider liking, bookmarking, or sharing it.
This article was created with AI assistance and reviewed by the author before publication.

Author: newbe36524
Original URL: https://docs.hagicode.com/go?platform=devto&target=%2Fblog%2F2026-03-31-voice-and-image-upload-multimodal-input%2F
License: Unless otherwise stated, this article is licensed under CC BY-NC-SA. Please retain attribution when sharing.

DEV Community

Typing is Not as Good as Speaking, Speaking is Not as Good as Screenshots—Multimodal Input Practices for AI Code Assistants

Typing is Not as Good as Speaking, Speaking is Not as Good as Screenshots—Multimodal Input Practices for AI Code Assistants

Background

About HagiCode

Analysis

Technical Challenges of Speech Recognition

Image Upload Functional Requirements

Solutions

Speech Recognition: WebSocket Proxy Architecture

Image Upload: Multi-Method Upload Component

Practice Guide

How to Use Speech Recognition

How to Use Image Upload

Notes

Summary

References

Original Article & License

Top comments (0)