Introduction
When building apps with AI-assisted coding, you get to decide "what to build." You can set the design and the policy yourself. However, you occasionally hit a wall.
There are things I want to do, but I lack the background knowledge in logic to realize them.
I am not a programmer, nor am I an expert in signal processing. This is a story about how Claude became a reliable partner when I faced such situations.
The Problem: Speech Gender Identification Isn't Working Well
I wanted to add a feature to LiveTR (a real-time voice translation app) that switches the synthesized voice based on the speaker's gender. Male-like for male voices, and female-like for female voices.
The first method that comes to mind is identifying gender via fundamental frequency (pitch). Men are low, women are high. Simple.
I tried it. It works okay for casual conversation.
However, when I tried streaming an F1 race, the app would identify the voice as female whenever the commentator got excited. Since the pitch rises when the race gets heated, the voice is incorrectly identified as "female" even though it is a man's voice. This happened frequently.
Pitch alone is not enough. But then, what else should I look at? I didn't know.
I Asked Claude
"When it comes to speech gender identification, using only pitch leads to misidentification during moments of excitement. What are some academic methods for this?"
I had Claude investigate. By having Claude use Research mode (a feature where Claude autonomously searches the web to investigate) to look for academic papers and patents, several methods I could never have reached on my own emerged.
It turns out that by combining multiple indicators—not just pitch, but also formants (resonance frequencies of the vocal tract) and MFCCs (Mel-frequency cepstral coefficients)—one can achieve stable identification even during states of excitement.
If you asked me whether I understood the entire content of the papers, honestly, that's doubtful. However, because Claude explained, "This method works on this principle and has these characteristics," I was able to establish a direction. From there, I decided, "I'll go with this combination," and refined the structure.
Straight to Implementation
Once the policy was decided, I worked with Claude to assemble it.
When I told Claude, "Based on the method in this paper, please implement it with this structure," it wrote the code for me. I ran it, checked the results, and adjusted it if something felt off. It's the usual cycle.
When I tested it with the F1 broadcast, it started identifying the voice as male even when the commentator was excited. The stability was completely different compared to when I was using only pitch.
I Think This Is the Strength of AI Coding
Having the AI write code has become a matter of course. However, being able to pull knowledge from fields you are unfamiliar with and translate it into implementation is a different kind of value.
I don't have the capacity to search for and read signal processing papers on my own. But if I tell Claude, "I have this problem," it can investigate relevant research and turn it into working code.
Of course, I don't trust what it produces blindly. I run it, test it, and if it's off, I rethink the approach. That part hasn't changed. But it's a huge help that you don't start from zero in terms of knowledge.
One caution: even if you get good results, if you do business using logic built by referencing patents, you might infringe on patent rights. Even with paper-based logic, patents related to those methods can exist. When using it commercially, you need to check the rights. You can ask Claude, "Are there any patents related to this method?" to have it investigate. However, Claude's investigation isn't necessarily perfect, so you must make the final judgment yourself.
I'm not the person writing the code. My job is to design, decide on policies, and make judgments. Claude pulls in knowledge and turns it into working code. I think this is an example where this division of roles fit perfectly.
Top comments (0)