After completing the Azure AI Foundry agentic AI challenge, the goal was to experiment with multi-agents. There was a lot to absorb around building and orchestrating agents using Azure and Semantic Kernel. So decided to experiment with simple model deployments using a chat completion model to gain deeper understanding especially with the azure ecosystem.
I had a bunch of ideas floating around, but after a visit to nandos. I decided on something to improve their ordering system. Currently with their system customers scan a QR code on their table, get redirected to a website, place their order, and minutes later, their food arrives. It was smooth, efficient.
How could Agents make this more flexible ?
Hmmm? Something that uses speech?
The upside of using voice as the main interaction is that it feels like talking to a real person. But there's also untapped potential where users will be able to navigate the entire website through speech alone. Scrolling pages, clicking buttons, filling forms—all through voice commands. That’s where this could become something much more powerful.
So I set up a quick FastAPI backend to be able to connect to my model deployment on azure, Scaffolded a simple front end with HTML and CSS, and reached into my creative toolbox.
The image below represents the current pipeline. It's functional, but extremely slow in practice. This is because Azure agent deployments do not yet support voice models directly. That’s a limitation worth noting. Similar to Langchain and Hugginface, they require integration with external speech-to-text (STT) and text-to-speech (TTS) tools to build voice-enabled applications
Playing with Voice using p5.js
I’ve had some experience with p5.js for a hackathon. It's super flexible and quick for visual prototyping. So I thought, why not use it to build a voice-driven interface? I also once saw someone build an agent that uses blandAi to represent them in an interview. And i really liked the effects and UI they used to represent the voice waves.
So I whipped up something simple and similar, a nice responsive round blob that pulses with your voice input using p5.AudioIn() and p5.FFT().
Then I added speech synthesis so you could talk to it — and as you speak, the waveform animates, and words pop up on screen one-by-one.
Record the speech
speechRec = new p5.SpeechRec('en-GB', gotSpeech);
speechRec.continuous = true;
speechRec.interimResults = false;
fft = new p5.FFT();
mic = new p5.AudioIn();
mic.start();
fft.setInput(mic);
Next was to give the instructions to the Ai agent. Different prompts bring out different outputs.
query_embedding = embedder.encode(speech).tolist()
results = collection.query(query_embeddings=[query_embedding], n_results=5)
top_chunks = results["documents"][0]
rag_context = "\n---\n".join(top_chunks)
messages = [
{
"role": "system",
"content": (
"You are a helpful Nando's assistant. Help the user place an order by asking for:\n"
"- Side dish\n- Main dish\n- Drink + size\n- Sauce\n- Table number\n- Payment method\n"
"Use the following menu context to answer accurately:\n" + rag_context +
"\nReturn the final order in JSON format when ready."
)
},
{
"role": "user",
"content": speech
}
]
RAG
ChromaDB was set up locally. This is where the deeper understanding of the difference between a normal chat compeltion model and an agent was seen and with it, a key learning curve.
Beginner Tip: Chat completion is like sending a single message to GPT and getting a reply. Agent deployments give you threads, memory, and advanced tools like function calling.
The distinction between standard model deployments and agents in Azure is that agents offer thread management, which is helpful for tracking state, managing conversation history, and linking file contexts. However, they can also incur additional costs.
To avoid diving in too deep too soon, the initial setup stuck with chat completions, manually managing context until the overall architecture became clearer.
Using a Nando’s menu PDF, the content was converted into JSON and flattened for embedding:
Setting up ChromaDb for RAG
# Flatten content into chunks
for item in menu.get("peri_peri_chicken", []):
chunks.append(f"{item['name']}: {item['description']} - £{item['price']}")
# ... same for burgers, wraps, drinks, etc.
embeddings = embedder.encode(chunks).tolist()
collection.add(
documents=chunks,
embeddings=embeddings,
ids=[f"item-{i}" for i in range(len(chunks))]
)
Converting Model reply to audio
After chroma was setup the the model was more context aware. It didnt need to hallucinate
Once input was processed and the AI responded with text, the output needed conversion to spoken feedback to mimic an oral communication.
Several TTS tools were explored. While some failed to deliver usable output, gTTS stood out for its simplicity and effectiveness.
Azure Insight: Azure offers audio transcription through its Speech services—separate from chat models. And while GPT-4o preview can accept audio input, it’s not yet fully supported in agents.
try:
from gtts import gTTS
tts = gTTS(text=text, lang='en', slow=False)
audio_buffer = io.BytesIO()
tts.write_to_fp(audio_buffer)
audio_buffer.seek(0)
# Write to stdout for streaming
sys.stdout.buffer.write(audio_buffer.read())
Future Ideas
As ive already mentioned, the next feature planned is keyword-based interaction. For example saying certain phrases to trigger buttons or frontend events. Initially, Hugging Face tools were on the table, but after all this, its important i do not mix too many frameworks. It’s clear that using Semantic Kernel is the smarter choice if i decide to stick to their model deployments or agents.
Also, im thinking about downloading some open source models locally instead of relying on endpoints. I think that can increase performance as well.
In terms of architecture, the current thinking is this: it’s often better to commit to a single ecosystem unless there’s a clear reason to mix. Semantic Kernel pairs naturally with Azure endpoints, offering memory, RAG, and function calling. Hugging Face, on the other hand, is powerful when fine-tuning, customizing, or hosting models locally. It’s flexible but demands more manual setup.
Biggest learning point?
This can be considered as version 1 of the project. So not to self
- For voice agents, use audio model deployments, not Azure agent deployments (which only support chat-type models for now).
- Some voice models like gpt-4o-mini-realtime-preview are available, but I haven’t integrated them yet. That’s probably where the next blog will go.
- Voice agents aren’t fully supported yet, but previews show that more is coming.
This started off as a half-baked idea and it’s grown into a real project with legs. The next phase? More performance tests, more model comparisons, and building out the interaction layer.
Keep track of the project on my github
https://github.com/ocansey11/azure.git
i recently saw a tiktok that shows what i wanted to do.
https://www.tiktok.com/@kyutai_labs/video/7507687261098741015?_t=ZM-8wkuLndLoUH&_r=1
Top comments (0)