LLaMA 3.3 AI Assistant to My Spring Boot WebSocket App

#ai #llm #springboot #websocket

Real-time messaging apps are great engineering exercises, but adding a conversational AI that seamlessly interacts within the same chat room takes the complexity—and the fun—to the next level.

Recently, I integrated a LLaMA 3.3 model into my messaging backend, ChatUp. Here is a breakdown of what the application is, and how I architected the AI integration using Spring Boot, WebSockets, and the Groq API.

🔗 Live Demo: Try ChatUp here

💻 GitHub Repo: HassanYosuf/ChatUp

What is ChatUp?

Before adding AI, I built ChatUp as a robust, real-time messaging backend. The goal was to engineer a system capable of low-latency, bi-directional communication and real-time state synchronization.

Under the hood, it is powered by an event-driven architecture using Spring Boot and WebSocket/STOMP. The frontend is highly responsive, built entirely with HTML5, CSS3, and ES6+ JavaScript, utilizing SockJS for reliable WebSocket fallback and cross-browser compatibility. To ensure the application is easily deployable and scalable, the entire system is containerized with Docker and hosted on the Render cloud.

The Core Architecture of the AI Integration

To keep the application highly maintainable, ChatUp utilizes a modular MVC architecture. When it came time to introduce AI capabilities, my primary architectural constraint was strict: the AI integration could not disrupt or add latency to the core human-to-human message flow.

How It Works: Intercepting the Stream

Instead of building a separate "bot chat" interface, I wanted the assistant to live directly inside the existing multi-user chat rooms.

Here is the event flow:

The Trigger: The application monitors the WebSocket stream and intercepts any messages prefixed with @ai.
The Routing: Once detected, the message is decoupled from the standard chat flow and routed to a dedicated AI service layer.
The Brains: This service layer processes the message and handles the external call to the Groq API (running the LLaMA 3.3 model).
The Broadcast: Once the LLM generates a response, the Spring Boot service broadcasts the AI's reply in real time back to all subscribers in that specific chat room.

Frontend Handling

On the client side, I wanted to ensure the user experience remained intuitive. I added distinct UI rendering logic to the JavaScript. AI-generated messages are styled differently from standard user messages, making it immediately clear when the LLM is speaking to the room versus a human participant.

Swapping the Brain: Alternative APIs

One of the biggest advantages of decoupling the AI logic into a dedicated service layer is that you aren't locked into a single provider. While I used Groq for its incredibly low latency with LLaMA 3.3, this architecture makes it trivial to swap in other LLM APIs by simply changing the endpoint and payload structure in the service class.

If you are looking to build something similar, here are a few other APIs that fit perfectly into this pattern:

Anthropic Claude API: If your chat rooms require complex reasoning, coding assistance, or handling large contexts (like summarizing a long chat history), Claude is an excellent alternative. It is highly capable and relatively easy to integrate into a Spring Boot pipeline.
OpenAI API (GPT-4o-mini): The industry standard. It's incredibly reliable, well-documented, and the -mini model is fast and cost-effective for high-volume, real-time chat environments.
Local Models via Ollama: If data privacy is a strict constraint or you want to run the system entirely on-premise without API costs, you can point your Spring Boot service to a local Ollama instance running open-weights models. The latency will depend entirely on your hardware, but the integration pattern remains exactly the same.

Wrapping Up

Integrating an LLM via the Groq API into a live STOMP/WebSocket pipeline was a fantastic exercise in decoupling services. By isolating the AI routing into its own layer, the app successfully maintains high-performance real-time state sync for normal chats while offering powerful generative AI capabilities on demand.

If you are working with Spring Boot WebSockets or integrating LLMs into event-driven systems, feel free to check out the repo or drop a comment below!