Follow-up to the "From Zero to AI Agent: My Journey into Java-based Intelligent Applications" series
After seven posts building our Java MCP client from scratch, it was time for a test recording a demo to see how our AI agent performs with some user queries. The results are here.
Today I'm sharing what happened during our demo session: the successes, the failures, and the valuable lessons learned about AI agent behavior.
The Demo Setup
Our JavaChatAI agent was running with:
- Three MCP servers: FileSystem, Weather and Time services
- LLM Provider: Groq with Llama model
- Full multi-tool orchestration as built in our series
- Three execution modes: DIRECT_ANSWER, SINGLE_TOOL, and MULTI_TOOL
The plan was simple: test various query types and see how our intelligent routing performs.
Test 1: "What is the capital of the United States?"
Expected Behavior: DIRECT_ANSWER (no tools needed)
What Happened:
🤖 Washington, D.C.
Result: PERFECT
Our agent correctly identified this as a knowledge-based query that doesn't require any external tools.
Test 2: "What time is it in Denver, CO?"
Expected Behavior: SINGLE_TOOL (time server)
What Happened:
Detected response modality: SINGLE_TOOL
🤖 It is 5:22 PM in Denver.
Result: PERFECT
The agent correctly identified this as a single-tool operation, selected the appropriate time server, and accurately extracted the required location parameter ("Denver, CO").
Our parameter extraction and tool selection logic is working.
Test 3: "How far is Denver, CO from Washington, D.C. by plane?"
Expected Behavior: DIRECT_ANSWER (no distance/flight tools available)
What Actually Happened:
Detected response modality: MULTI_TOOL
[Orchestrator attempts to create execution plan]
[Plan fails - no appropriate tools available]
🤖 The flight distance between Denver, CO and Washington, D.C. is approximately
** 1,600 miles (2,575 kilometers) **. The actual travel time will vary depending on the flight path and speed.
Result: CORRECT OUTPUT, WRONG PROCESS
Our agent initially misclassified this as a MULTI_TOOL query (get_current_time and get-forecast), but when the orchestrator couldn't find appropriate answer for distance calculation, it fell back to using the LLM's knowledge.
Test 4: "What time is it and How's the weather in Washington, DC?"
Expected Behavior: MULTI_TOOL with two parallel operations
What Actually Happened:
Detected response modality: MULTI_TOOL
[Creates plan with time tool, weather tool, and... file write tool?]
...
...
🤖 It's 7:22 PM Eastern Time. The weather in Washington, DC is currently 69°F with a chance of showers and thunderstorms tonight. Expect similar conditions for the next few days, with highs in the 70s and 80s.
Result: CORRECT INFORMATION, WRONG TOOL SELECTION
The agent successfully got both pieces of information we requested, but it hallucinated an additional step trying to save the results to a file! And it actually did save the file in the project's folder.
What These Results Tell Us
What's Working Well
The agent's architecture is built upon a basic routing logic that first distinguishes between queries requiring a DIRECT_ANSWER and those needing tool-based resolution. For the latter, it demonstrates proficiency in single-tool execution, accurately handling parameter extraction and making the appropriate tool call.
What Needs Improvement
First, there is a tendency for over-classification as MULTI_TOOL, where the agent is too aggressive in assuming queries require multiple tools. This is closely related to the issue of tool hallucination, where the agent adds unnecessary steps, like saving files, that were not requested by the user. To address these problems, the primary focus should be on analysis prompt tuning, making the criteria for when a multi-tool approach is actually needed much more precise and restrictive.
This demo reinforced something important: building AI agents is an iterative process. Our architecture is solid, our core functionality works, but the intelligence layer (prompt engineering and decision-making) needs continuous refinement.
Try It Yourself
The complete code from our series is available on GitHub. I encourage you to clone it, run your own tests, and see what interesting behaviors you discover.
Conclusion
This demo session showed us that our Java MCP agent is functional and can handle user queries, but it also highlighted areas where we need to improve the intelligence layer.
The agent never crashed, always provided some form of useful response, and demonstrated that our architecture can handle the complexity of AI interactions.
Follow @gazolla for more AI development using Java.
Top comments (0)