Mitansh Gor

Posted on Nov 27

Personal Health Agent (PHA): Multi-Agent Health System

#google #llm #health #ai

In the last few years, we saw different types of AI systems try to answer this question:
PH-LLM (2024) → focused on personalized coaching from wearable data.
PHIA (2024) → acted like an agent that analyzes your data, writes code, and explains results.
IR Prediction + IR Explainer Agent (2025) → estimated your metabolic risk and explained it like a doctor.

All of these systems were important steps. But each one had a weakness:

A single LLM tries to do everything.
It must read data, interpret medical meaning, check safety, and give advice.
This often leads to confident hallucinations — answers that sound correct but are actually wrong.
It also mixes medical reasoning with casual coaching, which creates risk. And one model cannot be an expert in every health domain.

So the big idea behind the Personal Health Agent (PHA) is simple:

Instead of one giant model doing everything, why not create multiple smaller agents — each with a role — working together like a mini health team?
This solves the main problem with earlier systems: no single AI can fully handle the complexity of real human health. But a team of specialized agents can.

Because health is multi-dimensional—where your sleep affects your stress, stress affects metabolic health, exercise influences sleep, and food and biomarkers impact everything—one AI model can’t realistically handle all of it at once. Just like a hospital relies on different specialists, a digital health system also needs different “roles”: one agent to interpret raw data, one agent to check medical safety, and one agent to translate insights into simple coaching. This is why multi-agent systems are becoming so important: they add structure, accountability, and more reliable reasoning, solving many of the issues earlier single-LLM systems struggled with.

A System of Specialized Agents

When I look at the PHA system, what stands out first is the Analyst Agent. I think of this agent like a friend who loves numbers and graphs. Its whole job is to read my raw data — sleep hours, steps, heart rate, and other signals — and turn them into clear observations. It doesn’t try to coach me or act like a doctor. It just says, “Here’s what I see happening in your data,” which makes everything feel clean and easy to understand.

Then there is the Domain Expert Agent, which to me is the “doctor brain” of the whole system. This agent takes what the Analyst found and checks if it makes medical sense. It worries about safety, accuracy, and whether the explanation aligns with real health science. I really like this part because it stops the system from giving random or misleading advice. It acts like the strict friend who always says, “Wait, is this actually correct?”

Next comes the Coach Agent, the most human-feeling part of PHA. This agent talks to me like a supportive friend who wants me to succeed. It takes all the technical pieces and translates them into simple, everyday advice I can act on. No medical jargon. No complicated charts. Just small, doable suggestions like, “Try walking after dinner” or “Aim for an earlier bedtime tonight.” It’s friendly and practical, which is exactly what people need.

What ties everything together is how these agents work as a team. Instead of one giant AI trying to do everything — and often making confident mistakes — each agent stays in its own lane. One analyzes, one verifies, one coaches. Because of this teamwork, the final answer feels clearer, safer, and much more trustworthy. It’s almost like having a mini health team inside your phone, and that makes the whole system feel both smart and human at the same time.

Technical Deep Dive

Even though all agents may start from the same base LLM (like Gemini or GPT-class models), they are trained separately, so each one becomes good at one thing instead of being average at everything.

Finetuning

Analyst Agent is finetuned from wearable logs (heart rate, steps, sleep), time-series patterns, data summaries, and statistical examples. It becomes good at reading numbers and spotting patterns.
Domain Expert needs more serious data for finetuning, like medical guidelines, curated clinical examples, safe/unsafe reasoning samples, and explanations validated by experts. It learns how to avoid hallucinations and unsafe claims
Coach Agent is the “human” one. It is finetuned on a friendly conversational dataset, lifestyle advice examples, behavior-change coaching patterns. It learns to talk in a simple, supportive way.

This division is what makes the final system feel more stable than earlier systems like PHIA or the IR Agent, where a single LLM had to juggle all roles and often made confident mistakes.

The process itself was simple but very intentional. Here’s how I imagine it:

We first collected examples for each agent’s role.
Then we cleaned and organized everything.
Then we cleaned and organized everything.
After that, we fine-tuned each agent separately.
Then we tested the agents together.
Finally, we fixed mistakes and improved the dataset.

System architecture and Data Flow

Underneath all of this, there’s a system architecture that makes their collaboration possible. I picture it like a small command center. The user’s data flows into a main controller, which decides which agent speaks first and who validates whom. The Analyst generates its interpretation, the Domain Expert checks it, and the Coach turns everything into a simple message. Each agent is isolated enough to stay focused, but interconnected enough to build on each other’s work. There’s also a safety layer running quietly in the background — something like a “medical guardrail” — to make sure the final output stays safe, consistent, and responsible before it reaches the user.

Data Pipeline Flow

My raw data comes in (wearable data, blood test results, communication history)
The system cleans and prepares the data.
The Analyst Agent gets the cleaned data first, and it gives a report.
The Expert Agent double-checks everything and makes sure the explanation is medically correct and safe.
The Coach Agent gets the approved explanation and rewrites the message in simple, friendly language
A final safety filter reviews the message and checks for risky things like any medical diagnosis, unsafe suggestions, and harmful claims.
I receive the final output.

Inference time

During inference — which is just a fancy word for “when the system actually responds to me” — the agents talk to each other almost like coworkers passing notes in a group chat. I don’t see any of this, but it’s happening behind the scenes. The Analyst looks at my raw data and says, “Here’s what I think is happening.” The Domain Expert looks at what the Analyst wrote and either approves it or corrects it. And finally, the Coach turns their combined reasoning into something that sounds human and helpful. What I receive as a single answer is actually the teamwork of multiple roles stitched together.

Tools used

To make all of this work, the system relies on a set of tools that help each agent extend its abilities. The Analyst might use data-processing tools to clean and summarize wearable data. The Domain Expert might have access to reference tools that help it compare findings against medical knowledge or validated guidelines. The Coach might use prompt templates designed for empathy, clarity, and motivation. And the main controller uses orchestration tools — something like LangGraph or a custom workflow engine — to decide who talks when and how information flows between them. These tools don’t replace the agents, but they give them structure, context, and external capabilities they wouldn’t have on their own.

Collection of data

Everything starts with a real study they ran called WEAR-ME, which included more than a thousand Fitbit users. People had to explicitly opt-in, sign digital consent, link their wearable device, and even go for a blood draw at a Quest Diagnostics center. So the data wasn’t scraped or assumed—it came from real people who agreed to share it for research. That part mattered to me because it made the whole system feel responsible, not rushed.

Once someone joined the study, their data flowed through a very controlled pipeline.So each person had a mix of: (1) wearable time-series, (2) lab biomarkers, (3) self-reported history, (4) lifestyle questionnaires.

What I liked is that the system doesn’t directly reach into a live Fitbit account or do anything “real-time.” Instead, the PHA gets access to summaries of the participant’s data, organized into tables—like a daily summary table, an activities table, and even a population-level reference table so the system could compare one user’s values against others in their age group. This gave context, not just raw numbers.

Inside the actual agent pipeline, each sub-agent only receives the part of the data that’s relevant to its job. The system never dumps the full raw dataset into every agent. Instead, the orchestrator acts like a coordinator who hands the right information to the right specialist.

Nothing is fetched unless the user asks something that genuinely needs it. And the system also avoids asking the user for things it already knows—like their sleep hours or steps—because the orchestrator checks the available data first.

PHA vs Gemini Results

PHA wasn’t just a small improvement. It was in a completely different league.

From the end-user side, I noticed that people consistently picked PHA as the best experience. Not just once or twice — almost half the time. And this is after comparing it side-by-side against a single Gemini agent and a parallel multi-agent system.

But the biggest eye-opener for me was the expert evaluation. Experts didn’t just prefer PHA — they almost abandoned the Gemini single-agent baseline. Only around 4–5% of expert rankings put Gemini in first place, while PHA was chosen as #1 about 80% of the time. That’s not a small margin.

Strengths of the system

What struck me most was how real the system felt. Not just in terms of architecture, but in how thoroughly it was tested. It wasn’t a small demo — it involved thousands of human annotations and hundreds of hours of expert evaluation. Because of that, the strengths and limitations felt very honest.

One of the biggest strengths I noticed is how well the whole system comes together. The three-agent setup — the data scientist, the domain expert, and the coach — isn’t just theoretical. In the evaluations, PHA actually outperformed both a single-agent system and a parallel multi-agent system. End-users preferred talking to PHA almost half the time, and experts loved it even more, choosing it as the best system in about 80% of cases. To me, that says the collaboration between these agents isn’t just helpful — it genuinely changes the quality of the conversation.

The data agent in particular felt like a major upgrade. It was much better at breaking down messy, vague health questions into proper statistical analyses, catching missing data, choosing the right timeframes, and generating correct code. I could see why the system became more trustworthy — it wasn’t guessing. It was reasoning over the user’s actual data. And the domain expert agent made a big difference too. It produced safer, more accurate, and more personalized medical explanations. Users rated its responses as far more trustworthy than the base model, and clinicians preferred its summaries because they were more complete and clinically meaningful.

The health coach agent also impressed me. It wasn’t just delivering motivational lines. It followed real coaching principles — active listening, SMART goals, motivational interviewing — and this made conversations feel more natural and supportive. Users were more engaged, and the conversations ended more naturally. It reminded me of talking to a real human coach who’s actually paying attention.

As a whole system, PHA feels more thoughtful. The orchestrator doesn’t just mix answers together — it understands the user’s goal, assigns the right agent, reflects on the output, and remembers important details for later turns. This gives the conversation a sense of direction and personalization that the baselines just didn’t have.

Limitations of the System

But the system isn’t perfect. One thing that stood out immediately was the cost. PHA is slower and more computationally expensive because each request triggers multiple agents and reflection steps. Where the single-agent system might respond in around 35 seconds, PHA could take over 200. It’s powerful, but not cheap.

Another limitation is that the diagnostic reasoning, while improved, is still not bulletproof. Sometimes the domain expert agent didn’t complete its reasoning chain, and its reliance on web search occasionally pulled in conflicting information. This reminded me that, even though it acts like a doctor in some ways, it’s not actually one — and the authors make that very clear.

Bias is also a real concern. The system personalizes advice based on user traits, conditions, and context… which is great, but it also means it can unintentionally repeat patterns or assumptions from datasets that might not represent everyone equally. The paper calls out this risk directly.

The coach agent, despite being strong overall, also had a weakness: it wasn’t great at tracking user progress over time. It did a good job helping set goals, but didn’t always follow up on them in later turns. For long-term coaching, that’s something they’ll need to improve.

And finally — maybe the biggest limitation — everything was tested in short-term interactions. We don’t yet know whether PHA can support behavior change over weeks or months, which is where health coaching really matters. The team also notes that this is a research framework, not a clinical tool, and major regulatory work would be needed before deploying anything like this in the real world.

So when I put it all together, here’s how I personally see it:
PHA is a huge improvement over past systems — smarter, safer, more human, and more useful — but it’s still early. It has real strengths, but also real gaps. It’s powerful as an idea, promising as a system, and clearly built with care. But it also reminds me that great AI doesn’t replace medical care — it supports it, and it still has a lot of growing up to do.

Conclusion

After reading the PHA paper, the biggest thing I realized is that this system isn’t just an AI answering health questions — it feels more like a small team working behind the scenes to help you make sense of your own data. The evaluations show that this multi-agent design really does lead to better, clearer, and more personalized guidance compared to older single-agent approaches.

At the same time, the authors are very honest about their limits. The system can still make mistakes, it can be biased depending on the data, and it should never replace real medical care. It’s powerful, but it’s also early — more like a research blueprint than a ready product.

To me, PHA represents a hopeful direction: AI that supports people in understanding their health, not by taking over, but by helping them make smarter choices with empathy, safety, and transparency in mind

DEV Community