PHIA: The Agentic LLM That Writes Code, Analyzes Your Data & Explains Health Insights

#google #llm #health #ai

I honestly felt like I was looking at the “next version” of what health AI should be. Whenever I asked myself questions like, “Is my sleep improving?” or “Does exercising at night change my deep sleep?”, I realized how hard it is to answer these without real analysis.

That is exactly the problem the PHIA team talks about on page 1 of the paper. They explain that even a simple question like “Do I sleep better after exercising?” requires many steps: checking recent data, comparing different days, calculating metrics, and then interpreting everything in the context of what “healthy” even means. And honestly, I could relate — that’s the kind of analysis I never do on my own.
What really caught my attention is that the paper says today’s LLMs struggle with numerical reasoning, meaning they often miscalculate or oversimplify things. I’ve definitely seen models do that — giving a confident answer but completely messing up basic math. (Limitation of PH-LLM)

So PHIA tries to fix this problem by giving an LLM the ability to plan, use tools, write code, and search the web. It doesn’t just “chat.” It becomes more like a small data analyst that works with your wearable data step-by-step.

PH-LLM vs PHIA

PH-LLM was an earlier coaching system that worked only with simple, pre-aggregated 30-day summaries and relied purely on the LLM’s internal reasoning. That meant it couldn’t actually analyze detailed daily wearable data, couldn’t run calculations, and completely failed on questions that required numbers or step-by-step reasoning. PHIA addresses these limitations by granting the LLM three new capabilities: (1) it can generate authentic Python code to analyze raw wearable data, (2) it can plan and break down tasks using an agent loop, and it can search the web to incorporate fresh, verified health knowledge.

Because of this, PHIA shifts from being “just a health coach” to becoming a true data analyst + coach, delivering insights that are more accurate, personalized, and grounded in both the user’s data and real health science. PHIA feels like the bridge between “AI as a friendly coach” and “AI as a personal data scientist.” By mixing reasoning, code execution, and web search, it finally unlocks the kind of deeper, more accurate insights that wearables have always had the potential to provide.

How PHIA Works

PHIA more like a mini health data analyst that can think step-by-step, write code, check your wearable data, search the web, and then explain everything in normal language. The PHIA paper shows this clearly — PHIA literally cycles through think → act → observe just like a human analyst would.

I’ll explain this in the simple way I understood it.

Instead of guessing answers, PHIA acts more like a small data analyst that plans its steps, checks your wearable data, and only then explains what it found. The paper describes this using the ReAct loop— Thought → Action → Observation — and seeing it in action helped me understand why PHIA is so different.

PHIA starts by thinking through the question. If I ask something like “Is my resting heart rate improving?”, it doesn’t jump to a quick reply. It pauses and decides what it needs—maybe comparing two weeks of data or calculating averages. Then, PHIA takes action by writing and running Python code in a safe virtual environment tool. This lets it analyze real daily time-series data using Pandas, just like a real data scientist. The nice part is that this removes mathematical mistakes that normal LLMs often make. And if its code crashes, PHIA actually fixes the error and tries again (recurrent trials system), which the paper highlights as one of its strengths.

In their evaluation, PHIA scored 84% accuracy on objective questions, way higher than simple LLM reasoning. This happens because PHIA can recover when the code fails.

When a question requires more than numbers—like understanding recommended sleep hours—PHIA uses a built-in web search tool to fetch verified information from trusted sources. Sometimes the user needs context, like: “Is this amount of sleep normal for my age?” “What workouts improve resting heart rate?” “Is my stress score healthy?”
PHIA has a built-in search tool that pulls information from verified online sources. This mixing of data + domain knowledge is what makes PHIA feel smart and practical.

What impressed me most is how PHIA blends everything together. It doesn’t just give numbers or copy facts. It calculates, it checks, it researches, and then it explains the result in clear language. The examples in the paper, like comparing a user’s sleep with national guidelines, really show how these steps come together. In the end, PHIA feels less like a chatbot and more like someone doing careful, step-by-step reasoning to help you understand your own health patterns.

A Deep Dive Into PHIA’s Technical Architecture

PHIA is not a fine-tuned model. It’s an agent framework built around Gemini 1.0 Ultra, with two major tools: (1) Python data analysis runtime, (2) Web search tool.

These tools are orchestrated using the ReAct (Reason + Act) agent pattern, which is why PHIA can do multi-step reasoning without fine-tuning. This design gives PHIA abilities that raw LLMs don’t have: planning, correction, stepwise attention, and grounded analysis.

No Fintuning

One of the most interesting things I learned from the paper is that PHIA is not fine-tuned like PH-LLM. Instead of modifying the model’s weights, the authors built a smart scaffolding around the model and taught it how to act like an agent.

Fine-tuning a huge LLM like Gemini Ultra is expensive, slow, requires tons of supervised data, and risky (can break general reasoning ability).

So instead, PHIA keeps the base model untouched and teaches the model how it should “think” and use tools using few-shot ReAct examples.

This is a totally different philosophy: PH-LLM → change the model. PHIA → change the system around the model

Process

The authors started with thousands of wearable-related questions (objective + open-ended). They needed a tiny set of example tasks that could teach the model how to behave like an agent.
Instead of randomly picking examples, they converted every question into numerical embeddings using Sentence-T5. This turns each question into a vector representing its meaning, so similar questions sit close together in vector space.

They ran k-means clustering (k=20) on these embeddings to automatically group similar questions into 20 clusters. Each cluster represents a whole “type” of question, like sleep trends, workout comparisons, anomaly detection, correlations, and so on.
From each cluster, they selected the most central question — the one closest to the cluster centroid.This gives 20 representative queries, each standing in for a larger family of similar queries.
For each of these 20 representative queries, the team manually wrote a complete ReAct agent trajectory that included every step of reasoning:
- Thought: what the model should plan to do
- Action (Python): the exact Pandas code needed
- Observation: the real output that the code would return
- Thought again: interpreting the output
- Action (Search): when domain knowledge is needed
- Observation: the retrieved information
- Final Answer: a clear, natural-language explanation. These aren’t partial examples — they are full step-by-step walkthroughs that demonstrate how an agent should behave.
These 20 full trajectories were inserted into PHIA’s few-shot prompt, acting as demonstrations inside its system instructions. So the model is not fine-tuned — instead, it is taught by example how to: plan before acting, decide when to run code, decide when to search the web, fix code errors, combine numerical results with domain knowledge, and produce a final personalized insight.

This is called “behavior cloning via prompting.” : It will learn the “pattern of behavior” from the examples, even without changing its internal parameters.

PHIA is a perfect example of how powerful LLMs become when they are paired with the right scaffolding instead of relying only on internal reasoning.

Limitations in the PHIA

PHIA is a huge step forward, especially compared to simple chat-based coaching. But it’s still early-stage. It shows what’s possible when LLMs use tools, but it also exposes how much work is left to make an AI truly trustworthy in personal health.

PHIA hasn’t been tested on real people yet, so we still don’t know if it truly helps users change habits or understand their health better in everyday life.
It only uses wearable data, and I feel it needs more context—like nutrition, medical records, or mood logs—to give deeper and more holistic insights.
PHIA isn’t medically validated, so even though it sounds smart, its recommendations haven’t been checked by doctors or health experts for real accuracy. Personalization still feels limited, because sometimes PHIA gives generic advice even when the user’s data is available right there.
Its reasoning depends heavily on prompting and Gemini Ultra, which means its behavior might change across models or updates, making it less consistent.
Error handling is better than basic LLMs but still imperfect, since PHIA can misread data columns or fail on messy real-world data.
Its toolset is still narrow, and I’d love to see it generate visualizations, analyze other health signals, or track long-term goals instead of just answering one question at a time.

Learnings from the paper

Reading the PHIA paper honestly changed the way I think about building AI systems. Before this, I used to focus mainly on making the model smarter, but PHIA showed me that the real progress comes from giving the model the right tools and the right structure. Seeing how PHIA uses the ReAct loop—think, act, observe—made me realize how important disciplined reasoning is for avoiding hallucinations and building trust.

The use of synthetic data was another eye-opener, because it proved that we can train and evaluate personal-health agents responsibly without touching real user data. On top of that, the way the authors handled safety—650 hours of human review, strict guardrails, and cautious refusal behavior—reminded me that responsible AI isn’t optional; it’s part of the engineering.

More than anything, PHIA taught me that the future of AI isn’t “a bigger model,” but a system where models, tools, data, and reasoning frameworks all work together. As an AI engineer, this shifted how I think about agents: not as chatbots, but as full pipelines designed to solve real problems with accuracy, humility, and safety.

Key Insights I’m Taking With Me

PHIA wins because it’s structured, not because the LLM is smarter.
Reasoning and architecture beat raw model size.
PHIA shines when a question needs data + external knowledge + multi-step logic, not just basic statistics.
Its real strength is disciplined reasoning, not raw computation.
PHIA’s strict safety guardrails make it trustworthy in a way normal LLMs aren’t.