DEV Community: Mitansh Gor

Personal Health Agent (PHA): Multi-Agent Health System

Mitansh Gor — Thu, 27 Nov 2025 05:33:43 +0000

In the last few years, we saw different types of AI systems try to answer this question:
PH-LLM (2024) → focused on personalized coaching from wearable data.
PHIA (2024) → acted like an agent that analyzes your data, writes code, and explains results.
IR Prediction + IR Explainer Agent (2025) → estimated your metabolic risk and explained it like a doctor.

All of these systems were important steps. But each one had a weakness:

A single LLM tries to do everything.
It must read data, interpret medical meaning, check safety, and give advice.
This often leads to confident hallucinations — answers that sound correct but are actually wrong.
It also mixes medical reasoning with casual coaching, which creates risk. And one model cannot be an expert in every health domain.

So the big idea behind the Personal Health Agent (PHA) is simple:

Instead of one giant model doing everything, why not create multiple smaller agents — each with a role — working together like a mini health team?
This solves the main problem with earlier systems: no single AI can fully handle the complexity of real human health. But a team of specialized agents can.

Because health is multi-dimensional—where your sleep affects your stress, stress affects metabolic health, exercise influences sleep, and food and biomarkers impact everything—one AI model can’t realistically handle all of it at once. Just like a hospital relies on different specialists, a digital health system also needs different “roles”: one agent to interpret raw data, one agent to check medical safety, and one agent to translate insights into simple coaching. This is why multi-agent systems are becoming so important: they add structure, accountability, and more reliable reasoning, solving many of the issues earlier single-LLM systems struggled with.

A System of Specialized Agents

When I look at the PHA system, what stands out first is the Analyst Agent. I think of this agent like a friend who loves numbers and graphs. Its whole job is to read my raw data — sleep hours, steps, heart rate, and other signals — and turn them into clear observations. It doesn’t try to coach me or act like a doctor. It just says, “Here’s what I see happening in your data,” which makes everything feel clean and easy to understand.

Then there is the Domain Expert Agent, which to me is the “doctor brain” of the whole system. This agent takes what the Analyst found and checks if it makes medical sense. It worries about safety, accuracy, and whether the explanation aligns with real health science. I really like this part because it stops the system from giving random or misleading advice. It acts like the strict friend who always says, “Wait, is this actually correct?”

Next comes the Coach Agent, the most human-feeling part of PHA. This agent talks to me like a supportive friend who wants me to succeed. It takes all the technical pieces and translates them into simple, everyday advice I can act on. No medical jargon. No complicated charts. Just small, doable suggestions like, “Try walking after dinner” or “Aim for an earlier bedtime tonight.” It’s friendly and practical, which is exactly what people need.

What ties everything together is how these agents work as a team. Instead of one giant AI trying to do everything — and often making confident mistakes — each agent stays in its own lane. One analyzes, one verifies, one coaches. Because of this teamwork, the final answer feels clearer, safer, and much more trustworthy. It’s almost like having a mini health team inside your phone, and that makes the whole system feel both smart and human at the same time.

Technical Deep Dive

Even though all agents may start from the same base LLM (like Gemini or GPT-class models), they are trained separately, so each one becomes good at one thing instead of being average at everything.

Finetuning

Analyst Agent is finetuned from wearable logs (heart rate, steps, sleep), time-series patterns, data summaries, and statistical examples. It becomes good at reading numbers and spotting patterns.
Domain Expert needs more serious data for finetuning, like medical guidelines, curated clinical examples, safe/unsafe reasoning samples, and explanations validated by experts. It learns how to avoid hallucinations and unsafe claims
Coach Agent is the “human” one. It is finetuned on a friendly conversational dataset, lifestyle advice examples, behavior-change coaching patterns. It learns to talk in a simple, supportive way.

This division is what makes the final system feel more stable than earlier systems like PHIA or the IR Agent, where a single LLM had to juggle all roles and often made confident mistakes.

The process itself was simple but very intentional. Here’s how I imagine it:

We first collected examples for each agent’s role.
Then we cleaned and organized everything.
Then we cleaned and organized everything.
After that, we fine-tuned each agent separately.
Then we tested the agents together.
Finally, we fixed mistakes and improved the dataset.

System architecture and Data Flow

Underneath all of this, there’s a system architecture that makes their collaboration possible. I picture it like a small command center. The user’s data flows into a main controller, which decides which agent speaks first and who validates whom. The Analyst generates its interpretation, the Domain Expert checks it, and the Coach turns everything into a simple message. Each agent is isolated enough to stay focused, but interconnected enough to build on each other’s work. There’s also a safety layer running quietly in the background — something like a “medical guardrail” — to make sure the final output stays safe, consistent, and responsible before it reaches the user.

Data Pipeline Flow

My raw data comes in (wearable data, blood test results, communication history)
The system cleans and prepares the data.
The Analyst Agent gets the cleaned data first, and it gives a report.
The Expert Agent double-checks everything and makes sure the explanation is medically correct and safe.
The Coach Agent gets the approved explanation and rewrites the message in simple, friendly language
A final safety filter reviews the message and checks for risky things like any medical diagnosis, unsafe suggestions, and harmful claims.
I receive the final output.

Inference time

During inference — which is just a fancy word for “when the system actually responds to me” — the agents talk to each other almost like coworkers passing notes in a group chat. I don’t see any of this, but it’s happening behind the scenes. The Analyst looks at my raw data and says, “Here’s what I think is happening.” The Domain Expert looks at what the Analyst wrote and either approves it or corrects it. And finally, the Coach turns their combined reasoning into something that sounds human and helpful. What I receive as a single answer is actually the teamwork of multiple roles stitched together.

Tools used

To make all of this work, the system relies on a set of tools that help each agent extend its abilities. The Analyst might use data-processing tools to clean and summarize wearable data. The Domain Expert might have access to reference tools that help it compare findings against medical knowledge or validated guidelines. The Coach might use prompt templates designed for empathy, clarity, and motivation. And the main controller uses orchestration tools — something like LangGraph or a custom workflow engine — to decide who talks when and how information flows between them. These tools don’t replace the agents, but they give them structure, context, and external capabilities they wouldn’t have on their own.

Collection of data

Everything starts with a real study they ran called WEAR-ME, which included more than a thousand Fitbit users. People had to explicitly opt-in, sign digital consent, link their wearable device, and even go for a blood draw at a Quest Diagnostics center. So the data wasn’t scraped or assumed—it came from real people who agreed to share it for research. That part mattered to me because it made the whole system feel responsible, not rushed.

Once someone joined the study, their data flowed through a very controlled pipeline.So each person had a mix of: (1) wearable time-series, (2) lab biomarkers, (3) self-reported history, (4) lifestyle questionnaires.

What I liked is that the system doesn’t directly reach into a live Fitbit account or do anything “real-time.” Instead, the PHA gets access to summaries of the participant’s data, organized into tables—like a daily summary table, an activities table, and even a population-level reference table so the system could compare one user’s values against others in their age group. This gave context, not just raw numbers.

Inside the actual agent pipeline, each sub-agent only receives the part of the data that’s relevant to its job. The system never dumps the full raw dataset into every agent. Instead, the orchestrator acts like a coordinator who hands the right information to the right specialist.

Nothing is fetched unless the user asks something that genuinely needs it. And the system also avoids asking the user for things it already knows—like their sleep hours or steps—because the orchestrator checks the available data first.

PHA vs Gemini Results

PHA wasn’t just a small improvement. It was in a completely different league.

From the end-user side, I noticed that people consistently picked PHA as the best experience. Not just once or twice — almost half the time. And this is after comparing it side-by-side against a single Gemini agent and a parallel multi-agent system.

But the biggest eye-opener for me was the expert evaluation. Experts didn’t just prefer PHA — they almost abandoned the Gemini single-agent baseline. Only around 4–5% of expert rankings put Gemini in first place, while PHA was chosen as #1 about 80% of the time. That’s not a small margin.

Strengths of the system

What struck me most was how real the system felt. Not just in terms of architecture, but in how thoroughly it was tested. It wasn’t a small demo — it involved thousands of human annotations and hundreds of hours of expert evaluation. Because of that, the strengths and limitations felt very honest.

One of the biggest strengths I noticed is how well the whole system comes together. The three-agent setup — the data scientist, the domain expert, and the coach — isn’t just theoretical. In the evaluations, PHA actually outperformed both a single-agent system and a parallel multi-agent system. End-users preferred talking to PHA almost half the time, and experts loved it even more, choosing it as the best system in about 80% of cases. To me, that says the collaboration between these agents isn’t just helpful — it genuinely changes the quality of the conversation.

The data agent in particular felt like a major upgrade. It was much better at breaking down messy, vague health questions into proper statistical analyses, catching missing data, choosing the right timeframes, and generating correct code. I could see why the system became more trustworthy — it wasn’t guessing. It was reasoning over the user’s actual data. And the domain expert agent made a big difference too. It produced safer, more accurate, and more personalized medical explanations. Users rated its responses as far more trustworthy than the base model, and clinicians preferred its summaries because they were more complete and clinically meaningful.

The health coach agent also impressed me. It wasn’t just delivering motivational lines. It followed real coaching principles — active listening, SMART goals, motivational interviewing — and this made conversations feel more natural and supportive. Users were more engaged, and the conversations ended more naturally. It reminded me of talking to a real human coach who’s actually paying attention.

As a whole system, PHA feels more thoughtful. The orchestrator doesn’t just mix answers together — it understands the user’s goal, assigns the right agent, reflects on the output, and remembers important details for later turns. This gives the conversation a sense of direction and personalization that the baselines just didn’t have.

Limitations of the System

But the system isn’t perfect. One thing that stood out immediately was the cost. PHA is slower and more computationally expensive because each request triggers multiple agents and reflection steps. Where the single-agent system might respond in around 35 seconds, PHA could take over 200. It’s powerful, but not cheap.

Another limitation is that the diagnostic reasoning, while improved, is still not bulletproof. Sometimes the domain expert agent didn’t complete its reasoning chain, and its reliance on web search occasionally pulled in conflicting information. This reminded me that, even though it acts like a doctor in some ways, it’s not actually one — and the authors make that very clear.

Bias is also a real concern. The system personalizes advice based on user traits, conditions, and context… which is great, but it also means it can unintentionally repeat patterns or assumptions from datasets that might not represent everyone equally. The paper calls out this risk directly.

The coach agent, despite being strong overall, also had a weakness: it wasn’t great at tracking user progress over time. It did a good job helping set goals, but didn’t always follow up on them in later turns. For long-term coaching, that’s something they’ll need to improve.

And finally — maybe the biggest limitation — everything was tested in short-term interactions. We don’t yet know whether PHA can support behavior change over weeks or months, which is where health coaching really matters. The team also notes that this is a research framework, not a clinical tool, and major regulatory work would be needed before deploying anything like this in the real world.

So when I put it all together, here’s how I personally see it:
PHA is a huge improvement over past systems — smarter, safer, more human, and more useful — but it’s still early. It has real strengths, but also real gaps. It’s powerful as an idea, promising as a system, and clearly built with care. But it also reminds me that great AI doesn’t replace medical care — it supports it, and it still has a lot of growing up to do.

Conclusion

After reading the PHA paper, the biggest thing I realized is that this system isn’t just an AI answering health questions — it feels more like a small team working behind the scenes to help you make sense of your own data. The evaluations show that this multi-agent design really does lead to better, clearer, and more personalized guidance compared to older single-agent approaches.

At the same time, the authors are very honest about their limits. The system can still make mistakes, it can be biased depending on the data, and it should never replace real medical care. It’s powerful, but it’s also early — more like a research blueprint than a ready product.

To me, PHA represents a hopeful direction: AI that supports people in understanding their health, not by taking over, but by helping them make smarter choices with empathy, safety, and transparency in mind

PHIA: The Agentic LLM That Writes Code, Analyzes Your Data & Explains Health Insights

Mitansh Gor — Thu, 27 Nov 2025 05:33:34 +0000

VIDEO LINK

I honestly felt like I was looking at the “next version” of what health AI should be. Whenever I asked myself questions like, “Is my sleep improving?” or “Does exercising at night change my deep sleep?”, I realized how hard it is to answer these without real analysis.

That is exactly the problem the PHIA team talks about on page 1 of the paper. They explain that even a simple question like “Do I sleep better after exercising?” requires many steps: checking recent data, comparing different days, calculating metrics, and then interpreting everything in the context of what “healthy” even means. And honestly, I could relate — that’s the kind of analysis I never do on my own.
What really caught my attention is that the paper says today’s LLMs struggle with numerical reasoning, meaning they often miscalculate or oversimplify things. I’ve definitely seen models do that — giving a confident answer but completely messing up basic math. (Limitation of PH-LLM)

So PHIA tries to fix this problem by giving an LLM the ability to plan, use tools, write code, and search the web. It doesn’t just “chat.” It becomes more like a small data analyst that works with your wearable data step-by-step.

PH-LLM vs PHIA

PH-LLM was an earlier coaching system that worked only with simple, pre-aggregated 30-day summaries and relied purely on the LLM’s internal reasoning. That meant it couldn’t actually analyze detailed daily wearable data, couldn’t run calculations, and completely failed on questions that required numbers or step-by-step reasoning. PHIA addresses these limitations by granting the LLM three new capabilities: (1) it can generate authentic Python code to analyze raw wearable data, (2) it can plan and break down tasks using an agent loop, and it can search the web to incorporate fresh, verified health knowledge.

Because of this, PHIA shifts from being “just a health coach” to becoming a true data analyst + coach, delivering insights that are more accurate, personalized, and grounded in both the user’s data and real health science. PHIA feels like the bridge between “AI as a friendly coach” and “AI as a personal data scientist.” By mixing reasoning, code execution, and web search, it finally unlocks the kind of deeper, more accurate insights that wearables have always had the potential to provide.

How PHIA Works

PHIA more like a mini health data analyst that can think step-by-step, write code, check your wearable data, search the web, and then explain everything in normal language. The PHIA paper shows this clearly — PHIA literally cycles through think → act → observe just like a human analyst would.

I’ll explain this in the simple way I understood it.

Instead of guessing answers, PHIA acts more like a small data analyst that plans its steps, checks your wearable data, and only then explains what it found. The paper describes this using the ReAct loop— Thought → Action → Observation — and seeing it in action helped me understand why PHIA is so different.

PHIA starts by thinking through the question. If I ask something like “Is my resting heart rate improving?”, it doesn’t jump to a quick reply. It pauses and decides what it needs—maybe comparing two weeks of data or calculating averages. Then, PHIA takes action by writing and running Python code in a safe virtual environment tool. This lets it analyze real daily time-series data using Pandas, just like a real data scientist. The nice part is that this removes mathematical mistakes that normal LLMs often make. And if its code crashes, PHIA actually fixes the error and tries again (recurrent trials system), which the paper highlights as one of its strengths.

In their evaluation, PHIA scored 84% accuracy on objective questions, way higher than simple LLM reasoning. This happens because PHIA can recover when the code fails.

When a question requires more than numbers—like understanding recommended sleep hours—PHIA uses a built-in web search tool to fetch verified information from trusted sources. Sometimes the user needs context, like: “Is this amount of sleep normal for my age?” “What workouts improve resting heart rate?” “Is my stress score healthy?”
PHIA has a built-in search tool that pulls information from verified online sources. This mixing of data + domain knowledge is what makes PHIA feel smart and practical.

What impressed me most is how PHIA blends everything together. It doesn’t just give numbers or copy facts. It calculates, it checks, it researches, and then it explains the result in clear language. The examples in the paper, like comparing a user’s sleep with national guidelines, really show how these steps come together. In the end, PHIA feels less like a chatbot and more like someone doing careful, step-by-step reasoning to help you understand your own health patterns.

A Deep Dive Into PHIA’s Technical Architecture

PHIA is not a fine-tuned model. It’s an agent framework built around Gemini 1.0 Ultra, with two major tools: (1) Python data analysis runtime, (2) Web search tool.

These tools are orchestrated using the ReAct (Reason + Act) agent pattern, which is why PHIA can do multi-step reasoning without fine-tuning. This design gives PHIA abilities that raw LLMs don’t have: planning, correction, stepwise attention, and grounded analysis.

No Fintuning

One of the most interesting things I learned from the paper is that PHIA is not fine-tuned like PH-LLM. Instead of modifying the model’s weights, the authors built a smart scaffolding around the model and taught it how to act like an agent.

Fine-tuning a huge LLM like Gemini Ultra is expensive, slow, requires tons of supervised data, and risky (can break general reasoning ability).

So instead, PHIA keeps the base model untouched and teaches the model how it should “think” and use tools using few-shot ReAct examples.

This is a totally different philosophy: PH-LLM → change the model. PHIA → change the system around the model

Process

The authors started with thousands of wearable-related questions (objective + open-ended). They needed a tiny set of example tasks that could teach the model how to behave like an agent.
Instead of randomly picking examples, they converted every question into numerical embeddings using Sentence-T5. This turns each question into a vector representing its meaning, so similar questions sit close together in vector space.

They ran k-means clustering (k=20) on these embeddings to automatically group similar questions into 20 clusters. Each cluster represents a whole “type” of question, like sleep trends, workout comparisons, anomaly detection, correlations, and so on.
From each cluster, they selected the most central question — the one closest to the cluster centroid.This gives 20 representative queries, each standing in for a larger family of similar queries.
For each of these 20 representative queries, the team manually wrote a complete ReAct agent trajectory that included every step of reasoning:
- Thought: what the model should plan to do
- Action (Python): the exact Pandas code needed
- Observation: the real output that the code would return
- Thought again: interpreting the output
- Action (Search): when domain knowledge is needed
- Observation: the retrieved information
- Final Answer: a clear, natural-language explanation. These aren’t partial examples — they are full step-by-step walkthroughs that demonstrate how an agent should behave.
These 20 full trajectories were inserted into PHIA’s few-shot prompt, acting as demonstrations inside its system instructions. So the model is not fine-tuned — instead, it is taught by example how to: plan before acting, decide when to run code, decide when to search the web, fix code errors, combine numerical results with domain knowledge, and produce a final personalized insight.

This is called “behavior cloning via prompting.” : It will learn the “pattern of behavior” from the examples, even without changing its internal parameters.

PHIA is a perfect example of how powerful LLMs become when they are paired with the right scaffolding instead of relying only on internal reasoning.

Limitations in the PHIA

PHIA is a huge step forward, especially compared to simple chat-based coaching. But it’s still early-stage. It shows what’s possible when LLMs use tools, but it also exposes how much work is left to make an AI truly trustworthy in personal health.

PHIA hasn’t been tested on real people yet, so we still don’t know if it truly helps users change habits or understand their health better in everyday life.
It only uses wearable data, and I feel it needs more context—like nutrition, medical records, or mood logs—to give deeper and more holistic insights.
PHIA isn’t medically validated, so even though it sounds smart, its recommendations haven’t been checked by doctors or health experts for real accuracy. Personalization still feels limited, because sometimes PHIA gives generic advice even when the user’s data is available right there.
Its reasoning depends heavily on prompting and Gemini Ultra, which means its behavior might change across models or updates, making it less consistent.
Error handling is better than basic LLMs but still imperfect, since PHIA can misread data columns or fail on messy real-world data.
Its toolset is still narrow, and I’d love to see it generate visualizations, analyze other health signals, or track long-term goals instead of just answering one question at a time.

Learnings from the paper

Reading the PHIA paper honestly changed the way I think about building AI systems. Before this, I used to focus mainly on making the model smarter, but PHIA showed me that the real progress comes from giving the model the right tools and the right structure. Seeing how PHIA uses the ReAct loop—think, act, observe—made me realize how important disciplined reasoning is for avoiding hallucinations and building trust.

The use of synthetic data was another eye-opener, because it proved that we can train and evaluate personal-health agents responsibly without touching real user data. On top of that, the way the authors handled safety—650 hours of human review, strict guardrails, and cautious refusal behavior—reminded me that responsible AI isn’t optional; it’s part of the engineering.

More than anything, PHIA taught me that the future of AI isn’t “a bigger model,” but a system where models, tools, data, and reasoning frameworks all work together. As an AI engineer, this shifted how I think about agents: not as chatbots, but as full pipelines designed to solve real problems with accuracy, humility, and safety.

Key Insights I’m Taking With Me

PHIA wins because it’s structured, not because the LLM is smarter.
Reasoning and architecture beat raw model size.
PHIA shines when a question needs data + external knowledge + multi-step logic, not just basic statistics.
Its real strength is disciplined reasoning, not raw computation.
PHIA’s strict safety guardrails make it trustworthy in a way normal LLMs aren’t.

PH-LLM - A LLM that gives personalized sleep and fitness coaching using wearable data.

Mitansh Gor — Thu, 27 Nov 2025 05:32:57 +0000

VIDEO LINK

PH-LLM (Personal Health Large Language Model) is a version of Gemini Ultra that has been fine-tuned specifically for sleep and fitness coaching. It isn’t just chatting—it actually learns from up to 30 days of wearable data, understands your patterns, and gives expert-level insights. The research paper shows PH-LLM scoring 79% on sleep medicine exams and 88% on fitness exams, which is on par or better than the expert groups they tested with.

What really caught my interest is that the system can:

analyze daily-resolution sensor data
generate personalized insights
Predict how rested or tired you will feel based only on wearable data and speak to you like a sleep coach or fitness expert.

What Problem PH-LLM Tries to Solve

Wearables track you, but they don’t talk to you. They tell you what happened, but not why it happened or how to fix it. PH-LLM aims to act like a personal health coach — one that understands your data, your patterns, your goals, and the science behind all of it.

Instead of treating sensor data like random numbers on a dashboard, it tries to interpret them the way a human expert would. If your bedtime has been drifting later over the past two weeks, or your deep sleep dropped right after you increased workout intensity, PH-LLM doesn’t just point it out — it explains why it happened and how it connects to your goals. It transforms passive metrics into an actual conversation about your habits.

The system was trained specifically for this kind of reasoning. The researchers fine-tuned Gemini Ultra so it could combine textbook knowledge with real-world wearable data. Interestingly, PH-LLM doesn’t just “sound smart.” In testing, it actually outperformed sleep experts on sleep medicine exam questions and matched them in fitness knowledge. That means the model isn’t just giving generic advice — it has a near-expert understanding of the underlying science.

But what really makes PH-LLM different is that it doesn’t stop at general knowledge. It looks at a person’s actual patterns over weeks. If you always sleep well on weekdays but crash on weekends, it's noticeable. If your HRV is dropping while your workouts are getting harder, it connects the dots. In one of the paper’s examples, the model realized a user had a very regular sleep schedule but consistently slept too little, and it suggested shifting bedtime by small increments rather than giving a one-size-fits-all rule. That’s the kind of nuance real coaches provide, but most apps don’t.

High-level Working of the PH-LLM

Researchers took Gemini Ultra — a very capable general-purpose model — and taught it how to understand sleep and fitness the same way a human expert would. They basically turned a large language model into a personal health specialist.

The process happened in two major steps. First, they fine-tuned the entire Gemini Ultra model on hundreds of detailed sleep and fitness case studies. These weren’t made-up examples — each case study was based on real wearable data from real people. The case studies included up to thirty days of information such as bedtimes, wake times, restlessness, workout intensity, heart rate metrics, and more. Alongside the data, sleep physicians and athletic trainers wrote expert-level insights, possible causes, and recommendations. These human-written explanations became the “teacher examples” that the model learned from. By imitating these expert responses over and over, PH-LLM learned how to talk like a coach and reason like one too.

After this, the researchers added a second layer of training using something called a multimodal adapter. This part fascinated me because it let the model go beyond simple text input. Instead of only reading written summaries of the user’s data, PH-LLM also receives a compressed representation of the actual sensor values — the statistical patterns hidden in daily heart rate, HRV, sleep duration, respiratory rate, and other signals across fifteen days. These adapter-generated “soft tokens” get injected directly into the LLM’s internal understanding. In simple terms, the model doesn’t just read your data — it absorbs it. This is how PH-LLM can predict things like “Will this person feel tired tomorrow?” with accuracy on par with traditional machine learning approaches.

The result of these two training phases is a model that can look at your daily metrics and instantly form a holistic picture. If your bedtime keeps drifting later, PH-LLM notices. If your deep sleep drops right after your workout intensity spikes, it picks up the relationship. If your HRV has been declining all week, it connects that to stress, recovery, or the need for rest. What impressed me is that PH-LLM isn’t just matching patterns — it can articulate the reasoning behind them, almost like an expert thinking out loud.

The paper shows a great example of this. For one user, the model pointed out that the midsleep point was extremely consistent — meaning their circadian rhythm was stable — but their total sleep time was consistently too low. From there, PH-LLM suggested a gradual shift in bedtime over several days. This wasn’t a generic “try to sleep more” tip; it was a specific plan tailored to what the data actually showed. That kind of reasoning is the whole point of the system.

Another thing I appreciated is that PH-LLM adapts its responses based on how much data it has. When the researchers removed parts of the input — like today’s sleep metrics or the last week of workout logs — PH-LLM still adjusted its explanations in a sensible way. To me, this shows that the model doesn’t rely on memorized patterns but actually understands the structure of sleep and fitness behavior.

Technical Working

The foundation of PH-LLM is Gemini Ultra 1.0, Google’s flagship multimodal model. In this work, it functions mainly as a text LLM (the vision-side is not used). Structurally, it’s a Transformer decoder with extremely large context and token embedding dimensions. This gives it the capacity needed to reason across long-form case studies, sleep explanations, and multi-step fitness logic.

The base model (Gemini Ultra 1.0) already has strong performance in medical question answering and general health reasoning. But by itself, it doesn’t know how to interpret raw wearable data — it needs domain-specific training.

Finetuning

The first major training step is a full-parameter finetuning of Gemini Ultra on 857 expert-annotated sleep and fitness case studies. Each case study includes: (1) Up to 30 days of daily metrics, (2) Aggregated (mean, variance, percentiles) statistics, (3) Expert-written insights, etiologies, and recommendations. This dataset is unique because each example combines real sensor patterns with expert-level reasoning. When the researchers fine-tuned the model, they essentially taught it:

“When you see patterns like this in the data, here’s how a real sleep doctor or athletic trainer explains it.”

To teach Gemini Ultra how to behave like a real sleep and fitness coach, the researchers fine-tuned the entire model—not just a small part of it—using a large collection of expert-written examples. Each example paired a user’s real wearable data with the exact explanation a sleep doctor or athletic trainer would give. There were about thirteen hundred of these pairs for sleep and fifteen hundred for fitness, and the model was trained on them over roughly fifteen hundred optimization steps. Instead of using shortcuts like LoRA, they updated all of the model’s weights directly, following a smooth cosine schedule for the learning rate so the model gradually stabilized as it learned. After this full training process, the base Gemini model effectively “became” PH-LLM: an LLM that now understands how to read multi-day sensor patterns and talk like a personal health expert.

Multimodal Adapter for Sensor Data

Once PH-LLM can generate coaching advice, the second challenge is enabling it to interpret raw numerical sensor data for prediction tasks (like estimating how tired someone will feel).

To do this, the researchers added a custom MLP-based multimodal adapter. This is the most “technical” part of the architecture.

How the adapter works:

For each of the 20 wearable sensor signals (HRV, resting HR, sleep duration, etc.), the system collects 15 days of values.
It computes standardized mean and variance for each signal.
These 40 numbers (20 means + 20 variances) feed into a multi-layer perceptron (MLP):
- Input: 40
- Hidden layers: 1024 → 4096 → 1024 (ReLU activations)
- Output: 4 “soft tokens,” each of size 14,336 (the embedding size of PH-LLM).
These 4 soft tokens are prepended to the text input as if they were real tokens.
The LLM then processes the numerical data inside its own embedding space, letting its reasoning layers combine subjective sleep patterns with wearable readings

The LLM never sees raw numbers — it sees learned embeddings representing the person’s physiological state. This allows PH-LLM to achieve machine-learning level accuracy in predicting subjective sleep outcomes, without needing a separate ML pipeline.

Input Format and Context Handling

The system uses two kinds of input representations:

Textual Representations
- Daily tables written out in text
- Time ranges (bedtime, sleep duration)
- Percentile comparisons
- Metric summaries
Soft Token Representation via Adapter
- Encodes underlying numerical structure
- Injected directly into the model’s attention layers

Because PH-LLM was trained on long, structured case studies, it naturally handles: multi-day patterns, variations in missing data, differences in available context.

Output Structure

PH-LLM produces multi-part responses in the same structure experts use:

Insights: Patterns detected in the data
Etiology: Possible causes based on sleep medicine frameworks (like RU-SATED)
Recommendations: Personalized, SMART-style advice
Readiness Scoring (Fitness): Evaluation of fatigue, HRV trends, and recovery loads

Because the model was trained on structured expert templates, it learns to produce cohesive, medically-grounded narratives instead of generic advice.

Evaluation Architecture

The team built a secondary system called AutoEval, which is an LLM finetuned to grade PH-LLM’s responses against expert criteria. This created an automated loop for model validation, enabling: fast benchmarking, ablation studies, large-scale quality scoring.
AutoEval itself is built using Gemini Pro with LoRA finetuning

Strengths of the system

What I like most about PH-LLM is how personal it feels. Instead of giving generic sleep or fitness advice, it actually looks at your patterns over weeks and speaks to you the way a real coach would. When it notices your bedtime shifting or your deep sleep dropping after intense workouts, it doesn’t just state the numbers—it explains what those changes mean and why they matter.

The system’s expert-level reasoning also stands out. In the paper, PH-LLM performs as well as or better than trained professionals on board-style sleep and fitness exams, scoring 79% in sleep medicine and 88% in fitness. This gives its recommendations a level of credibility that most consumer health apps don’t have.

Another strength is how smoothly it blends wearable data with human-like reasoning. Using the multimodal adapter, PH-LLM can predict subjective feelings like tiredness or restfulness based purely on sensor trends, something even many traditional ML models struggle with.

Finally, PH-LLM adapts well when information is missing or incomplete. The paper shows that when certain sleep or workout metrics are removed, the model still adjusts its reasoning instead of failing outright. That flexibility makes it feel more intelligent and reliable, almost like it truly understands how sleep and fitness behaviors change from day to day.

Limitations

The biggest one, in my opinion, is its dependence on the quality of the wearable data itself. If your device misreads your sleep stages or your heart rate jumps around because the watch wasn’t snug, PH-LLM will still try to interpret that noise as if it’s meaningful. The paper even points out that the model sometimes references data incorrectly or forms conclusions that don’t perfectly match the input — small confabulations that become more noticeable when the data is messy

Another issue is that the model sometimes struggles with consistency when giving recommendations. For sleep insights, the fine-tuning helped a lot, but for fitness coaching the improvements were smaller, and in certain sections like “training load,” PH-LLM actually performed worse than the base Gemini model and human experts. That tells me the model doesn’t fully grasp every fitness scenario as deeply as it understands sleep patterns.

The paper itself admits there were demographic skews — more middle-aged users, fewer younger or older participants, and no information about race or ethnicity. That means the eva.

And finally, there’s the broader limitation that PH-LLM, no matter how smart it sounds, is not a medical device. It can give coaching-style suggestions, but it isn’t validated for clinical decision-making. Sometimes its tone feels authoritative, which can make the advice sound more medically precise than it actually is.

Final Takeaways

What impressed me most is the model’s ability to turn a messy week of habits into a simple, actionable story. It doesn’t lecture or overwhelm. It just explains what’s going on and nudges you in the right direction. At the same time, it’s important to remember that PH-LLM isn’t a medical system. It still makes small mistakes, relies heavily on the data it sees, and carries the biases of the Fitbit-dominated population it was trained on.

But the core idea feels powerful. PH-LLM represents a shift from “apps that measure you” to “systems that understand you.” This paper is a glimpse of where personal health AI is heading, and it sets a strong foundation for the next models I’ll be reviewing in this series. As I move into PHIA, the IR Explainer Agent, and later multi-agent systems, I can already see how all these ideas start to connect. PH-LLM feels like the first major building block in creating an AI that doesn’t just track your health—but helps you improve it.

AI for Personal Health — My Review, Reflections & Real-Life Understanding

Mitansh Gor — Thu, 27 Nov 2025 05:32:31 +0000

VIDEO LINK

1. Introduction

I’ve been using wearables like Fitbit and Pixel Watch for a while, and I’ve always had the same question in the back of my mind:
“These devices collect so much data… but what does it actually mean for my health?”

I could see my steps, sleep score, heart rate, all the usual numbers — but I didn’t really know how to connect them to real insights about my body. That curiosity is what pulled me into this world of AI for personal health.

As I started reading research papers and experimenting with data, I realized something important:
AI is slowly becoming the “missing link” between raw wearable data and actual, useful guidance.
It can interpret patterns, explain them in simple language, and sometimes even coach you like a personal guide.

This blog series is my attempt to share what I learned — not as a researcher writing a formal report, but as a student trying to make sense of a fast-moving field in a friendly, simple way.

What This Series Will Cover (Based on the Timeline of Publications)

I studied four papers and mapped out how AI in personal health evolved over the past few years.
Here’s the order I’ll follow in this series — from earlier work to the most recent:

Intro Blog (this one) - Why I’m exploring this topic and how the whole series is structured.

Blog 1 (2024) — PH-LLM - A large language model that gives personalized sleep and fitness coaching using wearable data.

Blog 2 (2024) — PHIA - An agentic LLM that writes code, analyzes your data, and turns it into meaningful health insights.

Blog 3 (2025) — IR Prediction + IR Explainer Agent - A combination of machine learning and an LLM-based explainer to estimate metabolic risk.

Blog 4 (2025) — Personal Health Agent (PHA) - A multi-agent system where different AI “roles” collaborate — data analyst, coach, and domain expert.

Blog 5 (My Project, 2025) - My own proof-of-concept that mixes ideas from all four systems to create a unified health agent architecture. My goal is to keep everything straightforward, even if someone is reading about AI for the first time.

Why This Topic Matters (and What I Hope You’ll Learn)

Most of us don’t think much about our health until something feels wrong. But our wearables are quietly tracking us every day — our sleep, heart rate, movement, and even stress signals. That means they’re collecting clues long before we ever feel anything.

AI can turn those clues into something meaningful by noticing patterns we might miss, predicting early risks, explaining things in plain English, and even giving small suggestions that feel personal. To me, this makes health feel more understandable and less like a set of random numbers.

Through this series, I want to share what I learned about these systems in a simple, relatable way. I’ll talk about how wearables work with AI, how different models make sense of health data, why LLMs and multi-agent systems matter, where the ethical issues show up, and how these ideas connect to real projects.

I’m writing as a student — sharing what made sense to me, what confused me at first, and what finally “clicked.” And at the end of the series, I’ll also walk through my own proof-of-concept project where I combine everything I learned into one architecture.

10. References

PH-LLM: A Personal Health LLM for Sleep & Fitness Coaching
Nature Medicine (2025)
https://www.nature.com/articles/s41591-025-03888-0
PHIA: Transforming Wearable Data into Health Insights Using LLM Agents
Google Research Blog (2024)
https://research.google/blog/advancing-personal-health-and-wellness-insights-with-ai/
SHARP Framework: Principles for Building Health & Wellness LLMs
Google (2025)
https://services.google.com/fh/files/blogs/winslow_2025_sharp_framework.pdf
Insulin Resistance Prediction From Wearables + Routine Blood Biomarkers
arXiv (2025)
https://arxiv.org/pdf/2505.03784
Personal Health Agent (PHA): Multi-Agent System for Data + Coaching + Expertise
arXiv (2025)
https://arxiv.org/pdf/2508.20148

Nested Learning — My Reflections on a Model That Learns How to Learn

Mitansh Gor — Mon, 17 Nov 2025 02:57:13 +0000

I recently came across a paper called Nested Learning: The Illusion of Deep Learning by Behrouz and the team — the same researchers behind Titans and Atlas. It really caught my attention because it challenges what we usually think “deep learning” means. The paper says that depth in neural networks isn’t just about stacking layers — it’s about how many layers of learning the system can apply to itself. Instead of just updating weights, this model learns how to improve its own learning process.

While reading it, I realized this isn’t just another optimization trick. It actually feels like a glimpse into what real intelligence could be — an AI that doesn’t just react but reflects, improves, and evolves how it learns over time. The authors even built a prototype called Hope, a model that modifies itself using feedback, learning not just what to learn but how to learn better.

Take on the Problems in Deep Learning and Transformers

Today’s deep learning systems — even transformers — are strong but still limited in how they actually learn.

Neural networks are called “deep” because of their layers, but their learning is flat — one optimizer like Adam or SGD updates everything the same way. Once training ends, the model stops learning, like a student who graduates and never studies again. :)

Transformers improved context handling, but they only have short-term memory. They remember what’s inside their context window and forget the rest. Even if I teach GPT something new, it won’t remember it next time — its knowledge is frozen.

To me, that’s what makes this paper exciting. It explores real, continuous learning, where models don’t just perform tasks but actually grow and evolve from their experiences — more like how the human brain learns over time.

How the Human Brain Actually Learns

When I compared this idea to how our brain works, the difference was clear. The brain doesn’t just store facts — it keeps updating how it learns from every experience. Each new moment changes not only what we know but how we learn next time.

Like when I study late and remember less, my brain quietly adjusts — it learns how to learn better. We also have different kinds of memory: fast, short-term memory for quick thoughts and slow, long-term memory for what truly matters.

What’s amazing is that all this happens at different speeds — reflexes form instantly, habits take time. That’s what makes our learning flexible and self-improving. This paper helped me realize that real intelligence isn’t just about storing knowledge — it’s about systems that can adapt the way they adapt.

Why Real Intelligence Learns at Many Speeds

“Nested Learning” mirrors the way our brain learns at multiple speeds. In real life, not all learning happens instantly — some lessons come from quick feedback, and others sink in over time through reflection and repetition.

For example, when I make a mistake in code, I fix it fast — that’s short-term learning. But when I notice a pattern of mistakes across projects and change how I approach debugging, that’s slow, higher-order learning. My brain is basically nesting layers of learning, one inside another.

This is exactly what the paper argues AI should do. Instead of having one rigid update rule for all situations, it should have systems that operate on different timescales — fast ones for reacting to the present and slow ones for improving how it learns in the future. Real intelligence, human or artificial, grows when it can learn fast, remember slow, and keep adjusting both.

Why Optimizers Are Still “Shallow”

One part that stood out to me was how the paper calls our current optimizers “shallow.” At first, that sounded odd — optimizers like Adam or SGD are what make models learn, right? But the point is deeper: they only operate at one level. They adjust the weights, but they never learn how to optimize better on their own.

Think about it like this — an optimizer is a rulebook. It says, “If error is high, change parameters this way.” That rule never changes, no matter how the model behaves or what patterns it encounters. It doesn’t evolve. It’s like a student who keeps using the same study method forever, even when it stops working.

Nested Learning challenges that. It treats the optimizer as something that can learn from its own history — almost like giving the optimizer memory and awareness. So instead of being a fixed rule, it becomes a learner itself. That’s why normal optimizers are called “shallow” — they only see one layer of the learning process, while true intelligence needs many.

What Nested Learning Really Means

When I finally got to the main idea — Nested Learning — it clicked for me that this isn’t just about deeper networks, but deeper learning loops. Normally, a model learns by updating its parameters once per round. But in Nested Learning, there are multiple layers of learning stacked inside each other, each operating at its own level.

The paper calls these “levels.”

Level 1 is the fast learner — it adjusts to new data right away.
Level 2 is slower — it learns how well Level 1 is learning and changes its strategy.
Level 3 and beyond keep zooming out, letting the model reflect on its own updates and tweak the process itself.

It’s like having a mind inside a mind inside a mind — each layer watching and improving the one below. What makes it powerful is that it never stops at one rule; it can always find a better way to learn. I realized that’s what makes it feel almost human — because that’s how we grow too, by not just learning facts, but by constantly refining how we learn them.

The Power of Associative Memory

In this paper, associative memory is what allows the model to connect surprise signals over time. Each time it encounters something unexpected, it doesn’t just correct the output; it stores that “surprise” pattern and learns from how surprises evolve. So instead of forgetting past mistakes, it builds a history of how it has been wrong before — and uses that as context for new learning.

I liked how this turns memory from a passive storage system into an active, learning part of the network. It’s not just remembering data; it’s remembering how learning felt last time. That’s what makes the system more adaptive and self-improving, just like how human intuition forms through repeated experiences.

HOPE — The Model That Learns to Learn

The paper introduces a model called HOPE (Hierarchical Optimizing Processing Ensemble), and it really ties everything together. HOPE is basically the first real example of Nested Learning in action. It builds on the older Titans architecture, which was already designed for smart memory management — storing “surprising” experiences and forgetting the rest. But Titans could only update its parameters in two layers, which meant it was still limited to first-order learning.

HOPE takes that concept and adds self-modification. It doesn’t just store knowledge — it rewrites how it learns based on what it experiences. The more it learns, the better it gets at learning itself. That’s what makes it “hierarchical” — every layer is optimizing the one below it, creating an infinite loop of improvement.

When I read that, it felt like looking at a prototype for true adaptive intelligence. HOPE doesn’t just grow its memory; it evolves its own way of thinking. It’s almost like the model is building its own brain architecture in real time — guided only by feedback and surprise.

CMS — Learning Across Multiple Memory Speeds

One of the coolest parts of the paper was the Continuum Memory System (CMS). This idea clicked with me right away because it’s inspired by how our brain manages memory at different speeds. We have fast, short-term memory for reacting to the moment, and slower, long-term memory for storing what truly matters. CMS brings that same principle to AI.

In HOPE, CMS creates layers of memory that operate on different time scales. Fast memory reacts instantly to new data, slow memory holds onto stable knowledge, and middle layers balance both. The system learns what to keep, what to adapt, and what to forget — automatically.

This makes the model more flexible and less likely to “forget” old knowledge when it learns something new, solving a big problem in continual learning. For me, CMS felt like giving the model an actual sense of time — letting it learn short-term lessons without losing its long-term wisdom. It’s memory that grows, refines, and stays balanced, just like ours.

From Titans to HOPE — How the Architecture Evolved

Before HOPE came along, there was the Titans architecture, which was already an interesting idea. Titans worked like a long-term memory system for AI — it didn’t try to remember everything, only what was surprising. Whenever the model saw something that didn’t match its expectations, it marked that as “important” and stored it. This made Titans good at keeping rare or unexpected experiences while forgetting routine ones, kind of like how our brain remembers unusual events more vividly than daily habits.

But Titans had a big limitation — it could only learn at two levels. It could store knowledge (Level 1) and slightly adjust how it stored it (Level 2), but it couldn’t modify its own learning process. It was stuck with fixed update rules, so even though its memory was smart, its way of learning stayed static.

That’s where HOPE (Hierarchical Optimizing Processing Ensemble) came in as the next step. HOPE keeps Titans’ “surprise-based memory,” but adds self-modification — meaning it can change how it learns over time. Instead of just remembering, it reflects on how it learned and improves that process.

In simple terms:

Titans learns what to remember.
HOPE learns how to learn better next time.

This shift from reactive memory (Titans) to reflective learning (HOPE) is what made the architecture truly recursive — an AI that can not only adapt but evolve its own learning rules.

My Takes on the Paper

This paper made me rethink what “deep learning” really means. It’s not just about adding layers — it’s about adding levels of learning. I liked how it pushed the idea that intelligence should evolve, not just perform.

What stood out to me was the mindset shift. Instead of models that just learn tasks, it showed a system that learns how to learn better. That’s the kind of loop real intelligence needs — self-awareness in its own process.

I also liked the balance between fast and slow learning. It reminded me of how humans think — reacting quickly to new events while slowly refining long-term habits. The whole idea felt less like training a model and more like nurturing an evolving mind.

GEN-AI 5 : WGAN and WGAN-GP

Mitansh Gor — Sun, 20 Apr 2025 01:19:50 +0000

Why ❌ GAN

In the ever-evolving landscape of generative models, GANs have taken center stage with their remarkable ability to generate data that mimics real-world distributions. But as with all great things, classic GANs came with caveats—training instability, vanishing gradients, and mode collapse, to name a few.

Let’s dive deeper into these powerful models: WGAN and its enhanced cousin WGAN-GP, two sophisticated upgrades that fix many of GANs' shortcomings.

What is Wasserstein distance

Wasserstein Distance, also known as the Earth Mover's Distance (EMD), is a mathematical measure of the distance between two probability distributions.

Imagine you have two piles of sand—one symbolizing real data and the other representing data generated by a model. The goal is to reshape one pile into the other by moving portions of sand. The effort required depends not only on how much sand you move but also on how far you move it. This effort is what the Wasserstein Distance measures—a cost-efficient way to transform one distribution into another.

What makes this distance especially powerful is its ability to compare distributions even when they don't overlap at all—a situation where traditional measures like JS divergence fail. It offers a continuous and interpretable signal for how "far off" the generated data is from the real data, making it incredibly useful for training generative models like WGANs.

Traditional GANs use Jensen-Shannon (JS) divergence to measure similarity between distributions.
BUT :

JS divergence becomes undefined or uninformative when distributions don't overlap.
This leads to vanishing gradients—a huge problem for training

By contrast, Wasserstein Distance:

Remains well-defined and continuous even when distributions are far apart.
Provides meaningful gradients, allowing the generator to learn effectively from the start.

Improvement in WGAN wrt GAN

WGAN is a variant of GAN that uses the Wasserstein Distance as its loss function instead of JS divergence.

Key differences from traditional GAN:
The discriminator is now called a critic. It doesn't classify data as real/fake, but scores it to reflect how “real” it looks.

The loss function is based on the difference in critic scores for real vs. fake data.

In simple terms:
Real Data → High Critic Score
Fake Data → Low Critic Score
The generator improves by producing data that the critic scores higher.

Working

Just like in standard GANs, WGAN consists of two neural networks:

A Generator (G) that tries to generate realistic data from random noise.
A Critic (C) (not called a discriminator here) that scores the realness of data, assigning higher values to real data and lower values to fake ones.

Unlike GANs, which use a sigmoid function to classify samples as real or fake, WGAN's critic outputs real-valued scores—this small shift changes everything.

The Loss Function

The heart of WGAN lies in replacing Jensen-Shannon divergence with the Wasserstein distance (also called Earth Mover’s Distance)—a metric that measures how much "effort" it takes to morph the generated distribution into the real one.

This results in the following loss functions:

Critic's Loss:

𝐿c = E[C(fake)]−E[C(real)]

The critic aims to maximize the difference, rewarding real data with higher scores and penalizing fakes.

Generator's Loss:

𝐿G = −E[C(fake)]

The generator tries to minimize the critic’s ability to detect its fakes by generating better, more realistic data.

Missing Something? 🤔

The model that we learned until now has some issues.

The critic assigns real-valued scores to data samples, so nothing stops the critic from amplifying its outputs to maximize the loss. For example, it might assign real samples a score of +10,000. And fake samples a score of -10,000. This leads to :
- The generator receiving extreme gradients, which are not helpful and can distort learning.
- Loss curves spike or crash. Generated samples look worse over time instead of better, making training unstable.
- Mode collapse and exploding gradients can occur.

How can we solve this problem?
Let's introduce a 1-Lipschitz Function. This will solve our problem.
We will add a 1-Lipschitz Function to the critic (discriminator).
In the Lipshitz function, we will give any 2 images (x1 and x2) as input.

Numerator: Absolute difference between the critic predictions
Denominator: Average pixelwise absolute difference between two images.

What we are trying to do with this function is to limit the rate at which the predictions of the critic can change between 2 images, when comparing it with the real/pixel-wise difference of the image.

You can get this using a graph by visualizing 2 (white) cones whose origin can be moved along the graph so that the whole graph always stays outside the cones.

By doing so, we are clipping the weights of the critic to lie within a small range, eliminating problems of exploding gradients.
But again NEW PROBLEM !!
Learning significantly decreases 😢 😢

A much better Solution is the WGAN-GP

Adding Gradient Penalty

As the Lipshitz function did a partial job. We need to modify/tweak the function so that it can eliminate the problem of Learning significantly decreases.
The solution should be to penalize the Gradient if it's too far away from 1.

The Lipschitz constraint is enforced by including a gradient penalty (GP) term in the loss function for the critic. The GP penalizes the model if the gradient norm deviates from 1
The GP term, which we will be using with Lipschitz constraint, will measure the square difference between the gradient of the prediction wrt input image and 1.

From the above image, it's clear that the gradient penalty increases as we move away from 1.

There are a few important things to take care of while using training WGAN-GP :

With Wasserstein loss, the Critic must be trained to converge before updating the Generator. To do so, we train the Critic several times between Generator updates. A typical ratio used is 3 to 5 Critic updates per Generator update.
Batch Normalization shouldn't be used in the critic. Batch Norm creates a correlation between images in the same batch. This makes the gradient penalty loss less effective. Experiments have shown that WGAN-GPs works great without it anyway

Results after WGAN-GP model:

Using Wasserstein distance + Lipschitz function + Gradient penalty has improved the results of the images generated significantly.

Traditional GANs suffer from vanishing gradients or mode collapse due to the Jensen-Shannon divergence. WGAN-GP, by using Wasserstein distance + gradient penalty, leads to more consistent and stable convergence. You get smoother generator updates, making training less sensitive to hyperparameters and architectural choices.
In classic GANs, generators can produce a limited variety (mode collapse). WGAN-GP encourages the generator to explore the data distribution more fully, thanks to the meaningful gradient feedback from the critic. Hence, more diverse outputs, especially noticeable in image generation tasks.
The loss correlates with sample quality, unlike traditional GAN loss, which becomes meaningless when the discriminator is too good. Hence, you can track training progress numerically, not just visually
There are no sudden spikes or flat zones in gradients. Hence, more efficient and productive generator learning.

Disadvantages of WGAN

Slower training, higher memory usage, and increased training time, especially on large datasets.
Hyperparameters are sensitive. Even a small change in hyperparameters can cause a lot of change. It needs to be tuned carefully.

Summarize

A WGAN-GP uses the Wasserstein loss
The WGAN-GP is trained using labels of 1 for real and -1 for fake
No Sigmoid activation in the final layer of the Critic
Include a GP term in the loss function for the Critic
Train the Critic multiple times for each update of the Generator
There are no Batch Norm layers in the Critic

GEN-AI 4 : GAN

Mitansh Gor — Sun, 20 Apr 2025 01:19:40 +0000

In the world of generative models, Variational Autoencoders (VAEs) were among the first to show us how machines can learn to create new data. But they came with limitations which sparked the need for a new direction in generative modeling.

Enter Generative Adversarial Networks (GANs): a powerful, game-theoretic approach that changed the landscape by producing stunningly realistic outputs and pushing the boundaries of what machines can generate.

Lets start by understanding the limitations of VAE.

Disadvantages of VAE

Blurry Outputs: In a VAE, two things are being balanced: how well the model can recreate the input data (reconstruction loss) and how closely the latent variables follow the assumed normal distribution (KL divergence). If this balance isn't right, it can cause problems. For example, if the model focuses too much on making the latent space look normal, it might not focus enough on accurately recreating the input data, leading to blurry or low-quality reconstructions. Finding the right balance is tricky, and if it's off, the output can look fuzzy or unclear.
Latent Space Constraints: When we assume a simple normal distribution for the latent variables in a VAE, it means we expect the data to follow a very basic pattern (like a bell curve). However, real-world data is often more complex and doesn't always fit this simple pattern. Because of this, the VAE might not capture all the details or variations in the data, leading to less accurate results and poor generation of new data. Essentially, the model's assumption limits its ability to understand and recreate the data’s true complexity fully.
Mode Averaging: When the data has many different patterns or "modes" (called multimodal data), VAEs can struggle to capture all of them. The model tends to average over these modes, meaning it might create a mix of features from different data patterns instead of focusing on the specific details of each mode. This can result in the model generating outputs that don’t accurately reflect the diversity of the data, often leading to a loss of important variations or details in the generated samples. Essentially, the model might not fully capture all the unique aspects of the data.

How is GAN better than VAE?

While Variational Autoencoders (VAEs) offer a powerful framework for generative modeling, they come with certain limitations. To address these challenges, Generative Adversarial Networks (GANs) provide an alternative approach. Unlike VAEs, which rely on a probabilistic framework, GANs use two neural networks—a generator and a discriminator—that compete against each other to improve the quality of generated data. The adversarial setup in GANs enables them to produce sharper, more realistic outputs, particularly in applications like image generation, where detail and realism are crucial.

Here’s how GANs are better than VAEs:

Sharper Outputs: GANs produce clearer and more detailed images because they focus on distinguishing real data from generated data, leading to more realistic results.
No Blurry Reconstructions: Unlike VAEs, which can sometimes produce blurry outputs due to their reliance on a probabilistic framework, GANs avoid this issue by directly optimizing for realism.
Better for High-Quality Generation: GANs excel in tasks where the goal is to generate high-quality data, like realistic images, videos, or audio, because the adversarial training encourages the generator to produce more lifelike results.
Flexibility in Learning Complex Patterns: GANs can learn complex data distributions better, especially when data has many modes (variety of patterns), without averaging out the details like VAEs might.

Working of GAN

A Generative Adversarial Network (GAN) is a machine learning model that consists of two neural networks, a generator and a discriminator, that work against each other in a process called adversarial training. Here's how it works in detail:

The Generator:

The generator is like an artist trying to create new data (e.g., images) that looks as real as possible. It starts with random noise and uses this as input to produce a generated output. The goal of the generator is to produce data that can fool the discriminator into thinking it’s real.

The Discriminator:

The discriminator is like a critic trying to tell whether a piece of data is real (from the training data) or fake (produced by the generator). It takes an input—either real data or a generated one—and outputs a probability that the input is real or fake. It can be considered as a supervised image classification problem.

Adversarial Training:

The key idea behind GANs is the competition between the generator and the discriminator:

The generator tries to improve by creating more convincing, realistic data to fool the discriminator.

The discriminator tries to improve by getting better at distinguishing real data from the fake data created by the generator.

This process is a game where the generator tries to "cheat" by generating better data, and the discriminator tries to become more skilled at detecting fake data. Over time, both networks improve.

The Objective:

Generator's Goal: Minimize how often the discriminator correctly identifies fake data. It wants to produce data that the discriminator can't distinguish from real data.

Discriminator's Goal: Maximize its ability to correctly classify real vs. fake data, helping it get better at spotting fakes.

The interaction between the discriminator and generator in a GAN requires a delicate balance. If the discriminator is too good, it can easily distinguish real from fake data, leaving the generator with little feedback and hindering its ability to improve. On the other hand, if the generator is too strong, it may exploit weaknesses in the discriminator, producing fake data that the discriminator wrongly classifies as real (false negatives). The ideal situation occurs when the discriminator outputs a value close to 0.5, meaning it cannot distinguish between real and fake data, indicating that the generator is producing high-quality samples. For example, if the discriminator outputs ~1, the fake images are too realistic, and the generator won't be forced to improve. If it outputs ~0, the fake images are too obvious, and the generator needs more training. However, when the discriminator outputs ~0.5, it suggests the generator is performing well, producing convincing, realistic samples.

The Loss Functions:

The generator's job in a GAN is to create fake data that looks so real that the discriminator can’t tell it apart from the real data. Instead of directly focusing on the quality of the data, the generator’s main goal is to trick the discriminator into thinking the fake data is real. The generator gets feedback from the discriminator, which helps it improve. The loss function for the generator encourages it to produce data that has a high probability of being classified as real by the discriminator

The discriminator in a GAN's job is to tell whether the data it sees is real (from the true data source) or fake (generated by the generator). It outputs a probability (between 0 and 1) that the input is real.
The goal of the discriminator is to correctly classify Real data as real and Fake data as fake.

Together, the overall training objective for the GAN can be summarized as :
GAN Loss = Discriminator Loss + Generator Loss

Training Process:

The training process of a Generative Adversarial Network (GAN) involves a series of steps that alternate between training the discriminator and the generator.

Initialize the generator and discriminator.
Train the discriminator on real and fake data.
Train the generator to fool the discriminator.
Alternate between training the discriminator and the generator.
Repeat until the generator produces convincing fake data.

We must alternate the training of these two networks, making sure that we only update the weights of one network at a time!

Initially, the generator creates very poor, random data, and the discriminator easily detects it as fake. This adversarial training process continues until the generator has learned to produce realistic data, and the discriminator has become more adept at distinguishing real from fake data.

Training can be stopped when the generator produces high-quality data that the discriminator cannot reliably distinguish from real data (i.e., the discriminator’s output approaches 0.5, meaning it cannot differentiate between real and fake data).

Situations in GAN

1) - We normalize the input range from 0, 255 to [-1, 1] instead of [0, 1] because the tanh activation function works better with inputs in the range of [-1, 1]. The tanh function outputs values between -1 and 1, and normalizing the data to this range helps ensure the gradients are stronger and more stable during training. In contrast, the sigmoid activation function has a range of [0, 1] and produces weaker gradients, which can slow down learning. Therefore, using [-1, 1] helps improve the training process.
2) - The training process of GANs can be unstable because the generator and discriminator are constantly competing. Over time, the discriminator may become too good at distinguishing real from fake data, which could cause issues. However, this isn't always a problem because the generator might have already learned enough by that point. To improve stability, we can add a small amount of random noise to the training labels, which helps prevent the discriminator from becoming too dominant too quickly

3) - When the Discriminator Overpowers the Generator,

The discriminator becomes too good at distinguishing real from fake images.
This causes the generator to receive weak feedback, and the loss signal becomes too weak to improve the generator.
The discriminator perfectly classifies real and fake images, causing the gradients to vanish, and the generator stops training.
Solutions to Weaken the Discriminator:
Increase the Dropout rate in the discriminator to reduce its ability to overfit.
Reduce the learning rate of the discriminator to slow down its training.
Reduce the number of convolutional filters in the discriminator to limit its capacity.
Add noise to the labels when training the discriminator to make it harder for the discriminator to distinguish real from fake.
Randomly flip labels of some images during training to confuse the discriminator and prevent it from becoming too powerful.

4) - When the Generator Overpowers the Discriminator:

The discriminator becomes too weak, and the generator tricks it with a small set of nearly identical images.
This results in mode collapse, where the generator produces limited variety in its outputs, focusing on a single observation that fools the discriminator.
Solutions to Weaken the Generator:
If you find that your generator is suffering from mode collapse, you can try strengthening the discriminator using the opposite suggestions to those listed in the previous section. Also, you can try reducing the learning rate of both networks and increasing the batch size.

5) - The generator's loss is evaluated based on the current discriminator, which constantly improves during training. This makes it hard to compare the generator's loss at different stages, as it may not reflect the actual quality of generated images. The loss can even increase over time, despite the images improving, because the discriminator is getting better, making it harder for the generator to fool it.

Disadvantages of GAN.

Mode Collapse: The generator may start producing limited, identical outputs, tricking the discriminator, and failing to capture the full diversity of the data.
Uninformative Loss: The generator's loss may increase even as image quality improves, making it hard to track progress.
Vanishing Gradients: If the discriminator becomes too powerful, the generator may receive weak gradients, preventing meaningful learning.
Hyperparameter Sensitivity: GANs are sensitive to choices like learning rates, dropout rates, and the number of layers, requiring careful tuning for optimal performance.

GEN-AI-3 : VAE

Mitansh Gor — Sun, 20 Apr 2025 01:19:30 +0000

So far, we’ve seen how Autoencoders (AEs) can take data—like images, audio, or text—and compress it into a lower-dimensional space, only to reconstruct it again like a digital magician pulling a rabbit out of a hat. Pretty neat, right?

But there’s a catch: while AEs are great at learning compact representations, they’re not exactly dreamers. Their latent space—the core of their compressed understanding—isn’t built for imagination. Try sampling from it randomly, and you’re more likely to get noise than meaningful data.

Now, what if we wanted a model that not only compresses data but can also generate new, meaningful samples that look like they came from the original dataset? A model that understands probability, uncertainty, and can dream up new content with finesse?

🎉 Enter the Variational Autoencoder (VAE).

In this post, we’re going to unpack how VAEs work, why they’re a major leap forward from traditional autoencoders, and how they lay the groundwork for some of the most exciting generative models in AI today.

Let’s dive in.

The Problem with Regular Autoencoders 🧩

Traditional autoencoders compress data by learning a direct mapping from input to a latent vector, and then decompress it using a decoder. While effective for feature learning, they suffer from:

Disorganized Latent Space: Nearby latent points don’t necessarily produce similar outputs.
Poor Generative Ability: Sampling randomly from the latent space usually results in noisy, incoherent outputs.
No Uncertainty Modeling: The model doesn't capture how confident it is in the latent representation.
We needed structure, smoothness, and the ability to reason probabilistically.

Variational Autoencoders

VAEs solve these issues by introducing two key ideas:

Injecting randomness into the encoding process.
Imposing constraints on the distribution of the latent space.

Instead of encoding an input to a fixed point, the VAE encodes it to a probability distribution—specifically, a multivariate normal distribution centered at a point in the latent space.
Or in other words,
Instead of encoding input into a fixed latent vector, it encodes it into a distribution (usually Gaussian), from which it samples.

By doing so, it not only compresses data, but VAEs learn to imagine variations.

🎯 Constraints on the latent space

To bring structure to the latent space, we constrain how encodings are distributed:

🌀 Centering: Each encoded distribution should be centered as close to the origin (0, 0, ..., 0) as possible.
📏 Unit Variance: The spread (standard deviation) of each distribution should be close to 1.

The further the encoder deviates from these goals, the higher the loss during training.

Why? This forces all encoded samples to live near the same area in latent space, enabling smooth interpolation and consistent generation.

Changes in VAE Encoder

The encoder no longer outputs just a point, but rather, it parameterizes a distribution from which we can sample.

x → (μ, σ) → z ~ N(μ, σ²)

We do this because we want to generate new data by sampling from a continuous, meaningful latent space.

z_mean (μ): The center of the distribution where we want our encoding to be.
z_log_var (log(σ²)): The (log of) the variance, controlling the spread of that distribution.

Together, they define a normal distribution, from which we will later sample to get a latent variable z for decoding.

Now the question is why log(σ²) and not just σ² ?
There are two major reasons: positivity constraint, numerical stability

Variance Must Be Positive: If we try to output σ directly from a neural network, we need to ensure that the network only produces positive values. But neural nets naturally output values from (−∞,∞). Hence, log(σ²) guarantees that σ>0, always.
We work in the log domain, where multiplications become additions, and exponentials become linear. It avoids premature underflow for tiny variances or instability from tiny gradients when optimizing σ directly.

Imagine we want to allow the encoder to learn variances from 0.0001 to 1000:
If we output σ directly, the network must learn to span that huge dynamic range. But if we output log(σ²), the values range from about:
log(0.0001)=−9.2 to log(1000)=6.9.
A much more manageable range!

But wait on!!
There is one problem.

Let's go through it using the example

Let’s imagine the encoder outputs:
μ=0.5
σ=1.0
If we directly sample
z = np.random.normal(mu, sigma)

This operation:
Picks a random number (say 0.23 or 1.42) with no guarantee

There's no way to know how the output Z would change if μ or
σ changed — because the randomness hides the function's slope

And without a gradient, the network can’t learn.

How can we deal with it?

Reparameterization Trick

Here’s the twist:

Instead of sampling like this:
z∼N(μ,σ2)
We reparameterize it as:
z = μ+ σ⋅ε where ε∼N(0,1)
Here :

ε = Random noise from a fixed distribution
μ, σ = Output from the encoder (learnable)
z = Latent vector to feed into the decoder

Why this works:

ε is independent of the network, so its randomness doesn’t interfere with gradient flow.
μ and σ are now involved in a deterministic operation (addition and multiplication), so gradients can be calculated
Now, you can compute:
∂z/∂μ = 1 ∂z/∂σ = ε

Backpropagation is happy again. 🎉

This tiny trick makes backpropagation work through the stochastic layer. Without it, training would collapse.

Changes in VAE Decoder wrt AE

The decoder no longer sees a fixed vector in the latent space.
Instead, it gets:
z=μ+σ⋅ε
The decoder must be able to take any nearby sample around μ and still reconstruct a very similar output.

VAE Loss Function: More Than Just Reconstruction

We used reconstruction loss for AutoEncoders. We cannot use just the reconstruction loss anymore. We’re now dealing with probability distributions, and we need to keep the latent distributions close to a standard normal (N(0,1)).

So we add a Kullback–Leibler (KL) divergence term:

Loss = ReconstructionLoss + β∗KLDivergence

KL measures how much our learned distribution (μ, σ²) deviates from the standard normal. A higher KL means our encoding is straying too far.

Think of it like this:
Reconstruction loss: "How well did we recreate the input?"
KL divergence: "How wild is our latent distribution? Should we calm it down?"

What's the use case of the β Parameter?

The β coefficient balances reconstruction and regularization.

If β is too small, the KL divergence is ignored. The latent space becomes disorganized, similar to a vanilla AE. Good reconstructions, bad generation.
If β is too large, the model prioritizes matching N(0,1) over reconstruction. All samples start looking the same—blurry outputs, poor expressiveness.

✅ Sweet spot: When β is balanced, we get coherent generation and meaningful reconstructions.

🧱 Disadvantages of VAEs

Tends to generate blurry images
KL term is tricky to balance with reconstruction loss
Not ideal for high-resolution data (GANs often outperform here)

TL;DR: VAEs are smarter but also harder to tame.

🧭 Wrap-up: When to Use What and Why It All Matters

Use Autoencoders when you need compression, denoising, or anomaly detection.
Use VAEs when you need controlled generation, diversity, and smooth interpolation.
Use GANs when you need photo-realism.

The future of generative models lies in hybrids — combining VAEs, GANs, and Diffusion models.

GEN-AI 2 : Auto Encoders

Mitansh Gor — Sun, 20 Apr 2025 01:19:10 +0000

📌 What Are Generative Models?

Generative Models are machine learning algorithms that learn the underlying patterns or distribution of a dataset to generate new data that resembles the original.

In human terms:

They look at thousands of pictures of cats... and then start imagining their own cats.

🌟 Popular Examples of Generative Models

Model	What It Does	Common Uses
GANs (Generative Adversarial Networks)	Pit 2 networks (Generator & Discriminator) against each other	Deepfakes, art generation, face synthesis
VAEs (Variational Autoencoders)	Probabilistic version of AEs with smooth latent space	Image reconstruction, data generation
Autoregressive Models	Predict next element based on previous ones	GPT, music generation
Diffusion Models	Denoise images gradually to generate high-res outputs	DALL·E 3, Stable Diffusion

🛣️ How Do Generative Models Learn?

Imagine you’re an artist learning how to draw cats. You spend hours going through cat books, sketching from photos, and watching cat videos on YouTube (because obviously). Over time, without memorizing each specific cat, you start to get a feel for what a “cat” is — four legs, whiskers, maybe some attitude.

Their mission:

Learn the vibe of the data so well that they can create new things that look like it.

The Typical Path:

Input Data : This is your model’s training time — a massive gallery of examples.
Think: Thousands of cat pictures (or faces, music, text). No labels needed — just raw, unfiltered examples. The model isn’t told: “This is a cat.” It just sees cats, over and over. And from that, it begins to understand patterns.
Latent Representation
The model tries to compress all the messy, pixel-by-pixel cat data into something smaller and meaningful, kind of like the artist thinking: “Okay, I don’t remember every cat, but most cats have pointy ears, some fur texture, and this cat-ish outline... I can work with that.” In AI terms, this is the latent space — a multi-dimensional map where similar things live near each other. Imagine a universe where all cats are clustered in one corner, dogs in another, and so on.

Reconstruction or Generation: Once the model has this compressed knowledge, it can do either of these:
- Reconstruct what it saw: “Can I redraw the same cat just from memory?”
- Generate something new: “Can I draw a brand new cat that no one’s ever seen, but still looks legit?” And here’s the twist — different types of models do this learning differently.

This learning can be:

Explicit (learning the actual data distribution like VAEs)
These models are like a structured student. They try to understand the actual math behind how your data is spread out, trying to learn the probability distribution. They care about making the latent space smooth, organized, and logical.
You can walk around the space, sample random points, and get meaningful results.
Implicit (learning to fool a critic like in GANs)
These models are the street-smart artists. They don’t care about formulas — they just want their results to look good and be convincing. GANs do this through a game:
- One network (the Generator) tries to create fake data
- Another (the Discriminator) tries to spot the fakes
- They improve together in a loop until the fakes are too good to tell apart.

They never explicitly learn the actual math of the data, just how to mimic it well enough to fool someone.

🏗️ Autoencoders Architecture (AE): Compress, Learn, Reconstruct

Let’s say we’re dealing with image data (like handwritten digits from MNIST — each image is 28x28 pixels).

Input Layer

You feed the model a raw input vector, say a flattened version of the original data (e.g., 28x28 image = 784 dims). This is just pixel intensity values (usually between 0 and 1 if normalized).

Encoder

The encoder is a neural network that compresses the input into a lower-dimensional representation — the latent vector z.
z=FuncEncoder(x)
For example, you’re squeezing 784 dimensions down to 32. Hence size of the Z vector = 2.

We want the model to learn what matters most. By forcing it to compress, the model has to filter out unimportant noise and capture the essence of the data. This is called the bottleneck. You keep only the information about the image that matters.

How can we do that?

We pass the input image (28X28) through 3 convolutional layers, one after the other.
We use a stride of 2 (for eg), which means each layer reduces the image size by half while increasing the number of feature channels (kind of like zooming out but seeing more detail).
So image (ColXRowXfeatureChannel) from 28X28X1 -> 14X14X32 -> 7X7X64 -> 4X4X128 sized image.
Flatten (4 × 4 × 128) → 2048-dimensional vector
2048-dimensional vector Input as a Fully Connected Layer. Gives 2-Dimensional Output (Latent Vector - Z) - a compact 2D representation of your image.

Decoder

Now the decoder takes that compressed vector z and tries to rebuild the original image. It is a reverse-engineering/mirror image of the encoder.

x ′ = FuncDecoder(z)
It converts a 2D latent back to a 28*28 image.

How can we do that?
Instead of using normal convolutional layers that shrink things down, it uses transposed convolutions (or "deconvolutions") to build things back up to the original size. While a regular Conv layer with stride 2 shrinks the image, a Conv2DTranspose with stride 2 does the opposite — it doubles the size of the feature map. You can also use upsampling, which simply stretches the image to a larger size, but unlike transposed convolutions, it doesn't learn anything — it just resizes.

Here’s how the decoder does that:

Start with the 2-dimensional latent vector (Z), and pass it through a fully connected (dense) layer to expand it into a larger vector — say, 2048 dimensions.
Reshape that 2048-dimensional vector back into a 4 × 4 × 128 tensor, like rewinding a movie to the middle scene.
Now we apply transposed convolutional layers (Conv2DTranspose) one after another — each with a stride of 2 — which gradually upsamples the feature maps while reducing the number of channels: 4×4×128 → 7×7×64 → 14×14×32 → 28×28×1
The final layer outputs an image that's the same size as the input — in this case, 28×28 pixels with 1 channel (grayscale).
Optionally, we apply a sigmoid activation at the end so all pixel values are between 0 and 1 — perfect for comparing with the original input during loss calculation!

After the model reconstructs the 28*28 image, it checks the loss of the result by comparing it with the original image.

Loss Func Check

We want the input 𝑥 and the reconstruction 𝑥′ to be as close as possible.
So we minimize a loss function that measures the difference between them.

Two Common Losses:

Mean Square Error (MSE / L2 loss): It computes the square of the difference between each element in the predicted output vector and the corresponding element in the ground truth, then takes the average across all elements. This loss is particularly useful when dealing with continuous data and is known for its sensitivity to outliers.
MSE Loss(J) = (1/n) * Σ(y_true - y_pred)^2
Binary Cross-Entropy (BCE): used to measure the difference between two probability distributions. We use a sigmoid activation in the final layer of the decoder so that 0 < x′< 1, which makes it interpretable as a probability — a requirement for BCE to work.

The model looks at that loss and works backward( backpropagation). It figures out, layer by layer, how much each connection (weight) contributed to the mistake. Then, using an optimizer like Adam or SGD, it tweaks those weights just a little to do better next time. This process repeats again and again, and over time, the model gets good at understanding what matters in the data, like learning the essence of a handwritten “3” without being told it’s a 3.

❌ Limitations of Autoencoders

When we try to sample (pick) a random point from the latent space, we often end up with strange or meaningless outputs. That’s because some regions of the space are crowded, while others are empty, so we might accidentally pick a point from a “dead zone” where the model has never seen any training data.
Autoencoders don’t enforce any structure in the latent space. There's no rule saying, “the points should follow a neat, smooth pattern like a normal distribution.” So when we try to generate new data by sampling from the space, we don’t really know where to pick from — it's like throwing darts in the dark.
There are gaps or holes in the latent space — areas that never got used during training. The model has no clue how to decode these zones, so it often produces blurry or totally unrecognizable images.
Even points close together in space can produce wildly different outputs. For example, if (–1, –1) creates a nice image, (–1.1, –1.1) might give total noise. That’s because the model doesn’t care about making the space smooth or continuous — it's just trying to reconstruct training images.
In 2D latent spaces (for example, when visualizing MNIST), this problem isn’t too bad — everything is kind of squished together. But when we move to higher dimensions (like when generating complex stuff — faces, clothes, etc.), the gaps and discontinuities get worse. So even though more dimensions are needed to capture complex patterns, they also make the space harder to sample from safely.

Gen-AI 1 : The Generative AI Universe

Mitansh Gor — Sun, 20 Apr 2025 01:19:02 +0000

Alright, reader — buckle up! You’re about to embark on an exciting, mind-bending journey into the world of Generative AI. This isn't just a technical deep dive; it's a story, a pathway — one that starts simple and grows into the cutting edge of machine creativity.

Think of this as leveling up in a video game 🎮 — each model you learn is a new power-up!

🧱 Level 1: Autoencoders (AE) — The Compression Artists

We kick things off with Autoencoders, the humble but powerful networks that learn to compress and rebuild data. They’re the foundation — no frills, just the raw ability to reconstruct images like a neural .zip file.

🌀 Level 2: Variational Autoencoders (VAE) — Bringing Probability to the Party

Next, we get a bit more creative with VAEs. Instead of just compressing, they model distributions, letting us generate brand-new samples. You’ll meet latent spaces that look like dreamy universes where cats, dogs, and unicorns each have their own neighborhoods.

🎭 Level 3: Generative Adversarial Networks (GAN) — The Ultimate Fake Artists

Now things get spicy. GANs introduce an epic duel between two networks — a Generator and a Discriminator. The Generator tries to fool the Discriminator, and in doing so, creates stunningly realistic data. It’s like a counterfeiter and a detective locked in a neural arms race. 🕵️‍♂️💣

🌊 Level 4: Wasserstein GANs (WGAN) — Stability, Finally!

But GANs can be moody. That’s where WGANs come in. With Wasserstein loss, they bring calm and stability to the chaotic training process. Less mode collapse, more control.

🌊🧼 Level 5: WGAN-GP — Smooth Operators

Add a Gradient Penalty (GP) and voilà — WGAN-GP is born. It regularizes training, keeping the Discriminator’s gradients smooth and ensuring even better results.

🎨 Level 6: Conditional GANs (CGAN) — The Mind Readers

Want a GAN that listens to you? Say hello to CGANs, where you can generate images based on labels. “Hey GAN, draw me a sneaker!” And it does.

💨 Level 7: Diffusion Models — From Noise to Masterpiece

Diffusion Models flip the script. They start with pure noise and slowly reverse it to create images — like watching fog form into a painting. These are powering DALL·E 2, Stable Diffusion, and the latest text-to-image breakthroughs.

🔁 Level 8: Recurrent Neural Networks (RNNs) — The Time Travelers

Now we move into sequence-land. RNNs are the OGs of temporal data, remembering past inputs to generate text, music, and more. Great for early language modeling.

⚡ Level 9: Transformers — The Reigning Champions

And finally… the Transformers. The architecture behind GPT, BERT, DALL·E, and nearly every major AI breakthrough in recent years. With attention mechanisms, they’ve redefined how machines understand and generate language, images, even code.

🌈 What Awaits You

By the end of this journey, you won’t just know how these models work — you’ll see the connections, appreciate the evolution, and maybe even build your own!

So…

Start simple. Go deep. Level up. And let the generative magic begin.

Carla Simulator 1 : How to Set Up CARLA Simulator 🏎️🔥

Mitansh Gor — Fri, 20 Dec 2024 22:06:12 +0000

Yo, it's Dom here 🏁. Family's everything to me, but today, I'm pulling over from the street races to drop some knowledge about setting up the CARLA simulator. We're about to step into a virtual world where precision meets adrenaline. Buckle up, because we're cruising through what CARLA is, why it's your ultimate driving ally, and how to set it up like a pro.

check Video.

What’s CARLA, and Why Do We Need It? 🛠️🚗

CARLA isn't just some fancy tool; it's the NOS of autonomous driving simulators. Built on Unreal Engine, this open-source bad boy helps you:

Train perception algorithms 🎯.
Learn driving policies.
Validate your autonomous systems safely without denting a single bumper.
From radars to cameras and even LIDAR, CARLA hooks up your virtual ride with every sensor imaginable. Need realistic urban setups? It’s got those. Want to replay every crash like a Fast & Furious movie? There's a recorder for that. 🌆🎥

Why Simulators Like CARLA Matter 🚦

So, why the heck do we need CARLA? Well, these simulators are game-changers for anyone in autonomous driving. You wouldn’t take your muscle car out on a high-speed chase without practicing first, right? The same goes for autonomous cars—they need to be trained in different environments and situations, and that's where CARLA comes in.

We use simulators like CARLA for:

Training algorithms that teach cars to drive themselves (without causing a wreck).
Testing perception systems (sensors that make sure the car sees everything it needs to).
Learning driving policies (figuring out how to make the car follow the rules of the road and not blow through red lights).

Without simulators, self-driving cars would be like a race car driver trying to learn by just jumping into a car and going 200 mph without practice. You need to test the machine before hitting the real world. And trust me, we don’t want any bad surprises when we’re going full throttle.

Setting Up CARLA - Let's Get to Work 💻💥

Alright, now let’s get down to business. listen up! You wanna set up CARLA on a Linux VM? It's gonna be a wild ride, but I've got your back. Follow these steps like you're gunning for a 10-second quarter mile. Let's get this show on the road. 🔥

Download CARLA

First, hit up the CARLA GitHub repo to grab the carla-0.9.12-linux tarball. This is the start of your journey.

wget https://tiny.carla.org/carla-0-9-12-linux

You can’t drive a car without a motor, right? This tarball's gonna be your engine. Now, let’s crack that tarball open and release the power inside. Run the command.

tar -xzvf carla-0.9.12-linux

Install NVIDIA Drivers

If you’re running this on a beefy machine with a GPU (and you should be, because we’re going for high-speed performance here), we need to get the right drivers in place. Install the NVIDIA drivers:
sudo apt install nvidia-driver-535
Once the drivers are installed, you’ve gotta give your system a little restart to make sure everything kicks into gear using sudo reboot.
After rebooting, make sure your NVIDIA drivers are running smooth. Run nvidia-smi.

Install Vulkan Tools

Vulkan is your performance boost, just like nitrous on a muscle car. Install it with
sudo apt-get install vulkan-tools
We need all the performance we can get for this simulation. Think of Vulkan as your nitrous system—helps CARLA run smoother and faster.

Download and Install XQuartz

You can’t drive the machine if you can’t see the road. For that, we need XQuartz to get the display working on your Mac. Download it from url and install it.

SSH into the VM with X11 Forwarding

Now, you need to connect to your Linux VM from your Mac, with some special sauce (X11 forwarding). Use:

ssh -X <studentid>@host

Export Display Port

You’re not racing if you can’t see the track. Set the display port like this:

export DISPLAY=:0

This ensures you’re looking at the right display—your virtual cockpit

Export XDG_RUNTIME

For the system to know where to store the runtime files, do this:

export XDG_RUNTIME_DIR=/run/user/$(id -u)

Run this command to make sure everything’s running like a well-oiled machine vulkaninfo.

CARLA Python client

To make sure your machine can talk to the CARLA simulator server, you're gonna need the CARLA Python client. Let’s get it installed fast, like we're revving up for a race. 🏁

Run this command in your terminal:

pip install carla

Start Simulator

Alright, you’ve done the hard work. Now it’s time to hit the gas! 🏁
Let’s fire up CARLA and see what this machine can do. Run the command below, and watch the magic happen. 🚗💨

./CarlaUE4.sh -vulkan

Boom, you're in the driver’s seat. CARLA’s about to show you what real simulation looks like. Let’s roll! 👊

If you hit an error saying it can't access the display, it's like your car's stuck in the garage and the door won’t open. 🚗💥 Time to unlock that door with:

xhost +local:root

if you're still facing display issues, let's put the pedal to the metal and get creative. 💥
Hit it with this command:

export DISPLAY=localhost:10.0

Connect to Your Cloud VM Using VNC

we need to get that simulator running on the cloud VM and beam it over to your local machine. Think of it like driving a car in the clouds and watching it on your own screen.

We’re gonna use SSH tunneling to forward the port where the CARLA simulator is running in the VM to your local screen. It's like connecting the engine to the wheels. 🏎️
ssh -L <VM_simulator_port>:localhost:5901 <your_vm_username>@<vm_ip_address>

Now, you should be seeing the CARLA simulator on your local screen. It’s like you’ve got the wheel in your hands while the car’s on the cloud. 🌩️

What You'll See 🌟

Just like how I always say, “I live my life a quarter mile at a time,” with CARLA, you can take your simulations step by step. It’s not just about fast cars—it’s about using the right tools to get things done the right way. So, whether you’re testing driving policies, training perception algorithms, or just want to feel the thrill of autonomous driving, CARLA’s your ticket.

Now go out there, hit the gas, and get started with CARLA. It’s a wild ride, but you’ve got the skills to drive it home. 🏁

Family’s got your back!

Hold tight because, in the next blog, we’ll dive into adding cars, walkers, and more into CARLA. And who better to take the wheel for that than my close friend, right-hand man, and brother, Brian O’Conner?

Stay tuned—he’s up next. 🔥
Dominic Toretto, signing off. 🏎️💨
check Blog 2 of Carla Simulator.

Carla Simulator 2 : Welcome to the Ride 🚗🏍️

Mitansh Gor — Fri, 20 Dec 2024 22:03:50 +0000

check Blog 1 of Carla Simulator.

Check Video

Hey, family! It’s Brian here — yes, the Brian O’Conner, the gearhead who always has your back on the road or in a high-speed chase. Dom handed me the keys to this blog, saying it’s time to dive deep into some next-gen simulator tech that fuels the autonomous revolution. Thanks, Dom, for trusting me. I’ll try not to wreck it (not that I ever wreck anything, right? 😉).

Dom mentioned how the CARLA simulator lets us test and train autonomous vehicles in a realistic, open-source environment. It’s got the kind of details that make even Letty go, "Damn, that’s sharp." Our mission here? To break it down for you like we would a turbocharged engine. You’ll know everything about CARLA’s actors, vehicles, and how to control them with the finesse of a perfectly-timed NOS burst. And if you stick with me, we’ll even set up a Flask API to bring this simulation to life. Let’s ride! 🏍️

The Plan for This Blog

Actors and Blueprints 101: Understand the essentials of what makes CARLA tick, from vehicles to walkers and how blueprints play a role.
Dive into Vehicle and Walker Management: Learn how actors are spawned, manipulated, and destroyed — think of it like choosing the perfect car for a street race and knowing when to swap gears.
Flask API Integration: Build Flask APIs step-by-step to control CARLA’s simulation programmatically. By the end, you’ll be setting up car chases, random pedestrians, and traffic with just a few lines of code.
Hands-on Examples: Showcase how to populate the simulation with vehicles and walkers using our Flask API endpoints.

Ready to start? Let’s shift into gear! 🏎️

Actors and Blueprints 101 🎨

In CARLA, actors are the dynamic elements that breathe life into the simulation. This includes:

Vehicles: Cars, trucks, bikes — the heart of the simulation.
Walkers: Pedestrians who roam the environment.
Traffic Entities: Traffic lights and stop signs.
Sensors: Cameras, LiDAR, radar, and GPS for data collection.
Blueprints: The DNA of Actors

Think of blueprints as pre-configured settings for creating actors. CARLA’s blueprint library includes attributes like:

Vehicle Colors
Engine Power
Walker Speed
Sensor Ranges

Each blueprint is customizable, letting you tweak the simulation to match real-world scenarios. It’s like choosing a Skyline’s body kit and tuning it for maximum horsepower.

Managing Actors: Spawning, Interacting, and Destroying

In CARLA:

Spawning: Use the spawn_actor API to bring vehicles and walkers into the world.
Interacting: Control the physics, speed, and direction of actors.
Destroying: Remove actors once their role in the simulation is done.

Vehicle Physics 🏎️⚡

Vehicles in CARLA come equipped with:
Dynamic Control: Accelerate, brake, and steer.
Collision Response: Simulates crashes and obstacle impacts.
Environmental Response: Handles weather and road conditions.

Walker Dynamics 🚶‍♂️
Walkers can:

Follow predefined paths.
React dynamically to traffic.
Walk at various speeds and animations.

Flask API Integration: Building the Engine 🚒

Now that we’ve got the basics, it’s time to build a Flask API to interact with CARLA. Here’s how:

Flask API Endpoints

Spawn Vehicles

@app.route('/spawn_vehicle', methods=['POST'])
def spawn_vehicle():
    data = request.get_json()
    model = data.get('model', 'vehicle.tesla.model3')

    blueprint_library = world.get_blueprint_library()
    vehicle_bp = blueprint_library.filter(model)[0]

    spawn_points = world.get_map().get_spawn_points()
    if not spawn_points:
        return jsonify({'error': 'No spawn points available'}), 400

    spawn_point = random.choice(spawn_points)
    vehicle = world.spawn_actor(vehicle_bp, spawn_point)

    return jsonify({'message': 'Vehicle spawned', 'id': vehicle.id})

✨📸 Setting Up the Sensor Cameras

To capture images from a vehicle's top view and front view, two cameras are mounted on the car. Here's the setup:

The code retrieves camera blueprints (sensor.camera.rgb) from CARLA's blueprint library. Resolution settings (800x600) are defined for the cameras to ensure clear visuals. 📷

Top Camera: Positioned directly above the car to capture a bird's-eye view.

Front Camera: Placed at the front of the car for a forward-facing perspective

camera_spawn_pointTop = carla.Transform(
    carla.Location(x=0.0, y=0.0, z=5.0),  # 5 meters above the car
    carla.Rotation(pitch=-90.0)  # Looking straight down
)
cameraTop = world.spawn_actor(camera_bp1, camera_spawn_pointTop, attach_to=vehicle)

camera_spawn_pointFront = carla.Transform(
    carla.Location(x=-5.0, y=0.0, z=3.0),  # 5 meters in front and 3 meters high
    carla.Rotation(pitch=-15.0)  # Slight downward tilt
)
cameraFront = world.spawn_actor(camera_bp2, camera_spawn_pointFront, attach_to=vehicle)

Each camera is set up with a callback function (e.g., camera_callback1) to process the captured images. These images are saved as files and encoded in Base64 for easy transfer and storage.

def camera_callback1(image, id, vehicleId, pos):
    array = image.raw_data
    img = Image.frombytes("RGBA", (image.width, image.height), bytes(array))
    img = img.convert("RGB")
    img.save(f"./images/{id}_{vehicleId}_{pos}.png", format="PNG")

🎥 Spectator Camera: Follow the Car To make the simulation more immersive, a spectator camera is configured to follow the vehicle dynamically.

A spectator camera is moved relative to the car's position using the set_transform method.
It stays a few meters behind and slightly above the car, aligned with its orientation.

def setup_spectator_camera(world, vehicle):
    vehicle_transform = vehicle.get_transform()  # Get the car's position and rotation
    spectator = world.get_spectator()  # Access the spectator camera

    spectator_transform = carla.Transform(
        vehicle_transform.location + carla.Location(x=-6, y=0, z=2),  # Offset behind and above the car
        vehicle_transform.rotation  # Match the car's orientation
    )
    spectator.set_transform(spectator_transform)  # Update the spectator's position

The spectator camera is continually updated in a loop, ensuring it tracks the car as it moves. This creates a realistic following effect.

Spawn Walkers

@app.route('/spawn_walker', methods=['POST'])
def spawn_walker():
    data = request.get_json()
    walker_count = data.get('count', 1)

    walker_blueprints = world.get_blueprint_library().filter('walker.pedestrian.*')
    spawn_points = [random.choice(world.get_map().get_spawn_points()) for _ in range(walker_count)]

    walkers = []
    for spawn_point in spawn_points:
        walker_bp = random.choice(walker_blueprints)
        walker = world.spawn_actor(walker_bp, spawn_point)
        walkers.append(walker)

    return jsonify({'message': f'{walker_count} walkers spawned', 'ids': [w.id for w in walkers]})

Destroy All Actors

@app.route('/destroy_actors', methods=['POST'])
def destroy_actors():
    actors = world.get_actors()
    actors_to_destroy = [actor for actor in actors if actor.type_id.startswith('vehicle') or actor.type_id.startswith('walker')]

    for actor in actors_to_destroy:
        actor.destroy()

    return jsonify({'message': 'All actors destroyed'})

Add Random Vehicles and Walkers

@app.route('/populate_simulation', methods=['POST'])
def populate_simulation():
    data = request.get_json()
    vehicle_count = data.get('vehicles', 10)
    walker_count = data.get('walkers', 10)

    # Spawn Vehicles
    for _ in range(vehicle_count):
        vehicle_bp = random.choice(world.get_blueprint_library().filter('vehicle.*'))
        spawn_point = random.choice(world.get_map().get_spawn_points())
        world.spawn_actor(vehicle_bp, spawn_point)

    # Spawn Walkers
    for _ in range(walker_count):
        walker_bp = random.choice(world.get_blueprint_library().filter('walker.pedestrian.*'))
        spawn_point = random.choice(world.get_map().get_spawn_points())
        world.spawn_actor(walker_bp, spawn_point)

    return jsonify({'message': f'{vehicle_count} vehicles and {walker_count} walkers added'})

Wrapping Up 🏎️

And that’s how you take control of CARLA with Flask! Whether you’re simulating a high-speed chase or a casual Sunday drive, these tools let you shape your virtual world. Stay tuned for more deep dives and, as always, drive safe — or at least make it look cool while you’re not. 🚗🚨

DEV Community: Mitansh Gor

Personal Health Agent (PHA): Multi-Agent Health System

A System of Specialized Agents

Technical Deep Dive

Finetuning

System architecture and Data Flow

Data Pipeline Flow

Inference time

Tools used

Collection of data

PHA vs Gemini Results

Strengths of the system

Limitations of the System

Conclusion

PHIA: The Agentic LLM That Writes Code, Analyzes Your Data & Explains Health Insights

PH-LLM vs PHIA

How PHIA Works

A Deep Dive Into PHIA’s Technical Architecture

No Fintuning

Process

Limitations in the PHIA

Learnings from the paper

Key Insights I’m Taking With Me

PH-LLM - A LLM that gives personalized sleep and fitness coaching using wearable data.

What Problem PH-LLM Tries to Solve

High-level Working of the PH-LLM

Technical Working

Finetuning

Multimodal Adapter for Sensor Data

Input Format and Context Handling

Output Structure

Evaluation Architecture

Strengths of the system

Limitations

Final Takeaways

Links

AI for Personal Health — My Review, Reflections & Real-Life Understanding

1. Introduction

What This Series Will Cover (Based on the Timeline of Publications)

Why This Topic Matters (and What I Hope You’ll Learn)

10. References

Nested Learning — My Reflections on a Model That Learns How to Learn

Take on the Problems in Deep Learning and Transformers

How the Human Brain Actually Learns

Why Real Intelligence Learns at Many Speeds

Why Optimizers Are Still “Shallow”

What Nested Learning Really Means

The Power of Associative Memory

HOPE — The Model That Learns to Learn

CMS — Learning Across Multiple Memory Speeds

From Titans to HOPE — How the Architecture Evolved

My Takes on the Paper

GEN-AI 5 : WGAN and WGAN-GP

Why ❌ GAN

What is Wasserstein distance

Improvement in WGAN wrt GAN

Working

The Loss Function

Missing Something? 🤔

Adding Gradient Penalty

Results after WGAN-GP model:

Disadvantages of WGAN

Summarize

GEN-AI 4 : GAN

Disadvantages of VAE

How is GAN better than VAE?

Working of GAN

The Generator:

The Discriminator:

Adversarial Training:

The Objective:

The Loss Functions:

Training Process:

Situations in GAN

Disadvantages of GAN.

GEN-AI-3 : VAE

The Problem with Regular Autoencoders 🧩

Variational Autoencoders

🎯 Constraints on the latent space

Changes in VAE Encoder