<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Maureen Muthoni</title>
    <description>The latest articles on DEV Community by Maureen Muthoni (@maureenmuthonihue).</description>
    <link>https://dev.to/maureenmuthonihue</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3506197%2Fdeabd19c-e523-4314-8472-0f61bc48a204.jpg</url>
      <title>DEV Community: Maureen Muthoni</title>
      <link>https://dev.to/maureenmuthonihue</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/maureenmuthonihue"/>
    <language>en</language>
    <item>
      <title>Building a Smart Travel Assistant with RAG: A Journey Through Kenya's Tourism Landscape</title>
      <dc:creator>Maureen Muthoni</dc:creator>
      <pubDate>Sat, 11 Apr 2026 09:20:32 +0000</pubDate>
      <link>https://dev.to/maureenmuthonihue/building-a-smart-travel-assistant-with-rag-a-journey-through-kenyas-tourism-landscape-2oeb</link>
      <guid>https://dev.to/maureenmuthonihue/building-a-smart-travel-assistant-with-rag-a-journey-through-kenyas-tourism-landscape-2oeb</guid>
      <description>&lt;h2&gt;
  
  
  How I Built an AI-Powered Q&amp;amp;A System
&lt;/h2&gt;

&lt;p&gt;Have you ever wished you could ask specific questions about a travel destination and get accurate, sourced answers? That's precisely what I set out to build and in this article, I'll walk you through creating a Retrieval-Augmented Generation (RAG) system for Kenya's tourism industry.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: AI That Makes Things Up
&lt;/h2&gt;

&lt;p&gt;Large Language Models (LLMs) are impressive, but they have a fatal flaw: they confidently generate information that sounds right but might be completely wrong. Ask ChatGPT about the best time to visit the Maasai Mara, and it might give you a reasonable answer or it might hallucinate facts about wildebeest migration patterns.&lt;/p&gt;

&lt;p&gt;This is where RAG comes in. Instead of relying on what the AI "thinks" it knows, we give it a library of trusted documents and teach it to search through them before answering. Think of it as moving from a student who wings their exam to one who brings a cheat sheet with verified facts.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We're Building
&lt;/h2&gt;

&lt;p&gt;Our system ingests PDF documents about Kenyan tourism destinations (Maasai Mara, Mombasa, Mount Kenya, etc.) and provides a REST API where users can ask questions like the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"What wildlife can I see at Maasai Mara?"&lt;/li&gt;
&lt;li&gt;"What are the best beaches in Mombasa?"&lt;/li&gt;
&lt;li&gt;"How difficult is it to climb Mount Kenya?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Search through the PDF documents for relevant information&lt;/li&gt;
&lt;li&gt;Extract the most pertinent passages&lt;/li&gt;
&lt;li&gt;Use an LLM to generate a natural language answer based only on those passages&lt;/li&gt;
&lt;li&gt;Return the sources so users can verify the information&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Tech Stack
&lt;/h2&gt;

&lt;p&gt;Here's what we're using and why:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI&lt;/strong&gt;: Lightning-fast Python web framework, perfect for building APIs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sentence Transformers&lt;/strong&gt;: Converts text to embeddings (fancy math that makes similar text have similar numbers)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ChromaDB&lt;/strong&gt;: Stores and searches through those embeddings efficiently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Groq&lt;/strong&gt;: Blazingly fast LLM inference (seriously, it's ridiculously fast)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pypdf&lt;/strong&gt;: Extracts text from PDF documents&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture: The 30,000-Foot View
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PDFs → Text Extraction → Chunking → Embeddings → Vector Database
                                                        ↓
User Query → Embedding → Similarity Search → Context → LLM → Answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have two main pipelines:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Ingestion Pipeline&lt;/strong&gt; (run once): Takes PDFs, breaks them into chunks, converts chunks to vectors, stores in a database.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query Pipeline&lt;/strong&gt; (run every query): Takes question, converts to vector, finds similar chunks, sends to LLM for an answer.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Step 1: Document Ingestion — Teaching the System to Read
&lt;/h2&gt;

&lt;p&gt;Let's start with the ingestion script. This is where the magic of preparing our knowledge base happens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extracting Text from PDFs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pypdf&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PdfReader&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PdfReader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;page_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract_text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;page_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;page_text&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple enough, we read each page and concatenate the text. But here's the thing: PDFs are notoriously tricky. Some have scanned images (which need OCR), some have weird encodings, and some have tables that don't extract well. For this project, I assumed clean, text-based PDFs. In production, you'd want more robust error handling.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Chunking Strategy: Why Size Matters
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;words&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why chunk at all? LLMs have context windows, and we can't feed them entire books. More importantly, smaller chunks mean more precise retrieval. If your document chunk is an entire chapter about Mombasa and someone asks about beaches, you'll retrieve all of Mombasa's beaches, hotels, restaurants and history. That's too much noise.&lt;/p&gt;

&lt;p&gt;I chose 300 words per chunk through experimentation. Too small (100 words) and you lose context. Too large (1000 words) and your retrieval becomes imprecise.&lt;/p&gt;

&lt;h3&gt;
  
  
  Embeddings
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BAAI/bge-small-en-v1.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;normalize_embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;norms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdims&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;norms&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's where things get interesting. Embeddings convert text into high dimensional vectors (arrays of numbers). Similar text gets similar vectors. "The lion roared" and "The big cat made a loud sound" will have vectors that are close together in this mathematical space.&lt;/p&gt;

&lt;p&gt;I chose &lt;code&gt;BAAI/bge-small-en-v1.5&lt;/code&gt;because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It's small (133M parameters) fast inference&lt;/li&gt;
&lt;li&gt;It's good at semantic search tasks&lt;/li&gt;
&lt;li&gt;It's actively maintained and well documented&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The normalization step is crucial. It converts vectors to unit length, which makes cosine similarity (how ChromaDB compares vectors) equivalent to dot product and dot product is faster to compute.&lt;/p&gt;

&lt;h3&gt;
  
  
  Storing Everything in ChromaDB
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chromadb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PersistentClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./chromadb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_or_create_collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;travel_and_tourism&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Multi PDF Tourism documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;all_chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;all_embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;all_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadatas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;all_metadatas&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ChromaDB is a vector database designed for this exact use case. It:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stores embeddings efficiently&lt;/li&gt;
&lt;li&gt;Provides fast similarity search&lt;/li&gt;
&lt;li&gt;Persists data to disk&lt;/li&gt;
&lt;li&gt;Has a simple Python API&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;PersistentClient&lt;/code&gt; means our vectors survive restarts. We don't have to re-embed all our documents every time we start the server.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: The Query Pipeline
&lt;/h2&gt;

&lt;p&gt;Now for the fun part: answering questions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Converting Questions to Vectors
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalize_embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We use the same embedding model we used for documents. This is critical. If you embed documents with Model A and queries with Model B, the vector spaces won't align.&lt;/p&gt;

&lt;h3&gt;
  
  
  Similarity Search
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;query_embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;documents&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;metadatas&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadatas&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ChromaDB finds the 3 most similar document chunks to our query. How does it know what's similar? It computes the distance between the query vector and every document vector, then returns the closest ones.&lt;/p&gt;

&lt;p&gt;Why 3? Another Goldilocks number. Too few (1) and you might miss important context. Too many (10) and you'll include irrelevant information that confuses the LLM. I tested several values and found 3 provided the best balance.&lt;/p&gt;

&lt;h3&gt;
  
  
  The LLM
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;groq&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Groq&lt;/span&gt;

&lt;span class="n"&gt;groq_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Groq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;groq_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta-llama/llama-4-scout-17b-16e-instruct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer only using provided context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Context:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Question:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is where RAG shines. We give the LLM:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A system instruction: "Only use the provided context" (reducing hallucinations)&lt;/li&gt;
&lt;li&gt;The retrieved context&lt;/li&gt;
&lt;li&gt;The user's question&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;code&gt;temperature=0&lt;/code&gt; setting makes the model deterministic; the same input always produces the same output. This is crucial for reliability.&lt;/p&gt;

&lt;p&gt;Why Groq? Speed. Seriously, it's fast. What takes OpenAI 3-4 seconds, Groq does in under a second. For user facing applications, this matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Source Attribution
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;metadatas&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We return the source PDFs used to generate the answer. This serves two purposes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Users can verify the information&lt;/li&gt;
&lt;li&gt;It builds trust in the system&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Step 3: The FastAPI
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HTTPException&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi.middleware.cors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CORSMiddleware&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Travel and Tourism&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;CORSMiddleware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;allow_origins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;allow_methods&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;allow_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/ask&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;QuestionResponse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ask_question&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;QuestionRequest&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;QuestionResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;FastAPI gives us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatic API documentation (visit &lt;code&gt;/docs&lt;/code&gt; to see it)&lt;/li&gt;
&lt;li&gt;Request validation via Pydantic models&lt;/li&gt;
&lt;li&gt;Type hints that actually work&lt;/li&gt;
&lt;li&gt;Easy async support (though we're not using it here)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The CORS middleware allows frontend applications from any origin to call our API. In production, you'd restrict this to your specific domain.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results: Does It Actually Work?
&lt;/h2&gt;

&lt;p&gt;Let's test it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query&lt;/strong&gt;: "What wildlife can I see at Maasai Mara?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"answer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"At Maasai Mara, you can see the Big Five: lions, elephants, leopards, rhinos, and buffalo. The park is famous for the annual wildebeest migration between July and October, where millions of wildebeest, zebras, and gazelles cross the Mara River. You can also spot cheetahs, hyenas, giraffes, hippos, crocodiles, and over 450 bird species."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sources"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Maasai_Mara.pdf"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Beautiful. The answer is specific, accurate, and sourced.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture: Why RAG Matters
&lt;/h2&gt;

&lt;p&gt;RAG represents a fundamental shift in how we build AI applications. Instead of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fine-tuning models (expensive, time-consuming, static)&lt;/li&gt;
&lt;li&gt;Relying on model knowledge (outdated, prone to hallucination)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use any LLM as a reasoning engine&lt;/li&gt;
&lt;li&gt;Plug in our own knowledge dynamically&lt;/li&gt;
&lt;li&gt;Update information without retraining&lt;/li&gt;
&lt;li&gt;Provide source attribution for trust&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pattern works for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer support bots trained on company documentation&lt;/li&gt;
&lt;li&gt;Legal research tools searching case law&lt;/li&gt;
&lt;li&gt;Medical assistants referencing clinical guidelines&lt;/li&gt;
&lt;li&gt;Internal knowledge bases for enterprises&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building this RAG system taught me that the real challenge isn't the AI it's the data pipeline, retrieval strategy, and user experience. The LLM is just the final step that ties everything together.&lt;/p&gt;

&lt;p&gt;RAG won't solve all AI problems. But for question-answering over documents, it's incredibly powerful. And as embedding models improve, vector databases get faster, and LLMs become more capable, RAG systems will only get better.&lt;/p&gt;

&lt;h3&gt;
  
  
  Code Snippets
&lt;/h3&gt;

&lt;p&gt;All code in this article is available in my GitHub repository [&lt;a href="https://github.com/maureenmuthoni-hue/Travel_and_Tourism_RAG_System" rel="noopener noreferrer"&gt;https://github.com/maureenmuthoni-hue/Travel_and_Tourism_RAG_System&lt;/a&gt;]. Feel free to star, fork, and adapt it for your own projects!&lt;/p&gt;

</description>
      <category>rag</category>
      <category>beginners</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Customer Lifetime Value (CLV) Prediction with Machine Learning</title>
      <dc:creator>Maureen Muthoni</dc:creator>
      <pubDate>Mon, 23 Feb 2026 20:00:50 +0000</pubDate>
      <link>https://dev.to/maureenmuthonihue/customer-lifetime-value-clv-prediction-with-machine-learning-4545</link>
      <guid>https://dev.to/maureenmuthonihue/customer-lifetime-value-clv-prediction-with-machine-learning-4545</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Customer acquisition is expensive. But do you know which customers will actually generate long term revenue? That’s where &lt;strong&gt;Customer Lifetime Value (CLV)&lt;/strong&gt; comes in.&lt;/p&gt;

&lt;p&gt;Instead of focusing on one-off transactions, CLV estimates the total revenue a business expects from a customer over their entire relationship.&lt;/p&gt;

&lt;p&gt;In this project, I built an end-to-end CLV prediction model and then deployed it as a production ready API.&lt;/p&gt;

&lt;p&gt;In this article, we’ll cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business problem
&lt;/li&gt;
&lt;li&gt;Data preprocessing
&lt;/li&gt;
&lt;li&gt;Model development
&lt;/li&gt;
&lt;li&gt;Model evaluation
&lt;/li&gt;
&lt;li&gt;Model deployment with FastAPI
&lt;/li&gt;
&lt;li&gt;Production-ready setup
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Business Problem
&lt;/h3&gt;

&lt;p&gt;Businesses want to answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which customers are most valuable?&lt;/li&gt;
&lt;li&gt;Who should receive retention incentives?&lt;/li&gt;
&lt;li&gt;Where should marketing budgets be allocated?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Predicting CLV helps with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Customer segmentation
&lt;/li&gt;
&lt;li&gt; Revenue forecasting
&lt;/li&gt;
&lt;li&gt; Budget optimization
&lt;/li&gt;
&lt;li&gt; Retention strategies
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a regression problem since CLV is a continuous value.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 1: Data Preprocessing
&lt;/h4&gt;

&lt;p&gt;The dataset includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Purchase frequency
&lt;/li&gt;
&lt;li&gt;Recency
&lt;/li&gt;
&lt;li&gt;Average transaction value
&lt;/li&gt;
&lt;li&gt;Tenure
&lt;/li&gt;
&lt;li&gt;Demographic features
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Data Preparation
&lt;/h4&gt;

&lt;p&gt;Before training any model, we need to separate our features from the target variable. In this case, CLV is what we're trying to predict, and everything else in the dataset serves as input:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CLV&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CLV&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We also check for missing values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clean data is non-negotiable. Missing values can silently corrupt a model's performance if left unaddressed.&lt;/p&gt;

&lt;h4&gt;
  
  
  Splitting the Dataset
&lt;/h4&gt;

&lt;p&gt;We divide the data into training and testing sets 80% for training and 20% for evaluating performance on unseen data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;

&lt;span class="n"&gt;x_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setting random_state=42 ensures reproducibility, so results remain consistent across runs.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 2: Model development
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Linear Regression&lt;/strong&gt;&lt;br&gt;
We start with linear regression, a simple but interpretable baseline. It assumes a linear relationship between the features and the target, making it fast to train and easy to explain to stakeholders.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LinearRegression&lt;/span&gt;

&lt;span class="n"&gt;Linear&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LinearRegression&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Predictions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Random Forest Regressor&lt;/strong&gt;&lt;br&gt;
Next, we train a Random Forest  an ensemble method that builds 200 decision trees and averages their predictions. This approach is more robust to non-linear patterns in the data and tends to outperform linear models on complex real world datasets.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestRegressor&lt;/span&gt;

&lt;span class="n"&gt;rf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RandomForestRegressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;random_prediction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Step 3: Model evaluation
&lt;/h4&gt;

&lt;p&gt;We evaluate both models using Root Mean Squared Error (RMSE) and R² Score. RMSE tells us the average prediction error in the same units as CLV, while R² tells us how much of the variance in CLV our model explains (1.0 = perfect, 0 = no better than guessing the mean).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mean_squared_error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r2_score&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sqrt&lt;/span&gt;

&lt;span class="n"&gt;RMSE_linear&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;mean_squared_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Predictions&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;r2_linear&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;r2_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Predictions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;RMSE_tree&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;mean_squared_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_prediction&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;r2_tree&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;r2_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_prediction&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RMSE_linear: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;RMSE_linear&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r2_linear:   &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r2_linear&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RMSE_tree:   &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;RMSE_tree&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r2_tree:     &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r2_tree&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In most real-world CLV scenarios, the Random Forest will outperform Linear Regression due to its ability to capture complex, non-linear relationships between customer features and lifetime value.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Saving the Model&lt;/strong&gt;&lt;br&gt;
Once we're satisfied with model performance, we persist the trained model and feature schema using joblib. This makes reloading the model later without retraining straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CLV_model.joblib&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;feature_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;modelfeatures.joblib&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Saving the feature set alongside the model is a great practice. It documents exactly what columns and structure the model expects at inference time, which prevents subtle bugs when deploying.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Model deployment with FastAPI
&lt;/h3&gt;

&lt;p&gt;Training a model is only half the work. To put it into production, you need an API that other systems can call. Here's how to build a simple REST endpoint using FastAPI:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Install Dependencies&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="n"&gt;uvicorn&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt; &lt;span class="n"&gt;scikit&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;learn&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Create the API&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt; 
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt; 

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Customer Lifetime Value Prediction API&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Load the saved model and feature schema
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CLV_model.joblib&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;feature_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;joblib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;modelfeatures.joblib&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Define the input schema (adjust fields to match your actual dataset columns)
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CLVinput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;Customer_Age&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;Annual_Income&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;Tenure_Months&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;Monthly_Spend&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;Visits_Per_Month&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;Avg_Basket_Value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;Support_Tickets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;

&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;health_check&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API is running&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/predict-CLV&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;predict_CLV&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;CLVinput&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt;&lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;feature_name&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
    &lt;span class="n"&gt;prediction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;predicted_CLV&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prediction&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Run the Server Locally&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;uvicorn&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="nb"&gt;reload&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your API will be live at &lt;a href="http://localhost:8000" rel="noopener noreferrer"&gt;http://localhost:8000&lt;/a&gt;. You can test it at &lt;a href="http://localhost:8000/docs" rel="noopener noreferrer"&gt;http://localhost:8000/docs&lt;/a&gt;. FastAPI generates interactive API documentation automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Deploy to the Cloud&lt;/strong&gt;&lt;br&gt;
For production, deploy the API to a cloud provider. Here's a quick overview:&lt;br&gt;
&lt;strong&gt;Railway or Render (simplest):&lt;/strong&gt; Push your code to GitHub and connect the repo. Both platforms auto-detect Python apps and handle deployment with minimal configuration. Add a requirements.txt file:&lt;br&gt;
&lt;code&gt;&lt;br&gt;
fastapi&lt;br&gt;
uvicorn&lt;br&gt;
joblib&lt;br&gt;
scikit-learn&lt;br&gt;
pandas&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Summary
&lt;/h4&gt;

&lt;p&gt;Here's the end-to-end workflow we covered:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Load and explore the customer dataset&lt;/li&gt;
&lt;li&gt;Prepare features by separating inputs from the CLV target&lt;/li&gt;
&lt;li&gt;Train two models, Linear Regression and Random Forest and compare them using RMSE and R²&lt;/li&gt;
&lt;li&gt;Save the best model using joblib&lt;/li&gt;
&lt;li&gt;Deploy via FastAPI with a /predict endpoint that accepts customer data and returns a CLV estimate&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Predicting Customer Lifetime Value turns raw customer data into a strategic business asset. With a deployed model, your sales and marketing teams can make real-time decisions based on predicted value, not just historical behaviour.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>datascience</category>
      <category>fastapi</category>
    </item>
    <item>
      <title>How Statistics Can Be Used to Drive Business Decisions</title>
      <dc:creator>Maureen Muthoni</dc:creator>
      <pubDate>Fri, 06 Feb 2026 17:22:11 +0000</pubDate>
      <link>https://dev.to/maureenmuthonihue/how-statistics-can-be-used-to-drive-business-decisions-631</link>
      <guid>https://dev.to/maureenmuthonihue/how-statistics-can-be-used-to-drive-business-decisions-631</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In today's competitive business landscape, intuition is no longer sufficient for making critical decisions. Companies that leverage statistical analysis to inform their strategies consistently outperform those that rely on experience or instinct. &lt;/p&gt;

&lt;p&gt;The story that follows demonstrates how a systematic statistical approach from descriptive analytics to hypothesis testing can provide clear, evidence-based answers to complex business questions. More importantly, it shows how understanding statistical concepts like effect size, statistical power, and potential errors can prevent costly mistakes and unlock growth opportunities.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Business Problem
&lt;/h3&gt;

&lt;p&gt;A retail company operating both online and physical stores wanted to answer three key questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How are sales performing over time?&lt;/li&gt;
&lt;li&gt;How reliable are insights drawn from the data?&lt;/li&gt;
&lt;li&gt;Does running a marketing campaign actually increase revenue per transaction?
The company had three years of transaction data, including revenue, store type, region, and whether a marketing campaign was used. The goal was to use statistics to support decision-making.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step by Step Statistical Analysis
&lt;/h3&gt;

&lt;p&gt;Before testing anything, you need to know what your data looks like. This is called descriptive statistics.&lt;br&gt;
What We Calculated:&lt;br&gt;
&lt;strong&gt;Central Tendency (The "Average")&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mean revenue:  8272 per transaction&lt;/li&gt;
&lt;li&gt;Median revenue:  7723 per transaction
The mean is higher than the median, which tells us some transactions are very high (outliers). The median is often more "typical".&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Distribution Shape&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Skewness and kurtosis show that most transactions are low to moderate, but there are some very high transactions pulling the average up. The distribution looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspvwex2dsjg73em4wjov.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspvwex2dsjg73em4wjov.png" alt="Shape Distribution" width="800" height="652"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Visualize the Data
&lt;/h3&gt;

&lt;p&gt;Numbers are important, but pictures tell stories. We created four key visualizations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Revenue Over Time (Line Chart)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3fkgup4fcjp8lvdu8k12.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3fkgup4fcjp8lvdu8k12.png" alt="Revenue Over Time" width="800" height="596"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What we found is that revenue has seasonal peaks and valleys. December is high (holidays), and January is low (post holiday slump).&lt;br&gt;
Why this matters: If we only compared December to January, we'd think campaigns work miracles. But it might just be Christmas shopping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Revenue by Store Type (Bar Chart)&lt;/strong&gt;&lt;br&gt;
Online transactions are actually more valuable on average, even though physical stores sell more volume.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuu140ox73bx0y7rb95tu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuu140ox73bx0y7rb95tu.png" alt="Bar Chart" width="800" height="721"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Revenue by Region (Box Plot)&lt;/strong&gt;&lt;br&gt;
Nairobi: Highest median revenue but most variable&lt;br&gt;
Rift Valley: Most consistent (narrow box)&lt;br&gt;
Western &amp;amp; Coast: Lower median but good campaign response&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business Insight:&lt;/strong&gt; One marketing strategy won't fit all regions. We need customisation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feofqnpif0ka61ywr4q5c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feofqnpif0ka61ywr4q5c.png" alt="Box Chart" width="800" height="607"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Units Sold vs. Revenue (Scatter Plot)&lt;/strong&gt;&lt;br&gt;
This showed that campaigns don't just increase volume, they increase the value per unit sold. Customers buy more expensive items during campaigns.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqe6mhltf6z5s6sr8uafl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqe6mhltf6z5s6sr8uafl.png" alt="Scatter Plot" width="800" height="619"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Check for Bias (Sampling)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;TYPES OF BIAS:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;1. SELECTION BIAS&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Urban areas systematically differ from rural areas&lt;/li&gt;
&lt;li&gt;Higher income, different shopping behaviors&lt;/li&gt;
&lt;li&gt;Better infrastructure and internet connectivity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. GEOGRAPHIC BIAS&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rural regions completely excluded&lt;/li&gt;
&lt;li&gt;Cannot generalize findings to entire market&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. SOCIOECONOMIC BIAS&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Urban customers have different purchasing power&lt;/li&gt;
&lt;li&gt;Product preferences may differ&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;BUSINESS IMPACT&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Revenue estimates would be overstated&lt;/li&gt;
&lt;li&gt;Marketing effectiveness could be overestimated&lt;/li&gt;
&lt;li&gt;Regional strategy would be incomplete&lt;/li&gt;
&lt;li&gt;Expansion decisions would lack empirical foundation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;RECOMMENDED SAMPLING METHOD:&lt;/strong&gt;&lt;br&gt;
   STRATIFIED RANDOM SAMPLING:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Divide population into strata (regions, store types)&lt;/li&gt;
&lt;li&gt;Randomly sample proportionally from each stratum&lt;/li&gt;
&lt;li&gt;Ensures all segments are represented&lt;/li&gt;
&lt;li&gt;Maintains natural population distribution&lt;/li&gt;
&lt;li&gt;Allows both overall and stratum-specific analysis&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Apply Statistical Theorems
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Law of Large Numbers&lt;/strong&gt;&lt;br&gt;
We tested sample sizes from 10 to 1,000 transactions:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fktww0auusdgrg4k4xpfu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fktww0auusdgrg4k4xpfu.png" alt="Law of Large Numbers" width="800" height="604"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Central Limit Theorem&lt;/strong&gt;&lt;br&gt;
Even though individual transactions are all over the place (skewed distribution), when we take many samples and average them, the averages form a nice, normal bell curve.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcmnsf7rw71a02kafmbho.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcmnsf7rw71a02kafmbho.png" alt="CLT" width="800" height="641"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Hypothesis Testing
&lt;/h3&gt;

&lt;p&gt;A key business question examined was:&lt;br&gt;
Does running a marketing campaign increase average revenue per transaction?&lt;/p&gt;

&lt;p&gt;A one-tailed independent samples t-test was conducted to compare revenues from transactions with and without a marketing campaign.&lt;/p&gt;

&lt;p&gt;The results showed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A large t-statistic&lt;/li&gt;
&lt;li&gt;A p-value far below the 5% significance level&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This led to rejecting the null hypothesis and concluding that marketing campaigns significantly increase average revenue per transaction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Business implication:&lt;/strong&gt;&lt;br&gt;
Statistical testing provides objective evidence to support or challenge strategic initiatives, reducing reliance on intuition alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Errors and Interpretation
&lt;/h3&gt;

&lt;p&gt;Statistical decisions are subject to error:&lt;br&gt;
A Type I error would mean concluding the campaign works when it does not, leading to wasted marketing budgets.&lt;br&gt;
A Type II error would mean failing to detect a real effect, causing missed revenue opportunities.&lt;br&gt;
Recognizing these risks allows businesses to balance caution with opportunity.&lt;/p&gt;

&lt;p&gt;Type II error is worse because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Lost revenue is permanent &lt;/li&gt;
&lt;li&gt;Competitors gain market share&lt;/li&gt;
&lt;li&gt;Recovery is expensive &lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Effective size and Power
&lt;/h3&gt;

&lt;p&gt;Although the campaign effect was statistically significant, the calculated Cohen’s d indicated a small to medium effect size. This means that while the campaign works, its impact per transaction is modest.&lt;/p&gt;

&lt;p&gt;A statistically insignificant result could still matter in practice if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The effect is small but consistent&lt;/li&gt;
&lt;li&gt;The business operates at large scale&lt;/li&gt;
&lt;li&gt;The sample size is insufficient&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business implication:&lt;/strong&gt;&lt;br&gt;
Statistical significance should be interpreted alongside effect size and business context. Collecting more data can improve confidence and guide optimization rather than abandonment of strategies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;This case study illustrates that statistics is far more than an academic exercise. When applied correctly, it enables businesses to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understand performance realistically&lt;/li&gt;
&lt;li&gt;Measure risk and variability&lt;/li&gt;
&lt;li&gt;Test strategic decisions objectively&lt;/li&gt;
&lt;li&gt;Avoid costly cognitive and sampling biases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By integrating descriptive statistics, visualization, sampling theory, probability laws, and hypothesis testing, organizations can make evidence-based decisions that are both statistically sound and commercially meaningful.&lt;/p&gt;

&lt;p&gt;In an increasingly competitive environment, businesses that leverage statistics effectively gain a decisive advantage not by predicting the future perfectly, but by making better decisions under uncertainty.&lt;/p&gt;

</description>
      <category>statistics</category>
      <category>discuss</category>
      <category>machinelearning</category>
      <category>learning</category>
    </item>
    <item>
      <title>Ridge Regression vs Lasso Regression</title>
      <dc:creator>Maureen Muthoni</dc:creator>
      <pubDate>Tue, 03 Feb 2026 20:02:36 +0000</pubDate>
      <link>https://dev.to/maureenmuthonihue/ridge-regression-vs-lasso-regression-108c</link>
      <guid>https://dev.to/maureenmuthonihue/ridge-regression-vs-lasso-regression-108c</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Linear regression stands as one of the most fundamental tools in a data scientist's toolkit. At its core lies Ordinary Least Squares (OLS), a method that estimates model parameters by minimizing the sum of squared differences between predicted and actual values. In many real world problems, such as house price prediction datasets often contain many features, correlated variables, and noisy inputs. In such cases, traditional Ordinary Least Squares (OLS) regression becomes unstable and prone to overfitting. To solve these challenges, regularization techniques are used. The two most important regularization based models are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ridge Regression (L2 Regularization)&lt;/li&gt;
&lt;li&gt;Lasso Regression (L1 Regularization)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Ordinary Least Squares (OLS)
&lt;/h3&gt;

&lt;p&gt;Ordinary Least Squares estimates model parameters by minimizing the sum of squared residuals between predicted and actual values:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;i=1∑n​(yi​−y^​i​)2&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;where y^i represents predicted prices.&lt;br&gt;
OLS works well for small, clean datasets, but struggles when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There are many features&lt;/li&gt;
&lt;li&gt;Features are highly correlated (multicollinearity)&lt;/li&gt;
&lt;li&gt;Data contains noise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This leads to overfitting, where the model performs well on training data but poorly on unseen data.&lt;/p&gt;
&lt;h3&gt;
  
  
  Regularization in Linear Regression
&lt;/h3&gt;

&lt;p&gt;By including a penalty term in the loss function, regularisation addresses overfitting by effectively charging the model for complexity.  By including a penalty term in the loss function, regularisation addresses overfitting by effectively charging the model for complexity. The model now has to weigh accuracy against simplicity rather than just minimising error. The model now has to weigh accuracy against simplicity rather than just minimising error. Large coefficients are discouraged by this penalty, resulting in models that perform better when applied to new data. Large coefficients are discouraged by this penalty, resulting in models that perform better when applied to new data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;General form: Loss = Error + Penalty&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Ridge Regression (L2 Regularization)
&lt;/h3&gt;

&lt;p&gt;Ridge regression modifies the OLS loss function by adding an L2 penalty term proportional to the sum of squared coefficients.&lt;/p&gt;

&lt;p&gt;Ridge Regression Loss Function:&lt;br&gt;
&lt;strong&gt;Minimize: RSS + λΣβⱼ² = Σ(yᵢ - ŷᵢ)² + λ(β₁² + β₂² + ... + βₚ²)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;λ (lambda) = regularization parameter (λ ≥ 0)&lt;/li&gt;
&lt;li&gt;The penalty term is the sum of squared coefficients&lt;/li&gt;
&lt;li&gt;Note: The intercept β₀ is typically not penalised.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conceptual Effect&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shrinks coefficients smoothly&lt;/li&gt;
&lt;li&gt;Reduces model variance&lt;/li&gt;
&lt;li&gt;Keeps all features&lt;/li&gt;
&lt;li&gt;Handles multicollinearity well&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Property&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Ridge regression does not perform feature selection because coefficients are reduced but never become exactly zero.&lt;br&gt;
&lt;strong&gt;Python Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Ridge&lt;/span&gt;

&lt;span class="n"&gt;ridge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Ridge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ridge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train_scaled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;y_pred_ridge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ridge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test_scaled&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lasso Regression (L1 Regularization)
&lt;/h3&gt;

&lt;p&gt;Lasso takes a different approach through L1 regularization. Its loss function penalizes the sum of absolute coefficient values rather than squared values.&lt;/p&gt;

&lt;p&gt;Lasso Regression Loss Function:&lt;br&gt;
&lt;strong&gt;Minimize: RSS + λΣ|βⱼ| = Σ(yᵢ - ŷᵢ)² + λ(|β₁| + |β₂| + ... + |βₚ|)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Where:&lt;br&gt;
The penalty term is the sum of absolute values of coefficients&lt;br&gt;
λ controls the strength of regularization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conceptual Effect&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creates sparse models&lt;/li&gt;
&lt;li&gt;Forces some coefficients to exactly zero&lt;/li&gt;
&lt;li&gt;Automatically removes weak features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Property&lt;/strong&gt;&lt;br&gt;
Lasso performs feature selection, producing simpler and more interpretable models.&lt;br&gt;
&lt;strong&gt;Python Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Lasso&lt;/span&gt;

&lt;span class="n"&gt;lasso&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Lasso&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;lasso&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train_scaled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;y_pred_lasso&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lasso&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test_scaled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Comparing Ridge and Lasso
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Feature Selection Capability&lt;/strong&gt;&lt;br&gt;
Ridge retains all features with shrunken coefficients, while Lasso performs automatic selection by zeroing out irrelevant features.&lt;br&gt;
&lt;strong&gt;2. Coefficient Behavior with Correlated Features&lt;/strong&gt;&lt;br&gt;
When size (sq ft) and number of rooms correlate at r = 0.85:&lt;/p&gt;

&lt;p&gt;Ridge: Size = $120/sq ft, Rooms = $8,000/room (both moderate)&lt;br&gt;
Lasso: Size = $180/sq ft, Rooms = $0 (picks one, drops other)&lt;/p&gt;

&lt;p&gt;Ridge distributes weight smoothly; Lasso makes discrete choices.&lt;br&gt;
&lt;strong&gt;3. Model Interpretability&lt;/strong&gt;&lt;br&gt;
Ridge model: "Price depends on all 10 factors with varying importance."&lt;br&gt;
Lasso model: "Price primarily depends on size, location, and age, other factors don't matter."&lt;br&gt;
Lasso produces simpler, more explainable models for stakeholders.&lt;/p&gt;
&lt;h3&gt;
  
  
  Application Scenario: House Price Prediction
&lt;/h3&gt;

&lt;p&gt;Suppose your dataset includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;House size&lt;/li&gt;
&lt;li&gt;Number of bedrooms&lt;/li&gt;
&lt;li&gt;Distance to the city&lt;/li&gt;
&lt;li&gt;Number of nearby schools&lt;/li&gt;
&lt;li&gt;Several noisy or weak features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to use Ridge&lt;/strong&gt;&lt;br&gt;
Choose Ridge if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Most features likely influence price&lt;/li&gt;
&lt;li&gt;Multicollinearity exists&lt;/li&gt;
&lt;li&gt;You want stable predictions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to use Lasso&lt;/strong&gt;&lt;br&gt;
Choose Lasso if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only a few features truly matter&lt;/li&gt;
&lt;li&gt;Many variables add noise&lt;/li&gt;
&lt;li&gt;Interpretability is important&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Python Implementation&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Data Preparation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LinearRegression&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Ridge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Lasso&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mean_squared_error&lt;/span&gt;


&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;size&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;bedrooms&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;distance_city&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;schools_nearby&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;noise_feature&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;scaler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;X_train_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;X_test_scaled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;OLS Model&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LinearRegression&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ols&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train_scaled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;y_pred_ols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ols&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test_scaled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;mean_squared_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred_ols&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Ridge Regression&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ridge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Ridge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ridge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train_scaled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;y_pred_ridge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ridge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test_scaled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;mean_squared_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred_ridge&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lasso Regression&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;lasso&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Lasso&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;lasso&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train_scaled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;y_pred_lasso&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lasso&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test_scaled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;mean_squared_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred_lasso&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Choosing the Right Model for House Prices&lt;/strong&gt;&lt;br&gt;
If all features contribute meaningfully (e.g., size, bedrooms, schools, distance):&lt;br&gt;
Ridge Regression is preferred.&lt;br&gt;
If only a few features are truly important and others add noise:&lt;br&gt;
Lasso Regression is more suitable due to its feature selection capability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model Evaluation and Overfitting Detection
&lt;/h3&gt;

&lt;p&gt;Overfitting can be detected by comparing training and testing performance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High training score but low test score indicates overfitting&lt;/li&gt;
&lt;li&gt;Similar training and test scores suggest good generalization
Residual analysis also plays a key role. Residuals should be randomly distributed; visible patterns may indicate missing variables or non-linear relationships.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;OLS is simple but prone to overfitting in complex datasets. Ridge and Lasso regression introduce regularization to improve stability and generalization. Ridge is best when all features matter, while Lasso is preferred for sparse, interpretable models. Understanding when and how to apply these techniques is essential for both exams and real-world machine learning problems.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Building an Effective Power BI Dashboard: Connection, Cleaning, Modeling &amp; Design</title>
      <dc:creator>Maureen Muthoni</dc:creator>
      <pubDate>Wed, 10 Dec 2025 11:24:30 +0000</pubDate>
      <link>https://dev.to/maureenmuthonihue/building-an-effective-power-bi-dashboard-connection-cleaning-modeling-design-1k0d</link>
      <guid>https://dev.to/maureenmuthonihue/building-an-effective-power-bi-dashboard-connection-cleaning-modeling-design-1k0d</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Our hospital management system began with five interconnected tables storing operational data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Patients - Demographics and contact information&lt;/li&gt;
&lt;li&gt;Doctors - Provider profiles and specializations&lt;/li&gt;
&lt;li&gt;Appointments - Scheduling and visit records&lt;/li&gt;
&lt;li&gt;Admissions - Inpatient stay information&lt;/li&gt;
&lt;li&gt;Bills - Financial transactions and payment tracking&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Integration Challenge
&lt;/h3&gt;

&lt;p&gt;Raw data rarely tells a complete story. Our appointment records contained timestamps but lacked easy date grouping. Status fields had inconsistent formatting ("cancelled" vs "Cancelled" vs "CANCELLED"). Most critically, connecting appointment data to patient demographics and doctor specializations required three separate table joins.&lt;br&gt;
For financial analysis, understanding a patient's complete billing history meant traversing from patients → admissions → bills, aggregating along the way.&lt;br&gt;
&lt;strong&gt;The Solution&lt;/strong&gt;: Build staging views, making the data consistently accessible for all downstream analysis &lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;View 1: Appointments_Enriched (Operational Hub)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Combines the most frequently accessed data points&lt;/li&gt;
&lt;li&gt;Eliminates repetitive join logic across reports&lt;/li&gt;
&lt;li&gt;Maintains real-time accuracy (dynamic view updates automatically)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqgr5v0nh5g3nwxfayh3y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqgr5v0nh5g3nwxfayh3y.png" alt="Appointment_Enriched" width="800" height="692"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Key Decision: Used joins to ensure data integrity. Appointments without valid patient/doctor references are excluded, preventing corrupt data from polluting reports.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;View 2: Patient_Balances (Financial Lens)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Enables quick identification of collection priorities&lt;br&gt;
Supports cash flow forecasting and bad debt analysis&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyra9gwenxn7l46cncg4k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyra9gwenxn7l46cncg4k.png" alt="Patient_Balances" width="735" height="399"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Key Decision: Aggregated at "PatientID" level rather than admission level. &lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;view 3: Doctor_Monthly_Metrics (Performance Tracker)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvf6ar8fbuyrfu04tgpig.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvf6ar8fbuyrfu04tgpig.png" alt="Doctor monthly metrics" width="800" height="519"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this post, I’ll walk through the key steps I followed connecting to data, cleaning it, modelling it, and designing a clear dashboard highlighting practical decisions that helped shape the final report.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Connecting to the Data
&lt;/h3&gt;

&lt;p&gt;The project began by bringing multiple data sources into Power BI Desktop. These included structured tables that contained records, lookup information, and date related fields. Using Power BI’s Get Data interface, the sources were imported in Import Mode connect to PostgreSQL database. &lt;br&gt;
Once the tables were loaded, I confirmed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data types were detected correctly&lt;/li&gt;
&lt;li&gt;Column names were consistent&lt;/li&gt;
&lt;li&gt;Tables aligned logically (fact vs. dimension)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This initial step set the foundation for all transformations and modelling work.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Cleaning &amp;amp; Transforming the Data
&lt;/h3&gt;

&lt;p&gt;Most of the data preparation happened in Power Query, where I carried out cleaning tasks before loading tables into the model.&lt;/p&gt;

&lt;p&gt;Key cleaning steps included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Renaming columns to proper text.&lt;/li&gt;
&lt;li&gt;Merging columns, such as first and last name into full names.&lt;/li&gt;
&lt;li&gt;Fixing incorrect data types (e.g., numbers stored as text, date/time inconsistencies).&lt;/li&gt;
&lt;li&gt;Normalizing categories so that values followed a consistent naming convention.&lt;/li&gt;
&lt;li&gt;Filtering out invalid or missing records that could distort metrics by replacing. &lt;/li&gt;
&lt;li&gt;Removing duplicates to avoid inflated total.&lt;/li&gt;
&lt;li&gt;Trimming spaces  and texts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These cleaning procedures ensured that the dataset was accurate, consistent, and analytics ready.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Data Modelling Choices
&lt;/h3&gt;

&lt;p&gt;I used a star schema design to keep the model simple, efficient, and easy to scale.&lt;br&gt;
The model included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fact tables holding transactional or event level records.&lt;/li&gt;
&lt;li&gt;Dimension tables for people, categories, products, locations, and calendar data.&lt;/li&gt;
&lt;li&gt;A dedicated Date table, enabling accurate time-based analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Relationships were kept single-directional except where needed for specific behaviours, and unused columns were removed to keep the model lean.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpoi6iu7kkne0djwq4xwc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpoi6iu7kkne0djwq4xwc.png" alt="Schema" width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Dashboard Design &amp;amp; Visual Layout
&lt;/h3&gt;

&lt;p&gt;With the data clean and model optimized, I designed an interactive dashboard intended to provide both summaries and insights.&lt;br&gt;
The dashboard included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KPI cards showing totals and key performance indicators&lt;/li&gt;
&lt;li&gt;Trend charts to show how activity changed over time&lt;/li&gt;
&lt;li&gt;Category comparisons using bar and column charts&lt;/li&gt;
&lt;li&gt;Detailed tables for users who want  records&lt;/li&gt;
&lt;li&gt;Slicers and filters for month, doctor's name, and specialization&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Final Dashboard
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Home screen with KPIs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxjsp4kh7iuy59d3zzk5w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxjsp4kh7iuy59d3zzk5w.png" alt="KPIs" width="800" height="108"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trend charts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwx4ybx5saf6a43mn1x69.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwx4ybx5saf6a43mn1x69.png" alt="Charts" width="800" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Category breakdowns&lt;/li&gt;
&lt;li&gt;Detailed tables&lt;/li&gt;
&lt;li&gt;Filters panel&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ke9z36yllbcvox15yqg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ke9z36yllbcvox15yqg.png" alt="Filters" width="175" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dashboards&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fexeipsxedt55vqmi53qh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fexeipsxedt55vqmi53qh.png" alt="DB_1" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg5gjtwoagkiio8bxr5nj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg5gjtwoagkiio8bxr5nj.png" alt="DB_2" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frqrby2mf0t9o1iamqnff.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frqrby2mf0t9o1iamqnff.png" alt="DB_3" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfiuynzeakpqllr2mati.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfiuynzeakpqllr2mati.png" alt="DB_4" width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This Power BI project successfully transforms raw data into clear, actionable insights. Through careful data cleaning, a well structured star schema model, and thoughtfully designed visuals, the dashboard provides users with an intuitive way to explore trends and compare performance. The result is a clean, interactive, and reliable report that supports quick understanding and informed decision making.&lt;/p&gt;

&lt;p&gt;This was done as a group.&lt;br&gt;
&lt;strong&gt;co- authors:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. - Hilda Chepkirui&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;2. - Asha Siyat&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;3. - Saciid Shaakaal&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;4. - Samuel Irungu&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>discuss</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Connecting PostgreSQL to Power BI</title>
      <dc:creator>Maureen Muthoni</dc:creator>
      <pubDate>Sun, 23 Nov 2025 18:57:53 +0000</pubDate>
      <link>https://dev.to/maureenmuthonihue/connecting-postgresql-to-power-bi-l36</link>
      <guid>https://dev.to/maureenmuthonihue/connecting-postgresql-to-power-bi-l36</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;Power BI is one  of the most popular business intelligence tools for data visualization and analytics. Combined with PostgreSQL, a powerful open-source relational database, you can create dashboards and reports. This guide will walk you through connecting PostgreSQL to Power BI using two approaches: a local PostgreSQL installation and Aiven's cloud hosted PostgreSQL service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connecting Local PostgreSQL to Power BI
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1.Installation
&lt;/h3&gt;

&lt;p&gt;Download PostgreSQL from the official PostgreSQL website and follow the installation process. Download Power Bi from the Microsoft store. During installation, note that your user password and port number. If yours is local then the default port number is 5432.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2. Preparing your database
&lt;/h3&gt;

&lt;p&gt;Ensure your database contains the data you want to visualize. Make sure your PostgreSQL server is active.&lt;br&gt;
typical default setting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Host&lt;/strong&gt;: localhost&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Port&lt;/strong&gt;: 5432&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Default database&lt;/strong&gt;: postgres&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Username&lt;/strong&gt;: postgres&lt;br&gt;
Then test connection.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3. Connect PostgreSQL to Power BI.
&lt;/h3&gt;

&lt;p&gt;Open Power BI and follow these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Click on "Get Data" on the home ribbon.&lt;/li&gt;
&lt;li&gt;In the Get data window, navigate to more &amp;gt; Database&amp;gt; PostgreSQL.&lt;/li&gt;
&lt;li&gt;Click "Connect".&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fur2dvyh14kesty2x4av6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fur2dvyh14kesty2x4av6.png" alt="Postgres Database" width="800" height="781"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4. Enter local connection details.
&lt;/h3&gt;

&lt;p&gt;Fill the dialog:&lt;br&gt;
&lt;strong&gt;Server&lt;/strong&gt;: localhost:5432&lt;br&gt;
&lt;strong&gt;Database&lt;/strong&gt;: postgres (or name of your DB)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fffa2zw7e1x1ps90tshcg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fffa2zw7e1x1ps90tshcg.png" alt="Dialog Box" width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click "OK"&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5. Enter Credentials.
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Username&lt;/strong&gt;: your PostgreSQL username&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Password&lt;/strong&gt;: your password&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Select&lt;/strong&gt; “Use Encrypted Connection” if available&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Click&lt;/strong&gt; "Connect".&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 6. Load Data.
&lt;/h3&gt;

&lt;p&gt;The Navigator window will display all available tables and views in your database. Select the tables you want to work with by checking the boxes next to them. You can preview the data by clicking on each table name.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connecting Aiven PostgreSQL to Power BI.
&lt;/h2&gt;

&lt;p&gt;Aiven is a cloud-based platform that provides fully managed services for open-source data technologies like databases and streaming services.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1. Set up Aiven PostgreSQL.
&lt;/h3&gt;

&lt;p&gt;If you don't have an Aiven account, sign up at aiven.io. Aiven offers a free trial to test their services.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a new PostgreSQL service:&lt;/li&gt;
&lt;li&gt;Log into the Aiven console&lt;/li&gt;
&lt;li&gt;Click "Create a new service"&lt;/li&gt;
&lt;li&gt;Select "PostgreSQL" as the service type&lt;/li&gt;
&lt;li&gt;Choose your cloud provider and region&lt;/li&gt;
&lt;li&gt;Select a service plan based on your needs&lt;/li&gt;
&lt;li&gt;Name your service and click "Create service"&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 2. Retrieve Connection Information.
&lt;/h3&gt;

&lt;p&gt;In the Aiven console, click on your PostgreSQL service to view its details. You'll find the connection information in the "Overview" tab:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Host&lt;/strong&gt;: The service URI (usually in the format: name project.aivencloud.com)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Port&lt;/strong&gt;: Usually a number or another assigned port&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User&lt;/strong&gt;: Default is "avnadmin"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Password&lt;/strong&gt;: Click the eye icon to reveal the password&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database&lt;/strong&gt;: Default is "defaultdb"&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 3. Download the CA certificate.
&lt;/h3&gt;

&lt;p&gt;Aiven enforces SSL connections for security. Download the CA certificate from the Aiven console:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In your service overview, find the "Download CA cert" button&lt;/li&gt;
&lt;li&gt;Save the certificate file to a known location on your computer&lt;/li&gt;
&lt;li&gt;Note the file path for later use&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 4. Connect Power Bi to PostgreSQL.
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Click "Get Data" from the Home ribbon&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Select "PostgreSQL database" and click "Connect"&lt;br&gt;
In the connection dialog, enter:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Server&lt;/strong&gt;: Your Aiven host address: Port number.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Database&lt;/strong&gt;: Your database name ("defaultdb").&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F54j72t3y6ppcv03v33zh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F54j72t3y6ppcv03v33zh.png" alt="Postgres connection 2" width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5.Authentication.
&lt;/h3&gt;

&lt;p&gt;Enter your username and password from your aiven.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6. Load data and transform data.
&lt;/h3&gt;

&lt;p&gt;Just like with local PostgreSQL, the Navigator window will show your available tables and views. Select the data you need and click "Load" or "Transform Data" to begin working with your Aiven hosted data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Connecting PostgreSQL to Power BI whether running locally or hosted on Aiven is simple once the correct drivers and SSL configurations are in place.&lt;br&gt;
Local PostgreSQL connects using localhost and standard credentials.&lt;br&gt;
Aiven PostgreSQL requires SSL certificates and cloud connection parameters.&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>database</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Excel in the Era of Power BI &amp; Python</title>
      <dc:creator>Maureen Muthoni</dc:creator>
      <pubDate>Sat, 04 Oct 2025 06:51:51 +0000</pubDate>
      <link>https://dev.to/maureenmuthonihue/excel-in-the-era-of-power-bi-python-o1a</link>
      <guid>https://dev.to/maureenmuthonihue/excel-in-the-era-of-power-bi-python-o1a</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;Whenever I receive a messy dataset, I always reach out for Excel first and not Power BI or Python. &lt;/p&gt;

&lt;p&gt;Excel still remains as the go to or the very first option for operational analytics. It handles quick analysis, data entry and rapid data cleaning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Excel Shines
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Excel is ubiquitous- Almost everyone knows how to use it. It is also virtually accessible.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;No coding- It does not require you to use programming language which can be complex to some.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Rapid data cleaning- It is quicker when it comes to cleaning data and data entry. It doesn't require complex operations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Interactive dashboards- It is easier to create dashboards with excel as well as KPI's and dynamic visuals.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Self contained- It requires no dependencies, environments or deployment needed.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Power Bi
&lt;/h2&gt;

&lt;p&gt;Power BI is mostly better for automated reporting dashboards, real time data connections and interactive visualizations.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Scalability- It can handle large data with ease and connects to live data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Visual polish- It's visuals are sleek and interactive with modern charts, drill through and it's also great for storytelling.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;DAX- Unlocks advanced metrics, time intelligence and powerful modelling language.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Python
&lt;/h2&gt;

&lt;p&gt;Python performs complex analysis, automation, machine learning and data engineering. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Automation- It builds repeatable pipelines for cleaning and transformation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Advanced analytics- It enables analytics like regression clustering and forecasting.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Integration- It connects to APIs, databases and cloud services.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Practical Stack
&lt;/h3&gt;

&lt;p&gt;Here's how i view modern analytics&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5nrr44zdxg1k63oeio9l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5nrr44zdxg1k63oeio9l.png" alt="Data Flowchart" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each tool has it's strength and together they form a flexible workflow.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Power BI and Python are taking over repeatable, automated reports, large databases, complex transformation and production dashboards but excel still thrives in business work where flexibility matters more than scalability.&lt;/p&gt;

&lt;p&gt;Most organisations use all three tools but many analysts still prototype in Excel before building something more formal. &lt;/p&gt;

&lt;p&gt;Excel also continues to evolve, it now includes Python integration, power query for ETL tasks and other sophisticated functions. Rather than becoming obsolete, It's become part of an integrated toolkit.&lt;/p&gt;

&lt;p&gt;Excel isn't going anywhere, It's still the fastest way to clean, model and explore data.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>discuss</category>
      <category>analytics</category>
      <category>luxdev</category>
    </item>
  </channel>
</rss>
