<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Paulet Wairagu</title>
    <description>The latest articles on DEV Community by Paulet Wairagu (@pauletart).</description>
    <link>https://dev.to/pauletart</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1027631%2F566d21f6-0d69-46e1-a34a-4d0ed530615d.jpeg</url>
      <title>DEV Community: Paulet Wairagu</title>
      <link>https://dev.to/pauletart</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pauletart"/>
    <language>en</language>
    <item>
      <title>QN : Get started with lakehouses in Microsoft Fabric</title>
      <dc:creator>Paulet Wairagu</dc:creator>
      <pubDate>Thu, 04 Jun 2026 17:08:23 +0000</pubDate>
      <link>https://dev.to/pauletart/qn-get-started-with-lakehouses-in-microsoft-fabric-52f6</link>
      <guid>https://dev.to/pauletart/qn-get-started-with-lakehouses-in-microsoft-fabric-52f6</guid>
      <description>&lt;ul&gt;
&lt;li&gt;A lakehouse is a unified platform that combines:

&lt;ul&gt;
&lt;li&gt;The flexible and scalable storage of a data &lt;strong&gt;lake&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The ability to query and analyze data of a data ware*&lt;em&gt;house&lt;/em&gt;*&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;A lakehouse uses Apache Spark and SQL compute engines to process and analyze data at scale.&lt;/li&gt;

&lt;li&gt;Traditional Warehouses handle structured data but struggle on semi-structured and unstructured data from app logs , IoT devices etc hence data silos and complex integration efforts&lt;/li&gt;

&lt;li&gt;Data Lakes offer flexibility and scalability but lack structure and performance for b/s analytics&lt;/li&gt;

&lt;li&gt;Data Warehouses have strong analytical capabilities but struggle with different data formats and costly to scale.&lt;/li&gt;

&lt;li&gt;Lakehouse design:

&lt;ul&gt;
&lt;li&gt;tables : delta lake table that provide structured, queryable data

&lt;ul&gt;
&lt;li&gt;Support SQL queries through the SQL analytics endpoint&lt;/li&gt;
&lt;li&gt;Enforce schemas and support ACID transactions&lt;/li&gt;
&lt;li&gt;Can be accessed in Power BI for reporting&lt;/li&gt;
&lt;li&gt;Benefit from automatic optimization and maintenance&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;files :  stores raw or semi-structured data files in their native format

&lt;ul&gt;
&lt;li&gt;Support any file format (CSV, JSON, Parquet, images, documents)&lt;/li&gt;
&lt;li&gt;Provide flexibility for data exploration and processing&lt;/li&gt;
&lt;li&gt;Can be staged before transformation into tables&lt;/li&gt;
&lt;li&gt;Don't enforce schema or support direct SQL queries&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;/li&gt;

&lt;li&gt;Delta Lake is a open source storage layer taht brings reliability to data lakes.&lt;/li&gt;

&lt;li&gt;Data is stored in delta format in OneLake storage&lt;/li&gt;

&lt;li&gt;Delta Lake advanatges

&lt;ul&gt;
&lt;li&gt;ACID Transactions : consistency with frequent reads&lt;/li&gt;
&lt;li&gt;Schema enforcement : validates the data against the table schema&lt;/li&gt;
&lt;li&gt;Time Travel : maintains transaction logs&lt;/li&gt;
&lt;li&gt;Updates and Deletes :&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Delta table has parquet data files + transaction logs&lt;/li&gt;

&lt;li&gt;This design support batch + straeming workloads&lt;/li&gt;

&lt;li&gt;Lakehouse access :

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;workspace roles&lt;/strong&gt; for collaborators who need access to all items in the workspace&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Item-level sharing&lt;/strong&gt; to grant read-only access for specific needs, such as analytics or Power BI report development&lt;/li&gt;
&lt;li&gt;SQL analytics endpoint supports &lt;strong&gt;row-level&lt;/strong&gt; and &lt;strong&gt;column-level security&lt;/strong&gt;, so you can restrict what specific users see when they query through SQL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;schema-level permissions&lt;/strong&gt; to control access by business domain&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Well-organized lakehouse data becomes the foundation that intelligent experiences across Microsoft Fabric depend on.&lt;/li&gt;

&lt;li&gt;investment you make in organizing, naming, and structuring lakehouse data pays dividends beyond your immediate analytics needs. Good data engineering practices in the lakehouse create a reusable foundation for intelligent experiences across the platform.&lt;/li&gt;

&lt;/ul&gt;

</description>
      <category>architecture</category>
      <category>dataengineering</category>
      <category>microsoft</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>QN:Introduction to end-to-end analytics using Microsoft Fabric</title>
      <dc:creator>Paulet Wairagu</dc:creator>
      <pubDate>Thu, 04 Jun 2026 16:29:34 +0000</pubDate>
      <link>https://dev.to/pauletart/introduction-to-end-to-end-analytics-using-microsoft-fabric-1oec</link>
      <guid>https://dev.to/pauletart/introduction-to-end-to-end-analytics-using-microsoft-fabric-1oec</guid>
      <description>&lt;p&gt;Quick Short notes series&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Microsoft Fabric is an end-to-end analytics platform that provides a single, integrated environment where data professionals and the business collaborate on data projects. Built on a unified data lake called &lt;strong&gt;OneLake&lt;/strong&gt;, Fabric brings together the tools you need across that entire lifecycle.&lt;/li&gt;
&lt;li&gt;Fabric is a unified &lt;em&gt;software-as-a-service&lt;/em&gt; (SaaS) platform where all data is stored in a single open format in OneLake. All analytics engines in the platform can access OneLake, ensuring scalability, cost-effectiveness, and accessibility from anywhere with an internet connection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OneLake&lt;/strong&gt; is Fabric's centralized data storage architecture that enables collaboration by eliminating the need to move or copy data between systems&lt;/li&gt;
&lt;li&gt;OneLake is built on &lt;strong&gt;Azure Data Lake Storage Gen2&lt;/strong&gt; (ADLS Gen2) and supports various formats, including Delta, Parquet, CSV, and JSON&lt;/li&gt;
&lt;li&gt;All compute engines in Fabric automatically store their data in OneLake, making it directly accessible without the need for movement or duplication.&lt;/li&gt;
&lt;li&gt;For tabular data, the analytical engines in Fabric write data in delta-parquet format and all engines interact with the format seamlessly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shortcuts&lt;/strong&gt; are references to files or storage locations within OneLake or external data sources, such as Azure Data Lake Storage, Amazon S3, or Dataverse. Shortcuts allow you to access existing data without copying it, ensuring data consistency and enabling Fabric to stay in sync with the source.&lt;/li&gt;
&lt;li&gt;workspaces serve as logical containers that help you organize and manage your data, reports, and other assets.&lt;/li&gt;
&lt;li&gt;workspace has its own set of permissions, ensuring that only authorized users can view or modify its contents.&lt;/li&gt;
&lt;li&gt;Workspaces allow you to manage compute resources and integrate with Git for version control. You can optimize performance and cost by configuring compute settings, while Git integration helps track changes, collaborate on code, and maintain a history of your work.&lt;/li&gt;
&lt;li&gt;Fabric administration is centralized in the &lt;strong&gt;Admin portal&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;In the admin portal you can manage groups and permissions, configure data sources and gateways, and monitor usage and performance. You can also access the Fabric admin APIs and SDKs in the admin portal, which can automate common tasks and integrate Fabric with other systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OneLake catalog&lt;/strong&gt; helps you analyze, monitor, and maintain data governance. It provides guidance on sensitivity labels, item metadata, and data refresh status, offering insights into the governance status and actions for improvement.&lt;/li&gt;
&lt;li&gt;Fabric increases collaboration between data professionals by removing data silos and the need for multiple systems.&lt;/li&gt;
&lt;li&gt;In &lt;em&gt;Workspace settings&lt;/em&gt;, you can configure:

&lt;ul&gt;
&lt;li&gt;License type to use Fabric features.&lt;/li&gt;
&lt;li&gt;OneDrive access for the workspace.&lt;/li&gt;
&lt;li&gt;Azure Data Lake Gen2 Storage connection.&lt;/li&gt;
&lt;li&gt;Git integration for version control.&lt;/li&gt;
&lt;li&gt;Spark workload settings for performance optimization&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

</description>
      <category>dataengineering</category>
      <category>analytics</category>
      <category>analyticsengineering</category>
    </item>
    <item>
      <title>THINKING GAME DOCUMENTARY: MY REVIEW</title>
      <dc:creator>Paulet Wairagu</dc:creator>
      <pubDate>Wed, 26 Nov 2025 18:47:57 +0000</pubDate>
      <link>https://dev.to/pauletart/thinking-game-documentary-my-review-36c4</link>
      <guid>https://dev.to/pauletart/thinking-game-documentary-my-review-36c4</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31ku2jttaxit6k2q4h7o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31ku2jttaxit6k2q4h7o.png" alt=" " width="800" height="201"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So I finally watched AlphaGo, that documentary about the Google DeepMind AI that took on the world champion in Go, and honestly… I didn’t expect to enjoy it this much. I thought it would be one of those “tech bros hype themselves” films, but wueh, it’s actually deep.&lt;/p&gt;

&lt;p&gt;First things first,the game itself. Go is like chess on steroids. Watching those pros talk about it felt like watching athletes explaining how they breathe. The amount of strategy, intuition, and reading the board… it made me respect the game a lot more. And the way the documentary broke it down for normal people? Lovely. Even those of us who have never touched a Go board can follow the tension.&lt;/p&gt;

&lt;p&gt;Then there's Lee Sedol. Man, that guy carried the emotional weight of the whole thing. You feel the pressure on him — not just to win, but to defend human creativity. The guy literally said he felt like he was playing on behalf of everyone. That scene where he loses a game and walks out looking completely defeated? Your heart just sinks. Been there, that feeling of “I did everything and still lost.”&lt;/p&gt;

&lt;p&gt;And then AlphaGo. The AI itself almost feels like a character. Quiet, calculating, no hype, just vibes and probabilities. The wild part is when it makes those “impossible” moves that even the Go masters can’t understand. Move 37 especially — the commentators looked like they’d seen witchcraft. Even Sedol was like, “No human plays like that.” That’s the moment I realized AI isn’t just copying; sometimes it’s genuinely creating.&lt;/p&gt;

&lt;p&gt;But my favourite part is how the doc doesn’t frame it as “humans vs robots.” It shows how the match changed how humans think. After the loss, the pros started studying AlphaGo games and discovering new strategies. Like the AI unlocked creativity instead of killing it. That hit me because we’re in that same AI era now — people thinking AI is coming to take all jobs, yet here we are, learning new ways of thinking from it.&lt;/p&gt;

&lt;p&gt;Cinematography was also clean — the slow, quiet shots, the close-ups, the music. It’s not rushed or over-dramatic. Just calm, like meditation.&lt;/p&gt;

&lt;p&gt;If you like strategy, tech, psychology, or you just want to see a human fight for dignity against a machine, this is a solid watch. It’s not a hype documentary; it’s a thoughtful one. Emotion, tension, and a bit of “eish, surely, how is an algorithm beating a whole world champion?”&lt;/p&gt;

&lt;p&gt;Highly recommend.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>data</category>
      <category>datascience</category>
      <category>alphago</category>
    </item>
    <item>
      <title>The 5 Most Common Data Quality Issues (and How Analysts Can Fix Them)</title>
      <dc:creator>Paulet Wairagu</dc:creator>
      <pubDate>Mon, 24 Nov 2025 10:50:56 +0000</pubDate>
      <link>https://dev.to/pauletart/the-5-most-common-data-quality-issues-and-how-analysts-can-fix-them-3c7p</link>
      <guid>https://dev.to/pauletart/the-5-most-common-data-quality-issues-and-how-analysts-can-fix-them-3c7p</guid>
      <description>&lt;p&gt;Data analysts spend more time cleaning data than analyzing it. In fact, in most real-world projects, 60–80% of your time goes into preparing data for meaningful insights. &lt;br&gt;
Poor data quality leads to incorrect conclusions, broken dashboards, and bad decisions which is why understanding common issues and knowing how to fix them is a core skill for every analyst.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here are the five most common data quality problems and practical steps to solve each one.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Missing or Null Values&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Missing data can distort metrics, create gaps in reports, or lead to inaccurate ML models.&lt;/p&gt;

&lt;p&gt;Causes:&lt;br&gt;
• Manual data entry errors&lt;br&gt;
• Incomplete integrations&lt;br&gt;
• System migration issues&lt;/p&gt;

&lt;p&gt;How to fix it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identify missingness patterns using COUNT(*) in SQL or df.isna().sum() in Python.&lt;/li&gt;
&lt;li&gt;Drop rows only when missing data is irrelevant.&lt;/li&gt;
&lt;li&gt;Impute using averages, medians, or domain logic.&lt;/li&gt;
&lt;li&gt;Use Power Query’s “Replace Errors” or “Fill Down” functions for structured fixes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Inconsistent Formatting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You’ve probably seen this: “Kenya”, “kenya”, “K E N Y A”, or mismatched date formats in the same column.&lt;/p&gt;

&lt;p&gt;Why it happens:&lt;br&gt;
• Different data sources&lt;br&gt;
• Manual inputs&lt;br&gt;
• Lack of data validation rules&lt;/p&gt;

&lt;p&gt;How to fix it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apply standard casing (upper/lower/title).&lt;/li&gt;
&lt;li&gt;Convert all dates to a unified ISO format (YYYY-MM-DD).&lt;/li&gt;
&lt;li&gt;Use Excel Power Query’s “Transform → Format” options.&lt;/li&gt;
&lt;li&gt;In SQL, standardize with functions like UPPER(), TRIM(), or TO_DATE().&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Duplicate Records&lt;/strong&gt;&lt;br&gt;
Duplicates inflate counts, break KPIs, and cause incorrect aggregations.&lt;/p&gt;

&lt;p&gt;Why it happens:&lt;br&gt;
• Multiple data entry points&lt;br&gt;
• Poor primary key definition&lt;br&gt;
• System sync issues&lt;/p&gt;

&lt;p&gt;How to fix it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identify duplicates using ROW_NUMBER() windows in SQL.&lt;/li&gt;
&lt;li&gt;Use Power Query’s “Remove Duplicates”.&lt;/li&gt;
&lt;li&gt;Implement unique IDs early in the pipeline.&lt;/li&gt;
&lt;li&gt;In Python, use df.drop_duplicates().&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Outliers and Incorrect Values&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some values are valid extreme cases; others are simply errors (like a customer aged 600).&lt;/p&gt;

&lt;p&gt;Why it happens:&lt;br&gt;
• Typographical errors&lt;br&gt;
• Faulty sensors or scraping issues&lt;br&gt;
• Incorrect units (meters vs. feet)&lt;/p&gt;

&lt;p&gt;How to fix it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Visualize distributions using box plots or histograms.&lt;/li&gt;
&lt;li&gt;Apply domain thresholds or rule-based logic.&lt;/li&gt;
&lt;li&gt;Use interquartile ranges or z-scores for statistical outlier detection.&lt;/li&gt;
&lt;li&gt;Create automated validations in Power BI or SQL.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Mixed Granularity&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Data at different levels combined into one column or table — e.g., weekly and monthly data in the same dataset.&lt;/p&gt;

&lt;p&gt;Why it happens:&lt;br&gt;
• Data integration from multiple systems&lt;br&gt;
• Poorly designed source tables&lt;/p&gt;

&lt;p&gt;How to fix it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Split datasets by granularity before analysis.&lt;/li&gt;
&lt;li&gt;Create dimensional tables for dates, products, etc.&lt;/li&gt;
&lt;li&gt;Aggregate or disaggregate consistently before joining.&lt;/li&gt;
&lt;li&gt;Use a proper star schema when possible.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>etl</category>
      <category>data</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
