<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Xin Xu</title>
    <description>The latest articles on DEV Community by Xin Xu (@xin_xu_5c36b5326e7008e281).</description>
    <link>https://dev.to/xin_xu_5c36b5326e7008e281</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3770513%2F808d27bf-4c5e-45b0-95f9-17327a9f1d48.png</url>
      <title>DEV Community: Xin Xu</title>
      <link>https://dev.to/xin_xu_5c36b5326e7008e281</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/xin_xu_5c36b5326e7008e281"/>
    <language>en</language>
    <item>
      <title>Project: Building "Mini-C4" — A Production-Grade LLM Pre-training Pipeline 🏗️</title>
      <dc:creator>Xin Xu</dc:creator>
      <pubDate>Fri, 13 Feb 2026 10:22:36 +0000</pubDate>
      <link>https://dev.to/xin_xu_5c36b5326e7008e281/project-building-mini-c4-a-production-grade-llm-pre-training-pipeline-hfl</link>
      <guid>https://dev.to/xin_xu_5c36b5326e7008e281/project-building-mini-c4-a-production-grade-llm-pre-training-pipeline-hfl</guid>
      <description>&lt;h1&gt;
  
  
  Project: Building "Mini-C4" Pre-training Corpus 🏗️
&lt;/h1&gt;

&lt;p&gt;This project demonstrates how to build a miniaturized version of the &lt;strong&gt;C4 (Colossal Clean Crawled Corpus)&lt;/strong&gt; pipeline. Our mission: transform chaotic, raw web data (Common Crawl) into low-noise, deduplicated, high-quality text ready for LLM pre-training.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd7dm8cy8b024b2n313zd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd7dm8cy8b024b2n313zd.png" alt=" " width="800" height="626"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;datascale-ai/data_engineering_book&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Project Brief
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Objective:&lt;/strong&gt; Build a pipeline to process raw Common Crawl data into a clean text corpus.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Input:&lt;/strong&gt; Raw WARC files (&lt;code&gt;.warc.gz&lt;/code&gt;) containing HTTP headers, HTML source, and binary noise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output:&lt;/strong&gt; Categorized JSONL files (&lt;code&gt;final_data.jsonl&lt;/code&gt;) featuring clean text, language labels, and &lt;strong&gt;Perplexity (PPL)&lt;/strong&gt; scores.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Challenges:&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extremely Low Signal-to-Noise Ratio:&lt;/strong&gt; Over 90% of raw web data consists of navbars, ads, SEO spam, and JS code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fuzzy Deduplication:&lt;/strong&gt; Identifying semantically similar documents across millions of records is computationally expensive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality Quantification:&lt;/strong&gt; How to distinguish "human-grade prose" from "machine-generated gibberish" without expensive LLM APIs.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. Architecture Design
&lt;/h2&gt;

&lt;p&gt;We designed a &lt;strong&gt;Funnel-shaped pipeline&lt;/strong&gt; to filter noise layer by layer:&lt;/p&gt;

&lt;h3&gt;
  
  
  Tech Stack Decisions
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Choice&lt;/th&gt;
&lt;th&gt;Rationale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Parsing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;warcio&lt;/code&gt;, &lt;code&gt;trafilatura&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;trafilatura&lt;/code&gt; excels at extracting main content (removing footers/ads) far better than BeautifulSoup.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compute&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Ray&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Python's &lt;code&gt;multiprocessing&lt;/code&gt; has high overhead for large shared states. Ray’s &lt;strong&gt;Actor Model&lt;/strong&gt; scales easily from multi-core to clusters.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deduplication&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MinHash LSH&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reduces  complexity to  using Locality Sensitive Hashing.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Evaluation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;KenLM&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;A lightweight N-gram model used by GPT-3/CCNet to measure text "naturalness" via Perplexity.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  3. Step-by-Step Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Phase I: Heuristic Cleaning &amp;amp; Extraction
&lt;/h3&gt;

&lt;p&gt;Raw WARC files are a mess. We use &lt;code&gt;warcio&lt;/code&gt; for streaming and &lt;code&gt;trafilatura&lt;/code&gt; to extract the "soul" of the webpage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code Insight: Streaming Processor&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;warcio.archiveiterator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ArchiveIterator&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trafilatura&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nc"&gt;ArchiveIterator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rec_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Filter for HTML only
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text/html&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;http_headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="c1"&gt;# Extract main body, ignoring comments and tables
&lt;/span&gt;        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trafilatura&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;content_stream&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;include_comments&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;🔍 The Cleaning Rules (Gopher/C4 Standards):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Symbol-to-Word Ratio:&lt;/strong&gt; If symbols like &lt;code&gt;{ } [ ]&lt;/code&gt; exceed 10%, it's likely code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Average Word Length:&lt;/strong&gt; High-quality English text usually averages 5-10 characters. Values &amp;gt; 15 suggest minified JS or URL lists.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keyword Blocklist:&lt;/strong&gt; Drop pages containing "lorem ipsum", "enable cookies", or "403 forbidden".&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Phase II: Distributed MinHash Deduplication
&lt;/h3&gt;

&lt;p&gt;To handle "mirrored" content, we use &lt;strong&gt;Ray&lt;/strong&gt; to parallelize the computation of MinHash signatures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code Insight: Ray Actor Pattern&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@ray.remote&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MinHash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_perm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;
&lt;span class="c1"&gt;# Map-Reduce: Dispatch batches to all CPU cores
&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;process_batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;batches&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ray&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Phase III: Quality Filtering (KenLM)
&lt;/h3&gt;

&lt;p&gt;We use a pre-trained &lt;strong&gt;KenLM&lt;/strong&gt; model to calculate Perplexity. Lower perplexity means more "natural" language.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📈 Tuning the Threshold:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Score &amp;gt; -5.0:&lt;/strong&gt; Wikipedia-grade, highly fluent content.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score -5.0 to -6.0:&lt;/strong&gt; Standard blog posts and forum discussions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score &amp;lt; -6.5:&lt;/strong&gt; Broken sentences, machine translation failures, or SEO keyword lists (Discard).&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  4. Performance &amp;amp; Showcase (Data Funnel)
&lt;/h2&gt;

&lt;p&gt;Based on a sample 1GB WARC file processing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;In (Docs)&lt;/th&gt;
&lt;th&gt;Out (Docs)&lt;/th&gt;
&lt;th&gt;Retention&lt;/th&gt;
&lt;th&gt;Main Loss Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Raw WARC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~35,000&lt;/td&gt;
&lt;td&gt;~10,000&lt;/td&gt;
&lt;td&gt;28%&lt;/td&gt;
&lt;td&gt;Non-HTML, Empty content.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Heuristics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10,000&lt;/td&gt;
&lt;td&gt;~6,500&lt;/td&gt;
&lt;td&gt;65%&lt;/td&gt;
&lt;td&gt;Code snippets, short text.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deduplication&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6,500&lt;/td&gt;
&lt;td&gt;~4,800&lt;/td&gt;
&lt;td&gt;73%&lt;/td&gt;
&lt;td&gt;Mirrored sites, templates.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Quality Filter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4,800&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~3,900&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;81%&lt;/td&gt;
&lt;td&gt;Gibberish, non-English.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Final Yield&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;35,000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3,900&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~11%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Data Purity over Volume.&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  5. Scaling to Terabytes (The Next Steps)
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;State Management:&lt;/strong&gt; Move &lt;code&gt;MinHashLSH&lt;/code&gt; indices from RAM to &lt;strong&gt;Redis&lt;/strong&gt; or &lt;strong&gt;Cassandra&lt;/strong&gt; to handle billions of records.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I/O Optimization:&lt;/strong&gt; Transition from local files to &lt;strong&gt;S3/MinIO&lt;/strong&gt; using &lt;strong&gt;Apache Arrow&lt;/strong&gt; for columnar streaming.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Global Sharding:&lt;/strong&gt; Follow the CCNet approach—shard data by hash buckets and deduplicate within shards to minimize cross-node communication.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building Mini-C4 is a masterclass in &lt;strong&gt;Data Funneling&lt;/strong&gt;. It’s not about how much data you have, but how effectively you can discard the garbage.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;Full Source Code:&lt;/strong&gt; &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;datascale-ai/data_engineering_book&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Have you ever tried processing Common Crawl? What’s the weirdest thing you’ve found in a raw WARC file? Let’s talk in the comments! 👇&lt;/strong&gt;&lt;/p&gt;




</description>
      <category>ai</category>
      <category>opensource</category>
      <category>learning</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Recaptioning: Upgrading Your Image-Text Data for Better Model Alignment 🚀</title>
      <dc:creator>Xin Xu</dc:creator>
      <pubDate>Fri, 13 Feb 2026 10:14:31 +0000</pubDate>
      <link>https://dev.to/xin_xu_5c36b5326e7008e281/recaptioning-upgrading-your-image-text-data-for-better-model-alignment-4fme</link>
      <guid>https://dev.to/xin_xu_5c36b5326e7008e281/recaptioning-upgrading-your-image-text-data-for-better-model-alignment-4fme</guid>
      <description>&lt;h1&gt;
  
  
  Recaptioning: Engineering High-Quality Descriptions for Multi-modal Models 🚀
&lt;/h1&gt;

&lt;p&gt;In multi-modal AI, we often face the "Garbage In, Garbage Out" problem: scraped image captions are often too vague ("a pretty cup"), too long (exceeding the 77-token limit), or simply incorrect. &lt;strong&gt;Recaptioning&lt;/strong&gt; is the process of rewriting or regenerating these descriptions to ensure they are model-ready and semantically dense.&lt;/p&gt;

&lt;p&gt;Based on the &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;data_engineering_book&lt;/a&gt;, this post covers why you need recaptioning, the core strategies to implement it, and how to evaluate the results.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe013gub5zbandw5m3tev.png" alt=" " width="800" height="436"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  1. Why Recaptioning is a Game Changer
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Improve Semantic Alignment:&lt;/strong&gt; Fix vague or fictional descriptions to match 100% of the image content.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adapt to Model Constraints:&lt;/strong&gt; Shorten long sentences to fit token limits (e.g., CLIP's 77-token bottleneck) without losing core info.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-dimensional Coverage:&lt;/strong&gt; Generate multiple captions covering "Appearance," "Texture," and "Context" to improve retrieval robustness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standardize Style:&lt;/strong&gt; Clean up slang, typos, and irregular formatting.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. Core Strategies (From Simple to Advanced)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  A. Rule-based Recaptioning (Low Cost)
&lt;/h3&gt;

&lt;p&gt;Best for small datasets where you have metadata (like OCR or Object Detection tags). Use Python and RegEx to standardize and merge tags into a clean string.&lt;/p&gt;

&lt;h3&gt;
  
  
  B. Model-based Recaptioning (High Performance)
&lt;/h3&gt;

&lt;p&gt;Use Vision-Language Models (VLM) like &lt;strong&gt;BLIP-2&lt;/strong&gt; or &lt;strong&gt;LLaVA&lt;/strong&gt; to automatically generate detailed, accurate captions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implementation Example with BLIP-2:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Blip2Processor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Blip2ForConditionalGeneration&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Recaptioner&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Salesforce/blip2-opt-2.7b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Blip2Processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Blip2ForConditionalGeneration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RGB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Question: Describe this image accurately including color, material, and context. Answer:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Generating 3 diverse captions
&lt;/span&gt;        &lt;span class="n"&gt;outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_return_sequences&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;do_sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skip_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  C. Human-in-the-Loop (Highest Quality)
&lt;/h3&gt;

&lt;p&gt;For production datasets, use a hybrid approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Mass Generation:&lt;/strong&gt; Generate 5 candidates per image using LLMs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CLIP Filtering:&lt;/strong&gt; Automatically keep the top 2 captions based on CLIP similarity scores.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human Audit:&lt;/strong&gt; Randomly sample 5-10% for manual correction.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  3. Evaluation: Is Your New Caption Better?
&lt;/h2&gt;

&lt;p&gt;Don't guess—measure. Use &lt;strong&gt;CLIP Similarity&lt;/strong&gt; to quantify the alignment between the new text and the image.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semantic Alignment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CLIP Score (Cosine Similarity)&lt;/td&gt;
&lt;td&gt;Higher than the original caption.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Text Quality&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Perplexity / Grammar Check&lt;/td&gt;
&lt;td&gt;Fluent, no hallucinations.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Downstream Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Recall@K in Retrieval Tasks&lt;/td&gt;
&lt;td&gt;Improved retrieval accuracy.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  4. Engineering Pitfalls &amp;amp; Tips
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination:&lt;/strong&gt; Models might describe objects not present in the image. &lt;strong&gt;Solution:&lt;/strong&gt; Use a prompt that restricts the model to "only what you see."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Homogeneity:&lt;/strong&gt; Models often repeat the same phrases. &lt;strong&gt;Solution:&lt;/strong&gt; Increase &lt;code&gt;temperature&lt;/code&gt; (0.7-1.0) and use &lt;code&gt;repetition_penalty&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throughput:&lt;/strong&gt; Generating millions of captions is slow. &lt;strong&gt;Solution:&lt;/strong&gt; Use FP16/INT8 quantization and batch inference.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Recaptioning transforms "raw data" into "high-octane fuel" for multi-modal models. Whether you use simple rules or advanced VLMs, the goal remains the same: &lt;strong&gt;Precision, Adaptation, and Diversity.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For the full implementation guide and more multi-modal data tricks, visit the repo:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;datascale-ai/data_engineering_book&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Have you tried recaptioning your datasets? Did you see a jump in model performance? Share your findings below! 👇&lt;/strong&gt;&lt;/p&gt;




</description>
      <category>ai</category>
      <category>opensource</category>
      <category>learning</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Image-Text Pairs: The Fuel for Multi-modal Large Language Models 🖼️✍️</title>
      <dc:creator>Xin Xu</dc:creator>
      <pubDate>Fri, 13 Feb 2026 10:12:54 +0000</pubDate>
      <link>https://dev.to/xin_xu_5c36b5326e7008e281/image-text-pairs-the-fuel-for-multi-modal-large-language-models-4kle</link>
      <guid>https://dev.to/xin_xu_5c36b5326e7008e281/image-text-pairs-the-fuel-for-multi-modal-large-language-models-4kle</guid>
      <description>&lt;h1&gt;
  
  
  Image-Text Pairs: Building the Foundation for Multi-modal AI 🖼️✍️
&lt;/h1&gt;

&lt;p&gt;In the era of Multi-modal Large Language Models (like CLIP, BLIP, and LLaVA), &lt;strong&gt;Image-Text Pairs&lt;/strong&gt; are the most critical data assets. Whether it's pre-training, fine-tuning, or evaluation, the quality of your image-text alignment directly determines the model's ability to "see" and "describe."&lt;/p&gt;

&lt;p&gt;Based on the &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;data_engineering_book&lt;/a&gt;, this post breaks down how to construct, validate, and pipe multi-modal data for production-grade AI.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. What are Image-Text Pairs?
&lt;/h2&gt;

&lt;p&gt;An image-text pair consists of one image and one or more matching textual descriptions. The core requirement is &lt;strong&gt;Strong Semantic Alignment&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Scenarios
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Data Requirement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Image-Text Retrieval&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Precise descriptions of core features, zero redundancy.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;V-L Pre-training&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Massive diversity (People, Landscapes, Goods) and varied styles.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Generative AI (Stable Diffusion)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rich detail (Colors, Textures, Actions) corresponding to every pixel.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  2. Building High-Quality Datasets
&lt;/h2&gt;

&lt;h3&gt;
  
  
  A. Data Sourcing
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Open Datasets:&lt;/strong&gt; Start with standards like COCO Captions, Flickr30k, or LAION-400M. (Always check licenses for commercial use!)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual Annotation:&lt;/strong&gt; Use platforms like Label Studio. Rule #1: Describe the subject + attributes (e.g., "An orange tabby cat lying on a gray sofa").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated Captioning:&lt;/strong&gt; Use models like BLIP-2 or LLaVA to generate initial descriptions for unlabelled images, followed by human verification.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  B. Quality Validation Checklist
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;Semantic Alignment:&lt;/strong&gt; 100% of the text must exist in the image. No hallucinations.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Uniqueness:&lt;/strong&gt; No identical descriptions for different images.&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Length Optimization:&lt;/strong&gt; For CLIP-style models, keep text  tokens.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. Engineering: Storage &amp;amp; Loading
&lt;/h2&gt;

&lt;h3&gt;
  
  
  I. Storage Formats
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Small Scale:&lt;/strong&gt; JSONL (Easy to read and extend).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large Scale:&lt;/strong&gt; Parquet or WebDataset (High compression, supports streaming/mmap).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;JSONL Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"image_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"img_001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"image_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"data/images/img_001.jpg"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"texts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"A white ceramic mug with blue stripes, 350ml capacity"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"quality_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.98&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  II. High-Efficiency Loader (Python/PyTorch)
&lt;/h3&gt;

&lt;p&gt;Using a CLIP Processor to handle both image resizing and text tokenization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch.utils.data&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DataLoader&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CLIPProcessor&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ImageTextPairDataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;jsonl_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image_root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;image_root&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;image_root&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jsonl_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__getitem__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;image_root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])).&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RGB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Process both modalities at once
&lt;/span&gt;        &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;texts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;padding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;truncation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;77&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;squeeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CLIPProcessor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/clip-vit-base-patch32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dataloader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ImageTextPairDataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pairs.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Pitfalls &amp;amp; Solutions
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pitfall&lt;/th&gt;
&lt;th&gt;Engineering Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Weak Alignment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Create an "Annotation Style Guide" and perform 10%+ random spot checks.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Format Chaos&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Standardize all images to RGB and specific resolutions (e.g., 224x224).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Slow Loading&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Use &lt;strong&gt;Memory Mapping (mmap)&lt;/strong&gt; for JSONL or switch to &lt;strong&gt;WebDataset&lt;/strong&gt; for sharded binary loading.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Image-text pairs are the "fuel" for multi-modal AI. The logic is simple but the execution is hard: &lt;strong&gt;Define Scenario → Standardize Construction → Optimize Data Pipeline.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For full source code and advanced multi-modal data strategies, visit our project:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;datascale-ai/data_engineering_book&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you working with custom image-text data for your models? What's the biggest challenge you've faced—quality or scale? Let's discuss! 👇&lt;/strong&gt;&lt;/p&gt;




</description>
      <category>ai</category>
      <category>opensource</category>
      <category>learning</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Tokenization &amp; Serialization: The Unsung Heroes of LLM Development 🤖</title>
      <dc:creator>Xin Xu</dc:creator>
      <pubDate>Fri, 13 Feb 2026 10:03:50 +0000</pubDate>
      <link>https://dev.to/xin_xu_5c36b5326e7008e281/tokenization-serialization-the-unsung-heroes-of-llm-development-2m3n</link>
      <guid>https://dev.to/xin_xu_5c36b5326e7008e281/tokenization-serialization-the-unsung-heroes-of-llm-development-2m3n</guid>
      <description>&lt;h1&gt;
  
  
  Tokenization &amp;amp; Serialization: Mastering the Foundation of LLM Data Engineering 🤖
&lt;/h1&gt;

&lt;p&gt;In the lifecycle of Large Language Model (LLM) development, &lt;strong&gt;Tokenization&lt;/strong&gt; and &lt;strong&gt;Serialization&lt;/strong&gt; are the invisible bridges between raw data and model intelligence. One determines how a model "reads" text, while the other ensures that processed data is stored and transmitted efficiently.&lt;/p&gt;

&lt;p&gt;Based on the &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;data_engineering_book&lt;/a&gt;, this guide breaks down these core concepts with hands-on practice using the Hugging Face ecosystem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv64oi8q4h2yh6xe5uwfo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv64oi8q4h2yh6xe5uwfo.png" alt=" " width="750" height="750"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Core Concepts: Why Do They Matter?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  A. Tokenization: The "Translator" for LLMs
&lt;/h3&gt;

&lt;p&gt;LLMs don't understand words; they understand numbers (integers). Tokenization is the process of converting natural language into discrete &lt;strong&gt;Tokens&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Goal:&lt;/strong&gt; Balance &lt;strong&gt;Vocabulary Size&lt;/strong&gt; and &lt;strong&gt;Text Compression Ratio&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mainstream Algorithms:&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BPE (Byte Pair Encoding):&lt;/strong&gt; Used by GPT/LLaMA. Merges high-frequency byte pairs iteratively.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WordPiece:&lt;/strong&gt; Used by BERT. Uses a greedy approach to split words.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unigram:&lt;/strong&gt; Used by T5. Selects the best subword combination based on probabilities.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  B. Serialization: Packaging Your Data
&lt;/h3&gt;

&lt;p&gt;Serialization converts in-memory objects (like tokenized datasets or model weights) into formats (JSON, Pickle, Arrow) for storage or transmission. &lt;strong&gt;Deserialization&lt;/strong&gt; is the reverse.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Why use it?&lt;/strong&gt; Avoid repeating expensive preprocessing, enable cross-framework data sharing (PyTorch ↔ TensorFlow), and persist training checkpoints.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. Hands-on: Tokenization &amp;amp; Serialization with Hugging Face
&lt;/h2&gt;

&lt;h3&gt;
  
  
  I. Tokenization in Practice
&lt;/h3&gt;

&lt;p&gt;Using the LLaMA-2 tokenizer as an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Load Tokenizer (Use Fast version for speed)
&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta-llama/Llama-2-7b-hf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;use_fast&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_special_tokens&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pad_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[PAD]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Encoding Text
&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data Engineering is the backbone of AI!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;encoded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;truncation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;padding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Token IDs: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;encoded&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Decoded: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoded&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  II. Serialization Strategies
&lt;/h3&gt;

&lt;p&gt;Depending on your scale, you should choose different formats:&lt;/p&gt;

&lt;h4&gt;
  
  
  Option 1: JSON (Human-readable, Cross-platform)
&lt;/h4&gt;

&lt;p&gt;Best for small datasets or debugging.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;encoded&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()},&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Option 2: Apache Arrow (High-performance, Scalable)
&lt;/h4&gt;

&lt;p&gt;The industry standard for large-scale LLM training.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;
&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_dict&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;encoded&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()})&lt;/span&gt;
&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_to_disk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tokenized_dataset&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Highly efficient binary format
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3. Pitfalls &amp;amp; Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  🚨 Common Pitfalls
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tokenizer Mismatch:&lt;/strong&gt; Using a different tokenizer during inference than the one used in training leads to "garbage" outputs. &lt;strong&gt;Always&lt;/strong&gt; use &lt;code&gt;save_pretrained()&lt;/code&gt; to bundle the tokenizer with the model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incorrect Padding Side:&lt;/strong&gt; LLaMA generally prefers &lt;code&gt;padding_side="right"&lt;/code&gt;, while BERT uses &lt;code&gt;left&lt;/code&gt;. Setting this incorrectly can confuse the model's attention mechanism.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pickle Security:&lt;/strong&gt; Never unpickle data from untrusted sources (it can execute malicious code). Use &lt;strong&gt;JSON&lt;/strong&gt; or &lt;strong&gt;Safetensors&lt;/strong&gt; for public data.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ✅ Best Practices
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cache Processed Data:&lt;/strong&gt; For large corpora, tokenize once and serialize to &lt;strong&gt;Parquet&lt;/strong&gt; or &lt;strong&gt;Arrow&lt;/strong&gt;. Don't re-tokenize every time you start a training job.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify Consistency:&lt;/strong&gt; Always &lt;code&gt;decode&lt;/code&gt; a few serialized samples to ensure the tokens still represent the original text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Special Token Handling:&lt;/strong&gt; Ensure tokens like &lt;code&gt;[PAD]&lt;/code&gt;, &lt;code&gt;[BOS]&lt;/code&gt;, and &lt;code&gt;[EOS]&lt;/code&gt; are correctly defined and mapped in your vocabulary.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Tokenization is the "first gate" for an LLM's understanding, while Serialization is the "infrastructure" that ensures your data pipeline is scalable and reproducible.&lt;/p&gt;

&lt;p&gt;If you found this helpful, check out the full code and advanced docs in our repository:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;datascale-ai/data_engineering_book&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What’s your go-to serialization format for large datasets? Parquet, Arrow, or good old JSON? Let’s talk in the comments! 👇&lt;/strong&gt;&lt;/p&gt;




</description>
    </item>
    <item>
      <title>Why 80% of Data Engineering is Cleaning (and How to Do It Right)</title>
      <dc:creator>Xin Xu</dc:creator>
      <pubDate>Fri, 13 Feb 2026 10:01:28 +0000</pubDate>
      <link>https://dev.to/xin_xu_5c36b5326e7008e281/why-80-of-data-engineering-is-cleaning-and-how-to-do-it-right-29nm</link>
      <guid>https://dev.to/xin_xu_5c36b5326e7008e281/why-80-of-data-engineering-is-cleaning-and-how-to-do-it-right-29nm</guid>
      <description>&lt;h1&gt;
  
  
  Data Cleaning &amp;amp; Denoising: The "Battlefield" of Data Engineering 🧹
&lt;/h1&gt;

&lt;p&gt;It is an industry consensus that data engineers spend 60% to 80% of their time on data cleaning. Why? Because raw data is messy, and "garbage in, garbage out" is the absolute truth in data science.&lt;/p&gt;

&lt;p&gt;In this post, based on the &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;data_engineering_book&lt;/a&gt;, we’ll deconstruct the logic of industrial-grade data cleaning—moving from "just fixing bugs" to "building robust cleaning pipelines."&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzjizkfr6kl4h04hmfflh.png" alt=" " width="800" height="800"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  1. Where Does the "Noise" Hide?
&lt;/h2&gt;

&lt;p&gt;According to the &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;Data Engineering Book&lt;/a&gt;, data quality is the prerequisite for data value. Noise typically falls into 5 categories:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Noise Type&lt;/th&gt;
&lt;th&gt;Symptoms&lt;/th&gt;
&lt;th&gt;Business Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Missing Values&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Null addresses, missing age fields&lt;/td&gt;
&lt;td&gt;Failed deliveries, incomplete user segments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Outliers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$1M orders (avg is $100), 1000°C sensors&lt;/td&gt;
&lt;td&gt;Flawed sales forecasts, cost miscalculations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Duplicates&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Double-submitted forms, sync errors&lt;/td&gt;
&lt;td&gt;Inflated user counts, duplicate revenue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inconsistency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"2024-05-01" vs "05/01/24"&lt;/td&gt;
&lt;td&gt;Aggregation failures, broken time-series&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Logic Conflicts&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Registration date &lt;em&gt;after&lt;/em&gt; purchase date&lt;/td&gt;
&lt;td&gt;Distorted behavior analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  2. The Methodology: Diagnosis-Treatment-Validation
&lt;/h2&gt;

&lt;p&gt;The handbook proposes a three-step closed-loop system for industrial data cleaning:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Data Profiling (Diagnosis)
&lt;/h3&gt;

&lt;p&gt;Never start cleaning without measuring. Use &lt;strong&gt;Pandas&lt;/strong&gt; for a quick health check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_data.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Missing Value Ratio
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Outlier Detection using Boxplot
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;box&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Logic Check: Reg_date should be before Order_date
&lt;/span&gt;&lt;span class="n"&gt;conflict_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reg_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Logic conflicts found: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;conflict_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Targeted Cleaning (Treatment)
&lt;/h3&gt;

&lt;p&gt;Cleaning should be context-aware. Don't just delete everything.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Missing Values:&lt;/strong&gt; Use &lt;strong&gt;Median&lt;/strong&gt; for skewed numerical data, &lt;strong&gt;Mode&lt;/strong&gt; for categorical, or &lt;strong&gt;Model-based imputation&lt;/strong&gt; for high-priority missingness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outliers:&lt;/strong&gt; Use &lt;strong&gt;Winsorization&lt;/strong&gt; (clipping at 99th percentile) or business-rule-based correction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Duplicates:&lt;/strong&gt; Keep the "First" or "Last" based on a timestamp.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Validation
&lt;/h3&gt;

&lt;p&gt;Repeat the profiling step. Are the null ratios acceptable? Are the logic conflicts cleared? Does the cleaned data still represent the business reality?&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The 4 Principles of Engineering Excellence
&lt;/h2&gt;

&lt;p&gt;Data cleaning isn't a one-time script; it's an automated process. Follow these rules:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Traceability:&lt;/strong&gt; Log every step. Know exactly how many records were dropped and why.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reusability:&lt;/strong&gt; Wrap your logic into functions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-intrusive:&lt;/strong&gt; Never modify the source file. Always output to a new "Cleaned" layer (Silver layer).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation:&lt;/strong&gt; Orchestrate your cleaning jobs using &lt;strong&gt;Airflow&lt;/strong&gt; or &lt;strong&gt;Prefect&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Example: A Reusable Cleaning Module
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_missing_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fill_rules&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    A reusable function for missing value imputation.
    :param fill_rules: e.g., {&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;median&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;df_clean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;fill_rules&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;median&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;median&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_clean&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df_clean&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Data cleaning is the foundation of your data "building." If the foundation is weak, the building will fall.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;data_engineering_book&lt;/a&gt; covers the entire pipeline—from ingestion to deployment—with industrial-grade insights. If you want to move from "writing scripts" to "designing systems," this repo is a goldmine.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;Repo Link:&lt;/strong&gt; &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;datascale-ai/data_engineering_book&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the weirdest "dirty data" you've ever encountered in production? Let's share some horror stories in the comments! 👻👇&lt;/strong&gt;&lt;/p&gt;




</description>
      <category>ai</category>
      <category>opensource</category>
      <category>learning</category>
      <category>beginners</category>
    </item>
    <item>
      <title>High-Performance Data Processing: A Practical Guide from the Data Engineering Book</title>
      <dc:creator>Xin Xu</dc:creator>
      <pubDate>Fri, 13 Feb 2026 09:58:37 +0000</pubDate>
      <link>https://dev.to/xin_xu_5c36b5326e7008e281/high-performance-data-processing-a-practical-guide-from-the-data-engineering-book-2eno</link>
      <guid>https://dev.to/xin_xu_5c36b5326e7008e281/high-performance-data-processing-a-practical-guide-from-the-data-engineering-book-2eno</guid>
      <description>&lt;h1&gt;
  
  
  Data Processing &amp;amp; Transformation: Mastering ETL/ELT Workflows with Spark and Flink ⚡
&lt;/h1&gt;

&lt;p&gt;In data engineering, transformation is where raw data becomes valuable insight. Based on the &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;data_engineering_book&lt;/a&gt;, this post dives deep into the ETL vs. ELT paradigms, provides hands-on code for Spark (Batch) and Flink (Stream), and shares industry best practices for performance tuning.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;datascale-ai/data_engineering_book&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. ETL vs. ELT: The Paradigm Shift
&lt;/h2&gt;

&lt;p&gt;The fundamental difference lies in &lt;strong&gt;where&lt;/strong&gt; and &lt;strong&gt;when&lt;/strong&gt; the data is transformed.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;ETL (Extract-Transform-Load)&lt;/th&gt;
&lt;th&gt;ELT (Extract-Load-Transform)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Workflow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extract → Transform (External Engine) → Load&lt;/td&gt;
&lt;td&gt;Extract → Load (Raw Lake) → Transform (In-situ)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Separate Compute (e.g., Spark Cluster)&lt;/td&gt;
&lt;td&gt;Target Engine (e.g., Snowflake, Delta Lake)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Schema&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Schema-on-Write&lt;/strong&gt; (Structured only)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Schema-on-Read&lt;/strong&gt; (Structured/Unstructured)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Flexibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (Rigid rules)&lt;/td&gt;
&lt;td&gt;High (Agile exploration)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Use Case&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Traditional BI, Small datasets&lt;/td&gt;
&lt;td&gt;Big Data, ML, Data Lakes, Modern Lakehouse&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  2. Hands-on: Batch &amp;amp; Stream Transformation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  A. Batch Processing with Spark (ELT Paradigm)
&lt;/h3&gt;

&lt;p&gt;In ELT, we load raw CSV data into a Delta Lake table first, then perform cleaning and aggregation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Spark_Batch_ELT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta.sql.DeltaSparkSessionExtension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.spark_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark.sql.delta.catalog.DeltaCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Extract &amp;amp; Load (Raw Layer)
&lt;/span&gt;&lt;span class="n"&gt;raw_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./data/orders.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inferSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;raw_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./delta/raw/orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Transform (Cleaning &amp;amp; Aggregation)
&lt;/span&gt;&lt;span class="n"&gt;clean_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./delta/raw/orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropDuplicates&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;agg_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clean_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;agg_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  B. Stream Processing with Flink (Real-time Transformation)
&lt;/h3&gt;

&lt;p&gt;Real-time UV/PV calculation from a Kafka stream, persisting results to Delta Lake.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Extracting from Kafka&lt;/span&gt;
&lt;span class="nc"&gt;FlinkKafkaConsumer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FlinkKafkaConsumer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="s"&gt;"user_behavior"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SimpleStringSchema&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;props&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;DataStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addSource&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Transform: 5-minute Tumbling Window for UV calculation&lt;/span&gt;
&lt;span class="nc"&gt;DataStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Tuple2&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Long&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;uvStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;behaviorStream&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;keyBy&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;behavior&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;behavior&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;window&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TumblingProcessingTimeWindows&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;)))&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;aggregate&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;UvAggregateFunction&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;

&lt;span class="c1"&gt;// Load: Sink to Delta Lake&lt;/span&gt;
&lt;span class="n"&gt;uvStream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sinkTo&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deltaSink&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3. Transformation Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ✅ Data Cleaning (The "Minimum Cleaning" Principle)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deduplication:&lt;/strong&gt; Use business keys (Order ID) for batch and time-windowed logic for streams.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outlier Handling:&lt;/strong&gt; Instead of deleting records, flag them (e.g., &lt;code&gt;is_valid=false&lt;/code&gt;) to maintain auditability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Null Values:&lt;/strong&gt; Use &lt;code&gt;fillna()&lt;/code&gt; for non-critical fields; discard if primary keys are missing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔒 Data Masking &amp;amp; Privacy
&lt;/h3&gt;

&lt;p&gt;Compliance is non-negotiable (GDPR/CCPA). Use hashing or masking for sensitive fields:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Phone Numbers:&lt;/strong&gt; &lt;code&gt;138****5678&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Emails:&lt;/strong&gt; &lt;code&gt;jo****@example.com&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  📏 Standardization
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Naming:&lt;/strong&gt; Use &lt;code&gt;snake_case&lt;/code&gt; (e.g., &lt;code&gt;user_id&lt;/code&gt;) consistently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timezones:&lt;/strong&gt; Always standardize to &lt;strong&gt;UTC&lt;/strong&gt; (&lt;code&gt;yyyy-MM-dd HH:mm:ss&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Units:&lt;/strong&gt; Explicitly label units (e.g., &lt;code&gt;amount_usd&lt;/code&gt;, &lt;code&gt;weight_kg&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  4. Performance Tuning 101
&lt;/h2&gt;

&lt;p&gt;Performance is about matching resources to demand. Focus on these two levers:&lt;/p&gt;

&lt;h3&gt;
  
  
  I. Parallelism (Concurrency)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Rule of Thumb:&lt;/strong&gt; &lt;code&gt;Parallelism = Total Data Size / Task Processing Capacity&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Spark:&lt;/strong&gt; Adjust &lt;code&gt;spark.sql.shuffle.partitions&lt;/code&gt;. For 100GB of data, 400-800 partitions is a good starting point.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flink:&lt;/strong&gt; Set operator-level parallelism. Use &lt;code&gt;rebalance()&lt;/code&gt; to prevent data skew.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  II. Resource Allocation
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Spark Config&lt;/th&gt;
&lt;th&gt;Flink Config&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;spark.executor.memory&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;taskmanager.memory.process.size&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU Cores&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;spark.executor.cores&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;taskmanager.numberOfTaskSlots&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Summary: Correctness → Performance → Cost
&lt;/h2&gt;

&lt;p&gt;As the &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;data_engineering_book&lt;/a&gt; suggests: &lt;strong&gt;"First ensure correctness, then optimize performance, and finally reduce cost."&lt;/strong&gt; Never sacrifice data availability for a few seconds of speed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you team Spark or team Flink for your daily transformations? Let's settle the debate in the comments! 👇&lt;/strong&gt;&lt;/p&gt;




</description>
      <category>ai</category>
      <category>opensource</category>
      <category>learning</category>
      <category>architecture</category>
    </item>
    <item>
      <title>From Kimball to Lakehouse: The Evolution of Data Storage (with Python Demo)</title>
      <dc:creator>Xin Xu</dc:creator>
      <pubDate>Fri, 13 Feb 2026 09:54:20 +0000</pubDate>
      <link>https://dev.to/xin_xu_5c36b5326e7008e281/from-kimball-to-lakehouse-the-evolution-of-data-storage-with-python-demo-4dih</link>
      <guid>https://dev.to/xin_xu_5c36b5326e7008e281/from-kimball-to-lakehouse-the-evolution-of-data-storage-with-python-demo-4dih</guid>
      <description>&lt;h1&gt;
  
  
  Data Storage Architecture: Deconstructing Warehouse, Lake, and Lakehouse 🏛️
&lt;/h1&gt;

&lt;p&gt;In modern data engineering, choosing the right storage architecture is critical. Based on the &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;data_engineering_book&lt;/a&gt;, this guide breaks down the core differences between traditional Warehouses, Data Lakes, and the modern Lakehouse, while providing a hands-on Delta Lake demo.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foz8p54sgv2azxr38wzdw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foz8p54sgv2azxr38wzdw.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Warehouse vs. Lake vs. Lakehouse
&lt;/h2&gt;

&lt;p&gt;Understanding the core philosophy of each architecture is the first step toward a successful design.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Definition&lt;/th&gt;
&lt;th&gt;Design Philosophy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Warehouse (Kimball/Inmon)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Structured, integrated, non-volatile storage using Star/Snowflake schemas.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Schema-on-Write.&lt;/strong&gt; Optimized for fast BI reporting and business logic.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Lake&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A vast repository for raw data (Structured/Unstructured) with no strict schema.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Schema-on-Read.&lt;/strong&gt; Optimized for data exploration, ML, and low-cost storage.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Lakehouse&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A hybrid architecture bringing warehouse management to the data lake.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Best of both.&lt;/strong&gt; Retains lake flexibility with warehouse-level ACID transactions and governance.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  2. Core Storage Design Principles
&lt;/h2&gt;

&lt;p&gt;According to the &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;Data Engineering Book&lt;/a&gt;, a robust storage layer must balance &lt;strong&gt;maintainability, performance, and cost&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  A. Layering Strategy (The Medallion Architecture)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Raw (Bronze/ODS):&lt;/strong&gt; Stores data in its original form. Enables reprocessing if logic changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clean (Silver/CDM):&lt;/strong&gt; Deduplicated, standardized, and filtered data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrated (Gold/DWD):&lt;/strong&gt; Themed data organized by business subjects (User, Order, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregated (Platinum/DM):&lt;/strong&gt; Summarized data ready for BI dashboards.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  B. Partitioning Strategy
&lt;/h3&gt;

&lt;p&gt;Partitions reduce the amount of data scanned, directly boosting query performance.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Partition Keys:&lt;/strong&gt; Choose high-frequency filter fields (e.g., &lt;code&gt;dt&lt;/code&gt;, &lt;code&gt;region&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Granularity:&lt;/strong&gt; Avoid "Small File Problem" by ensuring partitions aren't too granular (e.g., use &lt;code&gt;day&lt;/code&gt; instead of &lt;code&gt;second&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  C. Data Lifecycle Management (DLM)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hot Data (&amp;lt;7 days):&lt;/strong&gt; High-performance storage (SSD / Delta Lake active partitions).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Warm Data (7 days - 3 months):&lt;/strong&gt; Standard object storage (S3 Standard).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cold Data (&amp;gt;3 months):&lt;/strong&gt; Archival storage (S3 Glacier) to minimize costs.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. Hands-on: Building a Lakehouse with Delta Lake
&lt;/h2&gt;

&lt;p&gt;Delta Lake is the backbone of the Lakehouse architecture, providing ACID transactions and Schema Enforcement. Here is a Python/PySpark snippet to get you started:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta.tables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;

&lt;span class="c1"&gt;# 1. Initialize Spark with Delta Support
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DeltaLakehouseDemo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.extensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;io.delta.sql.DeltaSparkSessionExtension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.catalog.spark_catalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org.apache.spark.sql.delta.catalog.DeltaCatalog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Ingest to Raw Layer (ODS)
&lt;/span&gt;&lt;span class="n"&gt;ods_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./lakehouse/ods/user_behavior&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-05-20&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;click&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-05-20&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purchase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ods_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 3. Time Travel Capability
# Access a specific version of your data effortlessly
&lt;/span&gt;&lt;span class="n"&gt;df_v0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;versionAsOf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ods_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# 4. ACID Transaction: Atomic Updates
&lt;/span&gt;&lt;span class="n"&gt;delta_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ods_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;delta_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;condition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;user1&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="s"&gt;view&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Updated Lakehouse Data:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ods_path&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Decision Tree: Choosing Your Architecture
&lt;/h2&gt;

&lt;p&gt;Not every project needs a full Lakehouse. Use this decision tree from our handbook to decide:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Do you only need structured data for fixed BI reports?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Yes:&lt;/em&gt; Traditional Data Warehouse (Snowflake/Redshift).&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;No:&lt;/em&gt; Proceed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Do you need to store Unstructured data (Logs, Videos, JSON)?&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Yes:&lt;/em&gt; Proceed to Lakehouse/Data Lake.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Do you need ACID transactions and Schema Enforcement?&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;No:&lt;/em&gt; Pure Data Lake (S3 + Hive/Glue).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Yes:&lt;/em&gt; &lt;strong&gt;Data Lakehouse&lt;/strong&gt; (Delta Lake / Iceberg).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The evolution from Warehouses to Lakehouses represents a shift toward balancing agility with governance. By implementing layering, partitioning, and lifecycle management, you can build a storage layer that scales with your business.&lt;/p&gt;

&lt;p&gt;For the full architectural guide and more hands-on demos, visit our repository:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;datascale-ai/data_engineering_book&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you still using a traditional Data Warehouse, or have you migrated to a Lakehouse? Share your migration stories below! 👇&lt;/strong&gt;&lt;/p&gt;




</description>
      <category>ai</category>
      <category>learning</category>
      <category>discuss</category>
      <category>architecture</category>
    </item>
    <item>
      <title>How to Build Scalable Data Pipelines: Lessons from the Data Engineering Book</title>
      <dc:creator>Xin Xu</dc:creator>
      <pubDate>Fri, 13 Feb 2026 09:50:10 +0000</pubDate>
      <link>https://dev.to/xin_xu_5c36b5326e7008e281/how-to-build-scalable-data-pipelines-lessons-from-the-data-engineering-book-2afn</link>
      <guid>https://dev.to/xin_xu_5c36b5326e7008e281/how-to-build-scalable-data-pipelines-lessons-from-the-data-engineering-book-2afn</guid>
      <description>&lt;h1&gt;
  
  
  Data Ingestion 101: Building Robust Pipelines with CDC, Batch, and APIs 🛠️
&lt;/h1&gt;

&lt;p&gt;Data ingestion is the "first gateway" of data engineering. The stability and efficiency of your ingestion layer directly determine the quality of all downstream processing and analytics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxlhrnwbyq4hbkmzl3hg6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxlhrnwbyq4hbkmzl3hg6.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this guide, based on the open-source &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;data_engineering_book&lt;/a&gt;, we’ll explore how to handle different data sources, choose the right ingestion patterns, and implement a real-time CDC pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Understanding Your Data Sources
&lt;/h2&gt;

&lt;p&gt;We categorize data sources into two main dimensions: &lt;strong&gt;Form&lt;/strong&gt; and &lt;strong&gt;Latency&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  By Form
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Structured:&lt;/strong&gt; Databases (MySQL, PostgreSQL), CSVs, or ERP exports with fixed schemas.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semi-Structured:&lt;/strong&gt; JSON/XML logs, Kafka messages, and NoSQL (MongoDB). These require schema inference or flattening.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unstructured:&lt;/strong&gt; PDFs, images, and audio/video files.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  By Latency
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Batch (Offline):&lt;/strong&gt; Daily/weekly reports or full database backups. High latency, but high data integrity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming (Real-time):&lt;/strong&gt; User clickstreams, payment logs, and DB change logs. Requires millisecond-level processing.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. Core Ingestion Strategies
&lt;/h2&gt;

&lt;p&gt;Based on the &lt;a href="https://github.com/datascale-ai/data_engineering_book/" rel="noopener noreferrer"&gt;Data Engineering Book&lt;/a&gt;, there are three primary patterns:&lt;/p&gt;

&lt;h3&gt;
  
  
  A. CDC (Change Data Capture)
&lt;/h3&gt;

&lt;p&gt;The gold standard for database synchronization. It captures row-level changes (Insert/Update/Delete) from database logs (like MySQL Binlog) without impacting the production application.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Top Tool:&lt;/strong&gt; &lt;strong&gt;Flink CDC&lt;/strong&gt; (supports full + incremental sync).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  B. Batch Ingestion
&lt;/h3&gt;

&lt;p&gt;Standardized scheduled pulls for offline scenarios.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tools:&lt;/strong&gt; &lt;strong&gt;DataX&lt;/strong&gt;, &lt;strong&gt;Apache Sqoop&lt;/strong&gt;, or even &lt;strong&gt;Python/Pandas&lt;/strong&gt; for smaller datasets.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  C. API Pulling
&lt;/h3&gt;

&lt;p&gt;The go-to method for 3rd-party SaaS (Stripe, Shopify, TikTok Ads).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Key Challenges:&lt;/strong&gt; Handling OAuth2, pagination logic, and exponential backoff for rate limiting.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. Hands-on: Real-time MySQL to Kafka Pipeline
&lt;/h2&gt;

&lt;p&gt;Let's implement a real-time sync using &lt;strong&gt;Flink CDC&lt;/strong&gt;. This setup captures every change in a MySQL table and streams it to Kafka as a JSON message.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Code (Java/Flink)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MySql2KafkaCDC&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="kd"&gt;throws&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;StreamExecutionEnvironment&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StreamExecutionEnvironment&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getExecutionEnvironment&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;enableCheckpointing&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// Critical for preventing data loss&lt;/span&gt;

        &lt;span class="c1"&gt;// 1. Configure MySQL CDC Source&lt;/span&gt;
        &lt;span class="nc"&gt;MySqlSource&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;mySqlSource&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MySqlSource&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;hostname&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"localhost"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3306&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;databaseList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"production_db"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;tableList&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"production_db.orders"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"cdc_user"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"cdc_password"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;deserializer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;JsonDebeziumDeserializationSchema&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="c1"&gt;// Convert to JSON&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

        &lt;span class="c1"&gt;// 2. Configure Kafka Sink&lt;/span&gt;
        &lt;span class="nc"&gt;KafkaSink&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;kafkaSink&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaSink&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setBootstrapServers&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"localhost:9092"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setRecordSerializer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;KafkaRecordSerializationSchema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
                        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setTopic&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"db_changes_orders"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setValueSerializationSchema&lt;/span&gt;&lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getBytes&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
                        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

        &lt;span class="c1"&gt;// 3. Run the Pipeline&lt;/span&gt;
        &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromSource&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mySqlSource&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;WatermarkStrategy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;noWatermarks&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="s"&gt;"MySQL-Source"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sinkTo&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kafkaSink&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

        &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"MySQL to Kafka Real-time Sync"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. Common Pitfalls (And How to Avoid Them)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  🚨 Data Loss
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scenario:&lt;/strong&gt; A Flink job restarts but doesn't have Checkpointing enabled, losing the current offset.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; &lt;strong&gt;Always&lt;/strong&gt; enable persistent Checkpointing (S3/HDFS) and implement &lt;strong&gt;Idempotent Writes&lt;/strong&gt; at the sink.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🐢 Data Lag
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scenario:&lt;/strong&gt; Binlog accumulation or insufficient Kafka partitions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Increase Flink parallelism and split synchronization for giant tables into separate jobs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🧩 Schema Drift
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scenario:&lt;/strong&gt; Upstream DB changes a column from &lt;code&gt;INT&lt;/code&gt; to &lt;code&gt;STRING&lt;/code&gt;, breaking your pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Use &lt;strong&gt;Schema Validation&lt;/strong&gt; tools (like &lt;em&gt;Great Expectations&lt;/em&gt;) at the ingestion layer to catch mismatches early.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Ingestion is the first line of defense for your data system. Small leaks here become floods downstream.&lt;/p&gt;

&lt;p&gt;For the full Docker-compose environment (MySQL + Kafka + Flink) and complete source code, head over to our repository:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;datascale-ai/data_engineering_book&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's your preferred tool for data ingestion? Are you a Flink CDC fan or do you prefer Airbyte/Meltano? Let's discuss below! 👇&lt;/strong&gt;&lt;/p&gt;




</description>
      <category>ai</category>
      <category>opensource</category>
      <category>learning</category>
      <category>discuss</category>
    </item>
    <item>
      <title>The Modern Data Stack: A Guide from the Open-Source Data Engineering Book</title>
      <dc:creator>Xin Xu</dc:creator>
      <pubDate>Fri, 13 Feb 2026 09:40:07 +0000</pubDate>
      <link>https://dev.to/xin_xu_5c36b5326e7008e281/the-modern-data-stack-a-guide-from-the-open-source-data-engineering-book-34a4</link>
      <guid>https://dev.to/xin_xu_5c36b5326e7008e281/the-modern-data-stack-a-guide-from-the-open-source-data-engineering-book-34a4</guid>
      <description>&lt;h1&gt;
  
  
  Data Engineering Fundamentals: Definitions, Tech Stacks, and Mastery Roadmap 🏗️
&lt;/h1&gt;

&lt;p&gt;Data Engineering is the "infrastructure" of the big data world. However, many people still confuse it with Data Analysis or Data Science.&lt;/p&gt;

&lt;p&gt;In this post, we’ll use the open-source &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;data_engineering_book&lt;/a&gt; to deconstruct the core logic of data engineering—from its definition and tech stack to the competency model and a quick self-test.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feltjgne7962wv38yjwnf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feltjgne7962wv38yjwnf.png" alt=" " width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;GitHub Repo:&lt;/strong&gt; &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;datascale-ai/data_engineering_book&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. What exactly is Data Engineering?
&lt;/h2&gt;

&lt;p&gt;In our handbook, we define Data Engineering as &lt;strong&gt;the engineering practice of turning data into assets.&lt;/strong&gt; The core goal is to build stable, scalable, and efficient pipelines that transform raw, fragmented, and heterogeneous data into structured, reusable, and high-availability assets.&lt;/p&gt;

&lt;h3&gt;
  
  
  The "House" Analogy: DE vs. DA vs. DS
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Data Engineering (DE)&lt;/th&gt;
&lt;th&gt;Data Analytics (DA)&lt;/th&gt;
&lt;th&gt;Data Science (DS)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Goal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Build pipelines/foundations&lt;/td&gt;
&lt;td&gt;Interpret data/Business QA&lt;/td&gt;
&lt;td&gt;Build predictive models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data Warehouse, ETL, APIs&lt;/td&gt;
&lt;td&gt;Reports, Insights, Dashboards&lt;/td&gt;
&lt;td&gt;ML Models, AI Systems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Analogy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;The Architect&lt;/strong&gt; (Builds the house)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;The Interior Designer&lt;/strong&gt; (Uses the house)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;The Scientist&lt;/strong&gt; (Optimizes house functions)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  2. Breaking Down the Modern Tech Stack
&lt;/h2&gt;

&lt;p&gt;We categorize the stack based on the "Data Flow Lifecycle" rather than just listing tools:&lt;/p&gt;

&lt;h3&gt;
  
  
  📥 Storage: The "Containers"
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Structured:&lt;/strong&gt; Data Warehouses (Snowflake, ClickHouse, BigQuery).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unstructured:&lt;/strong&gt; Data Lakes (S3, HDFS, MinIO).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified:&lt;/strong&gt; &lt;strong&gt;Lakehouse&lt;/strong&gt; (Delta Lake, Iceberg, Hudi) — Solving the rigidity of warehouses and the chaos of lakes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ⚙️ Compute: The "Processing Center"
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Batch Processing:&lt;/strong&gt; Spark, Flink Batch — For heavy-duty offline processing (e.g., daily syncs).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stream Processing:&lt;/strong&gt; Flink, Kafka Streams — For real-time processing (e.g., live order monitoring).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lightweight Compute:&lt;/strong&gt; Polars, Dask, Trino — High-performance tools for small-to-medium datasets.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🎼 Orchestration: The "Conductor"
&lt;/h3&gt;

&lt;p&gt;The "brain" that ensures tasks run in order (scheduling, retries, dependencies).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Key Tools:&lt;/strong&gt; &lt;strong&gt;Apache Airflow&lt;/strong&gt; (The industry standard), Dagster, Prefect.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🛡️ Operations &amp;amp; Observability: The "Safety Net"
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Observability:&lt;/strong&gt; Prometheus + Grafana (Monitoring), ELK (Logging).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Quality:&lt;/strong&gt; Great Expectations, Soda — Checking for missing values or schema drift.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineering Standards:&lt;/strong&gt; CI/CD (GitHub Actions), Environment Isolation.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. The Data Engineering Competency Model
&lt;/h2&gt;

&lt;p&gt;One of the highlights of the &lt;code&gt;data_engineering_book&lt;/code&gt; is the &lt;strong&gt;Growth Map&lt;/strong&gt;, moving beyond "tool-watching" to "capability-building":&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Foundational (The Essentials):&lt;/strong&gt; SQL (Window functions, CTEs), Data Modeling (Star/Snowflake schema), Linux/Python basics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Core Engineering (Mid-Level):&lt;/strong&gt; Designing ETL/ELT pipelines, understanding Batch vs. Stream, and mastering Data CDC (Change Data Capture).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ecosystem &amp;amp; Business (Senior):&lt;/strong&gt; Abstracting business needs into data architectures and managing cross-team data contracts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expert Level:&lt;/strong&gt; Building automated data platforms, cost optimization (FinOps), and ensuring global compliance (GDPR/Data Privacy).&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  🧠 Quick Quiz: Are you ready?
&lt;/h2&gt;

&lt;p&gt;These questions are pulled from Part 1 of our book. Can you answer them?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What is the core difference between &lt;strong&gt;ETL&lt;/strong&gt; and &lt;strong&gt;ELT&lt;/strong&gt;? When should you use which?&lt;/li&gt;
&lt;li&gt;What are the pros and cons of &lt;strong&gt;Star Schema&lt;/strong&gt; vs. &lt;strong&gt;Snowflake Schema&lt;/strong&gt;?&lt;/li&gt;
&lt;li&gt;What is a &lt;strong&gt;DAG&lt;/strong&gt; in Airflow, and how does it manage task dependencies?&lt;/li&gt;
&lt;li&gt;What problem does a &lt;strong&gt;Lakehouse&lt;/strong&gt; (e.g., Delta Lake) solve that a traditional Data Lake cannot?&lt;/li&gt;
&lt;li&gt;How do you validate &lt;strong&gt;Data Completeness&lt;/strong&gt; in a production pipeline?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;(Check the answers in our &lt;a href="https://datascale-ai.github.io/" rel="noopener noreferrer"&gt;GitHub Wiki/Docs&lt;/a&gt;)&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Data Engineering is about moving from being a "tool user" to a "system designer." If you’re looking for a systematic path to master these skills, check out our repository.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you found this helpful, give us a Star ⭐️ on GitHub to support open-source education!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>learning</category>
      <category>showdev</category>
      <category>data</category>
    </item>
    <item>
      <title>Data Engineering for LLMs: A Comprehensive Open-Source Guide 🚀</title>
      <dc:creator>Xin Xu</dc:creator>
      <pubDate>Fri, 13 Feb 2026 09:32:30 +0000</pubDate>
      <link>https://dev.to/xin_xu_5c36b5326e7008e281/data-engineering-for-llms-a-comprehensive-open-source-guide-5b65</link>
      <guid>https://dev.to/xin_xu_5c36b5326e7008e281/data-engineering-for-llms-a-comprehensive-open-source-guide-5b65</guid>
      <description>&lt;h1&gt;
  
  
  Data Engineering for LLMs: The Open-Source Guide to High-Quality Data Pipelines 🚀
&lt;/h1&gt;

&lt;p&gt;In the era of Large Language Models (LLMs), we all know that &lt;strong&gt;"Data quality determines the model's upper limit."&lt;/strong&gt; However, most developers are still "crossing the river by feeling the stones" when it comes to LLM data engineering. Finding systematic resources for data collection, cleaning, alignment, and RAG pipelines is surprisingly difficult. Many end up with datasets that are either low quality or impossible to deploy in production.&lt;/p&gt;

&lt;p&gt;That’s why we created &lt;strong&gt;&lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;data_engineering_book&lt;/a&gt;&lt;/strong&gt; — a one-stop open-source guide for LLM data engineering, covering architecture, algorithms, and real-world projects.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgmn8nose8erq7skahljp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgmn8nose8erq7skahljp.png" alt=" " width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;datascale-ai/data_engineering_book&lt;/a&gt;&lt;br&gt;
👉 &lt;strong&gt;Live Docs:&lt;/strong&gt; &lt;a href="https://datascale-ai.github.io/" rel="noopener noreferrer"&gt;Read Online Here&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🛠 Why This Project?
&lt;/h2&gt;

&lt;p&gt;Current industry pain points are clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fragmented Knowledge:&lt;/strong&gt; Tutorials are scattered across random blogs and papers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model-Centric Bias:&lt;/strong&gt; Too much focus on "fine-tuning parameters" while ignoring the &lt;strong&gt;Data-Centric AI&lt;/strong&gt; core.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lack of Production Context:&lt;/strong&gt; Theory is great, but how do you scale a cleaning pipeline to billions of tokens?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our goal is to bridge this gap, helping you move from "using tools" to "building robust data lifecycles."&lt;/p&gt;




&lt;h2&gt;
  
  
  🏗 What’s Inside?
&lt;/h2&gt;

&lt;p&gt;The handbook is structured into 6 parts, covering 13 chapters and &lt;strong&gt;5 end-to-end production projects&lt;/strong&gt;:&lt;/p&gt;

&lt;h3&gt;
  
  
  🗺 The Roadmap
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 1: Infrastructure &amp;amp; Core Concepts&lt;/strong&gt; (Modern stack selection)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 2: Text Pre-training&lt;/strong&gt; (Scraping, Cleaning, Tokenization)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3: Multi-modal Data&lt;/strong&gt; (Image-text pairs, Audio, Video)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 4: Alignment &amp;amp; Synthetic Data&lt;/strong&gt; (SFT, RLHF, and Synthetic generation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 5: Application-level Engineering&lt;/strong&gt; (Advanced RAG pipelines)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 6: Hands-on Projects&lt;/strong&gt; (Runnable enterprise-level code)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  💻 The Modern Tech Stack
&lt;/h2&gt;

&lt;p&gt;We don't just talk theory. We focus on tools used in production today:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;Tech Stack&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Distributed Computing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ray Data, Apache Spark&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Parquet, WebDataset, Vector DBs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NLP Processing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Trafilatura, KenLM, MinHash LSH&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-modal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CLIP, ColPali, img2dataset&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Versioning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;DVC, LakeFS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  🚀 Hands-on Projects You Can Run
&lt;/h2&gt;

&lt;p&gt;The repo includes 5 full-stack projects with reusable code:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Mini-C4 Construction:&lt;/strong&gt; Build a pre-training dataset from scratch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Legal Expert SFT:&lt;/strong&gt; High-quality instruction set generation for vertical domains.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-modal Instruction Sets:&lt;/strong&gt; Building visual-language datasets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthetic Data Pipeline:&lt;/strong&gt; Using LLMs to generate training data for LLMs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-modal RAG:&lt;/strong&gt; An enterprise-grade financial report assistant.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  🌟 Support the Project
&lt;/h2&gt;

&lt;p&gt;This is a community-driven project maintained by the &lt;code&gt;datascale-ai&lt;/code&gt; team. It’s licensed under MIT and supports both &lt;strong&gt;English and Chinese&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you find this resource helpful for your AI journey, we’d love your support:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Star the Repo:&lt;/strong&gt; &lt;a href="https://www.google.com/search?q=https://github.com/datascale-ai/data_engineering_book" rel="noopener noreferrer"&gt;datascale-ai/data_engineering_book&lt;/a&gt; ⭐️&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contribute:&lt;/strong&gt; Open an Issue or PR if you have better ideas for data cleaning or RAG optimization!&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;What is the biggest challenge you've faced in your LLM data pipeline?&lt;/strong&gt; Let’s discuss in the comments! 👇&lt;/p&gt;




</description>
      <category>ai</category>
      <category>discuss</category>
      <category>architecture</category>
      <category>learning</category>
    </item>
  </channel>
</rss>
