<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gabriel Henrique</title>
    <description>The latest articles on DEV Community by Gabriel Henrique (@gabrielhca).</description>
    <link>https://dev.to/gabrielhca</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3204077%2F15baaf0e-d5bd-4e45-a6b4-4059c8c42f0d.jpg</url>
      <title>DEV Community: Gabriel Henrique</title>
      <link>https://dev.to/gabrielhca</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gabrielhca"/>
    <language>en</language>
    <item>
      <title>Apache Iceberg in Production: Compaction, Catalogs, and the Pitfalls Nobody Warns You About</title>
      <dc:creator>Gabriel Henrique</dc:creator>
      <pubDate>Thu, 25 Jun 2026 00:19:45 +0000</pubDate>
      <link>https://dev.to/gabrielhca/apache-iceberg-in-production-compaction-catalogs-and-the-pitfalls-nobody-warns-you-about-3ml3</link>
      <guid>https://dev.to/gabrielhca/apache-iceberg-in-production-compaction-catalogs-and-the-pitfalls-nobody-warns-you-about-3ml3</guid>
      <description>&lt;p&gt;Apache Iceberg looked like the answer to everything when we first adopted it. Open format, ACID transactions, time travel, schema evolution. We migrated our Hive tables, ran a few queries, and felt good about life.&lt;/p&gt;

&lt;p&gt;Three months later, our S3 costs doubled. Queries that used to take 10 seconds were taking 4 minutes. Metadata operations were timing out. Nobody on the team could explain why.&lt;/p&gt;

&lt;p&gt;That was the beginning of a real education in how Iceberg actually behaves in production. This post covers what I wish someone had told us before we went all-in.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Small Files Problem Is Not Optional
&lt;/h2&gt;

&lt;p&gt;Iceberg is append-friendly by design. Every micro-batch write, every streaming insert, every incremental load creates new Parquet files. Each file also gets its own metadata entry.&lt;/p&gt;

&lt;p&gt;After a week of hourly loads, you might have 10,000 files in a single partition where you wanted 20.&lt;/p&gt;

&lt;p&gt;The result: Iceberg's metadata layer has to plan queries across thousands of file manifests. Planning takes longer than execution. Your 10-second query becomes a 4-minute query, and your users start filing tickets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix: automate compaction from day one.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In Spark, compaction is called &lt;code&gt;rewrite_data_files&lt;/code&gt;. The basic call looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Run this on a schedule, not on-demand&lt;/span&gt;
&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="n"&gt;iceberg_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rewrite_data_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'analytics.events'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;strategy&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'binpack'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'target-file-size-bytes'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'134217728'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;-- 128MB target per file&lt;/span&gt;
    &lt;span class="s1"&gt;'min-input-files'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'5'&lt;/span&gt;                  &lt;span class="c1"&gt;-- only compact if 5+ small files exist&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Target file size of 128MB to 512MB is the practical sweet spot. Smaller than that, you still have too many files. Larger, and your query engines cannot parallelize reads efficiently.&lt;/p&gt;

&lt;p&gt;If you are not using Spark, PyIceberg exposes compaction through the table maintenance API (as of 0.7.x). For Flink or Trino-only shops, schedule compaction as a separate Spark job. Yes, it is annoying, but it is the right call.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hidden Partitioning Is the Feature You Are Probably Ignoring
&lt;/h2&gt;

&lt;p&gt;Old Hive partitioning was explicit. You wrote &lt;code&gt;PARTITIONED BY (event_date STRING)&lt;/code&gt; and added that column to every query or Hive would scan the entire table.&lt;/p&gt;

&lt;p&gt;Iceberg's hidden partitioning decouples the physical layout from what the query writer sees. You define a partition spec on the table, and the engine automatically applies it during writes and prunes during reads without the query needing to reference the partition column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyiceberg.catalog&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_catalog&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyiceberg.transforms&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DayTransform&lt;/span&gt;

&lt;span class="n"&gt;catalog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_catalog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uri&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://your-rest-catalog:8181&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warehouse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://your-bucket/warehouse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Load an existing table and evolve its partition spec
&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analytics.events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add a day-level partition on event_timestamp
# Iceberg handles the bucketing. No ts_date column needed in your schema.
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_spec&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;source_column_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;transform&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;DayTransform&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;partition_field_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_day&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now every query that filters on &lt;code&gt;event_timestamp&lt;/code&gt; automatically benefits from partition pruning. The column stays a timestamp in the schema. No &lt;code&gt;WHERE event_date = '2026-06-24'&lt;/code&gt; hack required.&lt;/p&gt;

&lt;p&gt;The bigger win: you can change the partition strategy without rewriting the table. Iceberg supports multiple partition specs across snapshots. Old data stays on the old layout. New data uses the new one. The engine handles both transparently.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Catalog Decision Matters More Than You Think
&lt;/h2&gt;

&lt;p&gt;Every Iceberg table lives in a catalog. The catalog tracks which metadata file is current. Get this wrong and you either lock yourself into one vendor or end up with metadata conflicts that corrupt tables.&lt;/p&gt;

&lt;p&gt;The main options in 2026:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Glue Catalog&lt;/strong&gt; works well if your entire stack is AWS. Zero operational overhead. But cross-cloud access is painful, and engine compatibility outside of Spark and Athena requires extra configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nessie / REST Catalog&lt;/strong&gt; is the open standard. Any engine that supports the Iceberg REST spec can read and write. Nessie adds git-like branching for data, which is genuinely useful for staging ETL results before promoting to prod. Slightly more infra to manage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unity Catalog&lt;/strong&gt; is the right choice if you are on Databricks. Tight governance integration, fine-grained access control at the column level. But it is proprietary, and getting data out to non-Databricks engines requires extra work.&lt;/p&gt;

&lt;p&gt;My take: if you are building multi-engine (Spark + Trino + Flink), go REST-compatible from the start. Migrating catalogs later is painful. AWS Glue to REST is doable; Unity to anything else is not fun.&lt;/p&gt;

&lt;p&gt;Here is a rough decision guide:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Single cloud (AWS only)    → Glue Catalog
Databricks-primary stack   → Unity Catalog
Multi-engine / multi-cloud → REST Catalog (Nessie or Polaris)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Snapshot Management: The Silent Storage Leak
&lt;/h2&gt;

&lt;p&gt;Every write creates a snapshot. Snapshots reference manifest lists. Manifest lists reference manifest files. Manifest files reference data files.&lt;/p&gt;

&lt;p&gt;Without snapshot expiration, you are paying for every historical snapshot indefinitely. The metadata alone can grow into gigabytes. S3 LIST operations against large metadata trees get expensive fast.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Expire snapshots older than 7 days, keep at least 5 for safety&lt;/span&gt;
&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="n"&gt;iceberg_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expire_snapshots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'analytics.events'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;older_than&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="s1"&gt;'2026-06-17 00:00:00'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;retain_last&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After expiring snapshots, orphan files may still exist (files written but never committed to a snapshot):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Remove orphan files older than 3 days&lt;/span&gt;
&lt;span class="c1"&gt;-- The 3-day buffer ensures in-progress writes are not deleted&lt;/span&gt;
&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="n"&gt;iceberg_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remove_orphan_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'analytics.events'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;older_than&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="s1"&gt;'2026-06-21 00:00:00'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run these on a schedule. Weekly is fine for most tables. Daily for high-volume streaming tables.&lt;/p&gt;




&lt;h2&gt;
  
  
  Time Travel Done Right
&lt;/h2&gt;

&lt;p&gt;One of Iceberg's actual killer features. You can query any historical snapshot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Query the table as it was yesterday at midnight&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_TIME&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;OF&lt;/span&gt; &lt;span class="s1"&gt;'2026-06-23 00:00:00'&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'purchase'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Or by snapshot ID (useful when you need a specific pipeline run)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;VERSION&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;OF&lt;/span&gt; &lt;span class="mi"&gt;8027658604211071520&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The catch: time travel only works while the snapshot exists. Once you expire it, it is gone. Plan your retention window around your incident response SLA. If your team takes 72 hours to notice a bad pipeline run, keep at least 7 days of snapshots.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Not running compaction at all.&lt;/strong&gt; The default state of most Iceberg tables I have seen is "never been compacted." Set up compaction as part of table creation, not as a fix-it-later task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compacting too aggressively.&lt;/strong&gt; Running &lt;code&gt;rewrite_data_files&lt;/code&gt; too frequently on large tables wastes compute and can block concurrent reads. Once per day for most tables, twice per day for high-volume ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using the wrong partition granularity.&lt;/strong&gt; Partitioning by HOUR makes sense for 10 billion events per day. For 10 million, it creates too many small partitions and kills planning time. Match partition granularity to your data volume.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Picking Glue catalog for a multi-engine stack.&lt;/strong&gt; You will not feel the pain on day one. You will feel it six months in when you try to add Trino and spend two weeks on catalog configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not setting &lt;code&gt;write.target-file-size-bytes&lt;/code&gt;.&lt;/strong&gt; The default varies by engine. Set it explicitly in your table properties so file sizes stay consistent regardless of which engine is writing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;analytics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;TBLPROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="s1"&gt;'write.target-file-size-bytes'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'134217728'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;'write.delete.target-file-size-bytes'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'67108864'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What Iceberg Actually Is
&lt;/h2&gt;

&lt;p&gt;Iceberg is a table format specification, not a storage engine. It tells engines how to find data, what schema it has, and which files are current. The engines (Spark, Trino, Flink, Athena) do the actual reading and writing.&lt;/p&gt;

&lt;p&gt;This means Iceberg is only as good as the operational practices around it. The format solves real problems: ACID on object storage, schema evolution without rewriting, partition pruning without partition columns in queries. But you still have to run compaction. You still have to expire snapshots. You still have to pick the right catalog.&lt;/p&gt;

&lt;p&gt;The teams I have seen succeed with Iceberg treated these maintenance tasks as first-class engineering concerns, not afterthoughts. The ones who struggled treated Iceberg like a managed service and were surprised when it needed managing.&lt;/p&gt;

&lt;p&gt;Start with compaction and snapshot expiration automated before you write your first production table. Everything else you can figure out as you go.&lt;/p&gt;




&lt;p&gt;Best regards,&lt;br&gt;
Gabriel Henrique Cardoso Antonio 🔗 gabrielh.dev&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>data</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Data Contracts in Production: Stop Trusting Your Upstream Sources</title>
      <dc:creator>Gabriel Henrique</dc:creator>
      <pubDate>Sat, 20 Jun 2026 19:42:39 +0000</pubDate>
      <link>https://dev.to/gabrielhca/data-contracts-in-production-stop-trusting-your-upstream-sources-3gjl</link>
      <guid>https://dev.to/gabrielhca/data-contracts-in-production-stop-trusting-your-upstream-sources-3gjl</guid>
      <description>&lt;p&gt;Your upstream data source changed a column type last night. Your pipeline ran at 2am, ingested everything without a single error, and by the time your stakeholders opened their dashboards at 9am, the revenue numbers were wrong.&lt;/p&gt;

&lt;p&gt;No alert fired. No test failed. The pipeline was technically healthy.&lt;/p&gt;

&lt;p&gt;This is the most common and expensive failure mode in data engineering, and it happens because we build systems that trust the data they receive. Data contracts are the fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Data Contract Actually Is
&lt;/h2&gt;

&lt;p&gt;A data contract is a formal agreement between a data producer and a data consumer that defines what the data looks like, what quality guarantees it carries, and who owns it.&lt;/p&gt;

&lt;p&gt;Not documentation. Not a README. An &lt;strong&gt;executable specification&lt;/strong&gt; that can be validated automatically, versioned like code, and broken like an API contract when violated.&lt;/p&gt;

&lt;p&gt;Think of it like an API contract, but for your data. A REST API fails loudly with a 400 when you send the wrong payload. A data pipeline fails silently with bad numbers. Contracts change that.&lt;/p&gt;

&lt;p&gt;A contract typically covers: schema definition (fields, types, nullability), quality rules (completeness, uniqueness, valid value ranges), SLA metadata (freshness, update frequency), and ownership (who produces this, who consumes it).&lt;/p&gt;

&lt;h2&gt;
  
  
  Anatomy of a Real Data Contract
&lt;/h2&gt;

&lt;p&gt;Here is what a minimal contract looks like using the open &lt;code&gt;datacontract.yaml&lt;/code&gt; format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;dataContractSpecification&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.9.3&lt;/span&gt;
&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;orders-v2&lt;/span&gt;
&lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Orders Contract&lt;/span&gt;
  &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2.0.0&lt;/span&gt;
  &lt;span class="na"&gt;owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data-platform-team&lt;/span&gt;
  &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;active&lt;/span&gt;

&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;orders&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;One row per order placed on the platform&lt;/span&gt;
    &lt;span class="na"&gt;fields&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;order_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;unique&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;customer_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;total_amount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;decimal&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;minimum&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
      &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
        &lt;span class="na"&gt;enum&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pending&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;confirmed&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;shipped&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;delivered&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;cancelled&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;created_at&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;timestamp&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;quality&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SodaCL&lt;/span&gt;
  &lt;span class="na"&gt;specification&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;checks for orders:&lt;/span&gt;
      &lt;span class="s"&gt;- row_count &amp;gt; 0&lt;/span&gt;
      &lt;span class="s"&gt;- missing_count(order_id) = 0&lt;/span&gt;
      &lt;span class="s"&gt;- duplicate_count(order_id) = 0&lt;/span&gt;
      &lt;span class="s"&gt;- invalid_count(status) = 0:&lt;/span&gt;
          &lt;span class="s"&gt;valid values: [pending, confirmed, shipped, delivered, cancelled]&lt;/span&gt;
      &lt;span class="s"&gt;- freshness(created_at) &amp;lt; 6h&lt;/span&gt;

&lt;span class="na"&gt;servicelevels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;freshness&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Data must not be older than 6 hours&lt;/span&gt;
    &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;6h&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This file is checked into Git alongside the dbt models that produce the &lt;code&gt;orders&lt;/code&gt; table. When the schema changes, the contract changes. When the contract breaks, the pipeline stops.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Places to Enforce Contracts
&lt;/h2&gt;

&lt;p&gt;Most teams put the enforcement in one place and leave gaps everywhere else. You need all three layers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Producer / Source System]
        |
        v
[Ingestion Layer]  &amp;lt;-- enforce schema + type contracts here
        |
        v
[Transformation Layer (dbt)]  &amp;lt;-- enforce quality contracts here
        |
        v
[Serving Layer / Warehouse]  &amp;lt;-- enforce SLA and freshness here
        |
        v
[Consumer / Dashboard / LLM / API]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;At ingestion&lt;/strong&gt; you catch schema drift early, before bad data poisons your warehouse. Use Pydantic models to validate incoming records.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At transformation&lt;/strong&gt; you use dbt tests or Soda checks to enforce business-level quality rules. A row count of zero is not a schema violation, but it is a contract violation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At serving&lt;/strong&gt; you monitor freshness and completeness so consumers know the data they are reading meets SLA guarantees.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Real Ingestion Contract with Pydantic
&lt;/h2&gt;

&lt;p&gt;This runs at the top of every ingestion job, before writing a single row to the warehouse:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;decimal&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Decimal&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;enum&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Enum&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;

&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OrderStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;pending&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;confirmed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confirmed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;shipped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shipped&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;delivered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delivered&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;cancelled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cancelled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;total_amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Decimal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# must be non-negative
&lt;/span&gt;    &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;OrderStatus&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
    &lt;span class="n"&gt;promo_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# optional, but we track null rate
&lt;/span&gt;
    &lt;span class="nd"&gt;@validator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;order_id_not_empty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id cannot be blank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_and_load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Order&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="c1"&gt;# Returns (valid_records, failed_records).
&lt;/span&gt;    &lt;span class="c1"&gt;# Never silently drops failures. Log and route to a dead-letter topic.
&lt;/span&gt;    &lt;span class="n"&gt;valid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;failed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Contract violation: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | Record: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;record&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;

    &lt;span class="c1"&gt;# Fail the pipeline if more than 1% of records are invalid.
&lt;/span&gt;    &lt;span class="n"&gt;failure_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;failure_rate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Contract breach: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;failure_rate&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; of records failed validation &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;failed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; / &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;valid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;failed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two decisions here worth explaining.&lt;/p&gt;

&lt;p&gt;First, the 1% threshold. You do not want to fail the pipeline on a single bad record, but you also do not want to silently ingest garbage. Set a threshold that reflects your tolerance and make it explicit in the code.&lt;/p&gt;

&lt;p&gt;Second, the dead-letter queue. Every failed record should go somewhere observable. If you drop it, it is gone forever. If you log it, you can replay it after fixing the issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Treating contracts as documentation.&lt;/strong&gt; A YAML file that nobody checks is just noise. The contract has to run automatically, fail fast, and block bad data from propagating.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Putting all validation at one layer.&lt;/strong&gt; Schema is not the same as quality. You can have perfectly typed data that is 90% null. Both need contracts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Versioning contracts separately from the code.&lt;/strong&gt; When a producer changes a column, the contract and the dbt model and the ingestion code all need to change together. Keep them in the same repo, reviewed in the same PR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using blocking contracts everywhere from day one.&lt;/strong&gt; You will break things. Start with logging-only mode, measure your actual failure rates, then flip to hard-blocking after you understand the baseline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ignoring freshness SLAs.&lt;/strong&gt; A technically correct dataset from 14 hours ago is a broken contract for a real-time dashboard. Freshness is a first-class quality dimension.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Contracts Are Not Worth the Investment
&lt;/h2&gt;

&lt;p&gt;Not every dataset needs a formal contract. Internal scratch tables, exploratory datasets, and one-off analyses do not need this overhead.&lt;/p&gt;

&lt;p&gt;Contracts pay off when the data crosses a team or system boundary. If another team, application, or AI system depends on your data, you need a contract. If it breaks for them, you will spend more time debugging than you saved by skipping the contract in the first place.&lt;/p&gt;

&lt;p&gt;The ROI is clearest in two scenarios: high-value production pipelines (revenue, product metrics, ML features) and AI/LLM systems consuming structured data. An LLM receiving malformed features will not throw an exception. It will just produce worse outputs. Contracts at the feature serving layer are non-negotiable for production AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shift Happening Right Now
&lt;/h2&gt;

&lt;p&gt;The industry is moving toward contracts-first development. Write the contract before you write the pipeline. Define what the output should look like, what quality guarantees it carries, and who owns it. Then build to meet that spec.&lt;/p&gt;

&lt;p&gt;It is the same discipline that made API development more reliable. The data ecosystem is just a few years behind on this.&lt;/p&gt;

&lt;p&gt;In 2026, with AI systems consuming data directly, a schema break is no longer just a broken dashboard. It is a broken model, a wrong recommendation, a compounding error in an automated pipeline that nobody noticed. The cost of trust without verification has gone up significantly.&lt;/p&gt;

&lt;p&gt;If your pipelines have never failed because of an upstream schema change, consider yourself lucky. Put contracts in place before that luck runs out.&lt;/p&gt;




&lt;p&gt;Abs,&lt;br&gt;
Gabriel Henrique Cardoso Antonio 🔗 gabrielh.dev&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>python</category>
      <category>data</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Agentic Data Engineering in 2026: How to Build Pipelines That AI Agents Can Actually Use</title>
      <dc:creator>Gabriel Henrique</dc:creator>
      <pubDate>Wed, 17 Jun 2026 00:50:11 +0000</pubDate>
      <link>https://dev.to/gabrielhca/agentic-data-engineering-in-2026-how-to-build-pipelines-that-ai-agents-can-actually-use-4kgg</link>
      <guid>https://dev.to/gabrielhca/agentic-data-engineering-in-2026-how-to-build-pipelines-that-ai-agents-can-actually-use-4kgg</guid>
      <description>&lt;p&gt;If you've spent the last few years building data pipelines, you know the drill: ingest, transform, load. Maybe some orchestration on top. Solid work — the kind that keeps dashboards green and analysts happy.&lt;/p&gt;

&lt;p&gt;But something changed in 2026. Your pipeline's new consumer isn't a BI tool or a SQL query. It's an &lt;strong&gt;AI agent&lt;/strong&gt; — and agents are a very different kind of hungry.&lt;/p&gt;

&lt;p&gt;Welcome to agentic data engineering. Buckle up.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's an "Agentic" Data System, Exactly?
&lt;/h2&gt;

&lt;p&gt;Let's back up a second. An &lt;strong&gt;AI agent&lt;/strong&gt; is a system that perceives its environment, reasons about it, and takes actions to reach a goal — without needing a human to hold its hand at every step.&lt;/p&gt;

&lt;p&gt;Think of it like the difference between a GPS that tells you turn-by-turn directions (traditional AI) and one that books your hotel, reschedules your meeting, and orders food for when you arrive (agentic AI). One follows instructions. The other &lt;em&gt;acts&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;For agents to act, they need data. But not just any data — &lt;strong&gt;context-rich, semantically meaningful, machine-readable data&lt;/strong&gt;. And that's where data engineers come in.&lt;/p&gt;

&lt;p&gt;The cold truth: most existing data pipelines aren't built for this. They were designed for humans (or human-readable BI tools) as the end consumer. Agents need something different.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Context Engineering Problem
&lt;/h2&gt;

&lt;p&gt;Here's a concrete example. Say you have a &lt;code&gt;sales&lt;/code&gt; table with a column called &lt;code&gt;status&lt;/code&gt;. Values: &lt;code&gt;A&lt;/code&gt;, &lt;code&gt;B&lt;/code&gt;, &lt;code&gt;C&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A human analyst knows that &lt;code&gt;A = active&lt;/code&gt;, &lt;code&gt;B = blocked&lt;/code&gt;, &lt;code&gt;C = churned&lt;/code&gt; because they read the Confluence doc from 2022 (the one that's three Notion migrations out of date). An AI agent? It has no idea. It'll guess — and guessing at 2am during an automated pipeline run is a great way to corrupt a report.&lt;/p&gt;

&lt;p&gt;This is the &lt;strong&gt;context engineering problem&lt;/strong&gt;: your data is technically correct but semantically opaque.&lt;/p&gt;

&lt;p&gt;Context engineering is the practice of designing data systems that embed rich, machine-readable context &lt;em&gt;alongside&lt;/em&gt; the data itself. Gartner has already flagged this: over 40% of agentic AI projects are predicted to fail by 2027 — not because the models are bad, but because the &lt;strong&gt;data foundations are missing&lt;/strong&gt;. Bare schemas, unclear ownership, no lineage, inconsistent definitions.&lt;/p&gt;

&lt;p&gt;Sound familiar?&lt;/p&gt;

&lt;h2&gt;
  
  
  What Agents Actually Need From Your Pipeline
&lt;/h2&gt;

&lt;p&gt;Let's get practical. Here's what makes a data system "agent-ready":&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Rich Metadata and Semantic Descriptions
&lt;/h3&gt;

&lt;p&gt;Every table, column, and field should have a description an agent can read and reason about — not just a name.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Bad: An agent sees "status" and guesses&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Good: Metadata makes intent explicit&lt;/span&gt;
&lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; 
  &lt;span class="s1"&gt;'Customer lifecycle status. Values: A=active (paying), B=blocked (payment issue), C=churned (cancelled)'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Modern data catalogs (like DataHub, Amundsen, or OpenMetadata) can store this metadata in a way agents can query via API. If you're not using one, now is a very good time to start.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Data Lineage That's Actually Up-to-Date
&lt;/h3&gt;

&lt;p&gt;An agent running a pipeline needs to understand: where did this data come from? What transformations touched it? If something breaks, what else is affected?&lt;/p&gt;

&lt;p&gt;Tools like &lt;strong&gt;dbt&lt;/strong&gt; generate lineage graphs automatically from your SQL models. Here's a minimal dbt model with proper documentation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# models/schema.yml&lt;/span&gt;
&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;customer_lifetime_value&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="s"&gt;Calculates CLV per customer using the last 90 days of transactions.&lt;/span&gt;
      &lt;span class="s"&gt;Refreshed daily at 3am UTC. Source: raw.transactions joined with dim.customers.&lt;/span&gt;
    &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;customer_id&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Unique identifier. FK to dim.customers.customer_id&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;clv_usd&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Estimated lifetime value in USD. Null if customer has &amp;lt; 3 transactions.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;description&lt;/code&gt; block? An agent can read it, understand what the model does, and decide whether it's the right source for a given task. Without it, the agent is flying blind.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Embeddings and Vector-Ready Outputs
&lt;/h3&gt;

&lt;p&gt;This one trips people up. Traditional pipelines output structured tables. Agentic pipelines often need to &lt;em&gt;also&lt;/em&gt; output embeddings — vector representations of your data that LLMs can use for semantic search and RAG (Retrieval-Augmented Generation).&lt;/p&gt;

&lt;p&gt;Here's a simple example using Python and OpenAI's embedding API (or any open-source alternative like &lt;code&gt;sentence-transformers&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Your product catalog as a dataframe
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;products.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate embeddings from a meaningful text representation
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_repr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;. Category: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text_repr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Write to a vector store (e.g., pgvector, Pinecone, Weaviate)
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;to_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;products_embeddings.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key idea: you're not replacing your existing pipeline — you're &lt;strong&gt;extending&lt;/strong&gt; it. The structured table feeds your dashboards. The embeddings feed your agents.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Schema Drift Detection
&lt;/h3&gt;

&lt;p&gt;Here's a nightmare scenario: an upstream team renames a column. Your pipeline doesn't catch it. The agent downstream starts ingesting garbage. Nobody notices until a report goes out with completely wrong numbers.&lt;/p&gt;

&lt;p&gt;Schema drift detection is one of the highest-impact agentic data engineering tasks identified in the SIGMOD 2026 Data Agents tutorial. Integrate it into your orchestration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Using Great Expectations for schema validation
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;great_expectations&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;gx&lt;/span&gt;

&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_context&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Define expectation: column "user_id" must exist and be non-null
&lt;/span&gt;&lt;span class="n"&gt;suite&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_expectation_suite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales_suite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_expectation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnToExist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;suite&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_expectation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;gx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expectations&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExpectColumnValuesToNotBeNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run validation before anything touches the data
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_checkpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales_checkpoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Schema validation failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fail fast, fail loud. An agent that ingests bad data quietly is worse than a pipeline that crashes.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Mental Model: The Conveyor Belt vs. The Smart Warehouse
&lt;/h2&gt;

&lt;p&gt;Here's an analogy that might help it click.&lt;/p&gt;

&lt;p&gt;Traditional data pipelines are like a &lt;strong&gt;conveyor belt in a factory&lt;/strong&gt;: raw materials go in one end, finished goods come out the other. Fast, reliable, predictable. But the conveyor belt doesn't know what it's carrying. It doesn't label boxes. It doesn't track where things came from. It just moves.&lt;/p&gt;

&lt;p&gt;An agent-ready data system is more like a &lt;strong&gt;smart warehouse&lt;/strong&gt;: every item has a barcode, a location, a history, and a description. Robots can navigate it because everything is labeled and organized. You can ask "where are all the items from Supplier X that arrived in Q1?" and get an instant answer.&lt;/p&gt;

&lt;p&gt;Your job in 2026? &lt;strong&gt;Build the smart warehouse, not just the conveyor belt.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Do This Week
&lt;/h2&gt;

&lt;p&gt;You don't need to rip out your stack and start over. Here's a practical starting point:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Audit your most critical tables&lt;/strong&gt;: Do they have column descriptions? Add them in your catalog or directly in dbt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable lineage tracking&lt;/strong&gt;: If you're on dbt, it's already there. Expose it via the dbt API or push it to DataHub.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick one pipeline to make vector-ready&lt;/strong&gt;: Add an embedding generation step as a separate job. Don't break what works — extend it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add a schema validation checkpoint&lt;/strong&gt;: Use Great Expectations, Soda, or dbt tests. Run it before anything hits production.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this takes a week. The column descriptions alone can take an afternoon. But six months from now, when your team is deploying AI agents that actually work because your data is clean and semantically rich? You'll be very glad you started today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The rise of agentic AI doesn't make data engineers obsolete — it makes the craft harder and more important. Anyone can wire up an LLM to a database. Making that LLM reliably useful for autonomous agents? That requires real data engineering skill.&lt;/p&gt;

&lt;p&gt;Context engineering, lineage, schema validation, vector outputs — these aren't buzzwords. They're the new checklist. The engineers who build these foundations now are the ones who'll be building the most interesting systems in 2027.&lt;/p&gt;

&lt;p&gt;Go make your pipelines agent-ready. Your future AI coworkers are counting on you.&lt;/p&gt;




&lt;p&gt;Abs,&lt;/p&gt;

&lt;p&gt;Gabriel Henrique Cardoso Antonio&lt;br&gt;
🔗 &lt;a href="https://gabrielh.dev/" rel="noopener noreferrer"&gt;gabrielh.dev&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>ai</category>
      <category>python</category>
      <category>data</category>
    </item>
    <item>
      <title>Your ETL Pipeline Wasn't Built for AI — Here's How to Fix It in 2026</title>
      <dc:creator>Gabriel Henrique</dc:creator>
      <pubDate>Tue, 09 Jun 2026 00:33:51 +0000</pubDate>
      <link>https://dev.to/gabrielhca/your-etl-pipeline-wasnt-built-for-ai-heres-how-to-fix-it-in-2026-51g</link>
      <guid>https://dev.to/gabrielhca/your-etl-pipeline-wasnt-built-for-ai-heres-how-to-fix-it-in-2026-51g</guid>
      <description>&lt;h1&gt;
  
  
  Your ETL Pipeline Wasn't Built for AI — Here's How to Fix It in 2026
&lt;/h1&gt;

&lt;p&gt;You've got a beautiful data pipeline. It extracts from your sources, transforms everything cleanly, loads into the warehouse on schedule. Tests pass. Stakeholders are happy. Life is good.&lt;/p&gt;

&lt;p&gt;Then someone says: "Can we plug this into our LLM?"&lt;/p&gt;

&lt;p&gt;And suddenly your beautiful pipeline is useless.&lt;/p&gt;

&lt;p&gt;Not because it's broken — it works perfectly for what it was designed to do. The problem is that &lt;strong&gt;traditional ETL was designed for SQL queries, dashboards, and human analysts&lt;/strong&gt;. LLMs need something fundamentally different: context, meaning, and vectors. And if your pipeline doesn't produce those, your AI is flying blind.&lt;/p&gt;

&lt;p&gt;This is the silent crisis in data engineering right now. Companies are spending millions on LLM infrastructure while their underlying data pipelines are still shipping rows and columns to a warehouse that an AI can barely reason about.&lt;/p&gt;

&lt;p&gt;Let's fix that.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Does an LLM Actually Need From Your Data?
&lt;/h2&gt;

&lt;p&gt;When a human analyst queries your warehouse, they write SQL. They're smart enough to know that &lt;code&gt;status = 'churned'&lt;/code&gt; means a customer who cancelled their subscription. They bring their own context.&lt;/p&gt;

&lt;p&gt;An LLM doesn't have that luxury — at least not without help. When you ask a model "why are enterprise customers churning?", it can't just run &lt;code&gt;SELECT * FROM churn_events&lt;/code&gt;. It needs &lt;strong&gt;semantically relevant context&lt;/strong&gt; — passages, records, or summaries that are &lt;em&gt;meaning-close&lt;/em&gt; to the question being asked.&lt;/p&gt;

&lt;p&gt;That's where &lt;strong&gt;RAG (Retrieval-Augmented Generation)&lt;/strong&gt; comes in.&lt;/p&gt;

&lt;p&gt;Think of RAG like this: instead of the LLM trying to remember everything (it can't — its context window is finite), you build a library. Every time the LLM needs to answer a question, it walks into that library, finds the most relevant pages, and reads them before answering.&lt;/p&gt;

&lt;p&gt;Your job as a data engineer is to &lt;strong&gt;build and maintain that library&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And a library isn't a database. A library is organized by &lt;em&gt;meaning&lt;/em&gt;, not by rows and columns.&lt;/p&gt;




&lt;h2&gt;
  
  
  The AI-Native Pipeline: What's Different
&lt;/h2&gt;

&lt;p&gt;Here's the shift in mindset. Traditional ETL produces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;raw data → clean tables → warehouse → SQL queries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An AI-native pipeline produces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;raw data → cleaned chunks → embeddings → vector store → semantic retrieval
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The new ingredients are &lt;strong&gt;chunks&lt;/strong&gt;, &lt;strong&gt;embeddings&lt;/strong&gt;, and a &lt;strong&gt;vector store&lt;/strong&gt;. Let's break each down.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chunks
&lt;/h3&gt;

&lt;p&gt;You can't feed an entire database table into an LLM. Even if you could, it would be wasteful and noisy. Instead, you break your data into &lt;strong&gt;chunks&lt;/strong&gt; — small, meaningful pieces of text that can be retrieved independently.&lt;/p&gt;

&lt;p&gt;A chunk might be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A paragraph from a customer support ticket&lt;/li&gt;
&lt;li&gt;A 3-sentence description of a product&lt;/li&gt;
&lt;li&gt;A summarized row of metadata about a sales event&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The art here is in the chunking strategy. Too small, and a chunk loses its context. Too large, and you're wasting tokens and retrieval precision. In practice, &lt;strong&gt;512–1024 tokens with ~10% overlap&lt;/strong&gt; between chunks is a solid starting point.&lt;/p&gt;

&lt;h3&gt;
  
  
  Embeddings
&lt;/h3&gt;

&lt;p&gt;An embedding is a list of numbers — a vector — that represents the &lt;em&gt;meaning&lt;/em&gt; of a piece of text. Two texts with similar meanings will have vectors that are close together in space, even if they use completely different words.&lt;/p&gt;

&lt;p&gt;"Customer stopped paying" and "subscription was cancelled due to billing failure" have very different words. But in vector space, they're neighbors.&lt;/p&gt;

&lt;p&gt;That's the magic. And it's what makes semantic search possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vector Store
&lt;/h3&gt;

&lt;p&gt;A vector store is a database optimized for one special kind of query: "give me the N vectors most similar to this query vector." Systems like &lt;strong&gt;pgvector&lt;/strong&gt;, &lt;strong&gt;Qdrant&lt;/strong&gt;, &lt;strong&gt;Chroma&lt;/strong&gt;, and &lt;strong&gt;Weaviate&lt;/strong&gt; are built exactly for this.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building the Pipeline: A Practical Walkthrough
&lt;/h2&gt;

&lt;p&gt;Let's get concrete. Here's a complete AI-native ingestion pipeline in Python.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Load and Chunk Your Data
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;

&lt;span class="c1"&gt;# Imagine this comes from your warehouse, S3, or an API
&lt;/span&gt;&lt;span class="n"&gt;raw_documents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket_001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Customer says dashboard is not loading. Error 502. Happened after the deploy on June 3rd. They&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;re on the Enterprise plan.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket_002&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User cannot export reports to CSV. The button is greyed out. They say it worked last week. Basic plan.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# ... thousands more
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;length_function&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;raw_documents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;splits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;splits&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_chunk_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Created &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chunks from &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;RecursiveCharacterTextSplitter&lt;/code&gt; is smart — it tries to split on paragraph breaks, then sentences, then words. It keeps semantic boundaries intact wherever it can.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Generate Embeddings
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# uses OPENAI_API_KEY env var
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;embed_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-embedding-3-large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Embed a batch of texts. Returns list of vectors.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;dimensions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;  &lt;span class="c1"&gt;# trade-off: smaller = cheaper, slightly less precise
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Process in batches to respect rate limits
&lt;/span&gt;&lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="n"&gt;embedded_chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;embedded_chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Embedded &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedded_chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; chunks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things to notice here: we're batching (OpenAI has rate limits and batching is cheaper), and we're using &lt;code&gt;dimensions=1024&lt;/code&gt; instead of the default 3072. For most use cases, 1024 dimensions give you 95% of the precision at a third of the cost. Worth it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Store in a Vector Database
&lt;/h3&gt;

&lt;p&gt;Here's the same code using &lt;strong&gt;pgvector&lt;/strong&gt; (PostgreSQL with vector support) — a great choice if you're already running Postgres and don't want another managed service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgresql://user:password@localhost:5432/mydb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# One-time setup: enable the extension and create the table
&lt;/span&gt;&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CREATE EXTENSION IF NOT EXISTS vector;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    CREATE TABLE IF NOT EXISTS doc_chunks (
        id TEXT PRIMARY KEY,
        source_id TEXT,
        content TEXT,
        embedding vector(1024),
        created_at TIMESTAMPTZ DEFAULT NOW()
    );
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    CREATE INDEX IF NOT EXISTS doc_chunks_embedding_idx 
    ON doc_chunks USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Insert the embedded chunks
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;embedded_chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        INSERT INTO doc_chunks (id, source_id, content, embedding)
        VALUES (%s, %s, %s, %s)
        ON CONFLICT (id) DO UPDATE SET
            content = EXCLUDED.content,
            embedding = EXCLUDED.embedding;
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All chunks stored in pgvector!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;ivfflat&lt;/code&gt; index is what makes queries fast at scale. Without it, every query does a full table scan. With it, Postgres clusters vectors into "lists" and searches only the most promising ones — approximate nearest neighbor search, blazing fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Retrieval at Query Time
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_relevant_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Given a natural language query, find the most relevant stored chunks.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# Embed the query using the same model
&lt;/span&gt;    &lt;span class="n"&gt;query_vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed_batch&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;psycopg2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgresql://user:password@localhost:5432/mydb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        SELECT id, source_id, content,
               1 - (embedding &amp;lt;=&amp;gt; %s::vector) AS similarity
        FROM doc_chunks
        ORDER BY embedding &amp;lt;=&amp;gt; %s::vector
        LIMIT %s;
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;similarity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;

&lt;span class="c1"&gt;# Try it out
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;retrieve_relevant_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;why are enterprise users reporting errors after deploys?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;similarity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;&amp;lt;=&amp;gt;&lt;/code&gt; operator is pgvector's cosine distance. &lt;code&gt;1 - cosine_distance = cosine_similarity&lt;/code&gt;. The results will be the chunks most semantically close to your query — even if they don't share a single keyword with it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Tips for 2026
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Don't skip metadata.&lt;/strong&gt; Store the source ID, timestamp, author, and any other context alongside your vectors. Metadata filtering (e.g., "only search tickets from Enterprise customers in the last 30 days") is often more important than the semantic search itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Re-embed when the model changes.&lt;/strong&gt; If you upgrade from &lt;code&gt;text-embedding-3-small&lt;/code&gt; to &lt;code&gt;text-embedding-3-large&lt;/code&gt;, you need to re-embed everything. Different models produce incompatible vector spaces. Build this into your pipeline versioning from day one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Evaluate retrieval quality separately from generation quality.&lt;/strong&gt; The #1 mistake is blaming the LLM when the real problem is your retrieval. If the right chunks aren't being retrieved, the best model in the world will give you garbage. Use tools like RAGAS to measure retrieval precision/recall independently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. pgvector is enough for most teams.&lt;/strong&gt; Unless you're storing hundreds of millions of vectors, you don't need a dedicated vector database. pgvector in your existing Postgres is simpler to operate, cheaper, and lets you join vectors with your regular tables. Optimize later if you need to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Chunking is your most impactful lever.&lt;/strong&gt; Changing the LLM might give you 5% better answers. Fixing your chunking strategy might give you 40%. It's unglamorous, but it's where the results are.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;The shift to AI-native data engineering isn't about throwing away what you've built. It's about extending it.&lt;/p&gt;

&lt;p&gt;Your bronze/silver/gold lakehouse layers? Still valid — but add a "semantic layer" where data is chunked, embedded, and indexed for retrieval. Your Airflow DAGs? Still valid — add a daily job that re-embeds new documents and updates the vector store. Your data quality checks? Still valid — add checks for embedding freshness and retrieval coverage.&lt;/p&gt;

&lt;p&gt;Think of it as adding a new output format to your pipelines. You've always produced clean tables. Now you also produce vector indexes. Same discipline, new artifact.&lt;/p&gt;

&lt;p&gt;The engineers who learn to build both will be the ones building the AI systems that actually work — the ones where the model has the context it needs to be genuinely useful, not just impressively fluent.&lt;/p&gt;

&lt;p&gt;Your pipeline deserves to be as smart as the AI it's feeding.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Abs,&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Gabriel Henrique Cardoso Antonio&lt;/em&gt;&lt;br&gt;
&lt;em&gt;🔗 &lt;a href="https://gabrielh.dev/" rel="noopener noreferrer"&gt;gabrielh.dev&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>python</category>
      <category>ai</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Context Engineering: The Skill Replacing Prompt Engineering in 2026</title>
      <dc:creator>Gabriel Henrique</dc:creator>
      <pubDate>Thu, 04 Jun 2026 12:52:06 +0000</pubDate>
      <link>https://dev.to/gabrielhca/context-engineering-the-skill-replacing-prompt-engineering-in-2026-3lgd</link>
      <guid>https://dev.to/gabrielhca/context-engineering-the-skill-replacing-prompt-engineering-in-2026-3lgd</guid>
      <description>&lt;p&gt;If you've been calling yourself a "prompt engineer" for the past two years, it's time to update your vocabulary — and your mental model.&lt;/p&gt;

&lt;p&gt;In 2026, the real leverage when building LLM-powered systems isn't in crafting the perfect sentence. It's in &lt;strong&gt;context engineering&lt;/strong&gt;: designing everything an LLM sees before it ever generates a response. Andrej Karpathy coined the term in mid-2025, and it's since taken over serious AI engineering discussions.&lt;/p&gt;

&lt;p&gt;This article breaks down what context engineering actually is, why it matters more than prompt writing, and gives you concrete techniques you can apply today.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Context Engineering?
&lt;/h2&gt;

&lt;p&gt;Context engineering is the discipline of &lt;strong&gt;systematically designing the information environment&lt;/strong&gt; that surrounds a prompt. Where prompt engineering asks "what should I tell the model to do?", context engineering asks "what does the model need to &lt;em&gt;know&lt;/em&gt; to do it well?"&lt;/p&gt;

&lt;p&gt;Think of it this way: a doctor doesn't just answer the question you ask on the spot. They look at your chart, your history, your vitals, and then respond. Context engineering is building that chart for your LLM.&lt;/p&gt;

&lt;p&gt;The context window is the LLM's working memory — everything it can "see" at once. In 2026, these windows are massive:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude Opus 4.x&lt;/strong&gt;: 200K tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4o&lt;/strong&gt;: 128K tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini 2.5 Flash&lt;/strong&gt;: Up to 1M tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But bigger isn't automatically better. More tokens = more cost, more latency, and a real risk of what researchers call the &lt;strong&gt;"lost-in-the-middle" problem&lt;/strong&gt; — where models process information at the beginning and end of the context more reliably than content buried in the middle.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters for Data Engineers
&lt;/h2&gt;

&lt;p&gt;Data engineers are increasingly building pipelines that feed LLMs: RAG systems, AI copilots for data quality, agents that write and review SQL, tools that summarize data lineage. In every one of these systems, the quality of what lands in the context window directly determines output quality.&lt;/p&gt;

&lt;p&gt;A poorly designed context is like feeding a senior analyst a jumbled mess of raw logs and asking for an executive summary. Technically possible — but you'll get garbage.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core Techniques
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Strategic Positioning
&lt;/h3&gt;

&lt;p&gt;LLMs don't read context uniformly. Research consistently shows they pay more attention to the &lt;strong&gt;beginning and end&lt;/strong&gt; of the context window. So:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Put critical instructions and persona definitions &lt;strong&gt;at the start&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Put the most relevant retrieved data &lt;strong&gt;near the end&lt;/strong&gt;, close to the user query&lt;/li&gt;
&lt;li&gt;Move supporting or low-priority content to the middle
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# BAD: query buried in the middle
&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;system_instructions&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;docs_and_examples&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;more_examples&lt;/span&gt;

&lt;span class="c1"&gt;# GOOD: query at the end, most relevant data just before it
&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;system_instructions&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;background_context&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;retrieved_chunks&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Selective Retrieval Over Full Documents
&lt;/h3&gt;

&lt;p&gt;Don't dump entire documents into the context. Use &lt;strong&gt;semantic chunking + vector search&lt;/strong&gt; to retrieve only relevant paragraphs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_relevant_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;query_emb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;chunk_embs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_embs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_emb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;squeeze&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;top_indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argsort&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:][::&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;top_indices&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Context Caching (Huge Cost Savings)
&lt;/h3&gt;

&lt;p&gt;Both Claude and Gemini support &lt;strong&gt;prompt caching&lt;/strong&gt; — storing repeated context server-side so you only pay full price once.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a senior data engineer...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schema_definitions.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache_control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ephemeral&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Prompt caching reduces cost by &lt;strong&gt;75–90%&lt;/strong&gt; on cached tokens. At scale, this is the difference between a viable product and a budget disaster.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Structured Context Formats
&lt;/h3&gt;

&lt;p&gt;Use XML tags or clear delimiters to separate context sections — LLMs respond better to structured input:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_structured_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recent_errors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;errors_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recent_errors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;schema&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/schema&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;recent_errors&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;errors_str&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/recent_errors&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;question&amp;gt;&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;/question&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Dynamic Context Compression
&lt;/h3&gt;

&lt;p&gt;As conversations grow, implement &lt;strong&gt;rolling summarization&lt;/strong&gt; instead of truncating from the start:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compress_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;estimate_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;
    &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;summarize_with_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prior summary: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Context Engineering Checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[ ] System prompt at the very beginning of context?&lt;/li&gt;
&lt;li&gt;[ ] User query at or near the end?&lt;/li&gt;
&lt;li&gt;[ ] Retrieving relevant chunks instead of full documents?&lt;/li&gt;
&lt;li&gt;[ ] Repeated blocks cached (system prompts, schemas, docs)?&lt;/li&gt;
&lt;li&gt;[ ] Context sections clearly delimited?&lt;/li&gt;
&lt;li&gt;[ ] Compression strategy for long conversations?&lt;/li&gt;
&lt;li&gt;[ ] Measured token usage and cost per request?&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Shift in Mindset
&lt;/h2&gt;

&lt;p&gt;Prompt engineering is about &lt;em&gt;what you say&lt;/em&gt;. Context engineering is about &lt;em&gt;what you provide&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The best LLM outputs in production systems today come from engineers who think carefully about information architecture — what goes in the context window, in what order, how much of it, and how it's structured. That's an engineering discipline, not a writing exercise.&lt;/p&gt;

&lt;p&gt;If you're building data pipelines that feed AI systems, this is now part of your stack. Treat context design with the same rigor you'd apply to schema design or query optimization.&lt;/p&gt;




&lt;p&gt;Cheers,&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Gabriel Henrique&lt;/strong&gt; — Data Engineer | ETL/ELT | Databricks | Azure&lt;br&gt;&lt;br&gt;
🔗 &lt;a href="https://gabrielh.dev/" rel="noopener noreferrer"&gt;gabrielh.dev&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Unlock AI’s Hidden Power: The Ultimate Guide to Prompt Engineering</title>
      <dc:creator>Gabriel Henrique</dc:creator>
      <pubDate>Sun, 06 Jul 2025 05:47:30 +0000</pubDate>
      <link>https://dev.to/gabrielhca/unlock-ais-hidden-power-the-ultimate-guide-to-prompt-engineering-c41</link>
      <guid>https://dev.to/gabrielhca/unlock-ais-hidden-power-the-ultimate-guide-to-prompt-engineering-c41</guid>
      <description>&lt;h2&gt;
  
  
  Prompt Engineering: The Hidden Power of AI
&lt;/h2&gt;

&lt;p&gt;With the exponential advancement of artificial intelligence, we live in a unique moment in tech history. Millions of people use these powerful tools for coding, creative writing, studying, data analysis, and much more. Yet many still fail to realize they’re squandering AI’s true potential for one simple reason: &lt;strong&gt;they don’t know how to communicate effectively with it&lt;/strong&gt;.  &lt;/p&gt;

&lt;p&gt;This gap between humans and AI is where the “hidden power” of prompt engineering resides—a skill that can completely transform your AI experience and outcomes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dangers of Poor Prompts
&lt;/h2&gt;

&lt;p&gt;Have you ever wondered how many opportunities you miss with vague or poorly structured prompts? Poor prompts are like giving confusing instructions to an extremely capable assistant. Typing “help me with marketing” or “write some code” wastes AI’s potential and creates several problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Generic, Irrelevant Responses&lt;/strong&gt;: Vague prompts such as “tell me something interesting” yield superficial, low-value information. AI can’t guess your specific needs, so it produces generic content that adds little real value.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wasted Time and Frustration&lt;/strong&gt;: If you don’t get the desired result on the first try, you must reformulate and retry, creating an unproductive cycle.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinated Answers&lt;/strong&gt;: Ambiguous prompts greatly increase the chance that AI will fabricate plausible-sounding but false information—especially dangerous when you need accurate data for decision-making.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Underutilization of Capabilities&lt;/strong&gt;: Without a proper structure, you’re tapping only a fraction of AI’s power—like owning a supercomputer but using it as a basic calculator.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The ASK Framework: Transforming Your AI Interactions
&lt;/h2&gt;

&lt;p&gt;To solve these problems, we introduce the &lt;strong&gt;ASK&lt;/strong&gt; framework, a proven methodology that will revolutionize how you interact with AI:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ASK&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Define precisely what you want the AI to do.&lt;br&gt;&lt;br&gt;
Example:  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Generate a social media marketing plan for a small urban retail boutique targeting young adults.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;CONTEXT&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Provide relevant background information to help the AI understand your situation.&lt;br&gt;&lt;br&gt;
Example:  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“I own a streetwear shop in a college town with a monthly budget of $400.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;CONSTRAINTS&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Specify clear limits on format, length, tone, and style.&lt;br&gt;&lt;br&gt;
Example:  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Respond in a bulleted list of 5 items, professional tone, maximum 150 words.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;EXAMPLE&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Offer concrete examples of what you expect (few-shot prompting).&lt;br&gt;&lt;br&gt;
Example:  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Classify the sentiment of these reviews:&lt;br&gt;
Example 1: “Loved the fast delivery!” → Positive&lt;br&gt;
Example 2: “Product was defective, terrible service.” → Negative&lt;br&gt;
Example 3: “Item is okay, nothing special.” → Neutral Now classify: “Exceeded my expectations!”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;STYLE&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Define the tone, persona, or writing style you want the AI to adopt.&lt;br&gt;&lt;br&gt;
Example:  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Act as a senior marketing consultant with 10 years of experience, using clear, engaging language.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Advanced Prompting Techniques for Maximum Efficiency
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Chain-of-Thought (CoT) Prompting&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Ask AI to show its reasoning step by step for complex problems. This boosts accuracy.  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Solve this problem showing each reasoning step:&lt;br&gt;
“A company has 150 employees. 30% work in sales, 25% in production, and the rest in other departments. If the company grows by 20% next year, how many employees will each department have&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Prompt Chaining&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Break complex tasks into smaller, sequential steps to avoid overwhelming the AI:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Analyze the problem
&lt;/li&gt;
&lt;li&gt;Identify possible solutions
&lt;/li&gt;
&lt;li&gt;Evaluate pros and cons
&lt;/li&gt;
&lt;li&gt;Recommend the best solution
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Self-Consistency&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Have AI generate multiple answers to the same prompt and then choose the most consistent one to improve reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimization Strategies
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Iterative Refinement&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Never settle for the first result. Prompt engineering is iterative—review the response, identify improvements, and adjust your prompt accordingly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A/B Testing Prompts&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Compare different versions of the same prompt to see which yields better results, especially for critical applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Temperature Control&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Adjust AI creativity as needed:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Low temperature (0.1–0.3)&lt;/strong&gt;: Precise, consistent responses
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High temperature (0.7–1.0)&lt;/strong&gt;: Creative, varied outputs
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Avoiding Common Pitfalls
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Excessive Ambiguity&lt;/strong&gt;: Avoid words with multiple meanings—be specific.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Information Overload&lt;/strong&gt;: Don’t include unnecessary details that confuse AI.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unrealistic Expectations&lt;/strong&gt;: Understand AI’s limitations; it’s not magic.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lack of Context&lt;/strong&gt;: Always provide relevant background information.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Practical Use Cases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Software Development&lt;/strong&gt;  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Act as a senior Python developer specializing in REST APIs.&lt;br&gt;
Context: I need to build an e-commerce API.&lt;br&gt;
Constraints: Use FastAPI, include JWT authentication, document with OpenAPI.&lt;br&gt;
Example: Follow a structure similar to Mercado Livre.&lt;br&gt;
Style: Clean code, comments in English, adhering to PEP 8.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Content Creation&lt;/strong&gt;  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Act as a social media copywriter.&lt;br&gt;
Context: Women’s fashion boutique, audience 18–35, casual style.&lt;br&gt;
Constraints: Instagram post, max 150 characters, include a call-to-action.&lt;br&gt;
Example: “Found the perfect weekend look! 💕 #OOTD”&lt;br&gt;
Style: Casual tone, use emojis, youthful language. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Conclusion: Mastering the Hidden Power
&lt;/h2&gt;

&lt;p&gt;Prompt engineering isn’t just a technical skill—it’s an &lt;strong&gt;essential competency&lt;/strong&gt; for thriving in the AI era. By mastering these techniques, you not only improve your results but also communicate more effectively with the technologies shaping the future.&lt;/p&gt;

&lt;p&gt;Remember: &lt;strong&gt;the quality of an AI’s response is directly proportional to the quality of your prompt&lt;/strong&gt;. Investing time to learn these methods is an investment in your professional future.  &lt;/p&gt;

&lt;p&gt;The hidden power of prompt engineering is in your hands. It’s not a question of &lt;em&gt;if&lt;/em&gt; you’ll use it, but &lt;em&gt;when&lt;/em&gt; you’ll start mastering it. The sooner you begin, the greater your competitive edge in a world increasingly integrated with AI.  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Start today by applying the ASK framework in your next AI interactions. Test, iterate, refine. Your productivity and result quality will never be the same.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Virtual Learning Festival and Vouchers: An Unmissable Opportunity</title>
      <dc:creator>Gabriel Henrique</dc:creator>
      <pubDate>Sun, 15 Jun 2025 23:38:33 +0000</pubDate>
      <link>https://dev.to/gabrielhca/virtual-learning-festival-and-vouchers-an-unmissable-opportunity-12lm</link>
      <guid>https://dev.to/gabrielhca/virtual-learning-festival-and-vouchers-an-unmissable-opportunity-12lm</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu79jqhzyl4jvgdklrufo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu79jqhzyl4jvgdklrufo.png" alt="Image description" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the Virtual Learning Festival?
&lt;/h2&gt;

&lt;p&gt;The Virtual Learning Festival is an online event celebrating the Data + AI Summit 2025, running from June 11 to July 2, 2025. It is designed to help participants:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete training,&lt;/li&gt;
&lt;li&gt;Expand data and AI skills,&lt;/li&gt;
&lt;li&gt;Prepare for Databricks certifications.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How does it work?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://community.databricks.com/t5/events/dais-2025-virtual-learning-festival-11-june-02-july-2025/ev-p/119323" rel="noopener noreferrer"&gt;The Virtual Learning Festival&lt;/a&gt; offers free online sessions, workshops, and content, allowing you to participate at your own pace during the event period. It aligns with the in-person Data + AI Summit in San Francisco (June 9–12), complementing the experience with remote training.&lt;/p&gt;

&lt;h4&gt;
  
  
  Main objectives
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Free training on data and AI topics,&lt;/li&gt;
&lt;li&gt;Certification preparation (with materials and practice),&lt;/li&gt;
&lt;li&gt;Ongoing engagement before and after the main in-person event.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you wish, I can help you find session details, workshop registration links, and information about certificates and discount vouchers—just let me know what interests you!&lt;/p&gt;




&lt;h2&gt;
  
  
  Discount Vouchers: How Do They Work?
&lt;/h2&gt;

&lt;p&gt;During the Virtual Learning Festival, participants have access to exclusive benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;50% discount voucher&lt;/strong&gt; for Databricks certification (equivalent to $100 off).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;20% discount coupon&lt;/strong&gt; for Databricks Academy Labs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How to get them?
&lt;/h3&gt;

&lt;p&gt;Simply complete any course during the virtual festival (June 11 to July 2, 2025) to automatically receive the 50% certification discount voucher and the 20% Academy Labs coupon by email.&lt;/p&gt;

&lt;h4&gt;
  
  
  Quick Summary
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benefit&lt;/th&gt;
&lt;th&gt;How to get it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;50% off certification&lt;/td&gt;
&lt;td&gt;Complete any course during the festival&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20% off Academy Labs&lt;/td&gt;
&lt;td&gt;Upon receiving the certification voucher&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This dynamic has been confirmed in previous festivals and remains valid for the current event. If you are participating, just complete at least one course to secure your discounts.&lt;/p&gt;

&lt;p&gt;If you need help choosing courses or tracking your completion, I can guide you!  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://community.databricks.com/t5/events/dais-2025-virtual-learning-festival-11-june-02-july-2025/ev-p/119323" rel="noopener noreferrer"&gt;Access The Virtual Learning Festival&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>database</category>
      <category>webassembly</category>
    </item>
    <item>
      <title>Databricks News: Highlights from Data + AI Summit 2025</title>
      <dc:creator>Gabriel Henrique</dc:creator>
      <pubDate>Sun, 15 Jun 2025 23:31:47 +0000</pubDate>
      <link>https://dev.to/gabrielhca/databricks-news-highlights-from-data-ai-summit-2025-1oh1</link>
      <guid>https://dev.to/gabrielhca/databricks-news-highlights-from-data-ai-summit-2025-1oh1</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faojst92qjvu742cuh6i2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faojst92qjvu742cuh6i2.png" alt="Image description" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Databricks News and Vouchers: Highlights from Data + AI Summit 2025
&lt;/h2&gt;

&lt;p&gt;On June 12, 2025, the Data + AI Summit, Databricks' flagship annual event, concluded in San Francisco, gathering over 20,000 data and AI professionals from around the world. The event introduced a series of announcements and innovations set to transform the data, artificial intelligence, and cloud collaboration ecosystem. Below, I share a summary of the main news unveiled at the event, with brief descriptions for easy understanding.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Databricks Lakeflow: Unified Data Engineering
&lt;/h2&gt;

&lt;p&gt;Databricks Lakeflow was launched as a comprehensive solution for data ingestion, transformation, and orchestration, integrating managed connectors for enterprise applications, databases, and data warehouses. A highlight is &lt;strong&gt;Zerobus&lt;/strong&gt;, an API enabling real-time event data ingestion with high throughput and low latency, making large-scale data usage for analytics and AI easier.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Unity Catalog: Intelligent Governance and Automation
&lt;/h2&gt;

&lt;p&gt;Unity Catalog received new features to unify data and AI governance across different formats, clouds, and teams. Notable updates include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Attribute-Based Access Control (ABAC):&lt;/strong&gt; Enables flexible access policies using tags, now in beta for AWS, Azure, and GCP.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tag Policies:&lt;/strong&gt; Ensure consistency and security in data classification and usage across the platform, also in beta on major clouds.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. Data Sharing and Collaboration
&lt;/h2&gt;

&lt;p&gt;Improvements were announced to facilitate secure data sharing between organizations, including “clean rooms” that allow collaboration without compromising data privacy or security.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Full Support for Apache Iceberg™
&lt;/h2&gt;

&lt;p&gt;Databricks now offers full support for Apache Iceberg™, expanding open-format data management possibilities and making integration with various tools and platforms easier.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Spark Declarative Pipelines
&lt;/h2&gt;

&lt;p&gt;The platform introduced &lt;strong&gt;Spark Declarative Pipelines&lt;/strong&gt;, an evolution for developing data pipelines in a declarative, scalable, and open way, boosting productivity and standardization for data engineering teams.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Databricks SQL and Free Edition
&lt;/h2&gt;

&lt;p&gt;General availability of &lt;strong&gt;Databricks SQL&lt;/strong&gt; was announced, along with a new free edition of the platform, democratizing access to advanced data analytics and intelligence resources for organizations of all sizes.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. MLflow 3.0: AI Observability and Governance
&lt;/h2&gt;

&lt;p&gt;MLflow 3.0 arrives with improvements for experimentation, observability, and governance of AI models, streamlining the complete machine learning project lifecycle within the Databricks ecosystem.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Mosaic AI and Agent Bricks
&lt;/h2&gt;

&lt;p&gt;Mosaic AI introduced new features for developing intelligent agents, including &lt;strong&gt;Agent Bricks&lt;/strong&gt;, which enables the creation of self-optimizing agents using proprietary company data, accelerating the practical adoption of generative AI and autonomous agents.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Lakebase: Public Preview
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;Lakebase&lt;/strong&gt; concept was presented in public preview, offering an innovative approach for managing transactional and analytical data in a single environment, simplifying operations and accelerating insights.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Power Platform Connector
&lt;/h2&gt;

&lt;p&gt;The new Azure Databricks connector for Power Platform enables real-time, governed data access for Power Apps, Power Automate, and Copilot Studio, expanding integration possibilities between data platforms and productivity tools.&lt;/p&gt;




&lt;p&gt;These innovations reinforce Databricks' commitment to leading in data and AI, offering increasingly integrated, secure, and accessible solutions for organizations across all sectors. Stay tuned, as these updates are sure to impact the market in the coming months.&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>webdev</category>
      <category>database</category>
      <category>programming</category>
    </item>
    <item>
      <title>ETL vs. ELT: A Comprehensive Analysis of Modern Data Integration Strategies</title>
      <dc:creator>Gabriel Henrique</dc:creator>
      <pubDate>Sun, 01 Jun 2025 16:37:41 +0000</pubDate>
      <link>https://dev.to/gabrielhca/etl-vs-elt-a-comprehensive-analysis-of-modern-data-integration-strategies-1ibn</link>
      <guid>https://dev.to/gabrielhca/etl-vs-elt-a-comprehensive-analysis-of-modern-data-integration-strategies-1ibn</guid>
      <description>&lt;p&gt;The evolution of data architectures has sparked a critical debate between two dominant approaches: ETL (&lt;em&gt;Extract, Transform, Load&lt;/em&gt;) and ELT (&lt;em&gt;Extract, Load, Transform&lt;/em&gt;). This article examines their historical contexts, operational advantages, implementation challenges, and optimal use cases, providing actionable insights for organizations navigating modern data management.  &lt;/p&gt;




&lt;h2&gt;
  
  
  Historical Context and Conceptual Foundations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ETL: The Legacy Framework
&lt;/h3&gt;

&lt;p&gt;Developed in the 1990s, ETL emerged as a response to technological constraints, including expensive storage and limited computational resources. Its sequential process—extracting data from heterogeneous sources, transforming it into standardized formats, and loading it into centralized repositories—prioritized storage efficiency by discarding raw data post-transformation. This approach became foundational for legacy systems and regulated industries requiring strict governance.  &lt;/p&gt;

&lt;h3&gt;
  
  
  ELT: The Cloud-Native Paradigm
&lt;/h3&gt;

&lt;p&gt;The advent of scalable cloud infrastructure and cost-effective storage catalyzed ELT's rise. By loading raw data directly into &lt;em&gt;data lakes&lt;/em&gt; or &lt;em&gt;lakehouses&lt;/em&gt; and deferring transformations, ELT leverages modern tools like Apache Spark and Snowflake to enable flexible reprocessing and exploratory analytics. This shift aligns with the growing demand for real-time insights and unstructured data handling in AI/ML applications.  &lt;/p&gt;




&lt;h2&gt;
  
  
  Comparative Analysis and Practical Applications
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ETL Implementation Scenarios
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Regulatory Compliance&lt;/strong&gt;: Industries like healthcare (HIPAA) and finance (GDPR) benefit from ETL's pre-load data masking and retention policies.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Legacy System Integration&lt;/strong&gt;: Organizations with on-premise infrastructure use ETL to bridge traditional databases with modern BI tools while preserving existing investments.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured Reporting&lt;/strong&gt;: ETL simplifies dimensional modeling for OLAP cubes, ensuring consistency in traditional Business Intelligence workflows.
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  ELT Dominant Use Cases
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Big Data &amp;amp; IoT&lt;/strong&gt;: ELT efficiently handles high-velocity data streams from sensors and logs, enabling real-time analytics in platforms like Databricks Delta Lake.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Machine Learning Pipelines&lt;/strong&gt;: Data scientists leverage ELT's raw data retention to rebuild &lt;em&gt;feature stores&lt;/em&gt; and retrain models as fraud patterns or consumer behaviors evolve.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medallion Architecture&lt;/strong&gt;: Adopted by 68% of cloud-first enterprises, this structure organizes data into Bronze (raw), Silver (cleaned), and Gold (enriched) layers, reducing pipeline development time by 40%.
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Architectural Patterns and Cost Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Optimizing ETL Workflows
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration Tools&lt;/strong&gt;: Apache Airflow and Talend provide version-controlled pipelines with granular transformation rules.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Staging Zones&lt;/strong&gt;: Intermediate validation areas prevent data corruption, addressing the 62% of ETL failures occurring during extraction.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring Systems&lt;/strong&gt;: Checksums and schema validation ensure data integrity, particularly in cross-database migrations.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cloud-Native ELT Strategies
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Functionality&lt;/th&gt;
&lt;th&gt;Tools&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bronze&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Immutable raw data storage&lt;/td&gt;
&lt;td&gt;AWS S3, Azure Data Lake&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Silver&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Schema validation &amp;amp; deduplication&lt;/td&gt;
&lt;td&gt;Delta Lake, Snowflake&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gold&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Query-optimized aggregates&lt;/td&gt;
&lt;td&gt;BigQuery, Redshift&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Serverless technologies like AWS Glue reduce operational costs by 40% through auto-scaling, while columnar formats (Parquet) improve storage efficiency.  &lt;/p&gt;




&lt;h2&gt;
  
  
  Performance and Economic Trade-offs
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;ETL&lt;/th&gt;
&lt;th&gt;ELT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2-4 hours (batch processing)&lt;/td&gt;
&lt;td&gt;Minutes (real-time ingestion)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.023/GB (processed data)&lt;/td&gt;
&lt;td&gt;$0.036/GB (raw + processed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compute Flexibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited (pre-defined transforms)&lt;/td&gt;
&lt;td&gt;High (on-demand transformations)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compliance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ideal for PII handling&lt;/td&gt;
&lt;td&gt;Requires additional governance&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Studies show ELT reduces total cost of ownership (TCO) by 15-20% for petabyte-scale operations but remains less efficient than ETL in structured, low-variability environments.  &lt;/p&gt;




&lt;h2&gt;
  
  
  Strategic Recommendations and Future Trends
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Hybrid Adoption Framework
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ETL for Core Systems&lt;/strong&gt;: Apply to financial transactions and medical records requiring audit trails.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ELT for Innovation&lt;/strong&gt;: Utilize for social media sentiment analysis and IoT telemetry projects.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unified Governance&lt;/strong&gt;: Tools like Collibra manage both paradigms under centralized access policies.
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Migration Checklist
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1&lt;/strong&gt;: Inventory existing ETL pipelines and data dependencies
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2&lt;/strong&gt;: Pilot ELT with non-critical datasets (e.g., marketing analytics)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 3&lt;/strong&gt;: Upskill teams in distributed processing (Spark) and cloud security protocols
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion: Aligning Strategy with Organizational Maturity
&lt;/h2&gt;

&lt;p&gt;The ETL/ELT decision matrix below synthesizes key operational factors:  &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Criterion&lt;/th&gt;
&lt;th&gt;ETL&lt;/th&gt;
&lt;th&gt;ELT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Volume&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt;1 TB/day&lt;/td&gt;
&lt;td&gt;&amp;gt;1 TB/day&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transformation Complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (multi-stage logic)&lt;/td&gt;
&lt;td&gt;Low (SQL-based transformations)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infrastructure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;On-premise/ Hybrid&lt;/td&gt;
&lt;td&gt;Cloud-native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Team Skills&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ETL Developers&lt;/td&gt;
&lt;td&gt;Data Engineers + SQL Analysts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Regulatory Scope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High (PHI, PCI DSS)&lt;/td&gt;
&lt;td&gt;Moderate (GDPR with add-ons)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;As of 2025, 67% of enterprises with &amp;gt;1PB data leverage ELT, while ETL maintains 89% adoption in healthcare and banking. Emerging trends favor adaptive architectures combining ETL's governance with ELT's flexibility, particularly for AI-driven organizations needing both structured reporting and experimental sandboxes. By aligning technical choices with business objectives—rather than chasing industry trends—organizations can build resilient data ecosystems capable of evolving with technological and regulatory landscapes.  &lt;/p&gt;

</description>
      <category>webdev</category>
      <category>webassembly</category>
      <category>discuss</category>
      <category>database</category>
    </item>
    <item>
      <title>A2A and MCP: Revolutionary Protocols for Communication Between AI Agents and Their Impact on the Development Ecosystem</title>
      <dc:creator>Gabriel Henrique</dc:creator>
      <pubDate>Sat, 24 May 2025 23:21:14 +0000</pubDate>
      <link>https://dev.to/gabrielhca/a2a-and-mcp-revolutionary-protocols-for-communication-between-ai-agents-and-their-impact-on-the-a7k</link>
      <guid>https://dev.to/gabrielhca/a2a-and-mcp-revolutionary-protocols-for-communication-between-ai-agents-and-their-impact-on-the-a7k</guid>
      <description>&lt;h2&gt;
  
  
  A2A and MCP: Revolutionary Protocols for Communication Between AI Agents and Their Impact on the Development Ecosystem
&lt;/h2&gt;

&lt;p&gt;Microsoft recently announced support for the &lt;strong&gt;Agent2Agent (A2A)&lt;/strong&gt; protocol in Azure AI Foundry and Copilot Studio, while Anthropic’s &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; continues to gain ground as a standard for tool integration. This post dives into both protocols, compares them, and offers actionable insights for developers based on technical analyses and industry trends.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction: A New Era of AI Agent Collaboration
&lt;/h2&gt;

&lt;p&gt;Interoperability among AI systems is critical—43% of global enterprises already use autonomous agents to automate processes (Gartner, 2025). A2A and MCP address two distinct challenges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A2A:&lt;/strong&gt; Communication and coordination between heterogeneous agents
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP:&lt;/strong&gt; Standardized integration between agents and external tools or data sources
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A recent OpenAI study shows that systems combining both protocols achieve &lt;strong&gt;87% higher efficiency&lt;/strong&gt; on complex tasks compared to standalone solutions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Agent2Agent (A2A): A Universal Language for AI Collaboration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Technical Principles
&lt;/h3&gt;

&lt;p&gt;A2A is built on a &lt;strong&gt;publish-subscribe architecture&lt;/strong&gt; with these core components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Message Broker&lt;/strong&gt; (e.g., Azure Service Bus)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Registry&lt;/strong&gt; (global capability catalog)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task Orchestrator&lt;/strong&gt; (e.g., Azure Logic Apps)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Layer&lt;/strong&gt; (Azure AD + confidential computing)
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Example A2A payload:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
"sender": "copilot@microsoft.com",
"task_id": "a2a-9fhd83-2025",
"action": "schedule_meeting",
"parameters": {
"participants": [
"agent1@google.com",
"agent2@anthropic.com"
],
"time_window": "2025-05-25T09:00/17:00"
},
"context": {
"priority": "high",
"deadline": "2025-05-24T23:59"
}
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Source: Azure AI Foundry Technical Docs&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-World Use Cases
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LG Electronics:&lt;/strong&gt; 40% reduction in product development time by integrating design, supply chain, and QA agents via A2A
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;University Hospital Zurich:&lt;/strong&gt; Coordinated 127 medical agents for personalized cancer treatment, achieving 35% higher diagnostic accuracy&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Model Context Protocol (MCP): Bridging AI and the Real World
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Architectural Overview
&lt;/h3&gt;

&lt;p&gt;MCP defines a &lt;strong&gt;dynamic plugin system&lt;/strong&gt; with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MCP Host:&lt;/strong&gt; LLM runtime (e.g., Claude 3)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP Client:&lt;/strong&gt; Embedded connector
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP Server:&lt;/strong&gt; Tool or data provider (e.g., PostgreSQL, GitHub Actions)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Typical workflow:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A[AI Agent] --&amp;gt; B[MCP Client]
B --&amp;gt; C{MCP Server}
C --&amp;gt; D[(Database)]
C --&amp;gt; E[External API]
C --&amp;gt; F[Legacy System]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Performance Benchmarks
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Without MCP&lt;/th&gt;
&lt;th&gt;With MCP&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SQL Query&lt;/td&gt;
&lt;td&gt;1200 ms&lt;/td&gt;
&lt;td&gt;450 ms&lt;/td&gt;
&lt;td&gt;62.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;REST API Call&lt;/td&gt;
&lt;td&gt;800 ms&lt;/td&gt;
&lt;td&gt;300 ms&lt;/td&gt;
&lt;td&gt;62.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PDF Processing&lt;/td&gt;
&lt;td&gt;950 ms&lt;/td&gt;
&lt;td&gt;210 ms&lt;/td&gt;
&lt;td&gt;78%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Data: Anthropic Technical Report Q1/2025&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Technical Comparison: A2A vs MCP
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;A2A&lt;/th&gt;
&lt;th&gt;MCP&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Primary Focus&lt;/td&gt;
&lt;td&gt;Agent-to-agent collaboration&lt;/td&gt;
&lt;td&gt;Agent-to-tool integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Communication Model&lt;/td&gt;
&lt;td&gt;Peer-to-peer&lt;/td&gt;
&lt;td&gt;Client-server&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Average Latency&lt;/td&gt;
&lt;td&gt;150–300 ms&lt;/td&gt;
&lt;td&gt;50–150 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;OAuth 2.1 + Confidential ML&lt;/td&gt;
&lt;td&gt;TLS 1.3 + Hardware Keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ideal Use Case&lt;/td&gt;
&lt;td&gt;Complex orchestration&lt;/td&gt;
&lt;td&gt;Structured data access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Practical Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Using MCP to fetch market data
current_price = mcp_client.query("stock_api", symbol="MSFT")

Using A2A to coordinate risk calculation
a2a.send_task(
recipient="risk_agent@bank.com",
action="calculate_risk",
params={"portfolio": current_portfolio}
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Trends &amp;amp; Recommendations for Developers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Market Data (2025)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;67% of enterprises plan to adopt A2A by 2026
&lt;/li&gt;
&lt;li&gt;82% of developers consider MCP critical for AI projects
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Recommended Stack:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Azure A2A Orchestrator + Anthropic MCP Gateway
Python 3.12+ with asyncio for concurrency
Prometheus + Grafana for monitoring
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Implementation Checklist
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Define clear use cases for each protocol
&lt;/li&gt;
&lt;li&gt;[ ] Configure Azure Service Bus for A2A messaging
&lt;/li&gt;
&lt;li&gt;[ ] Deploy MCP gateways for critical systems
&lt;/li&gt;
&lt;li&gt;[ ] Unify security policies across protocols
&lt;/li&gt;
&lt;li&gt;[ ] Develop interoperability test suites
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion: The Future Is Multi-Protocol
&lt;/h2&gt;

&lt;p&gt;Combining A2A and MCP enables &lt;strong&gt;360° AI systems&lt;/strong&gt; that can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Process 5.7× more data per cycle
&lt;/li&gt;
&lt;li&gt;Reduce errors by 68% in complex operations
&lt;/li&gt;
&lt;li&gt;Dynamically adapt to evolving requirements
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For developers, mastering these protocols means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tripling development efficiency
&lt;/li&gt;
&lt;li&gt;Cutting integration costs by 40%
&lt;/li&gt;
&lt;li&gt;Enabling new business models in Web3 and the Metaverse
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;“Interoperability is no longer optional—it’s the currency of the AI ecosystem.”&lt;br&gt;&lt;br&gt;
— Satya Nadella, Microsoft CEO (May 2025)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Don’t get left behind: try A2A and MCP today, stay up to date and become a protagonist in this new chapter of artificial intelligence. The future starts now!🚀&lt;/p&gt;

</description>
      <category>ai</category>
      <category>development</category>
      <category>programming</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
