<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pat Sienkiewicz</title>
    <description>The latest articles on DEV Community by Pat Sienkiewicz (@sienkiewicz_pat).</description>
    <link>https://dev.to/sienkiewicz_pat</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F371733%2F9ef8978d-5ec9-4f46-be95-560ffd78ee6b.jpg</url>
      <title>DEV Community: Pat Sienkiewicz</title>
      <link>https://dev.to/sienkiewicz_pat</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sienkiewicz_pat"/>
    <language>en</language>
    <item>
      <title>Building a Scalable Telco CDR Processing Pipeline with Databricks Delta Live Tables - Part 1 [Databricks Free Edition]</title>
      <dc:creator>Pat Sienkiewicz</dc:creator>
      <pubDate>Fri, 29 Aug 2025 11:05:56 +0000</pubDate>
      <link>https://dev.to/sienkiewicz_pat/building-a-scalable-telco-cdr-processing-pipeline-with-databricks-delta-live-tables-part-1-1gmf</link>
      <guid>https://dev.to/sienkiewicz_pat/building-a-scalable-telco-cdr-processing-pipeline-with-databricks-delta-live-tables-part-1-1gmf</guid>
      <description>&lt;p&gt;&lt;em&gt;In this multi-part series, we'll explore how to build a modern, scalable pipeline for processing telecom Call Detail Records (CDRs) using Databricks Delta Live Tables. Part 1 focuses on the foundation: data generation and the bronze layer implementation.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Telecommunications companies process billions of Call Detail Records (CDRs) daily. These records capture every interaction with the network—voice calls, text messages, data sessions, and more. Processing this data efficiently is critical for billing, network optimization, fraud detection, and customer experience management.&lt;/p&gt;

&lt;p&gt;In this series, we'll build a complete Telco CDR processing pipeline using Databricks Delta Live Tables (DLT). We'll follow the medallion architecture pattern, with bronze, silver, and gold layers that progressively refine raw data into valuable business insights.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge
&lt;/h2&gt;

&lt;p&gt;Telecom data presents several unique challenges:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Volume&lt;/strong&gt;: Billions of records generated daily&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Variety&lt;/strong&gt;: Multiple CDR types with different schemas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Velocity&lt;/strong&gt;: Real-time processing requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complexity&lt;/strong&gt;: Intricate relationships between users, devices, and network elements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance&lt;/strong&gt;: Strict regulatory requirements for data retention and privacy&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Traditional batch processing approaches struggle with these challenges. We need a modern, streaming-first architecture that can handle the scale and complexity of telecom data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Our Solution
&lt;/h2&gt;

&lt;p&gt;git repo: &lt;a href="https://github.com/cloud-data-engineer/data/blob/main/dlt_telco/README.md" rel="noopener noreferrer"&gt;dlt_telco&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;--&lt;/p&gt;

&lt;p&gt;We're building a solution with two main components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data Generator&lt;/strong&gt;: A synthetic CDR generator that produces realistic telecom data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DLT Pipeline&lt;/strong&gt;: A Delta Live Tables pipeline that processes the data through medallion architecture layers&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In Part 1, we'll focus on the data generator and the bronze layer implementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Generator: Creating Realistic Synthetic Data
&lt;/h2&gt;

&lt;p&gt;For development and testing, we need a way to generate realistic CDR data. Our generator creates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User Profiles&lt;/strong&gt;: Synthetic subscriber data with identifiers (MSISDN, IMSI, IMEI), plan details, and location information&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple CDR Types&lt;/strong&gt;: Voice, data, SMS, VoIP, and IMS records with appropriate attributes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kafka Integration&lt;/strong&gt;: Direct streaming to Kafka topics for real-time ingestion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The generator ensures referential integrity between users and CDRs, making it possible to perform realistic joins and aggregations in downstream processing.&lt;/p&gt;

&lt;h3&gt;
  
  
  User Profile Generation
&lt;/h3&gt;

&lt;p&gt;Our user generator creates profiles with realistic telecom attributes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Sample user profile structure
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_42&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;msisdn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1234567890&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;imsi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;310150123456789&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;imei&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;490154203237518&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plan_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Premium Unlimited&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_limit_gb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;voice_minutes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sms_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;registration_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2023-05-15&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;active&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Seattle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WA&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  CDR Generation
&lt;/h3&gt;

&lt;p&gt;The CDR generator produces five types of records, each with appropriate attributes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Voice CDRs&lt;/strong&gt;: Call duration, calling/called numbers, cell tower IDs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data CDRs&lt;/strong&gt;: Session duration, uplink/downlink volumes, APN information&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SMS CDRs&lt;/strong&gt;: Message size, sender/receiver information&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VoIP CDRs&lt;/strong&gt;: SIP endpoints, codec information, quality metrics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IMS CDRs&lt;/strong&gt;: Service type, session details, network elements&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Kafka Integration
&lt;/h3&gt;

&lt;p&gt;The generator streams data to dedicated Kafka topics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;telco-users&lt;/code&gt;: User profile data&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;telco-voice-cdrs&lt;/code&gt;: Voice call records&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;telco-data-cdrs&lt;/code&gt;: Data usage records&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;telco-sms-cdrs&lt;/code&gt;: SMS message records&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;telco-voip-cdrs&lt;/code&gt;: VoIP call records&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;telco-ims-cdrs&lt;/code&gt;: IMS session records&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This streaming approach mimics real-world telecom environments where CDRs flow continuously from network elements to processing systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bronze Layer: Raw Data Ingestion with Delta Live Tables
&lt;/h2&gt;

&lt;p&gt;The bronze layer is the foundation of our medallion architecture. It ingests raw data from Kafka with minimal transformation, preserving the original content for compliance and auditability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Features
&lt;/h3&gt;

&lt;p&gt;Our bronze layer implementation provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streaming Ingestion&lt;/strong&gt;: Real-time data processing from Kafka&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema Preservation&lt;/strong&gt;: Maintains original message structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata Tracking&lt;/strong&gt;: Captures Kafka metadata (timestamp, topic, key)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: Secure credential management via Databricks secrets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: Serverless Delta Live Tables for auto-scaling&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Bronze Tables Structure
&lt;/h3&gt;

&lt;p&gt;Our bronze layer includes 7 tables total:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Table Name&lt;/th&gt;
&lt;th&gt;Source Topic&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bronze_users&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;telco-users&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Raw user profile data with parsed JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bronze_voice_cdrs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;telco-voice-cdrs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Voice call detail records&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bronze_data_cdrs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;telco-data-cdrs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Data session records&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bronze_sms_cdrs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;telco-sms-cdrs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SMS message records&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bronze_voip_cdrs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;telco-voip-cdrs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;VoIP call records&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bronze_ims_cdrs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;telco-ims-cdrs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;IMS session records&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bronze_all_cdrs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All CDR topics&lt;/td&gt;
&lt;td&gt;Multiplexed view of all CDR types&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each table preserves the original Kafka metadata (key, timestamp, topic) alongside the raw data, enabling reprocessing if needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Table Schema
&lt;/h3&gt;

&lt;p&gt;All bronze tables follow a consistent schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;bronze_&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;key&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                    &lt;span class="c1"&gt;-- Kafka message key&lt;/span&gt;
  &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;-- Kafka message timestamp&lt;/span&gt;
  &lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                  &lt;span class="c1"&gt;-- Source Kafka topic&lt;/span&gt;
  &lt;span class="n"&gt;processing_time&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;-- DLT processing timestamp&lt;/span&gt;
  &lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;-- Original JSON payload&lt;/span&gt;
  &lt;span class="n"&gt;parsed_data&lt;/span&gt; &lt;span class="n"&gt;STRUCT&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;       &lt;span class="c1"&gt;-- Parsed JSON (users table only)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Code Implementation
&lt;/h3&gt;

&lt;p&gt;Here's how we define our bronze tables using DLT:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dlt.table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze_users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Raw user data from Kafka&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;table_properties&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;get_bronze_table_properties&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;bronze_users&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_standard_bronze_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;read_from_kafka&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;telco-users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parsed_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;from_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;user_schema&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_bronze_cdr_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cdr_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topic_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Create a Bronze table for a specific CDR type&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="nd"&gt;@dlt.table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cdr_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_cdrs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Raw &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cdr_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; CDR data from Kafka&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;table_properties&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;get_bronze_table_properties&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;bronze_cdr_table&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;get_standard_bronze_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;read_from_kafka&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic_name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We use helper functions to ensure consistent table properties and column structures across all bronze tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_bronze_table_properties&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return standard bronze table properties&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pipelines.autoOptimize.managed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pipelines.reset.allowed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_standard_bronze_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return standardized bronze layer columns&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;processing_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Secure Credential Management
&lt;/h3&gt;

&lt;p&gt;For security, we retrieve Kafka credentials from Databricks secrets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Get Kafka credentials from Databricks secret
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DBUtils&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;kafka_settings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;env_scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;telco-kafka&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Extract values from the secret
&lt;/span&gt;&lt;span class="n"&gt;KAFKA_BOOTSTRAP_SERVERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kafka_settings&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bootstrap_server&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kafka_settings&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;api_secret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kafka_settings&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_secret&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach ensures that sensitive credentials are never hardcoded in our pipeline code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Automation
&lt;/h2&gt;

&lt;p&gt;We use Databricks Asset Bundles for deployment automation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deploy to development&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;dlt_telco
databricks bundle deploy &lt;span class="nt"&gt;--target&lt;/span&gt; dev

&lt;span class="c"&gt;# Deploy to production&lt;/span&gt;
databricks bundle deploy &lt;span class="nt"&gt;--target&lt;/span&gt; prod
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline uses Databricks Asset Bundles for consistent deployment across environments with serverless compute for automatic scaling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results and Benefits
&lt;/h2&gt;

&lt;p&gt;With our bronze layer implementation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Streaming Ingestion&lt;/strong&gt;: CDRs are available for analysis seconds after generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Preservation&lt;/strong&gt;: Original records are preserved for compliance and auditability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: Serverless compute handles millions of records per minute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: All credentials managed through Databricks Secret Scopes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation&lt;/strong&gt;: Asset Bundle deployment simplifies environment management&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The Power of a Unified Platform
&lt;/h3&gt;

&lt;p&gt;What makes this approach particularly exciting is achieving &lt;strong&gt;everything within one platform&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Delta Live Tables (DLT)&lt;/strong&gt; for streaming data ingestion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks Asset Bundles (DAB)&lt;/strong&gt; for deployment automation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unity Catalog&lt;/strong&gt; for governance and lineage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Serverless Compute&lt;/strong&gt; for auto-scaling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built-in monitoring&lt;/strong&gt; and alerting capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrated dashboards&lt;/strong&gt; for real-time insights&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This unified approach eliminates the complexity of managing multiple tools and platforms, allowing teams to focus on building value rather than managing infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;In Part 2 of this series, we'll build the silver layer of our medallion architecture. We'll focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data validation and quality enforcement&lt;/li&gt;
&lt;li&gt;Schema standardization across CDR types&lt;/li&gt;
&lt;li&gt;Enrichment with user and reference data&lt;/li&gt;
&lt;li&gt;Error handling and data recovery patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stay tuned as we continue building our Telco CDR processing pipeline!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This blog post is part of a series on building data processing pipelines for telecommunications using Databricks Delta Live Tables. Follow along as we progress from raw data ingestion to advanced analytics and machine learning.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
