Building a Scalable Telco CDR Processing Pipeline with Databricks Delta Live Tables - Part 1 [Databricks Free Edition]

Pat Sienkiewicz — Fri, 29 Aug 2025 11:05:56 +0000

In this multi-part series, we'll explore how to build a modern, scalable pipeline for processing telecom Call Detail Records (CDRs) using Databricks Delta Live Tables. Part 1 focuses on the foundation: data generation and the bronze layer implementation.

Introduction

Telecommunications companies process billions of Call Detail Records (CDRs) daily. These records capture every interaction with the network—voice calls, text messages, data sessions, and more. Processing this data efficiently is critical for billing, network optimization, fraud detection, and customer experience management.

In this series, we'll build a complete Telco CDR processing pipeline using Databricks Delta Live Tables (DLT). We'll follow the medallion architecture pattern, with bronze, silver, and gold layers that progressively refine raw data into valuable business insights.

The Challenge

Telecom data presents several unique challenges:

Volume: Billions of records generated daily
Variety: Multiple CDR types with different schemas
Velocity: Real-time processing requirements
Complexity: Intricate relationships between users, devices, and network elements
Compliance: Strict regulatory requirements for data retention and privacy

Traditional batch processing approaches struggle with these challenges. We need a modern, streaming-first architecture that can handle the scale and complexity of telecom data.

Our Solution

git repo: dlt_telco

We're building a solution with two main components:

Data Generator: A synthetic CDR generator that produces realistic telecom data
DLT Pipeline: A Delta Live Tables pipeline that processes the data through medallion architecture layers

In Part 1, we'll focus on the data generator and the bronze layer implementation.

Data Generator: Creating Realistic Synthetic Data

For development and testing, we need a way to generate realistic CDR data. Our generator creates:

User Profiles: Synthetic subscriber data with identifiers (MSISDN, IMSI, IMEI), plan details, and location information
Multiple CDR Types: Voice, data, SMS, VoIP, and IMS records with appropriate attributes
Kafka Integration: Direct streaming to Kafka topics for real-time ingestion

The generator ensures referential integrity between users and CDRs, making it possible to perform realistic joins and aggregations in downstream processing.

User Profile Generation

Our user generator creates profiles with realistic telecom attributes:

# Sample user profile structure
{
  "user_id": "user_42",
  "msisdn": "1234567890",
  "imsi": "310150123456789",
  "imei": "490154203237518",
  "plan_name": "Premium Unlimited",
  "data_limit_gb": 50,
  "voice_minutes": 1000,
  "sms_count": 500,
  "registration_date": "2023-05-15",
  "active": true,
  "location": {
    "city": "Seattle",
    "state": "WA"
  }
}

CDR Generation

The CDR generator produces five types of records, each with appropriate attributes:

Voice CDRs: Call duration, calling/called numbers, cell tower IDs
Data CDRs: Session duration, uplink/downlink volumes, APN information
SMS CDRs: Message size, sender/receiver information
VoIP CDRs: SIP endpoints, codec information, quality metrics
IMS CDRs: Service type, session details, network elements

Kafka Integration

The generator streams data to dedicated Kafka topics:

telco-users: User profile data
telco-voice-cdrs: Voice call records
telco-data-cdrs: Data usage records
telco-sms-cdrs: SMS message records
telco-voip-cdrs: VoIP call records
telco-ims-cdrs: IMS session records

This streaming approach mimics real-world telecom environments where CDRs flow continuously from network elements to processing systems.

Bronze Layer: Raw Data Ingestion with Delta Live Tables

The bronze layer is the foundation of our medallion architecture. It ingests raw data from Kafka with minimal transformation, preserving the original content for compliance and auditability.

Key Features

Our bronze layer implementation provides:

Streaming Ingestion: Real-time data processing from Kafka
Schema Preservation: Maintains original message structure
Metadata Tracking: Captures Kafka metadata (timestamp, topic, key)
Security: Secure credential management via Databricks secrets
Scalability: Serverless Delta Live Tables for auto-scaling

Bronze Tables Structure

Our bronze layer includes 7 tables total:

Table Name	Source Topic	Description
`bronze_users`	`telco-users`	Raw user profile data with parsed JSON
`bronze_voice_cdrs`	`telco-voice-cdrs`	Voice call detail records
`bronze_data_cdrs`	`telco-data-cdrs`	Data session records
`bronze_sms_cdrs`	`telco-sms-cdrs`	SMS message records
`bronze_voip_cdrs`	`telco-voip-cdrs`	VoIP call records
`bronze_ims_cdrs`	`telco-ims-cdrs`	IMS session records
`bronze_all_cdrs`	All CDR topics	Multiplexed view of all CDR types

Each table preserves the original Kafka metadata (key, timestamp, topic) alongside the raw data, enabling reprocessing if needed.

Table Schema

All bronze tables follow a consistent schema:

CREATE TABLE bronze_<type> (
  key STRING,                    -- Kafka message key
  timestamp TIMESTAMP,           -- Kafka message timestamp
  topic STRING,                  -- Source Kafka topic
  processing_time TIMESTAMP,     -- DLT processing timestamp
  raw_data STRING,              -- Original JSON payload
  parsed_data STRUCT<...>       -- Parsed JSON (users table only)
)

Code Implementation

Here's how we define our bronze tables using DLT:

@dlt.table(
    name="bronze_users",
    comment="Raw user data from Kafka",
    table_properties=get_bronze_table_properties()
)
def bronze_users():
    df = get_standard_bronze_columns(read_from_kafka("telco-users"))
    return df.withColumn("parsed_data", from_json(col("raw_data"), user_schema))

def create_bronze_cdr_table(cdr_type, topic_name):
    """Create a Bronze table for a specific CDR type"""

    @dlt.table(
        name=f"bronze_{cdr_type}_cdrs",
        comment=f"Raw {cdr_type} CDR data from Kafka",
        table_properties=get_bronze_table_properties()
    )
    def bronze_cdr_table():
        return get_standard_bronze_columns(read_from_kafka(topic_name))

We use helper functions to ensure consistent table properties and column structures across all bronze tables:

def get_bronze_table_properties():
    """Return standard bronze table properties"""
    return {
        "quality": "bronze",
        "pipelines.autoOptimize.managed": "true",
        "pipelines.reset.allowed": "false"
    }

def get_standard_bronze_columns(df):
    """Return standardized bronze layer columns"""
    return df.select(
        col("key"),
        col("timestamp"),
        col("topic"),
        current_timestamp().alias("processing_time"),
        col("value").cast("string").alias("raw_data")
    )

Secure Credential Management

For security, we retrieve Kafka credentials from Databricks secrets:

# Get Kafka credentials from Databricks secret
dbutils = DBUtils(spark)
kafka_settings = json.loads(dbutils.secrets.get(scope=env_scope, key="telco-kafka"))

# Extract values from the secret
KAFKA_BOOTSTRAP_SERVERS = kafka_settings["bootstrap_server"]
api_key = kafka_settings["api_key"]
api_secret = kafka_settings["api_secret"]

This approach ensures that sensitive credentials are never hardcoded in our pipeline code.

Deployment Automation

We use Databricks Asset Bundles for deployment automation:

# Deploy to development
cd dlt_telco
databricks bundle deploy --target dev

# Deploy to production
databricks bundle deploy --target prod

The pipeline uses Databricks Asset Bundles for consistent deployment across environments with serverless compute for automatic scaling.

Results and Benefits

With our bronze layer implementation:

Streaming Ingestion: CDRs are available for analysis seconds after generation
Data Preservation: Original records are preserved for compliance and auditability
Scalability: Serverless compute handles millions of records per minute
Security: All credentials managed through Databricks Secret Scopes
Automation: Asset Bundle deployment simplifies environment management

The Power of a Unified Platform

What makes this approach particularly exciting is achieving everything within one platform:

Delta Live Tables (DLT) for streaming data ingestion
Databricks Asset Bundles (DAB) for deployment automation
Unity Catalog for governance and lineage
Serverless Compute for auto-scaling
Built-in monitoring and alerting capabilities
Integrated dashboards for real-time insights

This unified approach eliminates the complexity of managing multiple tools and platforms, allowing teams to focus on building value rather than managing infrastructure.

What's Next?

In Part 2 of this series, we'll build the silver layer of our medallion architecture. We'll focus on:

Data validation and quality enforcement
Schema standardization across CDR types
Enrichment with user and reference data
Error handling and data recovery patterns

Stay tuned as we continue building our Telco CDR processing pipeline!

This blog post is part of a series on building data processing pipelines for telecommunications using Databricks Delta Live Tables. Follow along as we progress from raw data ingestion to advanced analytics and machine learning.

DEV Community: Pat Sienkiewicz