Jose Prabhu Michael Singarayan

Posted on Apr 23

Modernizing Data Movement for the AI-Ready Enterprises

#data #analytics #bigdata

Introduction

No matter what type of Artificial Intelligence workload your business implements, it requires high-quality data sources to operate effectively. From recommendation engines to conversational AI assistants, each AI application needs a robust data foundation that ensures timely delivery of the information and its integrity.

Despite this requirement, many organizations stick to legacy batch ETL pipelines built mainly for static reporting needs. Though such a pipeline could be helpful when it comes to developing a company's traditional reporting strategy, it is far from being enough for building AI solutions. That is why modernizing data movement should be among your top priorities.

Why Traditional ETL Fails for AI Workloads

Traditional ETL consists of three basic steps – extracting, transforming, and loading data. Though this process worked fine for classic dashboards, nowadays, it fails in several crucial aspects due to the nature of AI and machine learning systems they often rely on near real-time data streams

AI applications have to integrate different types of information – streaming data and historical data,large and diverse datasets require scalable processing capabilities upstream changes should not stop pipelines from operating effectively.

If your pipeline is inefficient, you can expect poor accuracy of your ML models or late delivery of insights generated by the system.

What Modern Data Movement Requires

Compared to their predecessors, modern data pipelines are much more scalable and flexible because they can accommodate all types of data regardless of its structure, size, or velocity. In addition to supporting traditional batch processing, they allow continuous and event-driven streaming, which is essential in developing AI-driven solutions.

Characteristics of a modern data pipeline include the following:

ELT over ETL so raw data can land quickly before transformation
streaming ingestion for time-sensitive workloads
event-driven design to trigger processing as changes happen
lakehouse storage for unified structured and semi-structured data
schema evolution to handle changing source systems
governance and lineage for trust and compliance

Reference Architecture of an AI-Ready Data Pipeline

As we discussed above, a typical modern data architecture follows a hierarchical structure close to that used for lakehouses. Let us review each layer below:

Sources: operational databases, SaaS solutions, RESTful APIs, IOT sensors, log files, external datasets.

Ingestion: batch and streaming data intake with Apache Kafka or Azure Event Hubs, cloud data integration services;

Processing: distributed processing engines like Spark and PySpark;

Storage: cloud lakehouse platform like Delta Lake and Microsoft Fabric Lakehouse;

Consumption: Power BI dashboards, machine learning models, feature stores, notebooks, AI-driven applications.

With this architecture in place, you will be able to use your unified data foundation for dashboards and machine learning as well.

Example of Coding Schema Evolution in Delta Lake

As we mentioned previously, schema evolution is key to handling changing sources of data properly. However, if the schema enforcement does not work correctly, a pipeline is at risk of breaking down.

Delta Lake technology allows managing evolving schemas easily.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SchemaEvolutionExample").getOrCreate()

df = spark.read.json("/data/incoming/retail_events")

df.write.format("delta") \

.option("mergeSchema", "true") \

.mode("append") \

.save("/data/lakehouse/retail_events")

As you see, this code allows adding the columns received from external data sources to your Delta table seamlessly.

Example of Real-Time Data Pipeline

The example of the previous section is pretty straightforward. In contrast, a small example demonstrating the difference between traditional and modern data pipelines could be a good choice here.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RetailStreamingPipeline").getOrCreate()

stream_df = spark.readStream \

.format("json") \

.schema("order_id STRING, product_id STRING, quantity INT, event_time TIMESTAMP") \

.load("/data/stream/orders")

aggregated_df = stream_df.groupBy("product_id").sum("quantity")

query = aggregated_df.writeStream \

.format("delta") \

.outputMode("complete") \

.option("checkpointLocation", "/checkpoints/orders") \

.start("/data/lakehouse/product_sales")

query.awaitTermination()

The code above demonstrates the operation of a streaming data pipeline that aggregates sales transactions.

Example of Data Quality Verification

Adding some code snippet illustrating data quality validation would enhance your discussion.

from pyspark.sql.functions import col

validated_df = df.filter(

col("customer_id").isNotNull() &

col("transaction_amount").isNotNull() &

(col("transaction_amount") >= 0)

)

As you understand, this piece of code allows validating the quality of incoming data and filtering out untrustworthy data sources.

Example of Real-Time Retail Pipeline

Now let us consider a retail enterprise implementing dashboards and AI-based demand forecasting in parallel. The company receives sales transactions, e-commerce data, customer engagement metrics, etc., from different sources.

In the traditional reporting environment, all data could be loaded into the database daily. For AI tasks like demand forecast or product recommendations, such a latency would be unacceptable.

A retail pipeline capable of ingesting sales transactions, merging them with historical inventory and customer data, and then delivering information to various destinations looks as follows:

Power BI dashboards for sales monitoring

machine learning models for demand forecasting

recommendation systems for personalization

alerting systems for anomaly detection

Governance, Security, and Control of AI Pipelines

For any data pipeline to work effectively, especially those used for building AI applications, it should comply with strict governance regulations and security controls.

Here is a list of important capabilities that you should implement in your data pipelines:

data lineage to trace data from source to model or dashboard

role-based access control to secure sensitive datasets

audit logging to monitor pipeline activity and usage

schema governance to manage changes safely

API security for authenticated and authorized access

All these mechanisms are crucial for securing your pipelines and ensuring that ML models receive high-quality data sources.

Conclusion

Modernizing your data movement architecture is probably the most critical step towards building an AI-ready enterprise. Despite the fact that classical batch ETL pipelines have proven their effectiveness for historical reporting purposes, they hardly meet today's needs.

By introducing lakehouse technologies, event-driven architecture, scalable distributed processing, and proper schema governance, you can create a robust pipeline architecture to fuel analytics and AI.

DEV Community

Modernizing Data Movement for the AI-Ready Enterprises

Top comments (0)