DEV Community: Harsh Patel

AI-Driven Data Engineering: Building Real-Time Intelligence Pipelines

Harsh Patel — Wed, 08 Oct 2025 02:15:32 +0000

Introduction

Today Data engineering is changing faster than ever. What once focused on building ETL jobs and managing batch pipelines has now become a discipline at the intersection of real-time analytics and artificial intelligence (AI). Businesses now no longer have the luxury of waiting for batch reports. They need insights and often decisions that are automated the moment data arrives.
The move described above has been due to the rise of streaming frameworks like Apache Kafka, Spark Streaming and Delta Lake, and its integration with AI techniques such as anomaly detection, pattern recognition, and reinforcement learning. Due to this combination of tools organizations are now able to make real-time decisions based on proactive reporting rather than reactive reporting.
In this article, we will explore how AI is reshaping the role of data engineering today by walking through real-world use cases, examining the technology stack and looking at the challenges and opportunities coming ahead.

How AI Is Reshaping Data Engineering

Smarter Pipelines Through Automation
In the past, pipelines required constant tuning. Engineers manually had to fix bottlenecks and wrote rules to handle exceptions. But, today, AI-driven automation is changing that aspect. Now, modern platforms can predict pipeline failures, rebalance loads and even suggest schema adjustments in real time.

Machine Learning Inside the Data Layer
Data models are not static anymore. With ML integrated directly into warehouses and lakes, models can have the capability to adapt to shifts in customer behavior or data quality. Tools like Google AutoML or H2O.ai let engineers embed predictive logic right into their workflows. These types of pipelines can provide clean data and provide intelligence as part of the stream.

Real-Time Insights as the Default
Today, batch is not enough anymore. Businesses like banks, streaming platforms and airlines can’t wait hours for analysis, they need real-time data. AI-enabled streaming can let engineers build pipelines that process, enrich and analyze streams on the fly. From fraud detection to churn prediction to price optimization, decisions can now be made in milliseconds.

Governance and Trust at Scale
AI is increasingly being applied to ensure compliance and enforce governance. Things like Data quality checks, anomaly detection for regulatory compliance and explainability tools are becoming essential and without trust in the pipeline, real-time AI is just a risk.

Real-Time AI in Action: Use Cases

For Fraud Detection
A. In Financial Services Workflow

Kafka ingests transaction events.
Spark Streaming cleanses and enriches them with user profile data.
AI models (TensorFlow/PyTorch via MLflow) score transactions for fraud risk.
Decision layers in Flink or Kafka Streams approve, block, or flag activity.
Power BI dashboards highlight suspicious activity and trigger fraud alerts.

B. Telecom Workflow

Kafka Connect ingests call detail records and network telemetry.
Spark Structured Streaming normalizes and enriches data.
AI models in Databricks flag SIM-box fraud and robocall activity.
Flink automatically blocks endpoints or opens incident tickets.
Results appear in Power BI dashboards, with urgent alerts sent to Slack and logs stored in Delta Lake.

For Customer Churn Prediction
A. Subscription Services Workflow

User logins and cancellations stream through Kafka.
Spark Streaming aggregates session durations and activity metrics.
Databricks ML models predict churn probability in near real time.
If a customer is at risk, Salesforce CRM triggers retention offers instantly.

B. Telecom Workflow

Call records and service logs stream into Kafka.
Delta Lake manages both historical and streaming data.
Churn models continuously score customers for risk.
Campaign systems automatically trigger retention actions like discounts or calls.

For Dynamic Pricing
A. E-Commerce Workflow

Kafka streams competitor pricing, user browsing, and inventory data.
Spark Streaming aggregates demand spikes and trends.
Reinforcement learning models in Databricks recommend price adjustments.
Pricing APIs update storefronts dynamically.

B. Airlines & Hospitality Workflow

Kafka ingests booking and occupancy rates.
Spark connectors add seasonal and external signals (holidays, weather).
Predictive models forecast surges in demand.
Reservation systems update fares or room prices instantly.

Technology Stack

• Apache Kafka + Spark Streaming: backbone for ingestion and transformation.
• Delta Lake + Databricks: reliable, scalable storage with integrated ML deployment.
• Industry Platforms: Uber Michelangelo (real-time ML at scale), PayPal AI (stream-based fraud analytics).
Challenges for Engineers
• Latency vs. Accuracy: Do you simplify models for speed or use complex models that risk delays is the question
• Scalability: Costs pile up as data volumes grow. Hence, optimizing infrastructure is critical.
• Governance: As per governance, transparency, explainability and regulatory compliance cannot be ignored.
• Model Drift: Retraining is not optional, real-time models degrade quickly without updates.

The Road Ahead

The role of data engineering is expanding, and engineers are no longer just pipeline builders, they are becoming AI systems architects, responsible for both the data flow and the intelligence within it. The one who’ll be ahead of everyone will be those who can combine streaming, AI and governance into unified, scalable platforms.
For businesses, it means that they will get faster decisions and stronger defense against fraud and churn. For engineers, it’s majorly mastering data frameworks along with machine learning, automation and cloud-native designing.

Building Smarter Dashboards: Improve Power BI Copilot Accuracy with Semantic Models and Metadata

Harsh Patel — Tue, 06 May 2025 03:00:39 +0000

Copilot in PowerBi has been a powerful advancement in making data analysis accessible to everyone. But the quality of Copilot's output is heavily dependent entirely on the foundation it sits upon — your PowerBI data model and metadata. If Copilot doesn’t understand your data structure clearly, its responses can become vague, inaccurate, or not business friendly.

This article will explain how building a strong semantic model and using rich metadata and descriptions could improve Copilot’s accuracy of PowerBI.

1 - Build a Strong Semantic Model
Let’s understand what semantic model is and why does it matter
In terms of PowerBI, the semantic model (or also known as data model) is nothing but how the data is structured, related and understood. When a user asks any using question in PowerBI copilot, it would use this model to interpret natural language queries.
As an example, when a user types “Show me the sales revenue for 2023 by region”, Copilot would use the semantic model to identify which table or measure represents "sales revenue", understand how regions are defined, join the data properly to generate the right result.
On the other hand, if the data model is designed poorly it would lead to ambiguity, which results in inaccurate data retrieval, misrepresentation of user queries and poor visual suggestions.

So, what are the best practices to improve the semantic model?
A few steps that can be taken to improve the semantic model so that copilot gives better results
• Use Clear and Descriptive Table and Column Names: Using business-friendly and intuitive names would help Copilot understand and match natural language phrases to fields without confusion, improving its ability to generate correct queries and visuals, hence it is advisable to avoid abbreviations.
Example: If you have a table having revenues then name it as “Revenue” instead of “tbl_rev”
• Define Accurate and Meaningful Relationships: Copilot depends on relationships between tables to create joins between datasets when generating DAX and visualizations. Hence, it is important to have logical connections (one-to-many or many-to-one) and avoid ambiguity.
• Define Measures: It is better to create key business measures such as Total Sales, YoY Growth, etc as Copilot is much better at referencing defined measures rather than creating complex aggregations.
• Star Schema: It simplifies the relationship structure which helps Copilot easily navigate between related tables. A Star Schema would have ideally have a central Fact Table that is connected dimension tables.

2 - Use Rich Metadata and Descriptions
Metadata provides meaning and context to tables, columns and measures. Just like how a strong semantic model would help Copilot, an enriched metadata greatly enhances Copilot’s ability to generate business-friendly summaries and narrative visuals. Metadata would tell Copilot
• What each field represents
• How values should be displayed
• How fields relate to each other in business terms

So, what are the best practices for Metadata and Descriptions?
A few steps that can be taken to enrich the metadata and description so that copilot gives better results
• Add Descriptions to Tables, Columns, and Measures: By providing meaningful description to tables, columns and measures would help Copilot generating narrative summaries or responding to natural language queries.
Example: A table named “Revenue” can be given description “Total Revenue generated after discounts and before taxes.” Or a table named "Order Date” can be given description “Date when the customer placed the order.”
• Format Data Correctly: Ensuring each field is formatted properly like Dates as Date type or Currency fields with currency format would help Copilot present data correctly in generated visuals and improve user understanding.
• Use Synonyms wherever possible: It is useful to define synonyms if the user uses the names interchangeable. As an example, a user may interchangeably use Sales for Revenue or Client for Customers. By doing this it improves Copilot’s ability to match natural language queries with the right data elements.

References
1) https://learn.microsoft.com/en-us/power-bi/create-reports/copilot-evaluate-data
2) https://www.pharossolutions.com/pharos-solutions-blog/microsoft-powerbi-copilot-overview-features-and-best-practices

Transforming LLMs for Industry Use: A Guide to Fine-Tuning Methods and Case Studies

Harsh Patel — Mon, 21 Apr 2025 01:11:59 +0000

Fine-tuning is the process of taking pre-trained models and further training them on smaller, domain-specific datasets. This process transforms general-purpose models into specialized ones, bridging the gap between generic pre-trained models and the unique requirements of particular applications.

Why use Fine-Tuning over General LLMs?
Reasons why fine-tuning models for tasks in new domains are crucial
• Domain-Specific Adaptation: Fine-tuning allows adaptation to the nuances and characteristics of a new domain, enhancing performance in domain-specific tasks.
• Shifts in Data Distribution: Fine-tuning helps align the model with the distribution of new data, addressing shifts in data characteristics and improving performance on specific tasks.
• Continual Learning: Fine-tuning supports continual learning by allowing models to adapt to evolving data and user requirements over time. It enables models to stay relevant and effective in dynamic environments

Let’s understand this with a case study example:
The case study referred here uses a general model (Llama 2 in this case) for fine tuning on Covid 19 Patient Data for accurate diagnosis and treatment recommendations. They curated supervised training datasets (more on this later) to help the general LLM understand health patterns, suggest treatments and respond effectively to clinical queries.
The domain-specific adaptation in this case showed how LLMs like LLaMA can be transformed into powerful, context-aware tools for specialized sectors such as healthcare. The case highlights the broader potential of fine-tuning general-purpose LLMs to meet domain-specific challenges.

Types of Fine-Tuning
Fine-tuning a language can be done using two main approaches: unsupervised and supervised. Let’s understand each of them briefly with some examples
• Unsupervised Fine-Tuning Method: In unsupervised learning method the data is passed unlabeled to extract patterns and structures without explicit labels. This method is relevant when there is a need to update the knowledge base of an LLM without modifying its behavior
Example:
A pioneering Large Language Model for Law
This research paper introduces SaulLM-7B model tailored for the legal domain. It leverages Mistral 7B as its base model and it is trained on an extensive English legal corpus of unlabeled legal documents, including case laws, statutes, contracts and legal opinions.
Methodology:
o The base model used in this case is Mistral 7B model
o This research uses a vast collection of unlabeled legal documents for training data, including case law, statutes, contracts, and legal opinions.
o The model adapts the specific language patterns, terminology, and structures prevalent in legal texts by continuous pre-training on the legal corpus.
End Results:
This approach enhanced SaulLM-7B’s efficiency after fine-tuning in understanding and generating texts for legal domain without relying on labeled data. It outperformed previous models in tasks such as legal document summarization and legal question answering, showcasing the effectiveness of unsupervised domain adaptation in specialized fields.

• Supervised Fine-Tuning Method: It involves having labeled data where model is trained on examples with corresponding desired outputs. During supervised fine-tuning, the pre-trained LLM is fine-tuned on this labeled dataset using supervised learning techniques. The supervised fine-tuning process allows the model to learn task-specific patterns and nuances present in the labeled data. By adapting its parameters according to the specific data distribution and task requirements, the model becomes specialized in performing well on the target task.
Example:
Supervised Fine-Tuning Model for Enhanced Financial Data Analysis
Researchers of this paper developed a specialized LLM model named Raven to improve tabular data analysis in the financial sector. They used supervised learning techniques for fine-tuning Meta’s Llama-2 13B chat model. They tailored Raven to handle complex financial tasks that require precise reasoning and data interpretation.
Methodology:
o The base model used in this case is Llama-2 13B chat model
o They used labeled datasets for training data which comprised of financial question-answer pairs that enable model to learn domain-specific knowledge and reasoning patterns
o The tailored model, Raven, effectively acted as both “task router” and “task retriever” with the capability to utilize external tools for tasks like calculations and data retrieval
End Results:
This research shows that Raven is fine-tuned using supervised learning method demonstrated a 35.2% improvement over the base model. This showcases the effectiveness of supervised fine-tuning in addressing the nuanced requirements of financial data analysis.

References
1) https://github.com/aishwaryanr/awesome-generative-ai-guide/blob/main/free_courses/Applied_LLMs_Mastery_2024/week3_finetuning_llms.md
2) https://medium.com/mantisnlp/supervised-fine-tuning-customizing-llms-a2c1edbf22c3
3) https://www.spaceo.ai/case-study/fine-tuning-llama-2/
4) https://arxiv.org/html/2403.03883v1
5) https://arxiv.org/abs/2401.15328

A Practical Guide to Data Architecture: Real-World Use Cases from Lakes to Warehouses

Harsh Patel — Sun, 13 Apr 2025 03:22:56 +0000

In today’s data-driven world, choosing the right architecture is crucial. This article compares Data Warehouses, Data Lakes, Data Lakehouse’s, and Data Marts through real-world business use cases—exploring how data flows from raw sources to decision-making dashboards. Each serves a unique purpose, and choosing the right one depends on your team's goals, tools, and data maturity.

Data Lakes
Data lake is a large repository that stores huge amounts of raw data in its original format until you need to use it. There are no fixed limitations on data lake storage. That means that considerations — like format, file type, and specific purpose — do not apply. It is used when organizations need flexibility is required in data processing and analysis. Data lakes can store any type of data from multiple sources, whether that data is structured, semi-structured, or unstructured. As a result, data lakes are highly scalable, which makes them ideal for larger organizations that collect a vast amount of data.

Let’s better understand Data Lakes end to end with a real-world example:
A real-world example would be a tech company that leverages Data Lakes for storing large-scale logs and unstructured user interaction data for product analytics.

What might the data source for this example look like?
Data might come through various sources such as web application logs, mobile application events, social media data.

How does extraction, transformation and loading (ETL) the data from the source to the Data Lake look like?
The raw data gets continuously streamed (real time processing) into Data Lakes (usually cloud storage). A thing to note is that there is no upfront transformation as it uses schema-on-read approach

Data Lake Tools:
Amazon S3, Azure Data Lake, or Goolgle Cloud Storage
Who might be the End Users and how would they use it?
Data Scientist using data for exploratory analysis purposes and applying machine learning using Spark or Python notebooks for identifying user behavior patterns, improving product features through ML models.

Data Warehouse
Data in Data Warehouse is collected from a variety of sources, but this typically takes the form of processed data from internal and external systems in an organization. This data consists of specific insights such as product, customer, or employee information. It is best used for reporting and data analysis, storing historical data.

Let’s better understand Data Warehouse end to end with a real-world example:
One of the prime examples where Data Warehouses are used is in very large retail chains where they would like to store and analyze customer purchases and their sales data.

What might the data source for this example look like?
It could be their Point of Sales (POS) Systems, online transactions, CRM data

How does this data get extracted, transformed and loaded (ETL) from the source to the Data Warehouse?
• As a first step, data gets extracted in batches at night from operational databases (quick detour: they are also called Online Transaction Processing – OLTP systems and are used to run day-to-day business operations. These are the systems where data is first created, updated, or deleted in real time during routine transactions).
• The second step in this case would be to transform the data where cleaning, deduplication and normalization takes place.
• The final step would be loading the data into the data warehouse (schema-on-write).

Tools for Data Warehouse:
Snowflake, Amazon Redshift or Google BigQuery.

End Users of Data Warehouse?
As one of the usages, it could be used by Analysts to create PowerBI or Tableau dashboards for creating daily sales reports, profitability analysis or inventory forecast.

Data Lakehouse
Data Lakehouse is a hybrid approach that combines the best of Data Warehouse and Data Lake. It combines the management and performance capabilities of data warehouses with the scalability of data lakes. It supports semi-structured, structured, and unstructured data.

Let’s better understand Data Lakehouse end to end with a real-world example:
Taking an example of financial services that would use Data Lakehouse for building real-time fraud detection and regulatory reporting.

What might the data source for this example look like?
Real-time transactional data through core banking systems, Customer Profiles with KYC info from CRM systems, fraud alert signals through Fraud Detection APIs and through external data feeds from credit bureaus, etc

How does hybrid ETL/ELT data load from the sources to the Data Lakehouse look like?
Loading the data into Data Lakehouse maybe either take an ETL or ELT route.
• ETL may be used when data must be cleaned and validated before loading, if there are strict schema and audit requirements. In this case when customer data from CRM systems needs Personal Information masking, standardization of names/addresses or if we aggregation is required before loading
• ELT is used when data is coming fast and frequently or if it’s better to land raw data first and clean it later. In this case, storing real-time transactions streamed via Apache Kafka and landing immediately into data lakehouse or fraud alerts from external APIs storing as is.

Lakehouse tools:
Databricks Lakehouse Platform with Delta Lake, Apache Iceberg.
End Users of Data Lakehouse?
Analysts and Data Scientists who run real-time data queries which could be used for creating regulatory reports and creating real-time fraud detection models that could be integrated into BI dashboards.

Data Marts
Data Marts are specialized and focused. It is a subset of data warehouses which allow your team to access relevant datasets without the pain of dealing with an entire complex warehouse. It is a great solution for you if you are looking to enable self-service analytics for individual departments

Let’s better understand Data Marts end to end with a real-world example:
Let’s take an example of a sales team in a pharmaceutical company that needs specific analytics for their product lines.
What might the data source for this example look like?
It will come through Data Warehouse, in this case Enterprise Data Warehouse (e.g. Snowflake), sales CRM, and marketing data.

How would the loading process (ETL) look like?
Creating a subset from the main data warehouse and loading pre-aggregated or filter data relevant specifically to the sales team.

Data Mart Tools:
It could be smaller databases like SQL Server, Snowflake, or simplified Redshift instances.
End Users of Data Mart:
For this use case, the sales team accessing the specialized reports through dedicated Tableau or PowerBI dashboards.

References
1) https://medium.com/@onliashish/exploring-data-architecture-design-patterns-3a9241862f2e
2) https://www.splunk.com/en_us/blog/learn/data-warehouse-vs-data-lake.html

How MCP Can Supercharge GenAI-Powered BI Dashboards with Real-Time Data Access from Your Databases

Harsh Patel — Mon, 24 Mar 2025 20:06:51 +0000

We have already seen most of the Business Intelligence dashboards are now powered with GenAI and it has already revolutionized how businesses interact with data. However, LLMs often struggle to provide real-time, context-aware insights since they rely on pre-trained knowledge rather than live database access.
To bridge this gap, MCP server could be used to enable LLMs access business data by accessing in-housed databases, ensuring real-time, actionable insights are delivered directly through BI dashboards like Power BI, Tableau and many more

What is Model Context Protocol (MCP)?
As per Anthropic’s official website, “MCP is an open protocol that standardizes how applications provide context to LLMs. Think of MCP as a USB-C port for AI applications. Just as USB-C provides a standardized way to connect your devices to various peripherals and accessories, MCP provides a standardized way to connect AI models to different data sources and tools.”

How would MCP work on Business Intelligence Dashboards?
1) User Query via BI Dashboard → A user asks a natural language question (e.g., "What were last month’s sales by region?").
2) LLM Consults MCP Server → Instead of using static training data, the LLM queries the MCP server, which connects databases, APIs, or cloud data lakes.
3) MCP Fetches Real-Time Data → The protocol retrieves the latest structured data from SQL, Snowflake, Databricks, or other sources.
4) Data Contextualization & Response Generation → The LLM processes the data, generates insights, and presents them in the BI dashboard as a report, visualization, or narrative summary.

How is MCP different from RAG?
Looking at MCP’s framework it might feel like it is very similar to RAG. Well, it is in some manner, but there are differences in my opinion. Let’s understand what those are
1) How data is retrieved
MCP: Direct APIs and database queries
RAG: Semantic search from vector databases
2) LLM interaction:
MCP: Directly queries from data sources (SQL, APIs, Snowflake, etc.)
RAG: Finds and injects text-based context into prompts

References
1) Anthropic Official Blog – Introducing the Model Context Protocol
Describes MCP as an open standard and explains its USB-C analogy for AI tools.
https://www.anthropic.com/news/model-context-protocol
2) Anthropic Developer Docs – MCP Overview
Technical documentation on how MCP works, how to build a server/client, and use it with Claude.
https://docs.anthropic.com/en/docs/agents-and-tools/mcp
3) Model Context Protocol GitHub Repositories
GitHub org hosting MCP specifications, sample servers, and client libraries in Python, TypeScript, and more.
https://github.com/modelcontextprotocol
4) “Claude's Model Context Protocol (MCP): The Standard for AI Interaction” – dev.to
A developer breakdown of how MCP simplifies interactions between AI and external data sources.
https://dev.to/foxgem/claudes-model-context-protocol-mcp-the-standard-for-ai-interaction-5gko
5) Retrieval-Augmented Generation (RAG): A Survey
Research paper on how RAG architectures retrieve information using semantic search and vector databases.
https://arxiv.org/abs/2005.11401
6) Pinecone – RAG 101: Retrieval-Augmented Generation Explained
Explains how RAG retrieves knowledge from vector databases and integrates it with LLMs.
https://www.pinecone.io/learn/retrieval-augmented-generation/