<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Subhasis Das</title>
    <description>The latest articles on DEV Community by Subhasis Das (@nexoperose).</description>
    <link>https://dev.to/nexoperose</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3644509%2Fd0501791-e06c-43dc-a043-28916bd85c48.png</url>
      <title>DEV Community: Subhasis Das</title>
      <link>https://dev.to/nexoperose</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nexoperose"/>
    <language>en</language>
    <item>
      <title>DAY 14 - Final Production-Ready System</title>
      <dc:creator>Subhasis Das</dc:creator>
      <pubDate>Sat, 14 Mar 2026 09:39:29 +0000</pubDate>
      <link>https://dev.to/nexoperose/day-14-final-production-ready-system-822</link>
      <guid>https://dev.to/nexoperose/day-14-final-production-ready-system-822</guid>
      <description>&lt;p&gt;Day 14 marked the final stage of the Databricks 14 Days AI Challenge – 2 (Advanced), bringing together the various components developed throughout the challenge into a complete production-ready system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fijyj48ijuhfyly3ft6zb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fijyj48ijuhfyly3ft6zb.png" alt="Visual Concept" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The primary objective was to integrate the data engineering pipeline with the machine learning workflow into a single operational process. Throughout the earlier phases, individual components such as data ingestion, feature engineering, model training, experiment tracking, and inference pipelines were developed separately. Day 14 focused on combining these pieces into an end-to-end architecture capable of generating predictions from raw data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2xzcp4ylh29jhutjvja1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2xzcp4ylh29jhutjvja1.png" alt="Notebook" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The pipeline begins with loading data from the Delta table that stores the processed e-commerce event dataset. Feature engineering is applied to transform event-level interactions into user-level behavioral features. These features include metrics such as total user activity, number of purchases, total spending, and average transaction value.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvj7gyu6w3bz3rte8r9zf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvj7gyu6w3bz3rte8r9zf.png" alt="Notebook" width="800" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, a purchase label is generated to identify whether each user has made a purchase. The feature dataset and label dataset are then joined to produce the final training dataset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fal7caftci2rrg5ieq67j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fal7caftci2rrg5ieq67j.png" alt="Notebook" width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using this dataset, a Logistic Regression model is trained to predict the probability that a user will make a purchase. The dataset is split into training and testing subsets, and model performance is evaluated using the Area Under the ROC Curve (AUC). The evaluation confirmed that the model could effectively distinguish between purchasing and non-purchasing users based on their interaction patterns.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxeddm55y46waalkgunso.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxeddm55y46waalkgunso.png" alt="Notebook" width="800" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After training, the final model is saved to a Unity Catalog volume and logged using MLflow. Because the system was executed on a serverless cluster, MLflow required a Unity Catalog temporary directory for model serialization. Adjusting the MLflow configuration allowed the model to be successfully logged and stored for future use.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqz9myu444q7p0htgwkc6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqz9myu444q7p0htgwkc6.png" alt="Notebook" width="800" height="344"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the model was persisted, batch inference was performed on the dataset to generate purchase probability predictions for each user. The predictions include the user identifier, predicted probability of purchase, and binary prediction label. These results were written to a Gold Delta table, making them accessible for downstream analytics and decision-making processes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgtascxd0igjmgjpybqc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsgtascxd0igjmgjpybqc.png" alt="Notebook" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The output dataset revealed a strong separation between users predicted to purchase and those predicted not to purchase, indicating that the engineered features provided meaningful signals for the model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl5zypd6h0rvio4u72riy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl5zypd6h0rvio4u72riy.png" alt="Notebook" width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During the implementation process, ChatGPT assisted with debugging MLflow configuration issues, refining the pipeline logic, and validating the final prediction extraction steps within the Databricks environment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ozz0gutcqs3c8lhyvyj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ozz0gutcqs3c8lhyvyj.png" alt="Codes" width="800" height="2613"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Completing Day 14 effectively demonstrates how individual data engineering and machine learning tasks can be assembled into a unified system capable of supporting production-style predictive analytics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pewter-porch-d86.notion.site/Databricks-14-Days-AI-Challenge-2-30947a4b88b880a5a363f2181e43601e?source=copy_link" rel="noopener noreferrer"&gt;Activity Log&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>database</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>DAY 13 - End-to-End Architecture Design</title>
      <dc:creator>Subhasis Das</dc:creator>
      <pubDate>Fri, 13 Mar 2026 17:25:14 +0000</pubDate>
      <link>https://dev.to/nexoperose/day-13-end-to-end-architecture-design-48m4</link>
      <guid>https://dev.to/nexoperose/day-13-end-to-end-architecture-design-48m4</guid>
      <description>&lt;p&gt;Day 13 of Phase 3: Performance &amp;amp; Production Thinking in the Databricks 14 Days AI Challenge – 2 (Advanced) focused on designing and documenting the end-to-end architecture of the system developed throughout the challenge.&lt;/p&gt;

&lt;p&gt;The first task involved creating an architecture diagram that represents the complete data and machine learning workflow. The architecture illustrates how raw e-commerce event data flows through a layered lakehouse design. Raw CSV data is ingested into the Bronze layer where it is stored as Delta tables. From there, feature engineering transforms event-level data into curated user-level features within the Silver layer. These features are used to construct the training dataset for machine learning models. Logistic Regression and Random Forest models are trained and evaluated, with experiments tracked using MLflow. The trained model is then used within a batch inference pipeline to score users and generate predictions that are stored in the Gold layer. In parallel, a collaborative filtering recommendation system using ALS generates product recommendations based on user interaction data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpjwr8fnsqs5d5wmv79ib.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpjwr8fnsqs5d5wmv79ib.png" alt="Visual Concept" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The second task required documenting the pipeline flow. This step connected the individual components implemented across earlier phases of the challenge. The pipeline begins with data ingestion and Delta table creation, followed by feature engineering and dataset preparation. Model training and evaluation occur after the training dataset is generated, with experiment tracking handled through MLflow. The inference stage then produces prediction outputs for downstream analysis. Supporting layers such as job orchestration, streaming ingestion capability, performance monitoring, and cost optimization were incorporated to reflect how such a pipeline would operate in a real production environment.&lt;/p&gt;

&lt;p&gt;The third task focused on defining a retraining strategy. A production-ready system must continuously adapt to evolving data patterns, so retraining can be triggered through scheduled jobs or changes in data distribution. The retraining workflow rebuilds the training dataset from updated Delta tables, retrains the models, evaluates performance metrics, and logs experiments through MLflow. The best-performing model is then deployed back into the inference pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9g6vvziuf2o1wc1h5igh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9g6vvziuf2o1wc1h5igh.png" alt="Visual Concept" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During the design and documentation process, ChatGPT assisted with structuring the architecture, organizing the pipeline flow, and refining the retraining strategy within the environment provided by Databricks.&lt;/p&gt;

&lt;p&gt;This exercise highlighted how individual data engineering and machine learning components can be integrated into a cohesive and scalable system architecture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F849nwhsweqvfxtrg9o7l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F849nwhsweqvfxtrg9o7l.png" alt="Diagram generated by ChatGPT" width="800" height="426"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pewter-porch-d86.notion.site/Databricks-14-Days-AI-Challenge-2-30947a4b88b880a5a363f2181e43601e?source=copy_link" rel="noopener noreferrer"&gt;Activity Log&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>DAY 12 – Cost Optimization Basics</title>
      <dc:creator>Subhasis Das</dc:creator>
      <pubDate>Thu, 12 Mar 2026 10:05:33 +0000</pubDate>
      <link>https://dev.to/nexoperose/day-12-cost-optimization-basics-3e2j</link>
      <guid>https://dev.to/nexoperose/day-12-cost-optimization-basics-3e2j</guid>
      <description>&lt;p&gt;Day 12 focused on cost optimization fundamentals in Spark-based data workflows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwg3kb89dmgs08fkqe1wt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwg3kb89dmgs08fkqe1wt.png" alt="Visual Concept" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The objective was to analyze job runtime behavior and identify common patterns that increase compute cost in distributed processing systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqmjhtp6043a2vdfzv76n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqmjhtp6043a2vdfzv76n.png" alt="Notebook" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first experiment measured runtime consistency for a heavy analytical query. The initial execution took approximately 39.87 seconds, while the second execution completed in about 2.35 seconds, demonstrating the difference between cold and warm query execution.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4bd9zxhybc9wukn3uysi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4bd9zxhybc9wukn3uysi.png" alt="Notebook" width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, the impact of unnecessary actions was explored. Executing &lt;strong&gt;.show()&lt;/strong&gt;, &lt;strong&gt;.count()&lt;/strong&gt;, and &lt;strong&gt;.collect()&lt;/strong&gt; on the same DataFrame triggered three separate Spark jobs, each scanning approximately 1.08 GB of data. By reducing execution to a single action or storing results for reuse, runtime was reduced to around 1.22 seconds.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzq6v2ljz3k5zl8233pmi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzq6v2ljz3k5zl8233pmi.png" alt="Notebook" width="800" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Additional experiments highlighted query optimization techniques. Simplifying a complex aggregation query reduced runtime from 7.48 seconds to 1.66 seconds. Avoiding *&lt;em&gt;SELECT *&lt;/em&gt;* and selecting only required columns further reduced execution time from 2.85 seconds to 1.44 seconds.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fji24aq47p5uh2hh3466x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fji24aq47p5uh2hh3466x.png" alt="Notebook" width="800" height="374"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Throughout the analysis, ChatGPT supported interpretation of runtime results and identification of practical cost-saving strategies within Databricks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1y3akg66mcixdqa0pl3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1y3akg66mcixdqa0pl3.png" alt="Notebook" width="800" height="338"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These observations illustrate how query design directly influences compute cost in distributed data processing systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnr1pne3yqey1871ku71p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnr1pne3yqey1871ku71p.png" alt="Codes" width="800" height="2680"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pewter-porch-d86.notion.site/Databricks-14-Days-AI-Challenge-2-30947a4b88b880a5a363f2181e43601e?source=copy_link" rel="noopener noreferrer"&gt;Activity Log&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>DAY 11 – Time Travel &amp; Data Recovery</title>
      <dc:creator>Subhasis Das</dc:creator>
      <pubDate>Wed, 11 Mar 2026 16:07:36 +0000</pubDate>
      <link>https://dev.to/nexoperose/day-11-time-travel-data-recovery-4044</link>
      <guid>https://dev.to/nexoperose/day-11-time-travel-data-recovery-4044</guid>
      <description>&lt;p&gt;Day 11 focused on Delta Lake’s time travel functionality and how historical data versions can be accessed in production data systems.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8yekn6zzgzdvnawvyxsb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8yekn6zzgzdvnawvyxsb.png" alt="Visual Concept" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two test records were appended to the &lt;strong&gt;ecom_orders&lt;/strong&gt; Delta table to simulate a new ingestion event. Using &lt;strong&gt;DESCRIBE HISTORY&lt;/strong&gt;, the table version history was examined to identify the newly created version. The dataset was then queried using &lt;strong&gt;VERSION AS OF&lt;/strong&gt; to retrieve the table state before the append operation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faufor13jarosykopzudm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faufor13jarosykopzudm.png" alt="Notebook" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxkf275htdxmquiu4xy45.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxkf275htdxmquiu4xy45.png" alt="Notebook" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Row counts were compared between Version 6 and Version 7 to validate the append operation. The dataset size increased from 312,456,680 rows to 312,456,682 rows, confirming that two new records were successfully added.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foc320oz5z6fry34ynf6j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foc320oz5z6fry34ynf6j.png" alt="Notebook" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr7expsdmadzs0gmrrtvg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr7expsdmadzs0gmrrtvg.png" alt="Notebook" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Additional filtering queries isolated the newly inserted rows using high user IDs. Timestamp-based time travel was also demonstrated to retrieve the table snapshot immediately before the append occurred.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F81vdtw4pku3ix91ub3u9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F81vdtw4pku3ix91ub3u9.png" alt="Notebook" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;An earlier attempt to query the initial table version failed due to Delta retention policies and a prior VACUUM operation, highlighting an important production consideration when relying on historical table versions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvedndfoqmazl8t92n8it.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvedndfoqmazl8t92n8it.png" alt="Notebook" width="800" height="376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During the implementation process, ChatGPT helped diagnose schema mismatches during append operations and guided the correct use of Delta time travel queries within Databricks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm7tx5mqmvnpxhkqf1l7k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm7tx5mqmvnpxhkqf1l7k.png" alt="Codes" width="800" height="3098"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pewter-porch-d86.notion.site/Databricks-14-Days-AI-Challenge-2-30947a4b88b880a5a363f2181e43601e?source=copy_link" rel="noopener noreferrer"&gt;Activity Log&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>DAY 10 - Query Optimization &amp; Explain Plans</title>
      <dc:creator>Subhasis Das</dc:creator>
      <pubDate>Wed, 11 Mar 2026 10:52:28 +0000</pubDate>
      <link>https://dev.to/nexoperose/day-10-query-optimization-explain-plans-1dbp</link>
      <guid>https://dev.to/nexoperose/day-10-query-optimization-explain-plans-1dbp</guid>
      <description>&lt;p&gt;Day 10 of Phase 2 focused on Query Optimization &amp;amp; Execution Analysis in Spark.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqggq8ipw14koimnoze9n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqggq8ipw14koimnoze9n.png" alt="Visual Concept" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Objective was to run a heavy Analytical Query on the Event Dataset, inspect its Execution Plan, and Analyze how Query Design affects Performance. A Purchase Aggregation Query was executed to identify the Most Active Buyers in the Dataset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc31v84ms9zu9560tbzrn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc31v84ms9zu9560tbzrn.png" alt="Notebook" width="800" height="607"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using Spark’s &lt;strong&gt;EXPLAIN&lt;/strong&gt; Functionality, the Parsed, Analyzed, Optimized, &amp;amp; Physical Execution Plans were examined. The Physical Plan revealed Stages such as Photon Scans, Hash Aggregation, Shuffle Exchanges, and Sorting Operations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6t9vctqazi4ccpsmb8i1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6t9vctqazi4ccpsmb8i1.png" alt="Notebook" width="800" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Execution Timing demonstrated the effect of Query Complexity. The Aggregation Query executed in approximately 2.20 seconds. A Simplified Projection Query that removed Aggregation and Sorting reduced Execution Time to approximately 1.41 seconds.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl7zq5x3ro134f4goup4f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl7zq5x3ro134f4goup4f.png" alt="Notebook" width="800" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Caching was attempted as part of the Optimization Workflow, but Serverless Compute Restrictions prevented Persistence Operations. As a result, Optimization was demonstrated through Query Simplification and Explain-Plan Interpretation instead.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftqddc76klr99d6tz4byq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftqddc76klr99d6tz4byq.png" alt="Notebook" width="800" height="440"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During the process, ChatGPT assisted with Explain-Plan interpretation and Query Optimization Reasoning within Databricks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ekkyf4nhhkxmzys3kpl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ekkyf4nhhkxmzys3kpl.png" alt="Codes" width="761" height="1947"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pewter-porch-d86.notion.site/Databricks-14-Days-AI-Challenge-2-30947a4b88b880a5a363f2181e43601e?source=copy_link" rel="noopener noreferrer"&gt;Activity Log&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>DAY 9 - Recommendation System</title>
      <dc:creator>Subhasis Das</dc:creator>
      <pubDate>Mon, 09 Mar 2026 11:18:05 +0000</pubDate>
      <link>https://dev.to/nexoperose/day-9-recommendation-system-4md7</link>
      <guid>https://dev.to/nexoperose/day-9-recommendation-system-4md7</guid>
      <description>&lt;p&gt;Day 9 of Phase 2: AI System Building focused on implementing a collaborative filtering Recommendation System using ALS.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F327o53owioqtycba39gi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F327o53owioqtycba39gi.png" alt="Visual Concept" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;User interactions were mapped into rating values (purchase = 3, cart = 2, view = 1) to simulate implicit feedback strength. An ALS model was trained on a controlled subset of users to prevent memory overflow in a shared/serverless environment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe5bp8icwm7pwtactj7hr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe5bp8icwm7pwtactj7hr.png" alt="Notebook" width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn1ol09n8hx1esiyun8cn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn1ol09n8hx1esiyun8cn.png" alt="Notebook" width="800" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Initial attempts using StringIndexer caused model size overflow due to high cardinality. Numeric casting of user and product IDs resolved this issue. Training on the full dataset resulted in heap memory errors, so user sampling and product pool limitation were applied to stabilize computation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxt7y1v7losn28zc1l0uo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxt7y1v7losn28zc1l0uo.png" alt="Notebook" width="800" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28zbtznlxiuw8qz34id5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28zbtznlxiuw8qz34id5.png" alt="Notebook" width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Because Unity Catalog restricts nested array rendering, manual candidate scoring and window-based ranking were implemented to generate Top-5 recommendations per user. Historical interactions were removed to ensure novelty in recommendations, which reduced counts for some users due to limited candidate coverage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwhwy0dskvassmy8hz7j5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwhwy0dskvassmy8hz7j5.png" alt="Notebook" width="800" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fko0apa0eljz4y8uziaf9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fko0apa0eljz4y8uziaf9.png" alt="Notebook" width="800" height="367"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Throughout implementation, ChatGPT supported architectural decisions, memory optimization, and troubleshooting within Databricks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17x27j9ysfpz00upcn7f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17x27j9ysfpz00upcn7f.png" alt="Notebook" width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ihn5521o0vo9jvqg1c4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ihn5521o0vo9jvqg1c4.png" alt="Codes" width="761" height="1947"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pewter-porch-d86.notion.site/Databricks-14-Days-AI-Challenge-2-30947a4b88b880a5a363f2181e43601e?source=copy_link" rel="noopener noreferrer"&gt;Activity Log&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>database</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>DAY 8 - Batch Inference Pipeline</title>
      <dc:creator>Subhasis Das</dc:creator>
      <pubDate>Sun, 08 Mar 2026 12:03:10 +0000</pubDate>
      <link>https://dev.to/nexoperose/day-8-batch-inference-pipeline-1n0o</link>
      <guid>https://dev.to/nexoperose/day-8-batch-inference-pipeline-1n0o</guid>
      <description>&lt;p&gt;Day 8 of Phase 2: AI System Building focused on implementing a batch inference pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynobbg84q8e519wwz7vg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fynobbg84q8e519wwz7vg.png" alt="Concept Visual" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using the engineered Silver feature table, feature vectors were assembled and applied to the trained Random Forest model to score over 5.3 million users. The model generated prediction probabilities and class outputs, which were then persisted into a managed Gold Delta table to simulate a production-style scoring layer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feo6ir92fbyu6vqm1xhj1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feo6ir92fbyu6vqm1xhj1.png" alt="Notebook" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhtgo5qgvoc41ji1hrwjg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhtgo5qgvoc41ji1hrwjg.png" alt="Notebook" width="800" height="528"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During implementation, Spark ML probability outputs were stored as VectorUDT types, requiring explicit conversion before extracting class probabilities. Additionally, notebook schema rendering messages initially appeared as errors but were confirmed to be display-related rather than pipeline failures. These debugging steps reinforced the importance of understanding Spark’s internal data types during inference workflows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fks5ajkktnudd57tm02j3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fks5ajkktnudd57tm02j3.png" alt="Notebook" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftu1fi5mn99f8nah8ngb0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftu1fi5mn99f8nah8ngb0.png" alt="Notebook" width="800" height="424"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The highest-ranked users displayed probabilities close to 1.0, consistent with earlier model evaluation outcomes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffzs566nrmy446mtltoef.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffzs566nrmy446mtltoef.png" alt="Notebook" width="800" height="277"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Throughout the process, ChatGPT assisted in resolving vector extraction issues and validating inference pipeline logic within Databricks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frfhae5alne72k1jzlogi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frfhae5alne72k1jzlogi.png" alt="Codes" width="800" height="2863"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This exercise completed the transition from experimentation to operational batch scoring in the AI system workflow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pewter-porch-d86.notion.site/Databricks-14-Days-AI-Challenge-2-30947a4b88b880a5a363f2181e43601e?source=copy_link" rel="noopener noreferrer"&gt;Activity Log&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>sql</category>
      <category>database</category>
    </item>
    <item>
      <title>DAY 7 - MLflow Tracking</title>
      <dc:creator>Subhasis Das</dc:creator>
      <pubDate>Sat, 07 Mar 2026 12:44:06 +0000</pubDate>
      <link>https://dev.to/nexoperose/day-7-mlflow-tracking-33bb</link>
      <guid>https://dev.to/nexoperose/day-7-mlflow-tracking-33bb</guid>
      <description>&lt;p&gt;Day 7 of Phase 2: AI System Building focused on experiment tracking using MLflow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8fw53fo27l1g8d4soz7c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8fw53fo27l1g8d4soz7c.png" alt="Visual Concept" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The objective was to log trained model runs, record parameters and evaluation metrics, and store model artifacts for reproducibility and comparison. Both Logistic Regression and Random Forest models were logged along with ROC-AUC scores, which were observed to be close to 1.0.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqwmcf8wqkk9z1ooxmjup.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqwmcf8wqkk9z1ooxmjup.png" alt="Notebook" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8snwiyyinv8k6x8qvgrs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8snwiyyinv8k6x8qvgrs.png" alt="Notebook" width="800" height="434"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During implementation, environment constraints in the shared/serverless workspace required specifying a Unity Catalog Volume path for temporary storage when logging Spark ML models. This highlighted how ML lifecycle management depends on infrastructure configuration, not just modeling logic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa2s9t0zqb2xkctp8i2iq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa2s9t0zqb2xkctp8i2iq.png" alt="Notebook" width="800" height="538"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The exercise reinforced the importance of experiment traceability, artifact storage, and reproducibility in scalable AI workflows. It also clarified the difference between logging a model and registering it within a model registry.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm5f63byiyb83ktgvy19h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm5f63byiyb83ktgvy19h.png" alt="Notebook &amp;amp; MLflow UI" width="800" height="293"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During troubleshooting and configuration, ChatGPT supported validation of MLflow setup and interpretation of lifecycle concepts within Databricks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fka6fdktmw7q340z5hov2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fka6fdktmw7q340z5hov2.png" alt="Codes" width="800" height="2817"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pewter-porch-d86.notion.site/Databricks-14-Days-AI-Challenge-2-30947a4b88b880a5a363f2181e43601e?source=copy_link" rel="noopener noreferrer"&gt;Activity&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>data</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>DAY 6 - Model Training &amp; Tuning</title>
      <dc:creator>Subhasis Das</dc:creator>
      <pubDate>Fri, 06 Mar 2026 14:15:05 +0000</pubDate>
      <link>https://dev.to/nexoperose/day-6-model-training-tuning-1l67</link>
      <guid>https://dev.to/nexoperose/day-6-model-training-tuning-1l67</guid>
      <description>&lt;p&gt;As part of Day 6 of Phase 2: AI System Building in the Databricks 14 Days AI Challenge – 2 (Advanced), I focused on model training, tuning, and evaluation using the supervised dataset prepared earlier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F52rtv41vq8ueji4tmkpj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F52rtv41vq8ueji4tmkpj.png" alt="Visual Concept" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feature vectors were assembled from engineered user-level metrics, and both Logistic Regression and Random Forest classifiers were trained using an 80/20 train-test split. Model performance was evaluated using ROC-AUC to ensure threshold-independent comparison.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgzhc51b7ub6dogha1x71.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgzhc51b7ub6dogha1x71.png" alt="Notebook" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fev70fficeylvlsynihbn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fev70fficeylvlsynihbn.png" alt="Notebook" width="800" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Due to workspace limitations in the shared/serverless environment, CrossValidator-based tuning was not supported because of temporary storage configuration restrictions. As a result, hyperparameter tuning for Random Forest was performed manually by iterating over different tree counts and depths.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh2xltis6ra8j2twestz5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh2xltis6ra8j2twestz5.png" alt="Notebook" width="800" height="384"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbfg8dq1yi801y0fl371d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbfg8dq1yi801y0fl371d.png" alt="Notebook" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The observed AUC values were extremely high (≈0.999999 for Logistic Regression and 1.0 for Random Forest). This highlighted an important modeling insight regarding feature-label relationships and the need to carefully assess potential information leakage in supervised learning workflows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffz52i5b8ekai9otq2bka.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffz52i5b8ekai9otq2bka.png" alt="Notebook" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During implementation, ChatGPT supported validation of model configuration, evaluation logic, environment troubleshooting, and interpretation of performance metrics within scalable AI system design practices inside Databricks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbh8mmpavmh8lr5sn8lgz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbh8mmpavmh8lr5sn8lgz.png" alt="Codes" width="800" height="3619"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pewter-porch-d86.notion.site/Databricks-14-Days-AI-Challenge-2-30947a4b88b880a5a363f2181e43601e?source=copy_link" rel="noopener noreferrer"&gt;Activity Log&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>data</category>
      <category>dataengineering</category>
      <category>pyspark</category>
    </item>
    <item>
      <title>DAY 5 - Production-Grade Feature Engineering</title>
      <dc:creator>Subhasis Das</dc:creator>
      <pubDate>Thu, 05 Mar 2026 19:04:30 +0000</pubDate>
      <link>https://dev.to/nexoperose/day-5-production-grade-feature-engineering-1n63</link>
      <guid>https://dev.to/nexoperose/day-5-production-grade-feature-engineering-1n63</guid>
      <description>&lt;p&gt;As part of Day 5 of Phase 2: AI System Building in the Databricks 14 Days AI Challenge – 2 (Advanced), I focused on preparing a production-ready supervised learning dataset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvdjfkkec2qi378zyvbvh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvdjfkkec2qi378zyvbvh.png" alt="Visual Concept" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The process began by creating a binary purchase label at the user level using event-level data. A user was labeled as 1 if at least one purchase event existed, otherwise 0. This label dataset was then joined with the previously engineered Silver feature table to create a consolidated training dataset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftt5ka0w1n9ebx7xuvix6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftt5ka0w1n9ebx7xuvix6.png" alt="Notebook" width="800" height="258"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;An 80/20 train-test split was applied using a fixed seed to ensure reproducibility. Distribution validation was performed across the full dataset, as well as the train and test splits, to confirm that class proportions remained consistent. The observed class ratio remained stable across partitions, reinforcing correct dataset preparation practices.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq6vrk344qdhci23ubux2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq6vrk344qdhci23ubux2.png" alt="Notebook" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During implementation, ChatGPT was used as a technical reference to validate aggregation logic, review join consistency, and confirm class distribution calculations aligned with scalable data engineering workflows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcglthifohen7uwzyuu5m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcglthifohen7uwzyuu5m.png" alt="Codes" width="722" height="1217"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pewter-porch-d86.notion.site/Databricks-14-Days-AI-Challenge-2-30947a4b88b880a5a363f2181e43601e?source=copy_link" rel="noopener noreferrer"&gt;Activity Log&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>data</category>
      <category>dataengineering</category>
      <category>python</category>
    </item>
    <item>
      <title>DAY 4 – Structured Streaming (Basic Simulation)</title>
      <dc:creator>Subhasis Das</dc:creator>
      <pubDate>Wed, 04 Mar 2026 17:12:28 +0000</pubDate>
      <link>https://dev.to/nexoperose/day-4-structured-streaming-basic-simulation-3pl1</link>
      <guid>https://dev.to/nexoperose/day-4-structured-streaming-basic-simulation-3pl1</guid>
      <description>&lt;p&gt;As part of Day 4 of Phase 1: Better Data Engineering in the Databricks 14 Days AI Challenge – 2 (Advanced), I explored the basics of Structured Streaming through a folder-based simulation approach.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc2mpce77l0f0dx2lxlk5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc2mpce77l0f0dx2lxlk5.png" alt="Day-4 in Short" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The objective was to simulate incremental data ingestion by monitoring a folder for incoming files and writing processed results into Delta format. Streaming input and checkpoint directories were prepared within Volume storage, and a predefined schema was used to configure streaming reads from curated data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6k0v9zgoi9c9lbpnoyml.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6k0v9zgoi9c9lbpnoyml.png" alt="Notebook" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During implementation, several practical challenges were encountered. Volume path validation, folder preparation, and workspace limitations prevented the use of continuous streaming triggers. The workflow therefore required adapting to an alternative trigger suitable for controlled execution. Checkpoint behavior also highlighted how previously detected files are ignored during subsequent runs, demonstrating how incremental ingestion is maintained.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wshdst6yhjf3bfy7x4r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wshdst6yhjf3bfy7x4r.png" alt="Notebook" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Although the streaming output could not be consistently demonstrated within the environment constraints, the exercise provided valuable insight into how storage configuration, checkpoints, and execution environments affect streaming pipelines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftrwhkrmrm79ekm92lg7x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftrwhkrmrm79ekm92lg7x.png" alt="Codes" width="800" height="3026"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pewter-porch-d86.notion.site/Databricks-14-Days-AI-Challenge-2-30947a4b88b880a5a363f2181e43601e?source=copy_link" rel="noopener noreferrer"&gt;Activity Log&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>data</category>
      <category>dataengineering</category>
      <category>python</category>
    </item>
    <item>
      <title>Day 3 - Job Orchestration Basics</title>
      <dc:creator>Subhasis Das</dc:creator>
      <pubDate>Tue, 03 Mar 2026 10:15:17 +0000</pubDate>
      <link>https://dev.to/nexoperose/day-3-job-orchestration-basics-1hin</link>
      <guid>https://dev.to/nexoperose/day-3-job-orchestration-basics-1hin</guid>
      <description>&lt;p&gt;As part of Day 3 of Phase 1: Better Data Engineering in the Databricks 14 Days AI Challenge – 2 (Advanced), the focus moved toward understanding job orchestration and preparing notebooks for automated execution.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh7b62f5nebokucc98ccn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh7b62f5nebokucc98ccn.png" alt="An Overview" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The notebook was first enhanced by introducing widget parameters to support runtime configuration. This allowed the workflow to remain flexible and reusable instead of relying on hardcoded execution logic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8d5s60bg7jqz1xexvrgb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8d5s60bg7jqz1xexvrgb.png" alt="The Notebook" width="800" height="581"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The feature engineering logic developed earlier was then modularized into a function. Organizing transformations this way improved readability and made the notebook better suited for pipeline-based execution.&lt;/p&gt;

&lt;p&gt;Following this, a Job was created using the workflow interface in Databricks. The notebook was added as a task, parameters were passed through configuration, and a daily schedule was defined to automate execution.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbrmbl6slvxhi816jh0ui.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbrmbl6slvxhi816jh0ui.png" alt="Steps in Job Creation" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fclgaot1wr3hok7ezar95.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fclgaot1wr3hok7ezar95.png" alt="Steps in Job Creation" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During implementation, ChatGPT supported the process as a technical reference for validating orchestration concepts and notebook structuring decisions.&lt;/p&gt;

&lt;p&gt;This exercise helped demonstrate how data workflows evolve from manual notebook runs into repeatable and scheduled data engineering pipelines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pewter-porch-d86.notion.site/Databricks-14-Days-AI-Challenge-2-30947a4b88b880a5a363f2181e43601e?source=copy_link" rel="noopener noreferrer"&gt;Activity Log&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6044h6c3e5lmqk4tp5v7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6044h6c3e5lmqk4tp5v7.png" alt="The Codes" width="683" height="852"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>data</category>
      <category>dataengineering</category>
    </item>
  </channel>
</rss>
