DEV Community: Lucy

Shopify Scripts Are Dead — Here's How to Migrate to Shopify Functions Before June 30, 2026

Lucy — Fri, 26 Jun 2026 07:14:33 +0000

If you're running a Shopify Plus store with custom Scripts, you have 4 days left.
June 30, 2026 is Shopify's hard deprecation date for Shopify Scripts — the Ruby-based customization layer powering custom discounts, shipping rules, and payment logic for thousands of merchants. After this date, every Script stops executing. Silently. With no fallback. Your checkout reverts to Shopify's defaults as if your custom logic never existed.

Critical: As of April 15, 2026, the Script Editor is already locked. You can no longer edit or publish Scripts. Any Script still running in production is frozen code — bugs cannot be patched. June 30 is when execution stops entirely. If you haven't started migrating, you are running unmodifiable code in production right now.

This isn't a "we'll get to it eventually" situation. It's a breaking change with a hard deadline days away.
In this guide I'll walk you through:

What's actually changing and why

The direct replacement for each Script type
A step-by-step migration path with corrected CLI commands
Real code examples showing before and after

Let's move fast.

What Are Shopify Scripts?

Shopify Scripts were introduced as a Shopify Plus-only feature that let developers write Ruby code to customize the cart and checkout experience. Three types existed:

Line Item Scripts — modify prices, apply discounts, bundle logic
Shipping Scripts — customize shipping rates, hide or rename options
Payment Scripts — show or hide payment methods based on conditions

They were powerful for their time, but they carried serious limitations: Ruby-only, Plus-only, slow to test, no version control, and fundamentally incompatible with Shopify's Checkout Extensibility architecture that replaced checkout.liquid for Plus merchants in 2023–2024.

Why Shopify Is Deprecating Them

Three reasons, in plain terms:

1. Checkout Extensibility replaced checkout.liquid — Scripts are architecturally incompatible with the new checkout model
2. Performance — Ruby Scripts ran server-side with cold-start delays. Shopify Functions compile to WebAssembly and run under a strict 5ms execution cap. Rust Functions typically execute in 3–5ms; JavaScript Functions run 10–30ms in real-world use and should be used only for simpler logic
3. Platform-wide access — Shopify Functions are available to all plan levels via installed apps, not just Shopify Plus subscribers

Scripts vs. Functions: What Actually Changed

Here's the architectural shift. Scripts were a Plus-only workaround bolted onto an older platform. Functions are native, WebAssembly-powered infrastructure available on every plan.

What Gets Replaced by What?

Enter Shopify Functions

Shopify Functions are the modern replacement. You write them in JavaScript/TypeScript or Rust — the two officially maintained, first-class languages in 2026. Any language that compiles to WebAssembly is technically supported, but Rust and JavaScript are the only paths with active Shopify CLI tooling and official support.

Choose based on your use case:
1. JavaScript/TypeScript — good for prototyping, simpler discount logic, teams without Rust experience. Compiled via Shopify's Javy toolchain.
2. Rust — recommended for complex logic, large carts, public apps, or any Function near the 256KB binary size limit. Runs 3–5ms versus 10–30ms for JavaScript.

Platform limits (as of 2026): Each store can run a maximum of 5 Discount Functions, 1 Cart Transform Function, and 5 Validation Functions. Plan your migration with these caps in mind if you have many Scripts.

Migration: Step by Step

Step 1: Audit your current Scripts

Go to Shopify Admin → Apps → Script Editor. The editor is now read-only (locked since April 15), but you can still view all active Scripts, export the customizations report, and read the source logic.
List every active Script, what business rule it enforces, and every edge case it handles.

Pay attention to:

Is this Script still being used in production?
What customer-facing behaviour does it produce?
Are there conditional rules — customer tags, order thresholds, product exclusions, B2B rules?

Use the Shopify Scripts customizations report (available from the Script Editor page) to export the full list automatically.

Step 2: Set up your Shopify Functions environment

You need Shopify CLI 4.0 or higher:

# Verify CLI version
npm install -g @shopify/cli
shopify version  # confirm 4.0+

Create a new app or add a Function extension to an existing one:

# Create a new app (correct CLI 4.0 syntax)
npm init @shopify/app@latest

# Or add an extension to an existing app
shopify app generate extension
# Choose: Discount, Cart Transform, Delivery Customization, or Payment Customization

Step 3: Scaffold the right Function type

shopify app generate extension
# Select "Discount - Order discounts"    → for order-level discount Scripts
# Select "Discount - Product discounts"  → for variant/product-level logic
# Select "Delivery customization"        → for Shipping Scripts
# Select "Payment customization"         → for Payment Scripts
# Select "Cart transform"                → for bundle/kit Line Item Scripts

Step 4: Rewrite your logic in JavaScript or Rust

Here's a real before/after for the most common migration case — 10% off all orders over $100:

Before (Ruby — Shopify Script)

Input.cart.line_items.each do |line_item|
  if Input.cart.subtotal_price >= Money.new(cents: 100_00)
    line_item.change_line_price(
      line_item.line_price * 0.9,
      message: "10% bulk discount"
    )
  end
end

Output.cart = Input.cart

After (JavaScript — Shopify Order Discount Function)

// run.js — Order Discount Function
export function run(input) {
  const subtotal = parseFloat(
    input.cart.cost.subtotalAmount.amount
  );

  if (subtotal >= 100) {
    return {
      discounts: [
        {
          targets: [
            { orderSubtotal: { excludedVariantIds: [] } }
          ],
          value: { percentage: { value: "10.0" } },
          message: "10% bulk discount"
        }
      ],
      discountApplicationStrategy: "FIRST"
    };
  }

  return {
    discounts: [],
    discountApplicationStrategy: "FIRST"
  };
}

The business logic is the same. The architecture is fundamentally better — compiled to WebAssembly, version-controlled, deployable via CI/CD, and accessible on all Shopify plans.

Using Rust instead? The logic structure is similar. Shopify's CLI scaffolds a full Rust project with the shopify_function crate when you select Rust at the extension type prompt. For complex discount engines with large catalogs, Rust is the safer choice as JavaScript Functions can exceed the 5ms execution cap on heavy carts.

Step 5: Test in a development store

shopify app dev

This deploys your Function to a development store in draft mode. Create a discount in Shopify Admin that uses your new Function, then test it across all cart scenarios — especially every edge case from the original Script.

Step 6: Deploy to production and monitor

shopify app deploy

Link your Function to a discount or customization in Shopify Admin (or via the Admin API). Monitor checkout conversion rate carefully for the first 48 hours after going live. A measurable drop typically signals a missed edge case, not a platform issue.

Common Migration Pitfalls

1. Scripts that relied on execution order

Scripts ran sequentially and could interact with each other. Functions run independently in parallel. If you had two Line Item Scripts that stacked or modified each other's output, refactor the logic into a single Function or control stacking behaviour explicitly with discountApplicationStrategy.

2. Bundle logic mapped to the wrong Function type

Bundle discounts in Line Item Scripts map to Cart Transform Functions, not Discount Functions. These are a completely separate extension type — selecting the wrong one at scaffold time means you're building against the wrong API schema.

3. Assuming Shipping Scripts are low priority

All three Script types — line item, shipping, and payment — stop executing on June 30. Shipping Scripts are often lower revenue impact than discount Scripts, but if your store hides express shipping for fragile items, shows different carrier options by customer tag, or applies shipping discounts conditionally, those rules disappear on the deadline. Migrate them before June 30, not after.

4. Edge cases left uncoded

Ruby Scripts accumulated implicit logic over years — gift card exclusions, B2B pricing tiers, free shipping thresholds, variant-level carve-outs. Functions are explicit. Every condition your business requires must be coded. The audit in Step 1 is your safety net here.

5. JavaScript Functions on complex carts

JavaScript Functions run 10–30ms in real-world use. Shopify's 5ms execution cap means JavaScript Functions can be aborted on heavy carts, leaving the checkout in its default state. For stores with large catalogs or many line items, write in Rust. For simple rules (e.g. flat percentage discount on all orders), JavaScript is fine.

6. Not reading the GraphQL input schema

Each Function type has a defined input schema. You control what data your Function receives by editing run.graphql. If your logic needs metafields, customer tags, or product attributes, add them to the input query — do not assume they arrive automatically.

What If You're Already Behind?

If you're reading this with four days on the clock and active Scripts still in production, here's the prioritisation:

1. Revenue-critical Scripts first — discount Scripts affecting checkout conversion and payment Scripts controlling visible gateways go first
2. Stub before you optimise — a working Function that matches the Script's behaviour is better than a perfectly architected one that isn't deployed
3. Migrate all Script types before June 30 — Line Item, Shipping, and Payment Scripts all stop on the same date. Don't assume Shipping Scripts can wait; schedule them in the same sprint
4. Use Shopify's migration docs at shopify.dev — the customizations report and Function input schemas are well-documented

If you have a complex store with multiple interdependent Scripts and the timeline feels impossible, working with a certified Shopify expert agency is the fastest path. Teams like Lucent Innovation — 10+ years as a Shopify Plus partner, 12+ years in ecommerce, 5,000+ stores delivered — can run a focused Functions migration sprint with proper QA, significantly faster than doing this under pressure in-house.

Quick Recap

Shopify Scripts deprecate June 30, 2026 — no extension, no fallback, no grace period
Script Editor locked since April 15, 2026 — existing Scripts are frozen and cannot be modified
Shopify Functions replace Scripts with WebAssembly execution, all-plan access, and full CI/CD support
Write Functions in JavaScript/TypeScript (simpler logic) or Rust (complex logic, large carts)
Per-store limits apply: 5 Discount Functions, 1 Cart Transform, 5 Validation Functions
Migration path: audit via customizations report → scaffold correct type → rewrite → test → deploy

Start with the audit. Everything else follows from there.

Already migrated? What was the trickiest Script to convert? Drop it in the comments — would love to hear real-world edge cases.

About the author: This post was written by the engineering team at Lucent Innovation, a certified Shopify Plus partner with 10+ years on the platform and 12+ years building and scaling ecommerce stores.

Migrate Your Data to Delta Lake: A Simple Guide for Developers

Lucy — Tue, 16 Jun 2026 09:31:12 +0000

TLDR: Delta Lake is like adding safety rules to your data storage. It stops data accidents, lets you see old versions of your data, and keeps everything organized. If you work with big data and Apache Spark, Delta Lake makes your life easier. It uses ACID transactions (fancy words for "data never breaks") and costs way less than old data warehouses.

What This Article Covers: This post explains what Delta Lake is, why you should move your data to it, and how to start the migration. We look at real problems Delta Lake solves and give you a simple roadmap for getting started. For more in-depth technical details, see Delta Lake Explained on the Lucent Innovation blog.

The Problem With Old Data Lakes

Imagine you have a giant closet to store files. In the old days, data lakes worked like that. You threw all your data files in there, and it was cheap. But there was a big problem.

What happens if someone is reading a file while someone else is writing to it? The file could get messed up. What if the computer crashes while saving? You might lose all your work. There was no good way to fix mistakes or go back to how things were before.

That is the main problem Delta Lake solves.

What Is Delta Lake?

Delta Lake is a layer of software that sits on top of your data files. Think of it like a smart librarian for your data. It keeps track of every change, makes sure nothing gets broken, and lets you fix mistakes fast.

The cool part is that Delta Lake does this without charging you a fortune. It runs on cheap cloud storage like Amazon S3 or Azure Blob Storage, just like regular data lakes. But it adds power that usually costs a lot of money.

The Big Features That Matter

ACID Transactions (No More Broken Data)

ACID is short for Atomicity, Consistency, Isolation, and Durability. That means:

Atomicity: Either the whole write happens or none of it happens. No half-finished work.
Consistency: Data stays clean and organized, no broken records.
Isolation: People can read and write at the same time without getting in each other's way.
Durability: Once data is saved, it stays saved. No losing work if the power goes out.

In plain English: You never have broken data or weird errors from too many people using the system at once.

Time Travel (Go Back in Time)

Did you delete something by accident? Need to check how data looked last week? Delta Lake remembers everything. You can run a query that shows you your data from any point in the past.

This is super helpful for audits, fixing mistakes, or checking when something went wrong.

Schema Enforcement (Keep Data Clean)

A schema is a blueprint that says what columns you have and what type of data goes in each one. Delta Lake watches the door and makes sure bad data never gets in. If someone tries to add data that does not match the blueprint, Delta Lake stops it right there.

Automatic Updates and Deletes

In regular data lakes, updates and deletes are slow and messy. You have to rewrite whole files. Delta Lake makes this fast and easy. You can change or remove records without rewriting everything.

Why Move to Delta Lake?

Cost Savings

A big data warehouse from twenty years ago costs a fortune to run. Delta Lake gives you warehouse-level reliability but uses cheap cloud storage underneath. You can save up to 50 times on computing costs while still getting fast answers to your questions.

Speed and Trust

Your team can trust the data faster. You spend less time fixing problems and more time using the data. No more mysteries about whether a number is right or wrong.

Real-Time and Batch in One Place

Some data comes in one batch per day. Some comes in live streams all day long. Delta Lake handles both in the same place, with the same tools. You do not need different systems for different types of data.

Easy Audits and Compliance

Every change is tracked in a log. This is gold when you need to follow rules like GDPR or show customers that their data is safe. You can prove exactly who changed what and when.

How to Start Your Migration

Step 1: Check Your Current Setup

Before you move anything, understand what you have now. Answer these questions:

What files do you store? (CSV, JSON, Parquet, something else?)
How big is your data? (1 GB or 1 TB?)
Who uses it? (Engineers, analysts, AI models?)
What problems do you hit most? (Slow updates? Broken data? Hard to audit?)

Step 2: Start Small

Do not move everything at once. Pick one table or one folder that is not critical. Move it to Delta Lake and run it for a week. See if your team likes it. Break things on purpose to understand how Delta Lake handles problems.

Step 3: Set Up Your Infrastructure

Delta Lake works with Databricks (a company built for this), or you can run it open-source with Apache Spark. For most projects, using Databricks is easier because everything just works together.

If you want to go full open-source, you will need:

Apache Spark
Storage like S3 or Azure
A way to run the code (like a Linux server)

Step 4: Copy Your Data Over

For small amounts of data, you can copy directly. For huge amounts, break it into chunks and move one chunk at a time. This way if something breaks, you only fix that chunk, not everything.

Step 5: Test Everything

Before you tell everyone to use the new system, run real queries on it. Check that the numbers match the old system. Have your analysts double-check important reports.

Step 6: Switch Over and Watch It

Pick a time when not many people are using the system. Switch everyone to Delta Lake. Have people ready to help if something goes wrong. Watch for problems the first few days.

Step 7: Keep the Old System as Backup

Even after you switch, keep your old data around for a few weeks. If something goes really wrong, you can go back.

Simple Example: Your First Delta Lake Table

If you know Python and Spark, it is super easy:

# Read your data
data = spark.read.csv("old_data.csv", header=True)

# Write it as Delta Lake
data.write.format("delta").mode("overwrite").save("delta_table")

# Now read it back as Delta Lake
df = spark.read.format("delta").load("delta_table")

# See past versions
df_yesterday = spark.read.format("delta").option("versionAsOf", 0).load("delta_table")

That is it. Three lines to switch from regular files to Delta Lake.

Real Costs vs. Old Systems

Let's look at numbers:

System	Cost Per Year	Speed	Broken Data	Ease
Old Data Warehouse	$500K+	Medium	Rare	Hard
Regular Data Lake	$50K	Slow	Common	Easy
Delta Lake	$50K-100K	Fast	Rare	Easy
Databricks Lakehouse	$100K-200K	Very Fast	Very Rare	Very Easy

The exact cost depends on how much data you have and how much you use it. But the pattern is clear: Delta Lake gives you warehouse reliability at lake prices.

Common Questions

Q: Do I need to rewrite all my code?

A: Not really. If you use Spark SQL or Python with Spark, you mostly use the same code. The main change is using "delta" as the format instead of "parquet" or "csv."

Q: What if my company uses a different system like Spark, Flink, or Kafka?

A: Delta Lake works with all of them. It is just a format and a set of rules. Any system that can read Parquet files can work with Delta Lake.

Q: Is Delta Lake production-ready?

A: Yes. Thousands of companies run it in production. It handles petabytes of data every day.

Q: How hard is the migration?

A: It depends on your setup. If you have simple CSV or Parquet files, it is easy. If you have a complex system with lots of custom code, it takes more time. Plan for weeks or months, not days.

Next Steps

Delta Lake is a great investment for any team that works with big data. It solves real problems and saves money at the same time. Start small, test it out, and see if it works for your team.

If you want to learn more about the deep technical details like transaction logs, how Delta Lake picks which files to read, and how schema evolution works, check out Delta Lake Explained on Lucent Innovation's technology blog. It goes much deeper into these topics.

The important thing is to start somewhere. Pick one small project. Try Delta Lake. See how it feels. You will probably wonder why you did not switch earlier.

How to Speed Up Your Shopify Store in 5 Easy Steps for Better Performance

Lucy — Mon, 15 Jun 2026 13:25:34 +0000

Introduction

Most online stores lose customers because of slow loading times. Did you know that if your store takes more than 3 seconds to load, people often leave? A slow Shopify store can cost you money and customers. (Check out more

#ecommerce Follow

strategies on Dev.to)

In this post, we will cover 5 simple things you can do to make your Shopify store faster. These tips work for everyone, from small businesses to large online stores.

What You Will Learn

In this article, you will find out about:

Image optimization methods
Reducing JavaScript and CSS files
Using a content delivery network (CDN)
Caching strategies
Monitoring your store speed

Let's dive in!

Step 1: Optimize Your Images

Large images slow down your store. Images should be as small as possible but still look good.

Why Images Matter:
Images make up most of a website's file size. If you have 20 large product images on one page, your store gets very slow.

How to Fix It:

Use tools like TinyPNG or ImageOptim to make images smaller. These tools remove extra data from images without making them look bad.

You can also use Shopify's built-in image compression features. Just upload your images to Shopify and let it handle the resizing.

Pro Tip: Use WebP format instead of JPG. WebP images are 25 to 35 percent smaller and look just as good.

Step 2: Minify CSS and JavaScript

Minification means removing extra spaces and characters from your code. This makes files smaller and faster to load.

What Gets Removed:

Extra spaces
Line breaks
Comments in the code
Unused characters

Most Shopify themes already do this automatically. But if you are building a custom theme, you should check your code.

Tools You Can Use:

CSS Minifier
JavaScript Minifier
Shopify's built-in minification

These tools take your code and make it shorter without changing what it does.

Step 3: Use a CDN

A CDN is a content delivery network. It stores your images and files in many locations around the world. When someone visits your store, they get files from the closest location.

How It Works:

If you have a customer in Japan and your server is in the USA, they have to download files from far away. With a CDN, a copy of your files sits in Japan too.

Shopify uses Cloudflare as its CDN, which is really good. Most Shopify plans include CDN automatically, so you probably already have this feature.

Step 4: Enable Caching

Caching means saving some information so you do not have to load it again. This makes repeat visits much faster.

Browser Caching:

Your customer's browser can save images, CSS, and JavaScript on their computer. When they visit again, these files load instantly.

Server Caching:

Your Shopify server can save product information and page data. This reduces the work the server has to do.

You can enable caching through Shopify settings or use apps designed for this purpose.

Popular Caching Apps:

Cache Cleaner
Bulk Operations
Speed Booster

Step 5: Test Your Speed

Use Google PageSpeed Insights or GTmetrix to test how fast your store loads. Run tests after you make changes to see what works best.

How to Test:

Go to Google PageSpeed Insights
Enter your Shopify store URL
Click "Analyze"
Read the report
Make changes
Test again

Keep testing until your score gets better. Aim for at least 75 out of 100.

Key Takeaways

Making your Shopify store faster does not have to be hard. Here are the main points to remember:

Optimize images with tools like TinyPNG
Minify your JavaScript and CSS code
Use a CDN like Cloudflare
Enable caching for faster repeat visits
Test your speed regularly

Each of these steps will help you keep customers happy and increase your sales.

Conclusion

A fast Shopify store means more customers and more money. These five steps will help you speed up your store today. Start with image optimization because that gives you the biggest improvement.

What speed tricks do you use on your Shopify store? Share in the comments below.

Related Resources

Google PageSpeed Insights: https://pagespeed.web.dev/
GTmetrix Speed Test: https://gtmetrix.com/
Shopify Performance Guide: https://shopify.dev/
Cloudflare CDN: https://www.cloudflare.com/

Related Dev.to Posts

If you found this helpful, check out these related articles on Dev.to:

#shopify Follow

Learn more about Shopify development and ecommerce solutions
#performance Follow

Tag for content related to software performance.

Explore other web performance optimization techniques
#ecommerce Follow

Discover more ecommerce development best practices
#webdev Follow

Because the internet...

Stay updated with the latest web development trends Have questions about Shopify performance? Drop them in the comments and I will help you out!

What Does an AI Consultant Actually Do? A 2026 Breakdown for Business Leaders

Lucy — Fri, 12 Jun 2026 12:59:04 +0000

Short Answer
An AI consultant helps your business figure out where AI can help, what to build, and how to make it actually work without the guesswork or wasted budget. They bridge the gap between cutting-edge technology and real business outcomes. In 2026, with the global AI consulting market valued at over $11 billion and growing at 26% annually, knowing exactly what you're paying for matters more than ever.

What Even Is an AI Consultant?

Let's be honest. "AI consultant" sounds like one of those titles that could mean anything.

It could mean someone who builds models. Or someone who just makes slide decks. Or someone who helps you figure out which tools to buy. The real answer? All three and more.

An AI consultant is a specialist who helps organizations identify AI opportunities, design solutions, and make sure those solutions actually work in the real world. They are not just coders. They are not just strategists. They sit right in the middle.

Think of them like a general contractor for a home renovation. The contractor doesn't just swing a hammer. They help you design the plan, pick the right materials, manage the work, and make sure the final result matches what you needed, not just what looked good on paper.

According to McKinsey's 2025 State of AI report, 78% of organizations now use AI in at least one business function. But only around 6% achieve significant, enterprise-wide results. That gap between "we're using AI" and "AI is actually helping our business" is exactly where AI consulting services come in.

What Does an AI Consultant Actually Do Day-to-Day?

Here is where most articles get vague. Let's break this down into real phases.

Phase 1: The AI Opportunity Audit

Before building anything, a good consultant spends time understanding your business.

They look at your current workflows, your data, your tools, and your goals. They ask: Where are the bottlenecks? Where is time being wasted? Where could automation or prediction actually add value?

This is not a one-hour meeting. It is typically a multi-week discovery process. It involves talking to department heads, reviewing data pipelines, and mapping out where AI can realistically help vs. where it would just add unnecessary complexity.

Many businesses skip this step and jump straight to building. That is one of the top reasons AI projects fail.

Phase 2: Strategy and Roadmap Building

Once they understand the business, a consultant builds a roadmap. This is a prioritized list of AI projects, ordered by value and feasibility.

Not every AI idea is a good idea. A roadmap helps you focus on what will move the needle first, instead of chasing the flashiest use case.

A solid roadmap answers these questions:

What AI projects do we tackle first?
What data do we need, and is it ready?
How long will each project take?
What does success actually look like?
How does this fit into our existing tech stack?

Phase 3: Picking the Right Tools and Stack

There are thousands of AI tools available in 2026. Large language models, vector databases, MLOps platforms, AutoML tools, fine-tuning services, the list keeps growing.

A good consultant knows which ones are right for your specific problem, not which ones are trending this month.

They look at your cloud provider (AWS, Azure, GCP), your data infrastructure, your team's existing skills, and your budget. Then they recommend a stack that actually fits — not the most expensive or the most popular.

This step alone can save companies from expensive vendor lock-in or over-engineered solutions that nobody ends up using.

Phase 4: Managing the Build and Deployment

This is where the technical work happens. Depending on the team, a consultant might:

Lead a team of data engineers and ML engineers
Review and approve model designs
Oversee integration with production systems
Set up monitoring and alerting for deployed models

They are not just advising from the sidelines. They are in the work — reviewing code, unblocking issues, and making sure the solution is being built correctly.

According to IDC, AI consulting demand grew 40% between 2024 and 2025 — largely because companies realized they needed someone to manage this build process, not just hand over a strategy document and walk away.

Phase 5: Governance, Ethics, and Ongoing Optimization

After a model goes live, the work is not done.

AI systems can drift over time. Their predictions get less accurate as the world changes. They can also produce biased or harmful outputs if they are not monitored carefully.

A consultant puts governance frameworks in place: model monitoring, bias detection, data refresh schedules, and escalation paths for when something goes wrong. In regulated industries like finance, healthcare, or insurance, this step is not optional. It is legally required.

Only 23% of IT leaders are confident their organizations can manage AI governance when rolling out generative AI tools, per a 2025 Gartner survey. This is a massive gap and it is one of the fastest-growing areas of demand in AI consulting right now.

How Is an AI Consultant Different From a Data Scientist or Software Engineer?

This question comes up a lot. Here is the simplest way to think about it:

Data Scientist: Focused on building models. Highly technical. Not always thinking about business outcomes or how the model gets used in practice.
ML/Software Engineer: Builds and deploys systems. Focused on code and infrastructure. Not always involved in strategy or stakeholder communication.
AI Consultant: Connects business goals to technical solutions. Manages the full lifecycle. Communicates across teams from the CEO to the dev team.

A data scientist might build you a great churn prediction model. But a consultant makes sure it connects to your CRM, that the sales team knows how to use it, that it gets updated every quarter, and that someone is watching for problems when it drifts.

Both roles are valuable. But they do very different jobs.

When Does a Business Actually Need AI Consulting Services?

You probably need a consultant if any of these sound familiar:

1."We've been talking about AI for 18 months but haven't shipped anything."
You need someone to cut through the noise and create a real plan with clear milestones.

2."We built something, but it's sitting unused."
You need help with change management, integration, and adoption — not just the model itself.

3."We don't know if our data is ready for AI."
A consultant will run a data readiness audit and tell you honestly what you have to work with.

4."We're worried about AI doing something harmful or non-compliant."
You need governance expertise before you go live not after.

5. "Our team knows how to code, but doesn't know where to start."
Strategy always comes first. A clear roadmap from an experienced team can save months of wasted effort.

For companies in retail, finance, healthcare, or any data-heavy industry, the gap between "wanting AI" and "using AI effectively" is often just a lack of structured guidance. Lucent Innovation's AI consulting experts work through exactly this process, from initial opportunity mapping all the way through to production deployment and ongoing governance.

What Tools Do AI Consultants Actually Use in 2026?

AI consultants are not just using ChatGPT. Here is a practical look at a real toolkit:

Strategy & Discovery

Miro or FigJam for workflow mapping
Notion or Confluence for roadmap documentation
Custom interview frameworks for stakeholder discovery

Data Readiness

Python (pandas, great_expectations) for data audits
dbt for data transformation pipelines
Databricks for large-scale data processing

Model Development

PyTorch or TensorFlow for custom model work
Hugging Face for open-source LLM access
OpenAI or Anthropic APIs for enterprise generative AI

Deployment & Monitoring

MLflow or Weights & Biases for experiment tracking
Kubeflow or AWS SageMaker for production deployment
Grafana or Datadog for monitoring and alerting

Governance

Fairlearn or IBM AI Fairness 360 for bias detection
AWS Macie or Microsoft Purview for data compliance

Here is an example of the kind of simple data readiness check a consultant might run before recommending any ML solution to a client. This is often the very first technical step:

# Basic AI Readiness Check — Data Quality Audit
# Run this before recommending any ML model to a client
# Gives a fast signal on whether the dataset is usable

import pandas as pd

def ai_readiness_check(df: pd.DataFrame) -> dict:
    """
    Quick data quality check before starting an AI project.
    Returns a readiness score and a list of key issues to fix.
    """
    issues = []
    score = 100

    # Check 1: Columns with >20% missing values
    missing_pct = df.isnull().mean() * 100
    high_missing = missing_pct[missing_pct > 20]
    if not high_missing.empty:
        issues.append(f"High missing data in: {list(high_missing.index)}")
        score -= 25

    # Check 2: Minimum row count for a usable ML dataset
    if len(df) < 1000:
        issues.append(
            f"Low row count ({len(df)} rows). "
            "Most ML models need at least 1,000 rows to train reliably."
        )
        score -= 20

    # Check 3: Too many duplicate rows
    dup_pct = df.duplicated().mean() * 100
    if dup_pct > 5:
        issues.append(f"{dup_pct:.1f}% duplicate rows found — clean before training.")
        score -= 15

    # Check 4: No numeric columns (most models need at least some)
    numeric_cols = df.select_dtypes(include="number").columns
    if len(numeric_cols) == 0:
        issues.append("No numeric columns found. Data may need encoding first.")
        score -= 20

    return {
        "readiness_score": max(score, 0),
        "issues_found": issues,
        "rows": len(df),
        "columns": len(df.columns),
        "recommendation": (
            "Good to proceed with model development"
            if score >= 70
            else "Fix data quality issues before building any model"
        ),
    }

# Example usage:
# df = pd.read_csv("your_business_data.csv")
# result = ai_readiness_check(df)
# print(result)

A lot of AI consulting begins exactly here with the data, not the model. Many companies believe they are "AI-ready" when their data tells a very different story. Running something like this before scoping a project saves weeks of rework.

For businesses that need this kind of structured, end-to-end support from data readiness assessments all the way through model governance, Lucent Innovation's AI strategy and implementation consulting covers the full lifecycle across industries including retail, healthcare, and financial services.

How to Pick an AI Consulting Partner in 2026 Without Regret

Lucy — Fri, 05 Jun 2026 06:44:07 +0000

Short answer: Hire an AI consulting partner the way you'd hire a senior engineer. Judge them on what they've shipped, not on the buzzwords in their deck. The good ones say no to bad-fit projects, put working code in front of you early, and tell you straight where AI won't help. The rest is theater.

Fair warning before we get going: I run an AI and data shop, so I've got skin in this. I'll flag it where it matters. This isn't a pitch, though it's the checklist I wish more founders used. Bad engagements are exactly what make this whole field smell like snake oil, and I'm tired of cleaning up after them.

Why does picking wrong hurt so much?

It's not the invoice. It's the quarter you burn, the engineers who quietly stop believing "AI" means anything real, and the brittle demo that folds the second production data touches it.

And the opportunity cost is brutal. While you were untangling someone's over-engineered RAG pipeline, a competitor shipped something boring that just worked. Speed-to-learning beats sophistication here almost every time, and the wrong partner optimizes for the wrong one.

What should I actually look for?

Skip the logo wall. When you're weighing AI consulting services, here's what actually tells you something:

Shipped systems, not slides. Ask to see something running. Real work leaves a trail — repos, dashboards, eval numbers.
Opinionated scoping. A good partner tells you which 80% of your idea to cut for v1. Say-yes-to-everything means they're selling hours, not outcomes.
Data honesty. The first hard question should be about your data: where it lives, how messy it is, who owns it. Nobody asks? Walk.
An exit ramp. You want to own the code, the model choices, the docs. Anyone building you a black box only they can maintain is building themselves a job not solving your problem.

Here's the thing about good AI strategy consulting: it starts from your business constraint, not from a model. If the opening call is about which LLM to pick rather than which decision you're trying to improve, that's a yellow flag.

How do I avoid getting burned?

Three moves that have saved me and people I trust.
Run a paid pilot first. Two to four weeks, tight scope, one real deliverable. And pay for it — free pilots pull the wrong incentives on both sides. You'll learn more in one honest sprint than in five sales calls.

Then ask for a reference who had a project go sideways. Anyone can hand you a happy logo. The question that actually works is, "Tell me about an engagement that didn't go to plan." How they answer tells you how they'll treat you when something breaks. Because something will.

And make them explain their evals. If they can't tell you how they'll measure whether the thing works accuracy, latency, cost per call, hallucination rate they're guessing. Guessing is fine at a hackathon. Not on your budget.

Teams that work this way are happy to scope a small, honest pilot before asking for the big commitment. For transparency, that's roughly how our own AI consulting practice runs but honestly, the principle matters more than the vendor. Hold whoever you're evaluating to it.

Build, buy, or partner at all?

Not every problem needs a consultant. Strong ML engineers and a clear use case? Build it. A SaaS tool already covers 90%? Buy that. Partnering earns its keep when the problem is real, the stakes are high, and you need to move faster than hiring allows or when you want your own engineers learning next to people who've done it before.

If you do bring someone in, treat them like a teammate with an expiry date. Your team should be sharper when they leave, not more dependent. The right AI strategy consulting engagement hands over knowledge on the way out. That's the whole difference between a partner and a crutch.

Takeaway: Judge partners on shipped work, sharp scoping, and data honesty. De-risk with a short paid pilot and a brutal reference check. Insist on owning what gets built. The best one leaves your team stronger — and then leaves.

Batch vs Streaming Pipelines: How I Actually Choose Between Them

Lucy — Fri, 05 Jun 2026 06:43:12 +0000

Every data pipeline starts with one big question before a single line of code gets written.

Should I process data in scheduled chunks? Or should I process it the moment events arrive?

That is the batch vs streaming decision. It sounds simple. But in real projects, it shapes everything: which tools you pick, how much you spend each month, what guarantees you can make about fresh data, and how many nights you spend fixing production incidents.

I have seen teams pick streaming when batch would have worked just fine. I have also seen the opposite. Both mistakes are expensive. This post walks through how I think about it.

What Batch Processing Actually Means

Batch processing collects data over a time window and then processes it all at once when a scheduled trigger fires.

Think about doing laundry. You do not wash one shirt the moment it gets dirty. You wait until you have a full load, then run the machine. The shirts pile up during the week. On Sunday, the machine runs.

Data batch pipelines work the same way. Source data builds up in a staging area. At a set time, usually overnight or hourly, a job picks up everything that arrived, runs the transformations, and loads the results into the destination.

The batch job has a clear start. It has a clear end. When it finishes, the destination has a snapshot of data as of the run time. Between runs, nothing changes.

What batch is great at:

Batch handles complex transformations well because there is zero time pressure per record. A batch job can join across tables with hundreds of millions of rows. It can run expensive multi-level calculations. It can apply feature engineering for machine learning without worrying about processing each event in milliseconds.

Batch pipelines are also much easier to test, debug, and rerun. When a transformation gives wrong results, you fix the logic and reprocess the affected time window. The worst thing that happens is a delayed job, not a production fire.

Where batch falls short:

Batch produces stale data. How stale depends on the schedule. Nightly jobs produce data up to 24 hours old. Hourly jobs produce data up to 60 minutes old.

For use cases where decisions depend on what is happening right now, that staleness is a real problem.

A fraud detection system that runs on a nightly batch schedule is not a fraud detection system. It is a fraud reporting system. The fraud already happened hours ago.

What Streaming Processing Actually Means

Streaming treats data as a continuous flow of individual events. Each event gets processed the moment it arrives, without waiting for others to pile up first.

Think about a moving walkway at an airport. People step onto the walkway as they arrive. Each person moves forward right away. Nobody waits for 500 people to gather before the walkway starts moving. The walkway runs all day whether one person is on it or ten thousand.

A streaming pipeline works the same way. An event source like Apache Kafka, Amazon Kinesis, or Google Pub/Sub delivers events in real time. The stream processor picks up each event, applies the transformation logic, and writes the result downstream within milliseconds to seconds. The pipeline runs 24 hours a day, seven days a week.

What streaming is great at:

Streaming is right when the output of the pipeline needs to trigger an action or update a system in real time.

Fraud detection needs to check whether a transaction looks suspicious before approving it. That decision cannot wait 60 minutes for the next batch run.

An e-commerce recommendation engine that adapts to clicks, cart additions, and browsing behavior as they happen gives a fundamentally different experience than one running on overnight batch data.

Infrastructure health dashboards that catch CPU spikes, error rate increases, or latency anomalies need second-level data, not hourly summaries.

Where streaming falls short:

Streaming infrastructure is a lot more complex to run than batch.

Stream processing introduces distributed processing requirements, state management, and fault tolerance mechanisms that batch engineers rarely deal with. Systems consume compute resources at all times rather than only during defined job windows.

Two failure modes in streaming catch teams off guard. The first is backpressure: incoming events exceed processing capacity, lag builds up, and outputs start describing events from minutes ago instead of seconds ago.

The second is silent correctness drift. Streaming systems often keep running even when data quality issues occur. Duplicate events, missing events, or schema changes can slowly corrupt outputs while dashboards still show active data.

The Comparison at a Glance

Dimension	Batch	Streaming
How data moves	Collects over time, processes in one run	Each event processed the moment it arrives
Latency	Minutes to hours	Milliseconds to seconds
Infrastructure	Compute spins up for the job, shuts down after	Always on, always running
Cost	Lower baseline, pay only when jobs run	Higher baseline, persistent infrastructure
Complexity	Lower, simpler error handling	Higher, state management and fault tolerance required
Failure mode	Delayed job, rerun and recover	Production incident, live intervention needed
Debugging	Rerun the job on the failed time window	Replay events from the message queue checkpoint
Schema change	Pipeline breaks loudly on next run	Can cause silent issues if not monitored

The One Question That Decides It

One question cuts through most of the debate: what happens if the data is one hour old?

If the answer is nothing meaningful, batch is probably the right choice.

If the answer is a real business loss, streaming earns its complexity.

Streaming is justified when the output triggers action. If the output only feeds retrospective analysis, batch is usually sufficient.

Four Questions to Ask Before Picking

1. How fresh does the data need to be to be useful?

Most analytics use cases tolerate data that is a few hours old. A weekly revenue report does not need second-level freshness. A fraud detection engine does. Know the actual freshness requirement before assuming you need streaming.

2. Does stale data cause a real business loss?

If a customer gets a product recommendation based on yesterday's browsing instead of what they clicked five minutes ago, does that cost the business money? If yes, streaming may be worth it. If it is a marginal difference, batch is almost certainly the right choice.

3. What is the operational capacity of your team?

Streaming infrastructure needs engineers who understand state management, checkpointing, exactly-once delivery semantics, and how to respond to backpressure incidents at midnight. If your team is small or your use case does not demand real-time results, that complexity is cost without benefit.

4. Is real-time the actual requirement, or is faster batch enough?

Stakeholders often say they want real-time when what they mean is they want data more current than nightly. A pipeline that runs every 15 minutes often satisfies that requirement at a fraction of the cost and complexity of a true streaming system.

When stakeholders say "real-time" but would accept hourly updates without meaningful business impact, they want faster batch, not streaming.

Real Use Cases: When Each Pattern Wins

When Batch Is the Right Answer

Nightly financial reporting. A bank's end-of-day ledger reconciliation processes every transaction from the day against regulatory limits and account balances. The job needs to run across the full day's dataset, apply complex multi-table joins, and produce a validated snapshot. Batch runs at end of day. Streaming adds nothing here.

ML model training. Training a machine learning model requires a large, static dataset processed multiple times. Streaming the training data adds enormous complexity without improving model quality.

Large-scale historical ETL. Migrating three years of transactional data into a new warehouse schema is a batch workload. The data already exists. There is no real-time requirement. Batch processes it once and moves on.

Compliance reporting. Monthly, quarterly, or annual regulatory reports that pull and aggregate data across long time windows are batch workloads. The business cost of a slightly delayed report is low. The complexity of a streaming system is not justified.

When Streaming Is the Right Answer

Fraud detection. Payment authorization systems need to evaluate whether a transaction is fraudulent before it clears, typically in under 500 milliseconds. A batch pipeline running every 30 minutes would approve or deny transactions without the context of what happened in the last 30 minutes.

Real-time feature serving for ML inference. When a deployed ML model needs features computed from recent user behavior to make a prediction, streaming pipelines update the feature store in real time. A recommendation model running on features from last night's batch is operating blind to today's context.

Live operational dashboards. A supply chain control tower showing current inventory levels, in-transit shipments, and order status across hundreds of warehouses needs second-level freshness. An overnight batch job cannot surface a stockout until the next morning.

IoT and sensor telemetry. In manufacturing, logistics, and energy, IoT devices generate continuous streams of sensor data that batch pipelines were not built to ingest or process. Predictive maintenance models that detect equipment issues before failure require streaming ingestion of live sensor data.

The Middle Ground Teams Often Miss: Micro-Batch

Between batch and streaming sits micro-batch processing. It is the pattern that Apache Spark Structured Streaming uses by default, and it solves most "near real-time" requirements without the full complexity of continuous streaming.

Micro-batch runs the same pipeline logic as streaming but on a very short fixed interval: every 30 seconds, every minute, every 5 minutes. Data builds up for the interval, then the batch processes it. Latency is measured in seconds to low minutes rather than hours.

Most use cases that stakeholders describe as "real-time" actually tolerate micro-batch latency. A dashboard that refreshes every minute looks real-time to every user. A data freshness requirement of "under 5 minutes" is achievable with micro-batch at a fraction of the streaming infrastructure cost.

Here is how the decision tree actually looks in practice:

Hours of latency are fine: standard batch on a schedule
Minutes of latency are fine: micro-batch with short trigger intervals
Sub-minute latency is required and the output triggers action: true streaming with Spark Structured Streaming
Sub-second latency is required: Real-Time Mode on Databricks Spark Structured Streaming

The Real Cost of Streaming: What Teams Underestimate

A simple batch ETL pipeline costs between $15,000 and $50,000 to build. A production streaming pipeline with proper monitoring costs between $50,000 and $200,000 or more. That is a 4x to 10x difference at the build stage alone.

Operational cost compounds on top of that. Streaming systems need always-on compute, persistent state storage, continuous monitoring for lag and backpressure, and engineers who can respond to incidents at any hour.

Three costs teams consistently underestimate:

State management. Streaming pipelines that compute windowed aggregations, sessionization, or joins across event streams must maintain state across every event. State grows with data volume. Managing state storage, checkpointing, and cleanup is a continuous engineering concern with no equivalent in batch.

Exactly-once delivery. Guaranteeing that each event is processed exactly once, not duplicated or dropped, requires careful coordination between the message queue, the stream processor, and the output destination. Getting this wrong means silent duplicate records or missing events in production.

Schema evolution. When a source system changes its event schema, a batch pipeline fails loudly on the next scheduled run. A streaming pipeline may silently accept the new schema, produce corrupt output, and keep running for days before anyone notices.

None of this means streaming is wrong. It means streaming should be chosen when the use case justifies the cost, not because it sounds more modern than batch.

Lambda vs Kappa: Two Ways to Run Both at Once

Many production systems need both patterns. Two architectural approaches define how teams organize that combination.

Lambda Architecture

Lambda runs two parallel pipelines. A batch layer reprocesses the full historical dataset on a schedule and produces accurate, complete results. A speed layer processes real-time events and produces approximate but current results. A serving layer merges outputs from both and delivers whichever is more current and accurate.

The batch layer produces trusted, complete data. The speed layer fills in the gap between now and the last batch run. When the batch layer catches up, it overrides the speed layer's approximate output.

Lambda works well when accuracy matters for historical data but approximate freshness is acceptable for recent data. The real cost is operational: two separate pipelines to build, test, and maintain.

Kappa Architecture

Kappa replaces the dual-pipeline design with a single streaming pipeline that handles everything. All data, historical and real-time, flows through the same stream processor.

Historical reprocessing works by replaying events from a durable message queue like Apache Kafka, which retains events for a configurable window. To reprocess, you replay from the beginning of the queue through the same pipeline code. No separate batch layer needed.

Kappa is simpler to maintain but requires your message queue to retain data long enough to support replays. It also requires that your transformation logic works correctly as a streaming pipeline, which rules out certain types of complex, multi-pass batch transformations.

Quick Reference: Which Pattern for Which Use Case

Use Case	Pattern	Why
Nightly revenue reporting	Batch	Data freshness within hours is fine
ML model training	Batch	Requires full static dataset
Historical data migration	Batch	Data already exists, no real-time constraint
Fraud detection	Streaming	Decision must happen before transaction clears
Real-time ML feature serving	Streaming	Model inference needs current behavioral context
IoT anomaly detection	Streaming	Equipment failure cannot wait for next batch
Live inventory dashboards	Streaming	Stockout response needs current state
Monthly compliance reports	Batch	Fixed window, no freshness urgency

My Rule of Thumb

Before you write a line of code, ask: does the output of this pipeline trigger an action, or does it inform analysis?

If it triggers an action and that action loses value after a few minutes, build streaming.

If it informs analysis and the insights hold up for a few hours, build batch.

And if your stakeholders say "real-time" but can actually accept updates every few minutes, build micro-batch. It gives you most of the freshness at a fraction of the cost.

The goal is not to use the most impressive technology. The goal is to ship the simplest system that meets the actual latency requirement and does not wake anyone up at 3 AM.

This post is part of a series on modern data engineering. For more on how these patterns connect to ETL vs ELT design choices, how Databricks handles both batch and streaming in one platform, and how to design for schema evolution at scale, check out the Modern Data Engineering Guide.

How to Set Up Local Data Engineering Environments with Docker Compose

Lucy — Thu, 28 May 2026 09:17:33 +0000

TL;DR: Docker Compose lets you spin up a full local data stack — Airflow, PostgreSQL, Spark, Redis — with a single YAML file and one command. This guide walks you through the exact setup, real compose configs, and the mistakes most engineers make along the way.

Why Your Local Data Environment Is Probably a Mess

Here's the thing: most data engineers I know have a local setup that technically works — but only on their machine, on a good day, when the stars align.

You install PostgreSQL manually. Pin a Python version. Struggle to get Airflow running without breaking something else. And then a new teammate joins and spends three days just trying to reproduce your environment.

That's not an engineering problem. It's a tooling problem. And Docker Compose solves it.

Docker Compose lets you describe your entire local data stack as code — services, networks, volumes, environment variables — and spin it up or tear it down with one command. No more "works on my machine." No more three-day onboarding nightmares.

This guide covers the full picture: what Docker Compose actually is (and what it's not), the building blocks you need to understand, and a production-quality example stack with Airflow, PostgreSQL, Redis, and Spark.

What Is Docker Compose (and What It's Not)

Docker Compose is an orchestration tool for defining and running multi-container Docker applications on a single machine. You write a compose.yaml file that describes each service — what image it uses, what ports it exposes, how it connects to other services, and where it stores data.

A quick note before we go further: Docker Compose v1 reached end-of-life in July 2023. The old docker-compose binary (with the hyphen) is gone. You should be using Docker Compose v2, which ships as a built-in CLI plugin. If you see docker-compose anywhere in your scripts or tutorials, replace it with docker compose (space, no hyphen).

Also worth knowing: the version: field at the top of your compose file is now officially deprecated. You don't need it. Drop it entirely from any new file you write.

Docker Compose is NOT:

A replacement for Kubernetes in production
A tool for managing distributed multi-machine deployments
A substitute for proper secrets management in prod

But for local development, CI pipelines, and single-machine staging? It's hard to beat.

Sources: Docker official documentation — docs.docker.com/compose; freeCodeCamp Docker Compose v2 guide (2026); Docker Compose specification at compose-spec.io

Prerequisites

Before we write a single line of YAML, make sure you have:

Docker Desktop (v4.0+) or Docker Engine + docker-compose-plugin on Linux
At least 8GB RAM available (data stacks eat memory)
4 CPU cores minimum — Spark in particular needs headroom
Basic familiarity with the command line

Run this to confirm your setup is current:

docker compose version
# Should show v2.24 or later in 2026

If that command fails or shows v1.x, update Docker before continuing.

Understanding the Core Building Blocks

Before jumping into the full stack, you need a mental model of the four things Docker Compose actually manages.

Services

A service is a running container. Each entry under services: in your compose file becomes one or more containers. For a data engineering stack, your services are things like your database, your orchestrator, your message broker, your transformation tool.

Networks

By default, every service in a compose file can talk to every other service using the service name as the hostname. No IP addresses. No manual DNS. This is one of the most underrated features — your Airflow scheduler connects to Postgres by literally using postgres as the hostname.

Volumes

Volumes are how your data survives container restarts. There are two flavors: named volumes (managed by Docker, recommended for databases) and bind mounts (a folder on your host machine mounted into the container, useful for DAGs, scripts, and code you're actively editing).

Environment Variables

Never hardcode credentials in your compose file. Always use a .env file and reference variables with ${VARIABLE_NAME} syntax. Your .env file stays out of version control. Your compose.yaml doesn't.

Building the Stack: A Real Data Engineering Environment

Let's build something real. This stack covers the tools that appear in most data engineering workflows:

Service	Role	Port
PostgreSQL	Metadata DB + data warehouse	5432
Redis	Message broker for Celery tasks	6379
Apache Airflow	Workflow orchestration	8080
Apache Spark	Distributed data processing	4040, 7077
Adminer	Lightweight DB GUI	8085

Step 1 — Project Structure

Start with a clean folder structure. This matters more than most tutorials admit — messy folders create tangled volume mounts and confusing build contexts.

data-eng-local/
├── compose.yaml
├── .env
├── .env.example
├── airflow/
│   ├── dags/
│   ├── logs/
│   ├── plugins/
│   └── config/
├── spark/
│   └── jobs/
├── postgres/
│   └── init/
│       └── 01_create_schemas.sql
└── README.md

Two rules: put your compose.yaml at the root, and never commit .env to git. Add it to .gitignore now, before you forget.

Step 2 — The `.env` File

# .env — DO NOT commit to version control
POSTGRES_USER=dataeng
POSTGRES_PASSWORD=changeme_local
POSTGRES_DB=warehouse

AIRFLOW_UID=50000
AIRFLOW__CORE__FERNET_KEY=your_fernet_key_here
AIRFLOW__WEBSERVER__SECRET_KEY=your_secret_key_here

REDIS_PASSWORD=redis_local_pass

Generate a Fernet key with:

python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

Step 3 — The `compose.yaml` File

Here's the full configuration. Read the inline comments — they explain the decisions, not just the syntax.

# compose.yaml — No version field needed (deprecated in Compose v2)

x-airflow-common: &airflow-common
  image: apache/airflow:3.0.4
  environment: &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres/${POSTGRES_DB}
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@postgres/${POSTGRES_DB}
    AIRFLOW__CELERY__BROKER_URL: redis://:${REDIS_PASSWORD}@redis:6379/0
    AIRFLOW__CORE__FERNET_KEY: ${AIRFLOW__CORE__FERNET_KEY}
    AIRFLOW__WEBSERVER__SECRET_KEY: ${AIRFLOW__WEBSERVER__SECRET_KEY}
    AIRFLOW_UID: ${AIRFLOW_UID}
  env_file:
    - .env
  volumes:
    - ./airflow/dags:/opt/airflow/dags        # Bind mount — edit DAGs without rebuilding
    - ./airflow/logs:/opt/airflow/logs
    - ./airflow/plugins:/opt/airflow/plugins
    - ./airflow/config:/opt/airflow/config
  depends_on:
    postgres:
      condition: service_healthy              # Wait for postgres to be ready — not just started
    redis:
      condition: service_healthy

services:
  # ─── DATABASE ──────────────────────────────────────────────────────────────
  postgres:
    image: postgres:16
    environment:
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: ${POSTGRES_DB}
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data          # Named volume — data survives restarts
      - ./postgres/init:/docker-entrypoint-initdb.d     # SQL files run on first startup
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped

  # ─── MESSAGE BROKER ────────────────────────────────────────────────────────
  redis:
    image: redis:7-alpine                    # Alpine = smaller image, same functionality
    command: redis-server --requirepass ${REDIS_PASSWORD}
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped

  # ─── AIRFLOW ───────────────────────────────────────────────────────────────
  airflow-init:
    <<: *airflow-common
    entrypoint: /bin/bash
    command:
      - -c
      - |
        airflow db migrate &&
        airflow users create \
          --username admin \
          --password admin \
          --firstname Admin \
          --lastname User \
          --role Admin \
          --email admin@example.com
    restart: "no"                            # Run once and exit — not a long-running service

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 5
    restart: unless-stopped

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    restart: unless-stopped

  airflow-worker:
    <<: *airflow-common
    command: celery worker
    restart: unless-stopped

  # ─── SPARK ─────────────────────────────────────────────────────────────────
  spark-master:
    image: bitnami/spark:3.5
    environment:
      - SPARK_MODE=master
    ports:
      - "4040:4040"                          # Spark UI
      - "7077:7077"                          # Spark master port
    volumes:
      - ./spark/jobs:/opt/spark-jobs
    restart: unless-stopped

  spark-worker:
    image: bitnami/spark:3.5
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=2
    depends_on:
      - spark-master
    restart: unless-stopped

  # ─── DB GUI ────────────────────────────────────────────────────────────────
  adminer:
    image: adminer:latest
    ports:
      - "8085:8080"
    depends_on:
      - postgres
    restart: unless-stopped

# ─── VOLUMES ───────────────────────────────────────────────────────────────
volumes:
  postgres_data:        # Docker-managed, persists across down/up cycles
  redis_data:

Step 4 — Start the Stack

# First time — initialize Airflow's database and create the admin user
docker compose up airflow-init

# Then start everything else in detached mode
docker compose up -d

# Watch it come up
docker compose ps

# Stream logs for a specific service (useful for debugging)
docker compose logs -f airflow-scheduler

That's it. Once everything is green, your services are at:

Airflow UI: http://localhost:8080 (admin / admin)
Spark UI: http://localhost:4040
Adminer: http://localhost:8085 (server: postgres, user: dataeng)

Health Checks: The Feature Most People Skip

This is the part most tutorials skip — and it's genuinely important.

Without health checks, depends_on is almost useless. By default, Docker considers a container "started" the moment the process launches, not when the service inside it is actually ready to accept connections. PostgreSQL needs a few seconds to initialize. Redis needs a moment to come up. If Airflow tries to connect before they're ready, it crashes — and you end up with a confusing pile of restart loops.

The condition: service_healthy syntax in depends_on fixes this. It tells Docker: don't start this service until that other service's health check is passing. Pair it with a proper healthcheck block on the dependency, and your stack starts in the right order every time.

# This is how you do it properly
depends_on:
  postgres:
    condition: service_healthy  # ← This is the key

Without this, you're relying on timing. Timing is not a strategy.

Common Mistakes and How to Avoid Them

Hardcoding credentials in compose.yaml — Don't. Use .env files. It takes 30 seconds to set up and prevents you from accidentally committing passwords to a public repo. It happens more than you'd think.

Using docker-compose (v1) instead of docker compose (v2) — The old binary is dead. If you're copying configs from tutorials older than mid-2023, check for this.

Including version: in your compose file — This field is obsolete as of Docker Compose v2 and now triggers deprecation warnings in Docker Desktop. Remove it from any file you write or maintain.

Not pinning image versions — postgres:latest is a trap. One upgrade and your init SQL might fail, your connection string might change, your extension might not exist. Pin to postgres:16. Always.

Forgetting to add .env to .gitignore — Seriously. Do this first.

Running docker compose down -v when you meant docker compose down — The -v flag deletes named volumes. That means your database data is gone. There's no undo. Be very intentional with that flag.

Working With Your Stack Day-to-Day

Once it's running, these are the commands you'll reach for most:

# Check what's running and the health status
docker compose ps

# Tail logs from everything
docker compose logs -f

# Tail logs from one service only
docker compose logs -f airflow-worker

# Restart a single service without touching the rest
docker compose restart airflow-scheduler

# Run a one-off command inside a running container
docker compose exec postgres psql -U dataeng -d warehouse

# Open a shell in a container for debugging
docker compose exec airflow-webserver bash

# Stop everything (preserves volumes — safe)
docker compose down

# Nuclear option — stops everything AND deletes all volumes
docker compose down -v

Managing Multiple Environments with Profiles

Here's something worth knowing once your stack gets more complex: Docker Compose Profiles let you define services that only start in certain contexts. You tag a service with profiles: [dev] or profiles: [monitoring] and it only runs when you explicitly request that profile.

services:
  # This only starts when you run: docker compose --profile monitoring up
  prometheus:
    image: prom/prometheus:latest
    profiles:
      - monitoring
    ports:
      - "9090:9090"

This is how you keep a single compose.yaml that works for all environments — local dev, CI, staging — without maintaining multiple files. It's one of the features that quietly makes Compose much more production-capable than its reputation suggests.

E-E-A-T Reference: Tools, Versions, and Sources

Here's a quick reference table with the verified tool versions used in this guide — and where to find official documentation.

Tool	Version Used	Official Docs
Docker Compose	v2.24+	docs.docker.com/compose
Apache Airflow	3.0.4	airflow.apache.org
PostgreSQL	16	postgresql.org/docs
Redis	7 (Alpine)	redis.io/docs
Apache Spark	3.5 (Bitnami)	spark.apache.org/docs

Further reading:

Docker Compose Specification: compose-spec.io — the authoritative reference for all YAML syntax
Apache Airflow official Docker Compose setup: airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html
Data Engineering Zoomcamp (DataTalks.Club): practical hands-on course that covers Docker + Airflow + dbt workflows — datatalks.club/courses/data-engineering-zoomcamp.html
freeCodeCamp — "How to Use Docker Compose for Production Workloads" (March 2026): covers profiles, watch mode, and GPU support in depth

Wrapping Up

Setting up a reliable local data engineering environment used to be a day-long exercise in frustration. With Docker Compose, it's a one-time investment: write the compose.yaml once, commit it to your repo, and everyone on your team gets the exact same environment with docker compose up.

A few things to remember as you go:

Drop the version: field — it's deprecated
Always use condition: service_healthy in depends_on
Pin your image versions, not latest
Keep credentials in .env, never in the compose file
Name your volumes — data you can't recover isn't worth much

The config in this guide is a starting point. As your stack grows, you'll want to look at override files (compose.override.yaml) for environment-specific tweaks, and Compose profiles for toggling monitoring tools, debug containers, or test databases without polluting your main setup.

The whole point is reproducibility. When a teammate clones your repo and runs docker compose up, they should get exactly what you have. That's not a nice-to-have — it's the baseline for any serious data engineering workflow.

Need a Data Engineer Who Already Knows This Stack?

Honestly, this is where a lot of projects stall. The setup is one thing — building production-grade pipelines on top of it, handling schema drift, optimizing Spark jobs, writing reliable Airflow DAGs that don't silently fail at 2am — that's the real work.

If your team is scaling and you need someone who has done this before (not just read about it), that's what we do.

Lucent Innovation helps product teams and growing companies hire experienced data engineers — people who can own the full stack, from local containerized environments to cloud-scale data infrastructure on AWS, GCP, or Azure.

Whether you need to augment your current team with a specialist or are looking to build a data engineering function from scratch, we can help you find and place the right person fast — usually within 2–3 weeks.

💬 Talk to us about hiring a data engineer →

No long intake forms. Just a conversation about what you're building and what you need.

Have questions about this setup or running into issues with a specific service? The comments are open. Also worth checking: the official Apache Airflow Docker Compose documentation gets updated with each release and is often more current than any tutorial you'll find.

Why Your In-House Databricks Team Is Probably Losing You Money

Lucy — Wed, 27 May 2026 10:44:20 +0000

60% of enterprise AI projects get abandoned because of data readiness and infrastructure issues.

Not because of bad ideas. Not because of wrong tooling. Because the foundation wasn't built right and by the time anyone noticed, the cost of fixing it was higher than starting over.

If you're running Databricks in-house, there's a decent chance you're heading toward one of four failure modes. I've seen each of them play out, sometimes in the same org.

1. The "unicorn engineer" job post

You know the one. It asks for someone who can handle platform architecture, complex ETL pipeline design, MLOps, and data governance. Maybe Unity Catalog experience preferred. Definitely Spark optimization. Oh, and some Python.

That person doesn't exist. Or if they do, they're already at a FAANG and not answering your recruiter.

What actually happens: you hire someone capable, and they spend most of their time on operational noise that manually partitioning tables, babysitting cluster configs, debugging integration issues that have nothing to do with your actual data problems.

Databricks has gotten genuinely complex. Delta Lake, Lakeflow Declarative Pipelines, Unity Catalog- these aren't plug-and-play. A generalist data engineer in 2026 is not the same as a Databricks platform specialist.

A consulting partner brings people who've already built this across multiple clients. You're not buying hours. You're buying what they learned the hard way somewhere else multi-cloud workspace topology, Liquid Clustering, private endpoint configs without waiting for your team to acquire those scars.

2. The cloud bill no one is watching

Here's one I've seen kill otherwise solid data platforms quietly.

In-house team gets the pipelines working. Everyone moves on. Nobody sets up auto-termination. Nobody enforces cluster policies. Clusters run indefinitely. Variable workloads stay on always-on compute when they should be hitting Serverless SQL.

[Traditional In-House Setup] ---> Over-provisioned Clusters ---> High Idle Waste & Skyrocketing Bills
[Consulting-Led Framework] ---> Serverless SQL + Cluster Policies ---> Automated Auto-Termination & Controlled Spend

The bill climbs slowly, and then suddenly it's a boardroom conversation.

A proper FinOps setup isn't exciting work, but it has a direct, measurable line to your cloud costs. Things like mandatory auto_termination_minutes, enforced instance pool configs, and routing the right workloads away from always-on clusters. This is table stakes, it just often doesn't get done when you're underwater on pipeline work.

3. Governance that gets bolted on after the fact

The pattern is almost universal:

Build the pipelines
Ship the dashboards
Deal with governance "later"

By the time "later" arrives, you've got fragmented data silos, ML models stuck in sandbox environments, inconsistent access controls, and no data lineage. Then someone asks about compliance.

Unity Catalog isn't an afterthought, it's the thing you configure before the pipelines, not after. Role-based access controls, automated data quality monitoring, end-to-end lineage tracking. If these aren't in the foundation, your downstream reports are unreliable by design.

The uncomfortable truth: A lot of teams treat governance like a documentation task. It's not. It's infrastructure.

4. The hiring timeline nobody accounts for

Realistic timeline from job post to a team that's onboarded, trained on Databricks, and actually productive:

6–9 months.

That's not pessimism, that's just recruiting + onboarding + platform ramp-up. Most orgs don't factor this in when they're comparing in-house costs against consulting rates.

A consulting firm gets there faster because they're not starting from scratch. Pre-built IaC templates, established Bronze/Silver/Gold ingestion patterns, CI/CD already wired up. Deployment that takes your internal team six months can happen in weeks.

That gap matters if your competitors are already running predictive analytics in production.

So what actually works?

It's not a binary choice, and framing it that way is usually how you end up making the wrong call.

The companies that handle this well use a hybrid model:

Bring in specialists for the hard setup — architecture, Unity Catalog, cluster optimization, MLOps scaffolding
Keep internal team focused on domain knowledge, custom data products, and the business problems that actually need context to solve

Your internal engineers understand your data, your customers, and your edge cases. That's valuable and hard to transfer. But asking them to also be platform infrastructure experts is how you end up with both things done poorly.

TL;DR

Problem	In-house default	What fixes it
Skill gaps	Overhire, underdeliver	Consulting for platform-specific work
Cloud costs	Idle compute, no policies	FinOps framework from day one
Governance	Bolted on later	Unity Catalog before pipelines
Speed	6–9 months to productivity	Pre-built templates + IaC

The architecture decisions you make in the first few months of a Databricks deployment are surprisingly hard to undo. Getting them right upfront — even with outside help — is almost always cheaper than refactoring a broken foundation at scale.

Have you gone through a Databricks migration or build-out? Curious what broke first — drop it in the comments.

RAG or Fine-Tuning? How We Decide for Our AI Consulting Clients

Lucy — Thu, 21 May 2026 07:22:26 +0000

Choosing the right architecture for an artificial intelligence product is one of the most expensive decisions a business can make. When clients come to Lucent Innovation for AI consulting, they often ask the same core question: should we use RAG or fine-tuning?

Many teams assume they need to train a custom model from scratch to make an AI understand their business. However, making the wrong choice can lead to hundreds of thousands of dollars in wasted cloud computing bills and months of lost development time.

This guide breaks down the choice in simple, plain English. Whether you are a software engineer building the pipeline or a business leader managing the budget, this framework will help you make the right architectural choice.

What is RAG in AI?

To understand your choices, we must begin with the basics of Retrieval-Augmented Generation.

What does RAG stand for in AI?

RAG stands for Retrieval-Augmented Generation. In simple terms, it is an architectural approach that gives a generative AI model an open-book exam.

Instead of relying solely on what the model learned during its initial training, a RAG AI system looks up real-time information from an external database before it answers a user query.

[User Query] ──> [Search External Database] ──> [Retrieve Relevant Text] ──> [Feed into RAG LLM] ──> [Final Accurate Answer]

How does RAG improve the accuracy of generative AI models?

Standard Large Language Models (LLMs) are frozen in time. They only know the data they were trained on. If you ask a standard model about a customer invoice from yesterday, it will either admit it does not know or confidently make up a false answer. This false answer is called a hallucination.

A RAG LLM setup solves this problem by executing a simple multi-step process:

The Retrieval Step: When a user asks a question, the system searches a private corporate database or vector store for matching documents.
The Augmentation Step: The system takes those matching documents and pastes them directly into the hidden prompt background.
The Generation Step: The model reads the question and the pasted documents together, synthesizing a perfectly accurate answer based strictly on the provided facts.

By grounding the model in verified data, you eliminate guessing and ensure that the system can access real-time, constantly changing information.

The Core Battle: RAG vs Fine Tuning

While RAG gives the model a library card, LLM fine tuning is completely different. Fine-tuning actually changes the internal brain structure of the model.

Understanding LLM Fine Tuning

When you fine tune LLM models, you take an existing base model and expose it to a highly specialized dataset for intensive training. This process adjusts the internal weights of the neural network. You are not giving the model an open-book exam: you are sending it back to school to learn a specific style, dialect, or structural format.

Here is an engineering visual to help conceptualize the foundational pathways:

RAG vs LLM: The Core Differences

To see why this matters for your engineering budget, consider this comparison table of operational trade-offs:

Evaluation Feature	RAG AI Systems	LLM Fine Tuning
Knowledge Base Type	Dynamic and real-time external data	Static snapshot baked into the model
Primary Use Case	Finding specific facts and text chunks	Learning a specific style, tone, or format
Hallucination Control	Very high: sources can be cited directly	Low: can still invent facts if prompt is weak
Upfront Setup Cost	Low to moderate developer hours	High compute costs and specialized data engineering
Data Privacy Boundaries	Easy to restrict data via database permissions	Difficult to restrict access once data is baked in

When to Use Fine Tuning vs RAG?

The choice between fine tuning vs RAG comes down to a simple engineering rule: Use RAG for knowledge, and use fine-tuning for behavior.

The Unique Lucent Innovation Point of View: The Data Lifecycle Reality

Most online guides tell you to evaluate your choice based purely on accuracy. At Lucent Innovation, we tell our enterprise clients to look at something completely different: look at who owns the data and how fast it changes.

If your data changes every hour, every day, or every week, fine tuning LLMs is a terrible operational trap. The moment your business updates a pricing sheet or changes a product feature, your fine-tuned model becomes obsolete. You would have to spend thousands of dollars to retrain it again.

RAG fine tuning decisions should follow these strict operational guidelines:

Choose RAG when:

You need to connect your AI to live business documents, customer support wikis, or internal Slack logs.
You must show users exactly where the information came from by providing source citations and links.
You need to build your product quickly without renting expensive GPU clusters for training cycles.

Choose Fine-Tuning when:

You need the model to output perfect, strict JSON code structures every single time without fail.
You want the AI to perfectly mimic a specific person's copywriting style, voice, or industry jargon.
You are working with an ultra-niche domain (like advanced medical pathology reports or ancient legal statutes) that the base model cannot comprehend.

RAG vs Fine Tuning vs Prompt Engineering?

Before jumping into a complex software architecture, engineers should always evaluate the entire spectrum of optimization. This brings us to a three-way comparison: RAG vs fine tuning vs prompt engineering.

[Prompt Engineering] ──> Simple instructions in the text box (Minutes to set up)
[RAG Architecture]   ──> Hooking up a search engine to the text box (Days to set up)
[Fine-Tuning]        ──> Re-wiring the underlying engine itself (Weeks to set up)

Prompt engineering is the foundation. It involves writing clever, descriptive instructions directly inside your system prompt. For instance, telling a model to "act like a professional accountant" is prompt engineering.

The Decision Spectrum

Prompt Engineering: Best for fast prototyping, basic text transformations, and setting up initial rules.
RAG vs Prompt Engineering: When your system prompt gets too full of information, it hits a wall. Standard context windows can become slow and expensive. That is when you step up to RAG, which selectively feeds only the relevant data chunks into the prompt instead of dumping the entire database.
Fine-Tuning: The final step. Once your RAG system knows what to say, you can use fine-tuning to perfect how it says it, shrinking your prompt sizes and reducing latency.

Real World Client Scenario: How We Consult

To make this practical, let us look at a real architecture challenge we solved for one of our enterprise consulting clients.

The client wanted an AI assistant to help their customer success team look up technical product specifications and write email responses in the company's precise tone of voice.

Instead of picking just one path, we deployed a hybrid strategy:

The RAG Layer: We hooked up their product documentation manuals to a vector database pipeline. This ensured that the AI always retrieved 100 percent accurate product specifications, eliminating hallucinations.
The Fine-Tuning Layer: We took the base open-source model and fine-tuned it on 5,000 historical customer service emails that were manually approved by their marketing team. This taught the model's brain to always write responses with a helpful, warm, and structured corporate tone.

By combining the open-book access of RAG with the behavioral habits of fine-tuning, the client achieved a 40 percent reduction in average ticket handling time while keeping errors at absolute zero.

Conclusion: Designing Your AI Roadmap

There is no single winner in the battle of RAG vs fine tuning. They are complementary tools designed for completely different software problems.

If your product goals require access to fresh facts, internal knowledge bases, and clear data source tracking, building a RAG framework is your optimal choice. If your product demands strict adherence to complex code layouts or deep alignment with a specific brand persona, investing in custom weights is the right path forward.

Get Expert Engineering Guidance

Navigating these architectural decisions requires deep hands-on experience. Making a mistake early in your development cycle can result in severe technical debt and bloated maintenance costs.

At Lucent Innovation, we specialize in helping businesses design, build, and optimize high-performance AI systems that drive real business outcomes. We analyze your data dynamics, security requirements, and budget constraints to engineer the perfect pipeline for your platform.

Are you unsure which approach fits your upcoming product? This is exactly what our engineering team helps clients figure out every day. Let us protect your runway and accelerate your deployment timeline. Book a free discovery call with the Lucent Innovation AI consulting team today.

Foundational Sources & Technical Reading

Learn more about the mechanics of Retrieval-Augmented Generation on the Databricks Lakehouse Platform Architecture.
Review foundational research and code guidelines on Large Language Model Fine-Tuning via OpenAI Developer Documentation.
Explore semantic indexing protocols via the Pinecone Vector Database Engineering Blog.

What Does a Databricks Consulting Partner Actually Do? (An Enterprise Buyer's Guide)

Lucy — Wed, 20 May 2026 09:26:49 +0000

You've probably sat through at least one vendor call where someone said
"end-to-end Databricks implementation" three times in ten minutes and still left with no idea what they'd actually do after signing.

That's the problem with how most Databricks consulting services are sold. The language is polished. The decks look great. But the specifics? Suspiciously vague.

So let's just say the quiet part out loud here's what a real partner does,
week by week, and what separates a genuinely good one from a well-branded generalist.

The 4 Things a Databricks Partner Is Actually Responsible For

1. Architecture First, Not Notebooks First

The first red flag? A partner who opens a Databricks workspace before they've audited your current data estate.

A good one starts by understanding what you already have to your sources, your pipelines, your governance gaps, where money is quietly leaking. Only then do they design an environment that fits your workloads.

In practice, that means:

Choosing the right cloud (AWS, Azure, or GCP) based on your existing infrastructure which is not what the partner is most comfortable with
Designing a medallion architecture (Bronze → Silver → Gold) with your actual data volumes in mind
Standing up Unity Catalog for governance from day one, not as an afterthought six months later when things get messy

2. Pipeline Engineering, The Real Heavy Lifting

Most enterprise data sits across five different places: a legacy ERP, a couple of SaaS tools, some flat files someone's been emailing around, and a Snowflake instance that half the team has forgotten the password to.

A Databricks partner consolidates this: building Delta Live Tables pipelines or custom Spark jobs that handle schema evolution, bad data, and SLA expectations. Not "it works on my machine" pipelines. Production-grade ones.

If you're coming from Hadoop or an aging data warehouse, this is where 90% of the real effort lives. It's also where you'll quickly learn whether your partner has actually done this before or just watched the conference talk.

3. Cost and Performance- Ongoing, Not Optional

Here's something vendors rarely lead with: Databricks compute costs can spiral fast if nobody's actively managing them.

A partner worth keeping around puts in:

Auto-scaling cluster policies so you're not paying for idle compute at 2am
Photon engine tuning for SQL-heavy workloads
Cost dashboards that map spend to actual business units, so finance stops asking you to explain the cloud bill

This isn't a one-time setup. It's a habit. If a partner treats it as a
checkbox, your AWS invoice will tell you eventually.

4. ML and AI Enablement- When You're Ready to Go Beyond Dashboards

A lot of enterprise teams reach a point where SQL dashboards aren't enough. They want predictions, recommendations, anomaly detection that is actual ML in production.

A Databricks partner with real ML capability sets up MLflow for experiment tracking, builds feature pipelines through Feature Store, and helps your data science team stop rebuilding infrastructure every time they want to ship a model.

This is genuinely where the Databricks ecosystem shines and where the right partner can save months of engineering time.

How to Actually Vet a Databricks Partner (Beyond the Sales Deck)

Most of this won't be on their website. You have to ask.

Check for Databricks certification at the engineer level, not just a partner tier badge. Certified Data Engineer Associate or Professional means someone on their team has passed a hands-on technical exam. That's meaningful.

Ask for vertical-specific references- A partner who's built lakehouse pipelines for a D2C brand thinks about schema design very differently than one who's only done banking compliance reporting. Generic case studies are a yellow flag.

Pin down the post-go-live model- Ask: "What does month three with
your team look like?" If the answer is vague or pivots back to the
onboarding process, they're not thinking past the implementation phase.

Confirm you own the code- Sounds obvious. Isn't always. Any partner
who builds undocumented pipelines or ties you to proprietary tooling is
creating dependency, not capability. Get this in writing.

Timing Matters More Than Most People Think

The best moment to bring in a Databricks partner is before your data
team has built workarounds they're now defending as architecture.

Before ad-hoc notebooks become your production pipeline. Before cluster
policies are an afterthought. Before your engineers are spending more time firefighting than building.

If AI and ML use cases are on your roadmap alongside the data modernization work and they probably should be, it's worth reading why mid-market enterprises are moving on AI consulting partnerships before 2027. The timelines are more connected than most teams realize.

One Last Thing: Good Partners Ask Uncomfortable Questions

The best Databricks consulting services engagement you'll ever have won't start with a proposal. It'll start with questions that make you think.

Things like:

"What does 'data-ready' actually mean for your business in 12 months?"
"Who currently owns data quality decisions and what happens when something breaks?"
"What's the real blocker for your team right now? skills, tooling, or architecture?"

If a vendor skips all of that and jumps to pricing, pay attention to
that instinct telling you something's off.

For a grounded look at what structured Databricks consulting services
actually cover certifications, engagement models, and specific deliverables. it's a solid benchmark before your next vendor call.

Evaluating Databricks partners? Drop the questions you're struggling to
get straight answers on in the comments, happy to help you cut through the noise.

Why Mid-Market Enterprises Need an AI Consulting Partner Before 2027

Lucy — Fri, 15 May 2026 11:09:34 +0000

Let’s strip away the "corporate-speak" for a moment. If you're running a mid-market company right now, AI probably feels less like a "revolutionary tool" and more like a loud, confusing neighbor who won't stop knocking on your door. Everyone’s talking about it, your bigger competitors are already using it, and your team keeps asking, “So… what’s our plan?”

The truth is: You don't have to become an AI expert overnight. But you'll probably need experienced help to get it right, especially before 2027, when things are expected to move much faster.

Most AI Projects Still Fail And That’s Expensive

Most AI experiments never see real use. Common reasons? Messy data, no clear business goals, integration headaches, or trying to do too much at once.

As a mid-market leader, you don’t have an endless budget to burn on science projects. You need results that show up in the P&L—faster automation in operations, smarter sales tools, better customer experiences, or fewer errors.

This is where a good AI consulting partner makes a big difference. They’ve seen mistakes before, know which use cases really deliver ROI for companies your size, and can help you build on solid data and processes rather than jumping straight to flashy tools.

[Messy Legacy Data] -> [Expensive LLM] -> [Confidently Incorrect Answers to Customers]

You Can’t Just Hire Your Way Out of This

Finding and retaining real AI talent is highly competitive and expensive. Most mid-market companies can’t build the perfect AI dream team and even if you could, it would take a long time to make them fully productive in your specific environment, systems, and industry.

This is where partners like Lucent Innovation Services become incredibly valuable. They give you immediate access to experienced AI experts without a huge full-time hiring commitment. They work side by side with your team, help your existing people upskill, and create practical solutions that truly fit your technology stack and company culture – no generic template.

You Need a Strategy That Fits Your Reality

What works for a Fortune 500 company often doesn’t work for you. Different budgets, risk tolerances, and pace of operations.

Good consultants help you create a practical, phased plan:

Start with the problems that have the most impact
Deliver quick wins to build momentum
Avoid the “graveyard of unused AI subscriptions”
Make sure everything is truly connected to your existing technology

They also help you prepare for what’s to come smarter AI agents, stricter regulations, and higher expectations around responsible use.

The "Boutique" Difference: Why Big Consulting Isn't Always Better

The Final Words

2027 is the year when “AI” will stop being a buzzword and become a core part of being competitive. Being a partner isn’t about being the most high-tech company on the block; it’s about ensuring your business remains agile enough to compete as the rules of the game change.

With most companies currently stuck in the “experimentation” phase, do you find your team more hesitant about the technical setup or cultural change of adopting AI?

How to Transition from a Traditional Data Warehouse to a Modern Lakehouse

Lucy — Thu, 14 May 2026 09:55:19 +0000

If your data warehouse feels slow, expensive, or hard to scale, you are not alone.

Many teams are hitting the same wall. Reports take too long. Storage costs keep going up. And when the machine learning team asks for raw data, the answer is always "we don't have that here."

The good news? There is a clear path forward. It is called the data lakehouse, and thousands of companies have already made the switch.

This guide will walk you through exactly what a lakehouse is, why it matters, and how to move from your old warehouse to a modern setup without breaking everything along the way.

What Is a Traditional Data Warehouse?

A traditional data warehouse is a structured database that holds cleaned, organized data for reporting and analytics. Tools like Teradata, Netezza, and on-premises SQL servers fall into this group.

What a traditional warehouse does well

Fast SQL queries on structured data
Reliable data for business reports
Strong data quality controls

Where it falls short

Very expensive to store large amounts of data
Hard to handle unstructured data like logs, images, or JSON files
Cannot easily support real-time analytics or AI workloads
Scaling up often means buying more expensive hardware

According to ACL Digital's migration strategy guide, traditional data warehouses are reaching their limits. Rising infrastructure costs, rigid architectures, and the inability to support real-time analytics are slowing down enterprise teams.

What Is a Data Lakehouse?

A data lakehouse is a newer kind of data platform. It combines the best parts of two older systems: the data lake and the data warehouse.

Here is a simple breakdown of all three:

Feature	Data Warehouse	Data Lake	Data Lakehouse
Storage cost	High	Low	Low
Handles unstructured data	No	Yes	Yes
Fast SQL queries	Yes	No	Yes
ACID transactions	Yes	No	Yes
Good for AI/ML	No	Partial	Yes
Data governance	Strong	Weak	Strong
Schema enforcement	Strict	None	Flexible

As Analytics8 explains, a lakehouse stores all your data in one place and reduces costs associated with managing multiple storage systems. It supports everything from traditional transaction records to images, video, and raw text files.

Why Teams Are Moving to a Lakehouse in 2026

The shift is not just about new technology. It is about what your business actually needs to stay competitive.

Here are the biggest reasons teams are making the move:

AI and machine learning need raw data. A traditional warehouse only keeps clean, transformed data. AI tools need the original records too. A lakehouse keeps both.
Real-time analytics are now expected. Batch reports that run once a day are not fast enough for modern decisions. A lakehouse supports streaming data alongside batch loads.
Storage costs are out of control. Cloud-based lakehouse storage costs a fraction of what a traditional warehouse charges for the same volume.
One platform for everything. Data engineers, analysts, and data scientists can all work on the same data without moving copies between systems.

IDC research cited by Kanerika found that over 70% of enterprises have already begun moving workloads from legacy warehouses to lakehouse platforms for better performance and cost efficiency.

If you want to understand the full picture of how modern data platforms are built today, the Modern Data Engineering Guide by Lucent Innovation covers every major concept, from pipelines to Delta Lake to Databricks, in one place.

Before You Start: Things to Check First

Do not rush into a migration. The biggest risk is moving a broken or messy environment and making it worse.

Before you write a single line of migration code, answer these questions:

Understand your current state

What data sources feed your warehouse today?
Which pipelines run daily, weekly, or on demand?
Which workloads are business-critical and which can wait?
What does your current schema look like?

Assess your team

Does your team know tools like Apache Spark, Delta Lake, or Databricks?
Do you have a data governance policy in place?
Who owns each data domain in your organization?

Set success metrics

What does a successful migration look like?
How will you measure data quality before and after?
What is your rollback plan if something goes wrong?

As logiciel.io advises in their enterprise migration guide, migration is about trust and confidence, not speed. If you migrate an unstable or inconsistent environment, you are adding extra risk to the project.

Step-by-Step: How to Transition from a Data Warehouse to a Lakehouse

Step 1: Audit Your Existing Data Environment

Start by making a full map of what you have.

Document the following:

All data sources (databases, APIs, flat files, SaaS tools)
All existing ETL pipelines and how often they run
All tables, schemas, and row counts
All dashboards and reports that depend on warehouse data
All users who query the warehouse regularly

This audit will help you figure out what to migrate first and what can wait.

Step 2: Pick Your Lakehouse Platform

The most widely used lakehouse platform today is Databricks, which is built on open-source tools like Apache Spark, Delta Lake, and MLflow.

Other options include:

Microsoft Fabric for organizations already in the Microsoft ecosystem
Apache Iceberg on AWS or GCP for teams that want open table formats
Snowflake for teams that want a SQL-first approach with some lakehouse features

Databricks documentation explains that replacing your data warehouse with a lakehouse is not about eliminating data warehousing. It is about unifying your data ecosystem so analysts, data scientists, and engineers can all work on the same tables in the same platform.

How to choose the right platform:

Need	Recommended Option
Unified AI and analytics	Databricks
Microsoft tools already in use	Microsoft Fabric
Strong SQL-first team	Snowflake
Multi-cloud with open formats	Apache Iceberg

Step 3: Set Up Your Lakehouse Storage Layer

Once you pick a platform, you need to set up your storage foundation.

What this involves:

Set up a cloud object storage account (AWS S3, Azure Data Lake Storage, or Google Cloud Storage)
Install Delta Lake or your chosen open table format on top of it
Configure your metadata catalog (Unity Catalog in Databricks is the standard choice)
Set up access controls and permissions from the start

Delta Lake is especially important here. It adds ACID transactions to plain storage files. That means:

Writes either fully complete or fully roll back. No partial or corrupted data.
Schema enforcement rejects bad data before it lands.
Time travel lets you query data as it looked at any point in the past.

You can read a full breakdown of how Delta Lake works in the Modern Data Engineering Guide, which explains each capability with real-world context.

Step 4: Design Your Data Layers (Bronze, Silver, Gold)

One of the best practices in a lakehouse is using the Medallion Architecture. This organizes your data into three clear layers.

Layer	What Goes Here	Example
Bronze	Raw data exactly as it arrived from the source	Original CSV files, API responses, database snapshots
Silver	Cleaned and validated data	Duplicates removed, nulls handled, schema enforced
Gold	Business-ready aggregated data	Revenue by region, daily active users, churn metrics

Why this matters:

You can always go back to the raw data if something goes wrong
Each layer has a clear quality standard
Analysts work on Gold. Engineers debug in Bronze. Everyone knows where to look.

This layered approach is one of the most important design patterns in modern data engineering. It keeps your data trustworthy at every stage.

Step 5: Migrate Your Data in Phases

Do not try to move everything at once. A phased migration by domain or workload is much safer.

A common phasing approach:

Phase 1: Migrate non-critical or low-traffic workloads first. Use these to learn the platform.
Phase 2: Migrate medium-priority domains. Validate data quality against the old warehouse in parallel.
Phase 3: Migrate business-critical workloads. Keep the old warehouse running as a fallback until you are confident.
Phase 4: Decommission the old warehouse once all queries and dashboards have been validated.

logiciel.io's enterprise migration playbook notes that an initial migration per domain typically takes 8 to 12 weeks, with a full migration across an organization taking several months. Planning for this timeline is important.

What to check during each phase:

Row counts match between old and new systems
Aggregated totals (revenue, counts, averages) match
Dashboards and reports produce the same numbers
Query performance is equal or better than before

Step 6: Rewrite or Migrate Your Pipelines

Your old ETL pipelines will need to be updated for the new platform.

In a traditional warehouse, most pipelines use the ETL pattern: extract the data, transform it in the middle, then load the clean version.

In a lakehouse, the preferred pattern is ELT: extract the raw data, load it first, then transform it inside the platform using the compute power already available there.

ETL vs ELT at a glance:

Pattern	Transform Location	Best For
ETL	Outside the warehouse	Legacy systems, tightly controlled schemas
ELT	Inside the lakehouse	Cloud-native, large volumes, AI workloads

When rewriting pipelines, focus on:

Moving transformation logic into Spark SQL or dbt
Switching from full loads to incremental loads where possible
Adding data quality checks at each stage
Using Change Data Capture (CDC) for source systems that update records frequently

Step 7: Set Up Data Governance from Day One

This is where many migrations go wrong. Teams focus on moving data and forget about governing it.

What governance means in practice:

Every table has a documented owner
Access controls are set at the table and column level
Data lineage tracks where each field came from
Sensitive data is masked or encrypted

In Databricks, Unity Catalog handles all of this in one place. It gives you access control, data lineage, auditing, and discovery across your entire lakehouse.

As Databricks documentation explains, governance configuration is one of the first things admins should complete, not something to add later.

Step 8: Add Monitoring and Observability

Once your lakehouse is running, you need to know when something breaks.

Set up alerts and monitoring for:

Pipeline failures or delays
Data quality checks that fail (unexpected nulls, out-of-range values, schema changes)
Cost per pipeline run (cloud compute is not free)
Row count anomalies between runs

Good observability means your team catches problems before downstream users notice them. Without it, broken data quietly reaches dashboards and decisions are made on bad numbers.

According to N-IX's 2026 data engineering trends analysis, Gartner forecasts that 50% of organizations with distributed data architectures will adopt data observability platforms in 2026, up from less than 20% in 2024.

Common Mistakes to Avoid

Mistake	Why It Hurts	What to Do Instead
Moving everything at once	High risk, hard to debug	Migrate in phases by domain
Skipping governance setup	Data becomes ungoverned and hard to trust	Set up Unity Catalog or equivalent on day one
Ignoring data quality checks	Bad data reaches analysts	Add quality checks at every pipeline stage
Not training the team	Engineers default to old patterns	Invest in training before the migration starts
Decommissioning the old system too early	No fallback if problems appear	Run both systems in parallel until fully validated

How Long Does a Migration Take?

There is no single answer, but here is a realistic range based on common experience:

Migration Scope	Estimated Timeline
Single data domain (pilot)	8 to 12 weeks
Mid-size organization, 3 to 5 domains	4 to 6 months
Large enterprise, full migration	12 to 18 months

The biggest factor is not the technology. It is the readiness of your data, your team, and your stakeholders.

What You Get on the Other Side

When the migration is done, here is what your team gains:

Lower storage costs. Cloud object storage is much cheaper than traditional warehouse storage for the same volume.
One platform for all workloads. Data engineering, analytics, and AI all work on the same data.
Real-time capabilities. You can now run streaming pipelines alongside batch loads.
AI-ready data. Raw, structured, and unstructured data all live in one governed place. Your ML team can finally access what they need.
Better reliability. Delta Lake's ACID transactions mean no more corrupted or partial writes.
Full data lineage. You can trace any number back to its source.

Frequently Asked Questions

What is the difference between a data lake and a data lakehouse?

A data lake stores raw data cheaply but has no structure or quality controls. A data lakehouse adds ACID transactions, schema enforcement, and fast query support on top of that same low-cost storage. A lakehouse gives you the flexibility of a lake with the reliability of a warehouse.

Do I have to use Databricks for a lakehouse?

No. You can use Apache Iceberg, Microsoft Fabric, or other platforms. Databricks is the most popular choice because it is built on widely used open-source tools and has a complete feature set for data engineering, analytics, and AI.

How do I handle data that cannot be moved?

Not all data needs to move at once. You can query external data sources through a lakehouse using federated query tools while you plan a full migration. Governance and metadata can cover both old and new systems during the transition.

Will my existing SQL queries still work?

Most SQL queries written for traditional warehouses will work in a lakehouse with little or no changes. Databricks notes that most workloads and dashboards can run with minimal code changes after the initial migration and governance setup.

Is a lakehouse good for small teams?

Yes. Serverless compute options mean small teams only pay for what they use. You do not need a large infrastructure team to manage it.

Learn More About Modern Data Engineering

This article covers the migration process, but there is much more to learn about how a modern data platform works.

If you want to understand the full picture, including how data pipelines work, what ETL vs ELT really means, and how tools like Delta Lake and Databricks fit together, the Modern Data Engineering Guide by Lucent Innovation is a great place to start. It covers every layer of a modern data platform from ingestion to governance in one detailed guide.

Wrapping Up

Moving from a traditional data warehouse to a modern lakehouse is not a quick project. But it is one of the most valuable investments a data team can make.

Here is a quick recap of the steps:

Audit your current environment before touching anything
Pick the right lakehouse platform for your team
Set up your storage layer with Delta Lake or an open table format
Design Bronze, Silver, and Gold data layers
Migrate data in phases, domain by domain
Rewrite pipelines from ETL to ELT patterns
Set up governance before you go live, not after
Add monitoring so you catch problems early

Start small. Pick one domain. Prove it works. Then expand.

The teams that build solid data foundations today will have a clear advantage when it comes time to run AI, real-time analytics, and anything else the business needs next.

Have you started a lakehouse migration at your organization? Share what worked or what you would do differently in the comments below.

DEV Community: Lucy

Shopify Scripts Are Dead — Here's How to Migrate to Shopify Functions Before June 30, 2026

What's actually changing and why

What Are Shopify Scripts?

Why Shopify Is Deprecating Them

Scripts vs. Functions: What Actually Changed

What Gets Replaced by What?

Enter Shopify Functions

Migration: Step by Step

Step 1: Audit your current Scripts

Step 2: Set up your Shopify Functions environment

Step 3: Scaffold the right Function type

Step 4: Rewrite your logic in JavaScript or Rust

Step 5: Test in a development store

Step 6: Deploy to production and monitor

Common Migration Pitfalls

1. Scripts that relied on execution order

2. Bundle logic mapped to the wrong Function type

3. Assuming Shipping Scripts are low priority

4. Edge cases left uncoded

5. JavaScript Functions on complex carts

6. Not reading the GraphQL input schema

What If You're Already Behind?

Quick Recap

Migrate Your Data to Delta Lake: A Simple Guide for Developers

The Problem With Old Data Lakes

What Is Delta Lake?

The Big Features That Matter

ACID Transactions (No More Broken Data)

Time Travel (Go Back in Time)

Schema Enforcement (Keep Data Clean)

Automatic Updates and Deletes

Why Move to Delta Lake?

Cost Savings

Speed and Trust

Real-Time and Batch in One Place

Easy Audits and Compliance

How to Start Your Migration

Step 1: Check Your Current Setup

Step 2: Start Small

Step 3: Set Up Your Infrastructure

Step 4: Copy Your Data Over

Step 5: Test Everything

Step 6: Switch Over and Watch It

Step 7: Keep the Old System as Backup

Simple Example: Your First Delta Lake Table

Real Costs vs. Old Systems

Common Questions

Next Steps

How to Speed Up Your Shopify Store in 5 Easy Steps for Better Performance

Introduction

#ecommerce Follow

What You Will Learn

Step 1: Optimize Your Images

Step 2: Minify CSS and JavaScript

Step 3: Use a CDN

Step 4: Enable Caching

Step 5: Test Your Speed

Key Takeaways

Conclusion

Related Resources

Related Dev.to Posts

#shopify Follow

#performance Follow

#ecommerce Follow

#webdev Follow

What Does an AI Consultant Actually Do? A 2026 Breakdown for Business Leaders

What Even Is an AI Consultant?

What Does an AI Consultant Actually Do Day-to-Day?

Phase 1: The AI Opportunity Audit

Phase 2: Strategy and Roadmap Building

Phase 3: Picking the Right Tools and Stack

Phase 4: Managing the Build and Deployment

Phase 5: Governance, Ethics, and Ongoing Optimization

How Is an AI Consultant Different From a Data Scientist or Software Engineer?

When Does a Business Actually Need AI Consulting Services?

What Tools Do AI Consultants Actually Use in 2026?

How to Pick an AI Consulting Partner in 2026 Without Regret

Why does picking wrong hurt so much?

What should I actually look for?

Step 2 — The `.env` File

Step 3 — The `compose.yaml` File