DEV Community: Vijay Ashley Rodrigues

Building with Snowflake Cortex Analyst — What I Learned About Semantic Layers and Guardrails

Vijay Ashley Rodrigues — Wed, 18 Mar 2026 01:42:52 +0000

When I started working with Snowflake Cortex Analyst, I assumed the hard part would be getting the system to answer questions correctly.

It wasn't. The hard part was deciding which questions it shouldn't answer.

In this post I want to share two things that took more thought than I expected — verified queries and guardrails.

A Quick Overview of Cortex Analyst

Snowflake Cortex Analyst lets users ask questions in plain English and get answers from structured data. Under the hood, it uses a semantic model defined in YAML to understand the data and generate SQL responses.

There are two ways it can respond:

Verified queries — pre-validated question-answer pairs you define
LLM-generated SQL — the model generates SQL on its own when no verified query matches

The goal of a well-structured semantic model is to maximize verified query hits. The more questions route through verified queries, the more controlled and reliable your output.

The Verified Queries Trade-off

My first instinct was to add as many verified query variations as possible — cover every way a user might ask the same question.

That backfired.

Approach	Problem
Too few variations	Model misses valid questions, falls back to LLM generation
Too many variations	Introduces noise, wrong query gets matched

The Guardrail Problem — Define What It Shouldn't Do

This is the part most people skip.

In data engineering we always plan for edge cases. I applied the same thinking here — users will assume this works like any AI tool and ask anything. You can't control that. So instead of trying to restrict users, I put the responsibility on the YAML.

Cortex Analyst has a question_categorization block where you explicitly define categories of questions the system should refuse. Here's a simplified example:

question_categorization:
  - category: unavailable_topics
    examples:
      - "What is the return rate by supplier?"
      - "Show me customer lifetime value"

  - category: greetings
    examples:
      - "Hey"
      - "Can you help me?"

  - category: forecast_or_prediction
    examples:
      - "What will sales look like next month?"
      - "Predict inventory needs for Q4"

  - category: ambiguous_queries
    examples:
      - "Show me something interesting"
      - "What should I look at?"

Without this block, the system will attempt to answer everything — including questions it has no business answering. That doesn't happen by itself. You have to build it in.

Summary

Structure your semantic model to maximize verified query hits, not just expose data.
Verified queries need enough variation to be useful — but too many creates noise.
Use question_categorization to explicitly define what the system should refuse.
Think defensively from day one — don't wait for something to break in production.

Still early in this build, but these are the decisions I'm glad I made at the start rather than retrofitting later.

From Views to Tables: How I Optimized dbt Models with Delta on Databricks

Vijay Ashley Rodrigues — Tue, 24 Jun 2025 18:20:03 +0000

When I started using dbt with Databricks, I was mostly focused on writing models and building transformations. But as the data volume grew, I realized the performance and cost implications of materializations and storage formats.

In this post, I want to share some real lessons learned while choosing between view, table, and incremental materializations—and why switching from Parquet to Delta Lake made a huge difference.

Views vs Tables vs Incremental: What Actually Works?

Here’s a quick breakdown of how I use each materialization in real projects:

Materialization	Use Case	Notes
`view`	Lightweight models (renaming, filtering)	Great for development, but recomputes every time
`table`	Medium-size dimensions	Useful for reference tables that rarely change
`incremental`	Large facts, high-frequency data	Only processes new or changed rows (ideal for big datasets)

DAG and `ref()` – How dbt Manages Dependencies

Instead of hardcoding table names, I use ref() to link models. For example:

SELECT *
FROM {{ ref('stg_orders') }}

This function does more than just alias a table. Behind the scenes, ref() helps dbt:

Build a Directed Acyclic Graph (DAG) of your models
Determine the correct run order during dbt run
Resolve fully qualified table names dynamically across dev/prod environments
Track lineage for documentation and testing

This becomes incredibly useful in Databricks where the environment, catalog, and schema may differ between workspaces.

For example, {{ ref('stg_orders') }} might compile into:

`dev_catalog`.`analytics_schema`.`stg_orders`

This makes your project environment-agnostic and easier to manage with Git-based workflows.

Real Problem I Faced with Views on Databricks

In one pipeline, I had several dbt models materialized as view, assuming it would keep things fast and light. But as models started chaining together—one ref() leading to another—the final model failed due to out-of-memory errors.

That’s because views in dbt are logical; Databricks has to recompute all the upstream SQL each time. As complexity grew, so did query length and memory usage.

Solution: Switching from `view` to `table`

I changed key intermediate models to use table materialization:

{{ config(materialized='table') }}

This persisted the results, reduced recomputation, and stabilized the pipeline.

Why I Use Delta Format on Databricks

If you're using Databricks and still materializing models as regular Parquet files, you're missing out.

Delta Lake adds:

ACID compliance (safe concurrent writes)
Time travel (query past versions)
Efficient upserts and merges
Z-Ordering (for faster filtering)

In my dbt config, I now explicitly define:

{{ config(materialized='table', file_format='delta') }}

And for large tables, I follow up with:

OPTIMIZE analytics.fct_orders ZORDER BY (order_date)

This alone made queries run significantly faster for date-filtered dashboards.

Summary

Use view when developing, but be cautious of deep chains.
Switch to table when performance matters or complexity grows.
Always use ref() to link models—hardcoding paths will break your project at scale.
On Databricks, prefer Delta over Parquet for better reliability, query speed, and flexibility.

Tags:

dbt #databricks #dataengineering #delta #sql #analyticsengineering

🧠 The Hidden Soft Skills That Make You a Better Data Engineer

Vijay Ashley Rodrigues — Wed, 04 Jun 2025 19:37:37 +0000

But here’s the thing: the most impactful data engineers I’ve worked with (and learned from) weren’t just great with tech. They were great at soft skills — the behind-the-scenes stuff that no one talks about in job descriptions, but makes a huge difference in real work.

Here are some soft skills I’ve picked up over the years that have made my job easier, helped me work better with teams, and honestly, kept a few projects from going sideways.

🗣️ Communication — Ask First, Query Later

Before you write any code, talk to people.

A lot of misunderstandings in data projects come down to vague requirements. One word — like “active users” or “conversion” — can mean different things to different teams.

💬 Real Example:
Whenever a project was discussed — especially tasks like converting code from one language to another — I made it a habit to ask as many questions as I could upfront:
“What is the code doing?”
“Why are we converting this?”
“What are the business outcomes?”
This helped me avoid rework and ensured I delivered exactly what was needed.

🐞 Debugging — More Detective Work Than Code

When something breaks (and it always does), your mindset matters more than your keyboard.

Instead of guessing, slow down and look for recent changes. Be systematic.

💬 Real Example:
I once dealt with a pipeline that kept failing randomly. After a bit of digging, I realized the source table had been updated — new columns were added and a few data types had changed. These subtle schema changes broke downstream processes.
Rather than patching it blindly, I tracked down the exact change and made the necessary schema adjustments to get things running again.

📝 Documentation — Future You Will Be Grateful

I used to skip documentation thinking, “I’ll remember this.”
Spoiler: I didn’t.

Now, I write short notes for my future self and teammates — especially for projects I know I’ll revisit.

💬 Real Example:
Documentation has saved me so many times — whether it’s writing clean comments in code or understanding a new project.
When joining an ongoing project, even a single-page process flow or doc can help you understand how everything fits together.
Even something as simple as a clear README or inline comment can go a long way.

🔄 Scope Changes — Flexibility Beats Frustration

Plans change. Projects evolve. That’s just part of the job.

Instead of getting frustrated, I’ve learned to pause, re-evaluate, and communicate the new reality.

💬 Real Example:
I was converting some legacy scripts to SQL. Looked simple at first — until I saw how messy and layered the old logic was.
Instead of rushing through it, I flagged the complexity, broke it down, and explained the extra effort needed to the team. We extended the timeline, prioritized correctly, and ended up with a stable, well-structured result.

🤝 Working With Non-Data Folks — Speak Their Language

As a data engineer, you’re often the bridge between raw data and real decisions. That means working with non-technical folks — analysts, PMs, stakeholders — and translating their goals into data logic.

💬 Real Example:
I’ve worked with stakeholders who were brilliant in their domain, but not technical. Instead of dumping jargon on them, I focused on asking what they were trying to achieve.
Once I understood their goals, I could break down the technical details in plain terms and deliver what they actually needed — not what I assumed they wanted.

🙌 Final Thoughts

You can be great at building pipelines, optimizing queries, or scheduling DAGs — but soft skills are what make you reliable, easy to work with, and trusted by your team.

These aren’t flashy skills, but they matter more than you think.

If you're starting out in data engineering, don't ignore them. And if you're already deep into it, it’s never too late to improve.

💬 Got a soft skill that’s helped you in your data journey? Drop it in the comments — would love to hear your take!

What Is DBT? A No-Fluff Guide for Data Engineers and Analysts

Vijay Ashley Rodrigues — Sun, 01 Jun 2025 06:30:40 +0000

If you’ve been around modern data tools, you’ve probably heard the term DBT pop up more than once. It's one of those tools that gets mentioned in conversations about clean data pipelines, SQL transformations, and analytics engineering — but what is it, really?

In this post, I’ll break down what DBT (short for Data Build Tool) actually is, how it works, and why it’s become such a big deal in the modern data stack — without the buzzword overload.

What Exactly Is DBT?

At its core, DBT is a transformation tool — not an extraction or loading tool. It doesn’t move data in or out of your warehouse (that’s what tools like Fivetran, Airbyte, or custom ETL scripts do). Instead, DBT focuses on the “T” in ELT: turning raw data inside your warehouse into clean, analytics-ready tables.

You write your transformations as SQL models (literally .sql files), organize them in a folder structure, and DBT runs them in a defined sequence using its built-in dependency graph.

It handles:

Execution order (via ref() references)
Data testing
Documentation
Environment configuration
And even integrates with Git for version control

Why Teams Love DBT
Traditional SQL development often turns into spaghetti — duplicate code, inconsistent logic, and barely any testing. DBT introduces structure to that chaos.

Here's what makes it awesome:

Modular, reusable SQL files
Git integration for version control
Automated data quality tests (nulls, uniqueness, relationships)
Interactive documentation with lineage graphs
Dynamic SQL using Jinja templates

How DBT Works (In Plain English)

A DBT project is basically a collection of models and configurations:

Models are .sql files containing SELECT statements
You reference other models using ref('model_name')
DBT builds a DAG (dependency graph) to figure out what runs when
You can test models, define sources, and set materializations (view/table/etc.)

Example:

-- models/stg_customers.sql

SELECT
  id AS customer_id,
  LOWER(email) AS normalized_email,
  created_at::DATE AS signup_date
FROM {{ source('raw', 'customers') }}
WHERE email IS NOT NULL

This model takes raw customer data and cleans it up. You can later reference it in other models like this:

SELECT * FROM {{ ref('stg_customers') }}

You can even add tests with a simple YAML config like:

columns:
  - name: customer_id
    tests:
      - not_null
      - unique

Key Concepts in DBT

Component	What It Does
Models	SQL-based transformations
Sources	Raw input tables defined in YAML
Tests	Validate data quality rules
Macros	Reusable Jinja + SQL logic
Docs	Auto-generated documentation and lineage

Who Should Use DBT?

If you:

Know SQL
Work with data in warehouses like Snowflake, Databricks, BigQuery, or Redshift
Want to write cleaner, testable transformation code

Then DBT is made for you — whether you’re a solo analyst or part of a larger data team.

You don’t need to learn a new language. DBT lets you keep working in SQL, but brings in the best parts of software engineering: version control, CI/CD, modularity, and documentation.

Getting Started

You’ve got two ways to use DBT:

DBT CLI — Open-source and terminal-based
DBT Cloud — Hosted version with UI, scheduler, logging, etc.

Start with the Jaffle Shop demo project (yes, that’s what it’s called) to see DBT in action.

Your getting-started flow:

Install DBT CLI or sign up for DBT Cloud
Connect it to your warehouse
Initialize a project
Create some models, tests, and sources
Run dbt run, then dbt docs generate to see lineage graphs

🙌 Final Thoughts

DBT is changing how data teams think about transformations. It brings the discipline of software engineering to SQL workflows — making your data pipelines more reliable, documented, and collaborative.

You don’t need to be an expert to get started. If you know SQL and want a smarter way to build and manage data models, DBT is absolutely worth exploring.

💬 Tried DBT already? Thinking of learning it? Drop your experience or questions in the comments — I’d love to connect!

No Docker, No Problem: Run Apache Kafka on Windows

Vijay Ashley Rodrigues — Sun, 25 May 2025 06:46:05 +0000

Getting Kafka up and running without Docker on a Windows machine might seem like navigating a maze—most tutorials assume you're working on Linux or using containers. But what if you're working directly on Windows, or inside WSL, and want full visibility into every config and step?

I ran into the same challenge—and after hours of tweaking, testing, and debugging, I finally got it all working. This article documents the entire process, step-by-step, so you can skip the trial and error and get straight to building.

Let’s dive into the full local setup: from environment setup to producing and consuming messages with Python.

Step 1: Activate Your Python Virtual Environment

Make sure your virtual environment is activated:

F:\python_virtual_environments\orchestration_env\Scripts\activate

You should see your prompt change to:

(orchestration_env) F:\python_virtual_environments>

Step 2: Install the Kafka Python Library

With your virtual environment active, install:

pip install kafka-python

Verify:

python -c "import kafka; print(kafka.__version__)"

Step 3: Configure Kafka and Java Environment Variables

Kafka Variables:

Open System Properties → Environment Variables.
Add:

KAFKA_HOME = F:\kafka_2_13__3_4_0
Add to Path: F:\kafka_2_13__3_4_0\bin\windows

Java Variables:
If not already set:

JAVA_HOME = C:\Program Files\Java\jdk-<version>

Restart your terminal to apply.

Step 4: Update Kafka Config Files

Navigate to:

F:\kafka_2_13__3_4_0\config

Edit zookeeper.properties:

clientPort=2182

Edit server.properties:

broker.id=0
zookeeper.connect=localhost:2182
log.dirs=F:/kafka_2_13__3_4_0/logs

Step 5: Delete meta.properties (Fix for InconsistentClusterId)

Stop Kafka:

F:\kafka_2_13__3_4_0\bin\windows\kafka-server-stop.bat

Stop Zookeeper:

F:\kafka_2_13__3_4_0\bin\windows\zookeeper-server-stop.bat

Delete meta.properties:

Navigate to: F:\kafka_2_13__3_4_0\logs
Delete the file: meta.properties

Step 6: Start Zookeeper

F:\kafka_2_13__3_4_0\bin\windows\zookeeper-server-start.bat F:\kafka_2_13__3_4_0\config\zookeeper.properties

Wait for Zookeeper to fully initialize.

Step 7: Start Kafka Broker

F:\kafka_2_13__3_4_0\bin\windows\kafka-server-start.bat F:\kafka_2_13__3_4_0\config\server.properties

Step 8: Create a Kafka Topic

F:\kafka_2_13__3_4_0\bin\windows\kafka-topics.bat -create -topic test-topic -bootstrap-server localhost:9092 -partitions 1 -replication-factor 1

Verify:

F:\kafka_2_13__3_4_0\bin\windows\kafka-topics.bat -list -bootstrap-server localhost:9092

Step 9: Use CMD for Producer and Consumer

Start Producer:

F:\kafka_2_13__3_4_0\bin\windows\kafka-console-producer.bat -topic test-topic -bootstrap-server localhost:9092

Type your messages in this window.

Start Consumer:

F:\kafka_2_13__3_4_0\bin\windows\kafka-console-consumer.bat -topic test-topic -bootstrap-server localhost:9092 -from-beginning

This window will show the messages received.

Step 10: Send & Receive Messages Using Python

Create Scripts:

kafka_producer.py

from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers='localhost:9092')
producer.send('test-topic', b'Hello from Python!')
producer.flush()

print("Message sent.")

kafka_consumer.py

from kafka import KafkaConsumer

consumer = KafkaConsumer(
    'test-topic',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    group_id='test-group'
)

for message in consumer:
    print(f"Received: {message.value.decode()}")

Run the Scripts:

Before running any Kafka-related Python scripts, first activate your
virtual environment:

F:\python_virtual_environments\orchestration_env\Scripts\activate

Once activated, your prompt will look like:

(orchestration_env) F:\python_virtual_environments>

Now, navigate to the Kafka scripts directory:

cd F:\kafka-scripts

Run the Producer script:

python kafka_producer.py

Then, open another CMD window, activate the virtual environment again, and run the Consumer script:

F:\python_virtual_environments\orchestration_env\Scripts\activate
cd F:\kafka-scripts
python kafka_consumer.py

Step 11: Stop Kafka and Zookeeper Safely

Once you're finished, stop both services gracefully:

F:\kafka_2_13__3_4_0\bin\windows\kafka-server-stop.bat
F:\kafka_2_13__3_4_0\bin\windows\zookeeper-server-stop.bat

🙌 Final Thoughts

Setting up Apache Kafka on Windows without Docker can feel tricky at first, but once you walk through the steps, it’s actually a clean and flexible solution — perfect for local development or experimentation.

If this guide helped you or saved you time, consider leaving a ❤️, bookmarking it for later, or sharing your experience in the comments.
I’d love to hear how it went for you — or answer any questions you might have!

DEV Community: Vijay Ashley Rodrigues

Building with Snowflake Cortex Analyst — What I Learned About Semantic Layers and Guardrails

A Quick Overview of Cortex Analyst

The Verified Queries Trade-off

The Guardrail Problem — Define What It Shouldn't Do

Summary

From Views to Tables: How I Optimized dbt Models with Delta on Databricks

Views vs Tables vs Incremental: What Actually Works?

DAG and ref() – How dbt Manages Dependencies

Real Problem I Faced with Views on Databricks

Solution: Switching from view to table

Why I Use Delta Format on Databricks

Summary

dbt #databricks #dataengineering #delta #sql #analyticsengineering

🧠 The Hidden Soft Skills That Make You a Better Data Engineer

🗣️ Communication — Ask First, Query Later

🐞 Debugging — More Detective Work Than Code

📝 Documentation — Future You Will Be Grateful

🔄 Scope Changes — Flexibility Beats Frustration

🤝 Working With Non-Data Folks — Speak Their Language

🙌 Final Thoughts

What Is DBT? A No-Fluff Guide for Data Engineers and Analysts

What Exactly Is DBT?

How DBT Works (In Plain English)

Key Concepts in DBT

Who Should Use DBT?

Getting Started

🙌 Final Thoughts

No Docker, No Problem: Run Apache Kafka on Windows

Step 1: Activate Your Python Virtual Environment

Step 2: Install the Kafka Python Library

Step 3: Configure Kafka and Java Environment Variables

Step 4: Update Kafka Config Files

Step 5: Delete meta.properties (Fix for InconsistentClusterId)

Step 6: Start Zookeeper

Step 7: Start Kafka Broker

Step 8: Create a Kafka Topic

Step 9: Use CMD for Producer and Consumer

Step 10: Send & Receive Messages Using Python

Run the Scripts:

Step 11: Stop Kafka and Zookeeper Safely

🙌 Final Thoughts

DAG and `ref()` – How dbt Manages Dependencies

Solution: Switching from `view` to `table`