DEV Community: Jag Thind

Setting Up a Robust Local DevX for Snowflake Python Development

Jag Thind — Fri, 27 Feb 2026 17:04:28 +0000

In the evolving world of data engineering, developing Python-based workloads in Snowflake (via Snowpark, Python UDFs, or Stored Procedures) has become increasingly popular. However, as pipelines become more complex, a critical question arises: How should we develop and maintain our Python code for Snowflake?

While the convenience of browser-based editors like Snowflake Workspaces is fine for quick scripts, there is a significant "Developer Experience (DevX) Gap" that emerges when you try to build production-grade Python code in a browser tab.

Why I'm writing this blog

I've seen many Data Engineers and Analytics Engineers fall into the "UI Trap" of writing complex Python logic directly in Snowflake, only to struggle with inconsistent environments, broken dependencies, and the frustration of "it works on my machine, but not on others" problems. This blog is born out of a desire to share a better way.

My goal is to encourage people to step out of the browser and into a professional local development environment. By establishing repeatable local dev environments where every developer uses the same Python version, the same dependencies, and the same tooling, we can build Python-based features that are not just functional and robust, but most importantly maintainable by others.

One aspect of democratizing data rich features in a product is by making it easier to develop and maintain code using consistent tools. This is why we need to focus on local DevX!

What we'll cover

We will explore the merits of a local-first approach to Snowflake Python development, specifically focusing on:

Deterministic Python versions with pyenv and .python-version
Robust dependency management with Poetry and pyproject.toml
Consistent tooling configured in a single file
Simplified task execution with Poe the Poet

Python version management with pyenv

On macOS pyenv is a tool for managing multiple Python versions. It allows you to install and switch between different Python versions on a per-project basis by creating a .python-version file in the project root.

Why this matters for DevX: By pinning the Python version in version control, you ensure that:

Every developer uses the same Python version for the project.
Your CI/CD pipeline can install the exact same version.
You avoid subtle bugs that arise from Python version differences.
Dependencies work consistently (some packages require specific Python versions).
Debugging is easier when issues are reproducible across all environments.

Setup:

Install pyenv:
```
brew install pyenv
```
Create a .python-version file in the project root:
```
3.10
```
Install the Python version specified in the .python-version file:
```
pyenv install
```
Verify the desired Python version is installed and is set for the project:
```
pyenv version
```

Dependency management with Poetry and the `pyproject.toml` file

Poetry is a tool for dependency management and packaging in Python. It allows you to declare the packages your project depends on and it will manage (install/update) them for you. Poetry offers a lockfile to ensure repeatable installs, and can build your project for distribution.

It uses the pyproject.toml file (which we'll explore next) as its source of truth.

Why this matters for DevX: With pyproject.toml and Poetry, you've eliminated the "works on my machine, but not on others" problem at the dependency level. Every developer and every CI/CD runner will install the exact same versions of every package, every time!

Installing Poetry

Install Poetry using Homebrew:
```
brew install poetry
```
Verify the installation:
```
poetry --version
```
Configure Poetry to create virtual environments in the project directory (recommended for better DevX). This ensures that when you run poetry install, it creates a .venv folder directly in the project, making it easy to activate and manage:

poetry config virtualenvs.in-project true

The `pyproject.toml` file

The pyproject.toml file is a PEP 518 standard that replaces the need for multiple configuration files (setup.py, requirements.txt, setup.cfg, etc.) with one unified file. It uses TOML (Tom's Obvious, Minimal Language) format.

Benefits:

Single source of truth: All project configuration lives in one file.
Version constraints: You can specify package versions according to Poetry's dependency specification and version constraints.
Deterministic builds: Poetry generates a poetry.lock file that pins every dependency—both direct (what you specify) and transitive (dependencies of your dependencies)—ensuring identical installs across environments.
Tool configuration: You can configure multiple tools in the same file (no need for separate config files).

Example pyproject.toml file:

[project]
name = "PROJECT_NAME"
version = "0.1.0"
description = "PROJECT_DESCRIPTION"
authors = [
    {name = "YOUR_NAME",email = "youremail@domain.com"}
]
readme = "README.md"

# Production dependencies that your code needs to run.
[tool.poetry.dependencies]
python = ">=3.10,<3.11"
snowflake-snowpark-python = "1.33.0" # Snowflake Snowpark Python library
pydantic = "2.11.7"                  # Data validation library in Python

# Development-only tools that aren't needed in production.
[tool.poetry.group.dev.dependencies]
black = "^23.0.0"           # Code formatter
pylint = "^3.0.0"           # Linter for code quality
isort = "^5.13.2"           # Import statement organiser
poethepoet = "^0.27.0"      # Task runner for simplifying development tasks
pytest = "^8.1.2"           # Testing framework for Python
pytest-xdist = "^3.0.0"     # Run tests in parallel for faster execution
pytest-cov = "^5.0.0"       # Generate code coverage reports

[build-system]
requires = ["poetry-core>=2.0.0,<3.0.0"]
build-backend = "poetry.core.masonry.api"

# Configure all your tools

[tool.black]
line-length = 120
target-version = ['py310']

[tool.isort]
profile = "black"
multi_line_output = 3

[tool.pylint]
max-line-length = 120
fail-under = 9.5

# Configure tasks for Poe the Poet
[tool.poe.tasks]
# Private tasks (prefixed with _ to hide from the help menu)
_format_black = "black ."
_format_isort = "isort ."
_pylint = "pylint src/"

# Public tasks that compose the individual tools
format = ["_format_black", "_format_isort"]
lint = ["format", "_pylint"]
test = "pytest --cov -vv"

Installing dependencies

Once your pyproject.toml is set up, installing all dependencies (including dev dependencies) is a single command. It will:

Create a virtual environment (in .venv if you configured Poetry to do so).
Install all dependencies (including dev dependencies).
Generate or update poetry.lock to ensure reproducible installs across environments.

For new projects where you haven't written code yet, you'll need to use the --no-root flag:

poetry install --no-root

Why --no-root is needed initially:

When you first create a project manually or with poetry init, Poetry assumes you're building a package. If you run poetry install without any code, you'll get an error:

Installing the current project: example-project (0.1.0)
Error: The current project could not be installed: No file/folder found for package example-project

The --no-root flag tells Poetry to skip installing your project as a package and only install the dependencies you've specified.

When you won't need --no-root:

Once you've written code and added a packages section to your pyproject.toml file like the example below, you can use the standard poetry install command (without --no-root):

[tool.poetry]
packages = [{include = "<YOUR_PACKAGE_NAME>", from = "src"}]

Configuring VS Code (optional):

To use the project's virtual environment in VS Code / Cursor for IntelliSense, debugging, and running code in the IDE:

Press Cmd+Shift+P (or Ctrl+Shift+P on Windows/Linux)
Type "Python: Select Interpreter"
Select "Enter interpreter path"
Enter the path to your project's virtual environment: ./<PROJECT_ROOT>/.venv/bin/python (adjust the path to match your project location)
VS Code will now use the same Python environment as Poetry, giving you access to all installed packages and proper code completion.

Poe the Poet: Simplifying development tasks

Poe the Poet is a task runner that lets you define common development commands in your pyproject.toml file. Instead of remembering long commands like poetry run black . && poetry run isort . && poetry run pylint src/, you can create a simple alias and run poetry run poe lint. See the [tool.poe.tasks] section in the example pyproject.toml file above for the configuration.

Benefits:

Consistency: Everyone on your team uses the same commands
Simplicity: poe lint instead of remembering multiple flags
Composability: Chain tasks together (e.g., lint runs format then pylint)
Documentation: Tasks are self-documenting in pyproject.toml

Wrapping Up

You now have a solid foundation for local Snowflake Python development:

✅ Deterministic Python versions with pyenv and .python-version
✅ Robust dependency management with Poetry and pyproject.toml
✅ Consistent tooling configured in a single file
✅ Simplified task execution with Poe the Poet

This setup eliminates the "it works on my machine, but not on others" problem at its source. Every developer on your team will have the exact same environment, the same dependencies, and the same tooling automatically.

The DevX payoff: By investing in these foundations, you're not just setting up tools, you're creating an environment where Data Engineers can focus on building features instead of fighting with configuration. This is how we democratize data development.

I hope you find this guide helpful. If you have questions or feedback, I'd love to hear from you!

How We Use OpenAI and Gemini Batch APIs to Qualify Thousands of Sales Leads

Jag Thind — Tue, 09 Sep 2025 11:38:32 +0000

The following blog details how the Data team used AI to solve a specific problem for our Marketing and Sales teams - Qualify 3000 websites (Salesforce Accounts) to determine if they are ecommerce and can take payments.

It is broken down into:

Problem at hand and what are we trying to solve?
Process design
Why use LLMs from 2 AI providers
Prompt engineering and using prompt templates
Scaling up using OpenAI batch API and google batch predictions

TL;DR

We implemented a batch data enrichment pipeline that uses OpenAI and Gemini Large Large Models (LLMs) via the OpenAI Batch API and Google Batch Predictions for a cost effective way to enrich data using the power of LLMs.

To ensure maximum accuracy and minimise the effects of hallucinations from the LLMs, we use a simple consensus system: each website is checked by both AIs 3 times each, and only results where they agree are accepted. Yes, this makes it more expensive, but we optimised for time to value and getting good leads into the hands of the Sales team.

We used a prompt template configured to use the web search tool to ground the LLM with real-time information about the website, overcoming the model's static knowledge cutoff date.

We trained the Marketing team in writing effective prompts for the LLMs before we scaled up using the batch mode.

A great example of tech and the business working together to achieve a shared outcome and spreading the use of AI in the business.

Problem at Hand

The Marketing team periodically builds lists of potential merchants that can integrate Super as a payment method on their website checkout. These leads are then provided to Account Executives (AEs) to sign up.

When assigned a website the first thing AEs do is manually double check is the website ecommerce:

Can you buy products on the website?
Is there a checkout on the website?
Does it accept card payments?

⚠️ Many websites were not ecommerce ⚠️ resulting in:

AEs wasting time doing manual checking
Many leads getting dis-qualified at the top of the sales funnel
AEs getting frustrated with leads they were assigned
AEs resorting to self-sourcing leads and taking them away from their core responsibilities of closing deals

What are we trying to solve?

Questions we asked ourselves:

Can we increase the number of leads at the top of the sales funnel?
Can we automate the is ecommerce check instead of manually qualifying each website?
Can we scale this check across N (hundreds/thousands) websites?

Process Design

Before we dive into the details of Prompt Engineering and how the Batch pipeline works. The below illustrates the process and its 2 parts.

Why use LLMs from 2 AI Providers?

Even though it was more costly to do so, we needed to be confident in the accuracy of what we were telling the AEs in the Sales team. Instead of relying on a single AI, we used LLMs from two different AI providers, then based our final decision on their consensus.

Think of it like getting a second opinion from a trusted expert. If two independent specialists examine the same data and come to the same conclusion, your confidence in that outcome increases dramatically.

Some benefits include:

Accuracy Through Consensus: The core of our strategy is built on consensus. An ecommerce qualification is only confirmed if both LLMs independently agree. This simple but powerful rule acts as a powerful filter, significantly reducing the risk of a single LLM making a mistake, hallucinating, or misinterpreting a site.
Mitigating Model-Specific Weaknesses: Every LLM has its own unique architecture, training data, and inherent biases. One LLM might be brilliant at identifying traditional retail sites but struggle with subscription services, while the other might have the opposite strengths. Using a single LLM means you also inherit all of its blind spots. By using two, we diversify our "cognitive portfolio," allowing the strengths of one LLM to compensate for the weaknesses of the other, leading to a more balanced and consistently accurate outcome.
Automatic Quality Control: Perhaps the most valuable benefit is what happens when the LLMs disagree. A disagreement is a critical signal. It tells us that a website is ambiguous, an edge case, or complex in a way that could have easily fooled a single AI. Our system automatically flags these disagreements for manual review.

Prompt Engineering

Prompt engineering is the process of writing effective instructions for a LLM, such that it consistently generates content that meets your requirements.

We used the OpenAI developer platform to iteratively develop a reusable prompt template that could be used in the responses API. The platform allows testing different versions of a prompt side-by-side to evaluate changes.

Advantages of using a Prompt Template

You can use variables via {{placeholder}} and your integration code remains the same, e.g.

response = client.responses.create(
    model="gpt-4.1",
    prompt={
        "id": "pmpt_abc123",
        "version": "2",
        "variables": {
            "website_url": "xyz.com"
        }
    }
)

You can also configure the prompt to use the web search tool to allow the LLM to search the web for the latest information before generating a response:

{
    "type": "web_search_preview",
    "user_location": {
        "type": "approximate",
        "country": "GB",
        "search_context_size": "high",
    }
}

Prompt Template

The Marketing team produced a prompt template that had clear instructions for the LLM to check if a single website URL is ecommerce.

Please research the website {{url}} provided by the user. You must only return the data requested in the "InformationRequested" section and in a format according to the "OutputFormat" section. Do not include any explanations, reasoning, or commentary.

## InformationRequested
- url: {{url}}
- is_url_valid: Y/N — Is the URL valid and accessible?
- is_ecommerce: Y/N - You MUST use rules from section "Evaluation Rules for column is_ecommerce"

## OutputFormat
Output as JSON with the following fields. Do not include markdown around the JSON:
- url
- is_url_valid
- is_ecommerce

## Evaluation Rules for column is_ecommerce

*Mark "Y" only if all of the following are true, based on explicit evidence available*:
* rule 1
* rule 2
* etc

*Mark "N" in any of the following cases*:
* rule 1
* rule 2
* etc

## Final Reminder

- You must only return the data requested in the "InformationRequested" section.
- You must only return it in the format according to the "OutputFormat" section.
- You must not include any explanations, reasoning, or commentary.

Scaling it up - OpenAI Batch API

OpenAI has a Batch API that allows you to send asynchronous groups of requests with 50% lower costs, a separate pool of significantly higher rate limits, and a clear 24-hour turnaround time. The workflow is:

The uploaded batch file containing the requests will have one line per website as below,

{"custom_id": "request-[1756480801.159196]-xyz.com", "method": "POST", "url": "/v1/responses", "body": {"model": "gpt-4.1", "input": "Run the following prompt", "prompt": {"id": "pmpt_XXX", "version": "2", "variables": {"url": "xyz.com"}}}}
{"custom_id": "request-[1756480802.1434196]-abc.com", "method": "POST", "url": "/v1/responses", "body": {"model": "gpt-4.1", "input": "Run the following prompt", "prompt": {"id": "pmpt_XXX", "version": "2", "variables": {"url": "abc.com"}}}}

The benefits of this are:

Significant Cost Reduction: The 50% discount on pricing is a major advantage for processing thousands of URLs, leading to substantial cost savings compared to using the real-time API.
Increased Throughput: The much higher rate limits allow for processing a large volume of requests in parallel, drastically reducing the overall time it takes to enrich a large dataset.
Asynchronous "Fire-and-Forget" Workflow: You can submit a large batch job and not have to wait for it to complete. This is perfect for non-time-sensitive, offline processing tasks, as you can retrieve the results later without keeping a connection open.
Simplified Client-Side Logic: It removes the need for you to build and maintain complex logic to handle rate limiting, concurrent requests, and retries. You simply prepare and upload a file.
Enhanced Resilience and Error Handling: Since requests are independent, the success or failure of one doesn't impact others. The output file clearly indicates the status of each request, making it easy to identify and retry only the failed ones.
Up to date context: The prompt template is configured to use the web search tool to ground the LLM with real-time information about the website. This search is performed independently for each website.

Scaling it up - Google Batch Predictions

Google Batch Predictions also allows you to generate predictions from Gemini models using a Batch Job, the workflow is:

Similar to OpenAI the batch job file contains one request per line, but you cannot use a prompt template, so each request in the file has the full prompt. Also web search tools in Gemini are not available via Batch Predictions, but we still found the results to be accurate.

Where we Ended Up

We now have a repeatable way to enrich data using the power of LLMs for a large number of websites. We have already started using it to conduct other checks.

The Salesforce Accounts we enriched with is commerce = Y/N was used to create a better qualified list at the top of the sales funnel.

AEs were no longer reporting the website as not ecommerce.

A job well done by AI and Humans!

Improve DBT Incremental Performance on Snowflake using Custom Incremental Strategy

Jag Thind — Thu, 29 May 2025 11:34:29 +0000

The following presents how to improve the performance of the DBT built-in delete-insert incremental strategy on snowflake so we can control snowflake query costs. It is broken down into:

Defining the problem, with supporting performance statistics
Desired solution requirements
Solution implementation, with supporting performance statistics

TL;DR

We implemented a DBT custom incremental strategy, along with incremental predicates to improve snowflake query performance:

Reduced MBs scanned by ~99.68%
Reduced micro-partitions scanned by ~99.53%
Reduced query time from 19 seconds to 1.3 seconds

Less data is being scanned, so the snowflake warehouse is waiting less time on I/O, so the query completes faster.

Disclaimer

Custom incremental strategies and incremental predicates are more advanced uses of DBT for incremental processing. But I suppose that’s where you have the most fun, so lets get stuck in!

Problem

When using the DBT built-in delete-insert incremental strategy on large volumes of data, you can get inefficient queries on snowflake when the delete statement is executed. This means queries take longer and increase warehouse costs.

Taking an example target table:

With ~458 million rows
Is ~26 GB in size
Has ~2560 micro-partitions

With a DBT model that:

Is running every 30 minutes
Typically there are ~100K rows to merge into the target table on every run. As data can arrive out-of-order, a subsequent run will pick these up, but means it can include rows already processed.

With DBT model config:

  - name: model_name
    config:
      materialized: "incremental"
      incremental_strategy: "delete+insert"
      on_schema_change: "append_new_columns"
      unique_key: ["dw_order_created_skey"] -- varchar(100)
      cluster_by: ["to_date(order_created_at)"]

Default delete SQL generated by DBT, before it inserts data in the same transaction:

delete from target_table as DBT_INTERNAL_DEST
where (dw_order_created_skey) in (
  select distinct dw_order_created_skey
  from source_temp_table as DBT_INTERNAL_SOURCE
);

Performance Statistics

To find the rows in the target table to delete with the matching dw_order_created_skey (see node profile overview image below), snowflake has to:

Scan ~11 GB of the target table
Scan all ~2560 micro-partitions
Query takes ~19 seconds

Why? - The query is not filtering on order_created_at to allow snowflake to use the clustering key of to_date(order_created_at) to find the matching rows to delete.

Query plan

Desired Solution

To limit the data read in the target table above. We can make use of incremental_predicates in the model config. This will add SQL to filter the target table.

DBT model config:

  - name: model_name
    config:
      materialized: "incremental"
      incremental_strategy: "delete+insert"
      on_schema_change: "append_new_columns"
      unique_key: ["dw_order_created_skey"]
      cluster_by: ["to_date(order_created_at)"]
      incremental_predicates:
        - "order_created_at >= (select dateadd(hour, -24, min(order_created_at)) from DBT_INTERNAL_SOURCE)"

Issues with this

The incremental_predicates docs states dbt does not check the syntax of the SQL statements, so it does not change anything in the SQL.
We get an error when it executes on snowflake: Object 'DBT_INTERNAL_SOURCE' does not exist or not authorized.
We cannot hardcode the snowflake table name in the incremental_predicates, as its dynamically generated by DBT.

Solution Implementation

We need to:

Do some pre-processing on each element of incremental_predicates to replace DBT_INTERNAL_SOURCE with actual source_temp_table so SQL like the below is generated by DBT for better performance:

delete from target_table as DBT_INTERNAL_DEST
where (dw_order_created_skey) in (
  select distinct dw_order_created_skey
  from source_temp_table as DBT_INTERNAL_SOURCE
)
-- Added by incremental_predicates
and order_created_at >= (select dateadd(hour, -24, min(order_created_at)) from source_temp_table)
;

Continue to call the default DBT delete+insert incremental strategy with the new value for incremental_predicates in the arguments dictionary.

How - The below macro implements a light-weight custom incremental strategy do this. You can see at the end it calls the default get_incremental_delete_insert_sql DBT code.

{% macro get_incremental_custom_delete_insert_sql(arg_dict) %}
  {% set custom_arg_dict = arg_dict.copy() %}
  {% set source = custom_arg_dict.get('temp_relation') | string %}
  {% set target = custom_arg_dict.get('target_relation') | string %}

  {% if source is none %}
    {{ exceptions.raise_compiler_error('temp_relation is not present in arguments!') }}
  {% endif %}

  {% if target is none %}
    {{ exceptions.raise_compiler_error('target_relation is not present in arguments!') }}
  {% endif %}

  {% set raw_predicates = custom_arg_dict.get('incremental_predicates', []) %}

  {% if raw_predicates is string %}
    {% set predicates = [raw_predicates] %}
  {% else %}
    {% set predicates = raw_predicates %}
  {% endif %}

  {% if predicates %}
    {% set replaced_predicates = [] %}
    {% for predicate in predicates %}
      {% set replaced = predicate
        | replace('DBT_INTERNAL_SOURCE', source)
        | replace('DBT_INTERNAL_DEST', target)
      %}
      {% do replaced_predicates.append(replaced) %}
    {% endfor %}
    {% do custom_arg_dict.update({'incremental_predicates': replaced_predicates}) %}
  {% endif %}

  {{ log('Calling get_incremental_delete_insert_sql with args: ' ~ custom_arg_dict, info=False) }}
  {{ get_incremental_delete_insert_sql(custom_arg_dict) }}
{% endmacro %}

This is now callable from the DBT model config by setting incremental_strategy to custom_delete_insert.

  - name: model_name
    config:
      materialized: "incremental"
      incremental_strategy: "custom_delete_insert"
      on_schema_change: "append_new_columns"
      unique_key: ["dw_order_created_skey"]
      cluster_by: ["to_date(order_created_at)"]
      incremental_predicates:
        - "order_created_at >= (select dateadd(hour, -24, min(order_created_at)) from DBT_INTERNAL_SOURCE)"

Performance Improvement Statistics

To find ~100K rows to delete in the target table, now snowflake has to only:

Scan ~35 MB of the target table, 11 GB → 35 MB = ~99.68% improvement
Scan 12 micro-partitions, 2560 → 12 = ~99.53% improvement
Query takes ~1.3 seconds