DEV Community: Debashis Adak

From Prompt to Pixel: My First Hands-On with Stable Diffusion XL

Debashis Adak — Sun, 27 Jul 2025 09:04:51 +0000

The Setup: Building a Reproducible SDXL Pipeline
The Prompt
Why This Was Exciting
The Output
Code

For a while now, I’ve been fascinated by the idea of creating high-quality images purely from text. The growing capabilities of generative models like Stable Diffusion had me curious — how exactly does one go from a simple sentence to a detailed piece of art?

This weekend, I decided to roll up my sleeves and try it out.

🔧 The Setup: Building a Reproducible SDXL Pipeline

I started by creating a Python notebook on Google Colab, designed to be reproducible, efficient, and GPU-aware. Here's what the pipeline does:

Mounts Google Drive to store models and outputs persistently
Sets up local caching directories for huggingface_hub, diffusers, and PyTorch wheels
Downloads required packages (like diffusers, transformers, accelerate) ahead of time into a local wheelhouse for fast, dependency-safe installs
Detects GPU availability and automatically selects:
stabilityai/stable-diffusion-xl-base-1.0 (SDXL) if CUDA is available
runwayml/stable-diffusion-v1-5 as a fallback on CPU (work in progress)
Initializes a DiffusionPipeline with memory-efficient options like vae_slicing and attention_slicing
Generates an image using a carefully designed prompt and saves it to Drive

🖼️ The Prompt

prompt = (
"Ultra-detailed portrait of a mysterious traveler walking through "
"a neon-lit cyberpunk city at night, reflective puddles, cinematic "
"lighting, intricate textures, hyper-realistic, depth of field"
)
neg = "blurry, distorted, watermark, text, extra limbs"

💡 Why This Was Exciting

While tools like Midjourney and DALL·E abstract away the complexity, building your own Stable Diffusion pipeline gives you:

Full control over versions, dependencies, and parameters
A better understanding of how diffusion models work under the hood
A great foundation for future projects like fine-tuning, LoRA, or prompt chaining

🧠 What I Learned

SDXL is powerful — but managing model size and inference speed is key
Prompt design is half the battle — you need to balance creativity with CLIP token limits (~77 tokens)
Using Hugging Face + Google Drive makes it easy to cache, persist, and share your experiments

💡 The Output

Image - 1

💡 Notebook Code

# ╔═══════════════════════════════════════════════════════════════╗
# ║  0.  MOUNT DRIVE + PREP CACHE FOLDERS                         ║
# ╚═══════════════════════════════════════════════════════════════╝
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

import os, pathlib, subprocess, textwrap

DRIVE_ROOT = "/content/drive/MyDrive/Study/ai_models"   # 👈 change if you like
HF_CACHE   = f"{DRIVE_ROOT}/hf_cache"
WHEELHOUSE = f"{DRIVE_ROOT}/wheelhouse"

for p in (HF_CACHE, WHEELHOUSE):
    pathlib.Path(p).mkdir(parents=True, exist_ok=True)

# Redirect every 🤗 cache to Drive
os.environ["HF_HOME"]               = HF_CACHE
os.environ["HUGGINGFACE_HUB_CACHE"] = HF_CACHE
os.environ["TRANSFORMERS_CACHE"]    = HF_CACHE
os.environ["DIFFUSERS_CACHE"]       = HF_CACHE

# ╔═══════════════════════════════════════════════════════════════╗
# ║  1.  DEFINE EXACT VERSIONS                                    ║
# ╚═══════════════════════════════════════════════════════════════╝
TORCH_WHEEL = "torch==2.6.0+cu124"           # matches Colab's CUDA 12.4 tool-chain
EXTRAS      = [
    "torchvision==0.21.0+cu124",
    "torchaudio==2.6.0+cu124",
]
PKGS = [                                      # all installed with --no-deps later
    "diffusers==0.27.2",
    "transformers==4.34.0",
    "accelerate==1.9.0",
    "huggingface_hub==0.24.1",                # provides cached_download
    "safetensors==0.5.3",
    "invisible_watermark==0.2.0",
    "tokenizers==0.14.1"
]

# ╔═══════════════════════════════════════════════════════════════╗
# ║  2.  DOWNLOAD WHEELS ONCE (SKIPPED IF THEY EXIST)             ║
# ╚═══════════════════════════════════════════════════════════════╝
def wheel_present(name_version: str) -> bool:
    name, ver = name_version.split("==")
    return bool(list(pathlib.Path(WHEELHOUSE).glob(f"{name}-{ver}*.whl")))

# Torch + friends come from the PyTorch wheel index
if not wheel_present("torch==2.6.0+cu124"):
    !pip download $TORCH_WHEEL {" ".join(EXTRAS)} -d "$WHEELHOUSE" \
        --index-url https://download.pytorch.org/whl/cu124

# Other packages from PyPI
for pkg in PKGS:
    if not wheel_present(pkg):
        !pip download $pkg -d "$WHEELHOUSE"

# ╔═══════════════════════════════════════════════════════════════╗
# ║  3.  INSTALL FROM LOCAL WHEELS ONLY                           ║
# ╚═══════════════════════════════════════════════════════════════╝
# 3-a  Torch first (allows ABI-compatible extras)
!pip install --quiet --no-index --find-links="$WHEELHOUSE" \
    $TORCH_WHEEL {" ".join(EXTRAS)}

# 3-b  Everything else, but **--no-deps** so nothing tries to upgrade torch
!pip install --quiet --no-index --find-links="$WHEELHOUSE" --no-deps \
    diffusers==0.27.2 transformers==4.34.0 accelerate==1.9.0 \
    huggingface_hub==0.24.1 tokenizers==0.14.1 \
    safetensors==0.5.3 invisible_watermark==0.2.0

!pip uninstall -y peft

# ╔══════════════════════════════════════════════════════════════════╗
# ║  1.  LOAD A PIPELINE – SDXL on GPU, SD-v1.5 on CPU fallback     ║
# ╚══════════════════════════════════════════════════════════════════╝
import importlib.metadata, torch
print("torch        :", torch.__version__)
print("diffusers    :", importlib.metadata.version("diffusers"))
print("hub          :", importlib.metadata.version("huggingface_hub"))
# should show a consistent trio, e.g.:
# torch 2.6.0+cu124 | diffusers 0.29.x | hub 0.25+

from diffusers import DiffusionPipeline

has_cuda = torch.cuda.is_available()
if has_cuda:
    MODEL_ID  = "stabilityai/stable-diffusion-xl-base-1.0"
    DTYPE     = torch.float16
    DEVICE    = "cuda"
    LOAD_KW   = {}                      # GPU loads everything
else:
    MODEL_ID  = "runwayml/stable-diffusion-v1-5"
    DTYPE     = torch.float32
    DEVICE    = "cpu"
    LOAD_KW   = dict(device_map="balanced", max_memory={"cpu": "10GiB"})

pipe = DiffusionPipeline.from_pretrained(
    MODEL_ID,
    torch_dtype=DTYPE,
    use_safetensors=True,
    **LOAD_KW,
).to(DEVICE)

# VRAM / RAM savers
if has_cuda:
    #pipe.enable_xformers_memory_efficient_attention()
    pipe.enable_vae_slicing()
else:
    pipe.enable_attention_slicing()
    pipe.enable_vae_slicing()

# ╔══════════════════════════════════════════════════════════════════╗
# ║  2.  GENERATE A TEST IMAGE                                       ║
# ╚══════════════════════════════════════════════════════════════════╝
prompt = ("Ultra-detailed portrait of a mysterious traveler walking through "
          "a neon-lit cyberpunk city at night, reflective puddles, cinematic "
          "lighting, intricate textures, hyper-realistic, depth of field")
neg    = "blurry, distorted, watermark, text, extra limbs"

height = width = 1024 if has_cuda else 512
steps  = 30    if has_cuda else 20

image = pipe(
    prompt              = prompt,
    negative_prompt     = neg,
    height              = height,
    width               = width,
    num_inference_steps = steps,
    guidance_scale      = 7.5,
    generator           = torch.Generator(DEVICE).manual_seed(42),
).images[0]

# ╔══════════════════════════════════════════════════════════════════╗
# ║  3.  SAVE TO DRIVE                                               ║
# ╚══════════════════════════════════════════════════════════════════╝
out_dir  = pathlib.Path(DRIVE_ROOT) / "outputs"
out_dir.mkdir(parents=True, exist_ok=True)
fname    = out_dir / f"traveler_{'sdxl' if has_cuda else 'v15'}.png"
image.save(fname)

print(f"✅ Render complete → {fname}")
image   # displays inline in Colab

Pulumi-Day1- Getting Started

Debashis Adak — Mon, 01 Jul 2024 10:20:21 +0000

Hi All,

I am starting to learn pulimi. Pulumi's infrastructure-as-code SDK helps you create, deploy, and manage AWS containers, serverless functions, and infrastructure using programming languages like TypeScript, Python, Go, C#, and Java, and markup languages like YAML. The Pulumi AWS provider packages and CLI help you accomplish all these within minutes.

Docs

Scope of this blog will be just to check how pulumi project works. Its a learning journey for Infrastructure as code with AWS.

Selected Pulumi for infrastructure-as-code as it provides programming language flexibility (It means you write test cases) & integration with a lot of the cloud providers (like aws, azure, gcp etc).

As I am comfortable with Python, it will be my choice of language.
Also, I am using my windows personal machine.

Environment Setup

Following link: https://www.pulumi.com/docs/clouds/aws/get-started/begin/
Looks like you need Powershell is preferred to install pulumi as per documentation.

To install pulumi

choco install pulumi

To check installed pulumi version

pulumi version

Version installed: v3.121.0

You need to install Python or any other language you are using. I have Python already installed in my system. So, need to install it.

Configure Pulumi to access your AWS account

Following page: https://www.pulumi.com/registry/packages/aws/installation-configuration/

Pulumi requires cloud credentials to manage and provision resources. You must use an IAM user account that has programmatic access with rights to deploy and manage resources handled through Pulumi.

I already have AWS CLI installed on my system. For pulumi we need to create a new IAM user/ existing IAM user.

Followed below mentioned steps

I create a new IAM user "pulumi-user"
Gave AdministratorAccess to the user (note: Gave it admin access for ease of experiments)
Added profile "pulumi-dev" aws_access key & secret key in ~/.aws/credentials (screenshot below)
Configure pulumi to use the profile from your terminal (command: pulumi config set aws:profile pulumi-dev)

Create a Pulumi Project

Reference link: https://www.pulumi.com/docs/clouds/aws/get-started/create-project/

I created a gitlab project called "pulumi". Created the project "awsproj" as mentioned in the steps given on the document.

PS > pulumi new aws-python
This command will walk you through creating a new Pulumi project.

Enter a value or leave blank to accept the (default), and press <ENTER>.
Press ^C at any time to quit.

project name (project): awsproj
project description (A minimal AWS Python Pulumi program):
Created project 'awsproj'

stack name (dev):
Created stack 'dev'
Enter your passphrase to protect config/secrets:
Re-enter your passphrase to confirm:

The toolchain to use for installing dependencies and running the program pip
aws:region: The AWS region to deploy into (us-east-1):
Saved config

Installing dependencies...

Creating virtual environment...
Finished creating virtual environment
Updating pip, setuptools, and wheel in virtual environment...
Requirement already satisfied: pip in c:\users\debashis\git_projects\pulumi\venv\lib\site-packages (22.0.4)
Collecting pip
  Downloading pip-24.1.1-py3-none-any.whl (1.8 MB)
     ---------------------------------------- 1.8/1.8 MB 11.6 MB/s eta 0:00:00
Requirement already satisfied: setuptools in c:\users\debashis\git_projects\pulumi\venv\lib\site-packages (58.1.0)
Collecting setuptools
  Downloading setuptools-70.1.1-py3-none-any.whl (883 kB)
     ------------------------------------- 883.3/883.3 KB 28.2 MB/s eta 0:00:00
Collecting wheel
  Using cached wheel-0.43.0-py3-none-any.whl (65 kB)
Installing collected packages: wheel, setuptools, pip
  Attempting uninstall: setuptools
    Found existing installation: setuptools 58.1.0
    Uninstalling setuptools-58.1.0:
      Successfully uninstalled setuptools-58.1.0
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-24.1.1 setuptools-70.1.1 wheel-0.43.0
Finished updating
Installing dependencies in virtual environment...
Collecting pulumi<4.0.0,>=3.0.0 (from -r requirements.txt (line 1))
  Downloading pulumi-3.121.0-py3-none-any.whl.metadata (11 kB)
Collecting pulumi-aws<7.0.0,>=6.0.2 (from -r requirements.txt (line 2))
  Downloading pulumi_aws-6.42.1-py3-none-any.whl.metadata (8.4 kB)
Collecting protobuf~=4.21 (from pulumi<4.0.0,>=3.0.0->-r requirements.txt (line 1))
  Downloading protobuf-4.25.3-cp310-abi3-win_amd64.whl.metadata (541 bytes)
Collecting grpcio~=1.60.1 (from pulumi<4.0.0,>=3.0.0->-r requirements.txt (line 1))
  Downloading grpcio-1.60.1-cp310-cp310-win_amd64.whl.metadata (4.2 kB)
Collecting dill~=0.3 (from pulumi<4.0.0,>=3.0.0->-r requirements.txt (line 1))
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting six~=1.12 (from pulumi<4.0.0,>=3.0.0->-r requirements.txt (line 1))
  Downloading six-1.16.0-py2.py3-none-any.whl.metadata (1.8 kB)
Collecting semver~=2.13 (from pulumi<4.0.0,>=3.0.0->-r requirements.txt (line 1))
  Downloading semver-2.13.0-py2.py3-none-any.whl.metadata (5.0 kB)
Collecting pyyaml~=6.0 (from pulumi<4.0.0,>=3.0.0->-r requirements.txt (line 1))
  Using cached PyYAML-6.0.1-cp310-cp310-win_amd64.whl.metadata (2.1 kB)
Collecting parver>=0.2.1 (from pulumi-aws<7.0.0,>=6.0.2->-r requirements.txt (line 2))
  Downloading parver-0.5-py3-none-any.whl.metadata (2.7 kB)
Collecting typing-extensions>=4.11 (from pulumi-aws<7.0.0,>=6.0.2->-r requirements.txt (line 2))
  Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting arpeggio>=1.7 (from parver>=0.2.1->pulumi-aws<7.0.0,>=6.0.2->-r requirements.txt (line 2))
  Downloading Arpeggio-2.0.2-py2.py3-none-any.whl.metadata (2.4 kB)
Collecting attrs>=19.2 (from parver>=0.2.1->pulumi-aws<7.0.0,>=6.0.2->-r requirements.txt (line 2))
  Downloading attrs-23.2.0-py3-none-any.whl.metadata (9.5 kB)
Downloading pulumi-3.121.0-py3-none-any.whl (263 kB)
   ---------------------------------------- 263.4/263.4 kB 5.5 MB/s eta 0:00:00
Downloading pulumi_aws-6.42.1-py3-none-any.whl (9.3 MB)
   ---------------------------------------- 9.3/9.3 MB 14.8 MB/s eta 0:00:00
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
   ---------------------------------------- 116.3/116.3 kB 6.6 MB/s eta 0:00:00
Downloading grpcio-1.60.1-cp310-cp310-win_amd64.whl (3.7 MB)
   ---------------------------------------- 3.7/3.7 MB 3.2 MB/s eta 0:00:00
Downloading parver-0.5-py3-none-any.whl (15 kB)
Downloading protobuf-4.25.3-cp310-abi3-win_amd64.whl (413 kB)
   ---------------------------------------- 413.4/413.4 kB ? eta 0:00:00
Using cached PyYAML-6.0.1-cp310-cp310-win_amd64.whl (145 kB)
Downloading semver-2.13.0-py2.py3-none-any.whl (12 kB)
Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Downloading Arpeggio-2.0.2-py2.py3-none-any.whl (55 kB)
   ---------------------------------------- 55.3/55.3 kB ? eta 0:00:00
Downloading attrs-23.2.0-py3-none-any.whl (60 kB)
   ---------------------------------------- 60.8/60.8 kB ? eta 0:00:00
Installing collected packages: arpeggio, typing-extensions, six, semver, pyyaml, protobuf, grpcio, dill, attrs, pulumi, parver, pulumi-aws
Successfully installed arpeggio-2.0.2 attrs-23.2.0 dill-0.3.8 grpcio-1.60.1 parver-0.5 protobuf-4.25.3 pulumi-3.121.0 pulumi-aws-6.42.1 pyyaml-6.0.1 semver-2.13.0 six-1.16.0 typing-extensions-4.12.2
Finished installing dependencies
Finished installing dependencies

Your new project is ready to go!

To perform an initial deployment, run `pulumi up`

As we can see, it created a virtual environment ".venv".

Note: I have used pip as the package installer as finding error while using poetry.

How project structure looks like?

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d-----        01-07-2024     14:29                venv
-a----        01-07-2024     14:29             14 .gitignore
-a----        01-07-2024     14:29            117 Pulumi.dev.yaml                                                                                                      
-a----        01-07-2024     14:29            206 Pulumi.yaml
-a----        01-07-2024     14:29             48 requirements.txt
-a----        01-07-2024     14:29            229 __main__.py

Contents inside requirements.txt

pulumi>=3.0.0,<4.0.0
pulumi-aws>=6.0.2,<7.0.0

Activate the Virtual Environment

To activate the virtual environment

.\venv\Scripts\activate

Check pip dependencies

pip freeze

Arpeggio==2.0.2
attrs==23.2.0
dill==0.3.8
grpcio==1.60.1
parver==0.5
protobuf==4.25.3
pulumi==3.121.0
pulumi_aws==6.42.1
PyYAML==6.0.1
semver==2.13.0
six==1.16.0
typing_extensions==4.12.2

Running a Pulumi Project

When you create a pulumi project, it generates a main.py by default.

Content of main.py

"""An AWS Python Pulumi program"""

import pulumi
from pulumi_aws import s3

# Create an AWS resource (S3 Bucket)
bucket = s3.Bucket('my-bucket')

# Export the name of the bucket
pulumi.export('bucket_name', bucket.id)

It means if we run this script, it should create an S3 bucket with the prefix "my-bucket".

Lets run it
command:

pulumi up

Screenshot of the run

As you can see, it created an S3 bucket

Destroying Resources

command

pulumi destroy

As you can see bucket got destroyed.

Databricks - Variant Type Analysis

Debashis Adak — Sat, 29 Jun 2024 12:34:57 +0000

The VARIANT data type is a recent introduction in Databricks (available in Databricks Runtime 15.3 and above) designed specifically for handling semi-structured data. It offers an efficient and flexible way to store and process this kind of data, which often has a dynamic or evolving schema.

Here's a quick rundown of its key features (As per the documentation of Databricks):

Flexibility: VARIANT can store various data structures within a single column, including structs, arrays, maps, and scalars. This eliminates the need for pre-defined schemas, making it adaptable to changing data formats.
Performance: Compared to storing semi-structured data as JSON strings, VARIANT offers significant performance improvements. This is because VARIANT uses a binary encoding scheme for data representation, allowing for faster processing.
Ease of Use: Databricks recommends VARIANT as the preferred choice over JSON strings for working with semi-structured data. It provides familiar syntax for querying fields and elements within the VARIANT column using dot notation and array indexing.

Overall, the VARIANT data type streamlines working with semi-structured data in Databricks, enhancing flexibility, performance, and ease of use.

Snowflake has long offered the VARIANT data type, allowing you to store semi-structured data without pre-defining a schema. This eliminates the burden of schema design upfront.

In contrast, for Delta Lake previously we relied on the MAP data type, which requires a defined schema. However, semi-structured data often exhibits schema variations across rows, creating challenges for data engineers. Parsing the data correctly before storage was a necessary but tedious step.

In this exploration, I'll try to uncover the VARIANT data type in Databricks and its underlying mechanisms.

Some important databricks documentation links

Code Reference

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2440252792644019/398134129842206/8684924100662862/latest.html

Steps To Setup

Step 1: Provision a databricks cluster with a runtime environment 15.3

I created a Test Cluster with runtime 15.3 Beta version (Apache Spark 3.5.0, Scala: 2.12)

Step 2: Verify your workbook if it is running with 15.3

Step 3: Create a schema

For the first time, we are going to create the schema. You can use your own schema. Post creation, it should start reflecting in the catalog.

Step 4: Verify if parse_json function is present (from 15.3 version, this function should be present)

As per the documentation: https://docs.databricks.com/en/semi-structured/variant.html to parse json data of a column, you can use parse_json function. It will validate incoming data if it is in JSON format or not. Also, it will create VARIANT data type of the column if you are creating the table using select.

Step 5: Create a table

On this step, we are creating a table variant_data_exploration in the schema myschema parsing a json object As per the query, it will create 3 columns

id: int
name: string
raw: variant

Step 6: Table Schema Verification

As you can see below, under the section, # Delta Statistics Columns,

Column Names: id, name, raw

Column Selection Method: first-32 (default behavior of a deltalake table. It gathers statistics for the first 32 columns)

Location: dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration

Step 7: Verify Files in Table Location



[
FileInfo(path='dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration/_delta_log/', name='_delta_log/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration/part-00000-603c8a87-dfdd-41a0-817d-9226cef0ab8a-c000.snappy.parquet', name='part-00000-603c8a87-dfdd-41a0-817d-9226cef0ab8a-c000.snappy.parquet', size=3943, modificationTime=1719567376000)
]

As we can see, there are _delta_log directory (having deltatable metadata/stats related files) and one parquet file (part-00000-603c8a87-dfdd-41a0-817d-9226cef0ab8a-c000.snappy.parquet) holding a single row.

Step 8: Verify Files in __delta_log location



[FileInfo(path='dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration/_delta_log/00000000000000000000.crc', name='00000000000000000000.crc', size=2616, modificationTime=1719567381000),
 FileInfo(path='dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration/_delta_log/00000000000000000000.json', name='00000000000000000000.json', size=1741, modificationTime=1719567377000),
 FileInfo(path='dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration/_delta_log/_commits/', name='_commits/', size=0, modificationTime=0)]

Mainly it has 2 file types

00000000000000000000.json (Holds column statistics & responsible for data pruning/ file skipping. For each commit, a new json file with incremented value gets created)
00000000000000000000.crc (Every 10 transactions json files in the deltalog are converted to parquet files. The .crc file is a checksum added to prevent corruption if a parquet file is corrupted in flight)

Explore

Lets see the content inside 00000000000000000000.json as this is the main driving factor of data skipping & query performance.



---------------+----------------------------------------------------+
|add                                                                                                                                                                                                                                                                                            |commitInfo                                                                                                                                                                                                                                                                                            |metaData                                                                                                                                                                                                                                                                                   |protocol                                            |
+---------------------------------------------------------------------------------------------
|NULL                                                                                                                                                                                                                                                                                           |{0628-070916-m4o60ack, Databricks-Runtime/15.3.x-scala2.12, false, WriteSerializable, {398134129842206}, CREATE OR REPLACE TABLE AS SELECT, {1, 3943, 1}, {[], NULL, true, [], {}, false}, {true, false}, 1719567376334, 472f6c8b-cd1d-4347-acd4-c49c2ebd8072, 8684924100662862, abc@gmail.com}|NULL                                                                                                                                                                                                                                                                                       |NULL                                                |
|NULL                                                                                                                                                                                                                                                                                           |NULL                                                                                                                                                                                                                                                                                                  |{1719567374807, {parquet}, c22760ef-1595-4a59-974a-2a4dbb3a1386, [], {"type":"struct","fields":[{"name":"id","type":"integer","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"raw","type":"variant","nullable":true,"metadata":{}}]}}|NULL                                                |
|NULL                                                                                                                                                                                                                                                                                           |NULL                                                                                                                                                                                                                                                                                                  |NULL                                                                                                                                                                                                                                                                                       |{3, 7, [variantType-preview], [variantType-preview]}|
|{true, 1719567376000, part-00000-603c8a87-dfdd-41a0-817d-9226cef0ab8a-c000.snappy.parquet, 3943, {"numRecords":1,"minValues":{"id":1,"name":"abc"},"maxValues":{"id":1,"name":"abc"},"nullCount":{"id":0,"name":0,"raw":0}}, {1719567376000000, 1719567376000000, 1719567376000000, 268435456}}|NULL                                                                                                                                                                                                                                                                                                  |NULL                                                                                                                                                                                                                                                                                       |NULL                                                |
+---------------------------------------------------------------------------------------------

Observations

Check this stats section in delta log



{"numRecords":1,"minValues":{"id":1,"name":"abc"},"maxValues":{"id":1,"name":"abc"},"nullCount":{"id":0,"name":0,"raw":0}}

deltatable gathered min value & max value statistics for only id & name column not for raw. It means FILTER condition in a SELECT query on a VARIANT datatype shouldn't contribute to file skipping. You would still be dependent on other non-complex columns to have a file level data skipping.

What if we insert a NULL value then? Will it contribute in data skipping?

Now deltalog has version 00000000000000000001 available

Content of 00000000000000000001.json file



+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|add                                                                                                                                                                                                                                                                                            |commitInfo                                                                                                                                                                                                                                                      |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|NULL                                                                                                                                                                                                                                                                                           |{0628-070916-m4o60ack, Databricks-Runtime/15.3.x-scala2.12, true, WriteSerializable, {398134129842206}, WRITE, {1, 1036, 1}, {Append, [], false}, 0, {true, false}, 1719572454185, 1b7a5e88-9f4b-4c9e-8af3-39a1b808b5cc, 8684924100662862, abc@gmail.com}|
|{true, 1719572454000, part-00000-477f86d9-19d7-462d-ab1d-e7891348b2a3-c000.snappy.parquet, 1036, {"numRecords":1,"minValues":{"id":2,"name":"def"},"maxValues":{"id":2,"name":"def"},"nullCount":{"id":0,"name":0,"raw":1}}, {1719572454000000, 1719572454000000, 1719572454000000, 268435456}}|NULL                                                                                                                                                                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

See the plan of the query



select * from myschema.variant_data_exploration where raw is not null

Here no. of files read = 1 & no. of file pruned = 1. It means it skipped the file of the first commit. It means it contributed to file skipping. Why?

See the section 00000000000000000001.json file



"nullCount":{"id":0,"name":0,"raw":1}}

It means during the second commit, we have inserted the row with raw field is NULL & deltalake captured that stats. So, it is able to skip the file scan.

What if we insert a {} value then? Will it contribute in data skipping?

We all know, in many systems, while persisting a NULL semi-structured data, we persist the record as {}. It is kind of NULL representation of a JSON. Lets see how VARIANT responds to it.

Content of 00000000000000000002.json file



+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|add                                                                                                                                                                                                                                                                                            |commitInfo                                                                                                                                                                                                                                                      |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|NULL                                                                                                                                                                                                                                                                                           |{0628-070916-m4o60ack, Databricks-Runtime/15.3.x-scala2.12, true, WriteSerializable, {398134129842206}, WRITE, {1, 1102, 1}, {Append, [], false}, 1, {true, false}, 1719573911083, 6a04ec9a-9a61-4875-a47d-7d26d14877cf, 8684924100662862, abc@gmail.com}|
|{true, 1719573911000, part-00000-4d9aa00d-2c82-4e96-b15c-16cba8b374a4-c000.snappy.parquet, 1102, {"numRecords":1,"minValues":{"id":3,"name":"ghi"},"maxValues":{"id":3,"name":"ghi"},"nullCount":{"id":0,"name":0,"raw":0}}, {1719573911000000, 1719573911000000, 1719573911000000, 268435456}}|NULL                                                                                                                                                                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Now if you see this below section, deltalake didn't consider {} as NULL value



"nullCount":{"id":0,"name":0,"raw":0}}

So, if we run a query like below, it will scan two files (1st transaction & 3rd transaction). See the below explain plan.



select * from myschema.variant_data_exploration where raw:fb::string ='abc';

Conclusion

Variant provides huge flexibility in terms of storing semi-structured data
For Variant data type, File pruning is possible if you are storing data as NULL not {} or any other
As deltalake doesn't capture stats of internal fields of a variant column, if we query them it will result in loading all the parquet files (in a partition if a partitioned table) with NOT NULL variant data.
If we are modelling the deltatable & our ETL is pushing NOT NULL values in variant, you can keep those columns outside of first 32 columns. Performance is expected to be the same.

Things to Follow-up

As far as I can remember, few years back, I was debugging a Variant column performance issue in Snowflake. At that time, either snowflake team or in some snowflake forum (Can't remember exactly) claimed, snowflake persists Variant data in a very flattened format in their storage. Also, it gathers stats of all the internal fields of a variant. That makes snowflake very unique as it doesn't have any performance impact between querying a normal column & an attribute within a Variant column. Not sure if it is true on today's date or not. Need to follow-up :)