<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Debashis Adak</title>
    <description>The latest articles on DEV Community by Debashis Adak (@dadak5).</description>
    <link>https://dev.to/dadak5</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1547320%2F8aa695ce-d17d-4e34-aac0-a6dd51748aaf.png</url>
      <title>DEV Community: Debashis Adak</title>
      <link>https://dev.to/dadak5</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dadak5"/>
    <language>en</language>
    <item>
      <title>From Prompt to Pixel: My First Hands-On with Stable Diffusion XL</title>
      <dc:creator>Debashis Adak</dc:creator>
      <pubDate>Sun, 27 Jul 2025 09:04:51 +0000</pubDate>
      <link>https://dev.to/dadak5/from-prompt-to-pixel-my-first-hands-on-with-stable-diffusion-xl-2gjj</link>
      <guid>https://dev.to/dadak5/from-prompt-to-pixel-my-first-hands-on-with-stable-diffusion-xl-2gjj</guid>
      <description>&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The Setup: Building a Reproducible SDXL Pipeline&lt;/li&gt;
&lt;li&gt;The Prompt&lt;/li&gt;
&lt;li&gt;Why This Was Exciting&lt;/li&gt;
&lt;li&gt;The Output&lt;/li&gt;
&lt;li&gt;Code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a while now, I’ve been fascinated by the idea of creating high-quality images purely from text. The growing capabilities of generative models like Stable Diffusion had me curious — how exactly does one go from a simple sentence to a detailed piece of art?&lt;/p&gt;

&lt;p&gt;This weekend, I decided to roll up my sleeves and try it out.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;🔧 The Setup: Building a Reproducible SDXL Pipeline&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I started by creating a Python notebook on Google Colab, designed to be reproducible, efficient, and GPU-aware. Here's what the pipeline does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mounts Google Drive to store models and outputs persistently&lt;/li&gt;
&lt;li&gt;Sets up local caching directories for huggingface_hub, diffusers, and PyTorch wheels&lt;/li&gt;
&lt;li&gt;Downloads required packages (like diffusers, transformers, accelerate) ahead of time into a local wheelhouse for fast, dependency-safe installs&lt;/li&gt;
&lt;li&gt;Detects GPU availability and automatically selects:&lt;/li&gt;
&lt;li&gt;stabilityai/stable-diffusion-xl-base-1.0 (SDXL) if CUDA is available&lt;/li&gt;
&lt;li&gt;runwayml/stable-diffusion-v1-5 as a fallback on CPU (work in progress)&lt;/li&gt;
&lt;li&gt;Initializes a DiffusionPipeline with memory-efficient options like vae_slicing and attention_slicing&lt;/li&gt;
&lt;li&gt;Generates an image using a carefully designed prompt and saves it to Drive&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;🖼️ The Prompt&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;prompt = (&lt;br&gt;
    "Ultra-detailed portrait of a mysterious traveler walking through "&lt;br&gt;
    "a neon-lit cyberpunk city at night, reflective puddles, cinematic "&lt;br&gt;
    "lighting, intricate textures, hyper-realistic, depth of field"&lt;br&gt;
)&lt;br&gt;
neg = "blurry, distorted, watermark, text, extra limbs"&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;💡 Why This Was Exciting&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;While tools like Midjourney and DALL·E abstract away the complexity, building your own Stable Diffusion pipeline gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full control over versions, dependencies, and parameters&lt;/li&gt;
&lt;li&gt;A better understanding of how diffusion models work under the hood&lt;/li&gt;
&lt;li&gt;A great foundation for future projects like fine-tuning, LoRA, or prompt chaining&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;🧠 What I Learned&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;SDXL is powerful — but managing model size and inference speed is key&lt;/li&gt;
&lt;li&gt;Prompt design is half the battle — you need to balance creativity with CLIP token limits (~77 tokens)&lt;/li&gt;
&lt;li&gt;Using Hugging Face + Google Drive makes it easy to cache, persist, and share your experiments&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;💡 The Output&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Image - 1&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkqxv9aw0jk5ehn9hzlk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkqxv9aw0jk5ehn9hzlk.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;💡 Notebook Code&lt;/strong&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ╔═══════════════════════════════════════════════════════════════╗
# ║  0.  MOUNT DRIVE + PREP CACHE FOLDERS                         ║
# ╚═══════════════════════════════════════════════════════════════╝
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

import os, pathlib, subprocess, textwrap

DRIVE_ROOT = "/content/drive/MyDrive/Study/ai_models"   # 👈 change if you like
HF_CACHE   = f"{DRIVE_ROOT}/hf_cache"
WHEELHOUSE = f"{DRIVE_ROOT}/wheelhouse"

for p in (HF_CACHE, WHEELHOUSE):
    pathlib.Path(p).mkdir(parents=True, exist_ok=True)

# Redirect every 🤗 cache to Drive
os.environ["HF_HOME"]               = HF_CACHE
os.environ["HUGGINGFACE_HUB_CACHE"] = HF_CACHE
os.environ["TRANSFORMERS_CACHE"]    = HF_CACHE
os.environ["DIFFUSERS_CACHE"]       = HF_CACHE

# ╔═══════════════════════════════════════════════════════════════╗
# ║  1.  DEFINE EXACT VERSIONS                                    ║
# ╚═══════════════════════════════════════════════════════════════╝
TORCH_WHEEL = "torch==2.6.0+cu124"           # matches Colab's CUDA 12.4 tool-chain
EXTRAS      = [
    "torchvision==0.21.0+cu124",
    "torchaudio==2.6.0+cu124",
]
PKGS = [                                      # all installed with --no-deps later
    "diffusers==0.27.2",
    "transformers==4.34.0",
    "accelerate==1.9.0",
    "huggingface_hub==0.24.1",                # provides cached_download
    "safetensors==0.5.3",
    "invisible_watermark==0.2.0",
    "tokenizers==0.14.1"
]

# ╔═══════════════════════════════════════════════════════════════╗
# ║  2.  DOWNLOAD WHEELS ONCE (SKIPPED IF THEY EXIST)             ║
# ╚═══════════════════════════════════════════════════════════════╝
def wheel_present(name_version: str) -&amp;gt; bool:
    name, ver = name_version.split("==")
    return bool(list(pathlib.Path(WHEELHOUSE).glob(f"{name}-{ver}*.whl")))

# Torch + friends come from the PyTorch wheel index
if not wheel_present("torch==2.6.0+cu124"):
    !pip download $TORCH_WHEEL {" ".join(EXTRAS)} -d "$WHEELHOUSE" \
        --index-url https://download.pytorch.org/whl/cu124

# Other packages from PyPI
for pkg in PKGS:
    if not wheel_present(pkg):
        !pip download $pkg -d "$WHEELHOUSE"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ╔═══════════════════════════════════════════════════════════════╗
# ║  3.  INSTALL FROM LOCAL WHEELS ONLY                           ║
# ╚═══════════════════════════════════════════════════════════════╝
# 3-a  Torch first (allows ABI-compatible extras)
!pip install --quiet --no-index --find-links="$WHEELHOUSE" \
    $TORCH_WHEEL {" ".join(EXTRAS)}

# 3-b  Everything else, but **--no-deps** so nothing tries to upgrade torch
!pip install --quiet --no-index --find-links="$WHEELHOUSE" --no-deps \
    diffusers==0.27.2 transformers==4.34.0 accelerate==1.9.0 \
    huggingface_hub==0.24.1 tokenizers==0.14.1 \
    safetensors==0.5.3 invisible_watermark==0.2.0

!pip uninstall -y peft
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ╔══════════════════════════════════════════════════════════════════╗
# ║  1.  LOAD A PIPELINE – SDXL on GPU, SD-v1.5 on CPU fallback     ║
# ╚══════════════════════════════════════════════════════════════════╝
import importlib.metadata, torch
print("torch        :", torch.__version__)
print("diffusers    :", importlib.metadata.version("diffusers"))
print("hub          :", importlib.metadata.version("huggingface_hub"))
# should show a consistent trio, e.g.:
# torch 2.6.0+cu124 | diffusers 0.29.x | hub 0.25+

from diffusers import DiffusionPipeline

has_cuda = torch.cuda.is_available()
if has_cuda:
    MODEL_ID  = "stabilityai/stable-diffusion-xl-base-1.0"
    DTYPE     = torch.float16
    DEVICE    = "cuda"
    LOAD_KW   = {}                      # GPU loads everything
else:
    MODEL_ID  = "runwayml/stable-diffusion-v1-5"
    DTYPE     = torch.float32
    DEVICE    = "cpu"
    LOAD_KW   = dict(device_map="balanced", max_memory={"cpu": "10GiB"})

pipe = DiffusionPipeline.from_pretrained(
    MODEL_ID,
    torch_dtype=DTYPE,
    use_safetensors=True,
    **LOAD_KW,
).to(DEVICE)

# VRAM / RAM savers
if has_cuda:
    #pipe.enable_xformers_memory_efficient_attention()
    pipe.enable_vae_slicing()
else:
    pipe.enable_attention_slicing()
    pipe.enable_vae_slicing()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# ╔══════════════════════════════════════════════════════════════════╗
# ║  2.  GENERATE A TEST IMAGE                                       ║
# ╚══════════════════════════════════════════════════════════════════╝
prompt = ("Ultra-detailed portrait of a mysterious traveler walking through "
          "a neon-lit cyberpunk city at night, reflective puddles, cinematic "
          "lighting, intricate textures, hyper-realistic, depth of field")
neg    = "blurry, distorted, watermark, text, extra limbs"

height = width = 1024 if has_cuda else 512
steps  = 30    if has_cuda else 20

image = pipe(
    prompt              = prompt,
    negative_prompt     = neg,
    height              = height,
    width               = width,
    num_inference_steps = steps,
    guidance_scale      = 7.5,
    generator           = torch.Generator(DEVICE).manual_seed(42),
).images[0]

# ╔══════════════════════════════════════════════════════════════════╗
# ║  3.  SAVE TO DRIVE                                               ║
# ╚══════════════════════════════════════════════════════════════════╝
out_dir  = pathlib.Path(DRIVE_ROOT) / "outputs"
out_dir.mkdir(parents=True, exist_ok=True)
fname    = out_dir / f"traveler_{'sdxl' if has_cuda else 'v15'}.png"
image.save(fname)

print(f"✅ Render complete → {fname}")
image   # displays inline in Colab
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>ai</category>
      <category>stablediffusion</category>
      <category>learning</category>
      <category>python</category>
    </item>
    <item>
      <title>Pulumi-Day1- Getting Started</title>
      <dc:creator>Debashis Adak</dc:creator>
      <pubDate>Mon, 01 Jul 2024 10:20:21 +0000</pubDate>
      <link>https://dev.to/dadak5/pulumi-day1-getting-started-1ieg</link>
      <guid>https://dev.to/dadak5/pulumi-day1-getting-started-1ieg</guid>
      <description>&lt;p&gt;Hi All,&lt;/p&gt;

&lt;p&gt;I am starting to learn pulimi. Pulumi's infrastructure-as-code SDK helps you create, deploy, and manage AWS containers, serverless functions, and infrastructure using programming languages like TypeScript, Python, Go, C#, and Java, and markup languages like YAML. The Pulumi AWS provider packages and CLI help you accomplish all these within minutes.&lt;/p&gt;

&lt;p&gt;Docs&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.pulumi.com/docs/clouds/aws/get-started/create-project/"&gt;https://www.pulumi.com/docs/clouds/aws/get-started/create-project/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.pulumi.com/docs/clouds/aws/get-started/begin/"&gt;https://www.pulumi.com/docs/clouds/aws/get-started/begin/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.pulumi.com/docs/clouds/aws/get-started/create-project/"&gt;https://www.pulumi.com/docs/clouds/aws/get-started/create-project/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scope of this blog will be just to check how pulumi project works. Its a learning journey for Infrastructure as code with AWS.&lt;/p&gt;

&lt;p&gt;Selected Pulumi for infrastructure-as-code as it provides programming language flexibility (It means you write test cases) &amp;amp; integration with a lot of the cloud providers (like aws, azure, gcp etc).&lt;/p&gt;

&lt;p&gt;As I am comfortable with Python, it will be my choice of language.&lt;br&gt;
Also, I am using my windows personal machine.&lt;/p&gt;
&lt;h2&gt;
  
  
  Environment Setup
&lt;/h2&gt;

&lt;p&gt;Following link: &lt;a href="https://www.pulumi.com/docs/clouds/aws/get-started/begin/"&gt;https://www.pulumi.com/docs/clouds/aws/get-started/begin/&lt;/a&gt;&lt;br&gt;
Looks like you need Powershell is preferred to install pulumi as per documentation.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;To install pulumi&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;choco install pulumi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To check installed pulumi version&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pulumi version
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Version installed: v3.121.0&lt;/p&gt;

&lt;p&gt;You need to install Python or any other language you are using. I have Python already installed in my system. So, need to install it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configure Pulumi to access your AWS account
&lt;/h2&gt;

&lt;p&gt;Following page: &lt;a href="https://www.pulumi.com/registry/packages/aws/installation-configuration/"&gt;https://www.pulumi.com/registry/packages/aws/installation-configuration/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pulumi requires cloud credentials to manage and provision resources. You must use an IAM user account that has programmatic access with rights to deploy and manage resources handled through Pulumi.&lt;/p&gt;

&lt;p&gt;I already have AWS CLI installed on my system. For pulumi we need to create a new IAM user/ existing IAM user. &lt;/p&gt;

&lt;p&gt;Followed below mentioned steps&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;I create a new IAM user "pulumi-user"&lt;/li&gt;
&lt;li&gt;Gave AdministratorAccess to the user (note: Gave it admin access for ease of experiments)&lt;/li&gt;
&lt;li&gt;Added profile "pulumi-dev" aws_access key &amp;amp; secret key in ~/.aws/credentials (screenshot below)&lt;/li&gt;
&lt;li&gt;Configure pulumi to use the profile from your terminal (command: pulumi config set aws:profile pulumi-dev)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs40rcunmtfqpf768cm2n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs40rcunmtfqpf768cm2n.png" alt="Image description" width="713" height="66"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Create a Pulumi Project
&lt;/h2&gt;

&lt;p&gt;Reference link: &lt;a href="https://www.pulumi.com/docs/clouds/aws/get-started/create-project/"&gt;https://www.pulumi.com/docs/clouds/aws/get-started/create-project/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I created a gitlab project called "pulumi". Created the project "awsproj" as mentioned in the steps given on the document.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PS &amp;gt; pulumi new aws-python
This command will walk you through creating a new Pulumi project.

Enter a value or leave blank to accept the (default), and press &amp;lt;ENTER&amp;gt;.
Press ^C at any time to quit.

project name (project): awsproj
project description (A minimal AWS Python Pulumi program):
Created project 'awsproj'

stack name (dev):
Created stack 'dev'
Enter your passphrase to protect config/secrets:
Re-enter your passphrase to confirm:

The toolchain to use for installing dependencies and running the program pip
aws:region: The AWS region to deploy into (us-east-1):
Saved config

Installing dependencies...

Creating virtual environment...
Finished creating virtual environment
Updating pip, setuptools, and wheel in virtual environment...
Requirement already satisfied: pip in c:\users\debashis\git_projects\pulumi\venv\lib\site-packages (22.0.4)
Collecting pip
  Downloading pip-24.1.1-py3-none-any.whl (1.8 MB)
     ---------------------------------------- 1.8/1.8 MB 11.6 MB/s eta 0:00:00
Requirement already satisfied: setuptools in c:\users\debashis\git_projects\pulumi\venv\lib\site-packages (58.1.0)
Collecting setuptools
  Downloading setuptools-70.1.1-py3-none-any.whl (883 kB)
     ------------------------------------- 883.3/883.3 KB 28.2 MB/s eta 0:00:00
Collecting wheel
  Using cached wheel-0.43.0-py3-none-any.whl (65 kB)
Installing collected packages: wheel, setuptools, pip
  Attempting uninstall: setuptools
    Found existing installation: setuptools 58.1.0
    Uninstalling setuptools-58.1.0:
      Successfully uninstalled setuptools-58.1.0
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-24.1.1 setuptools-70.1.1 wheel-0.43.0
Finished updating
Installing dependencies in virtual environment...
Collecting pulumi&amp;lt;4.0.0,&amp;gt;=3.0.0 (from -r requirements.txt (line 1))
  Downloading pulumi-3.121.0-py3-none-any.whl.metadata (11 kB)
Collecting pulumi-aws&amp;lt;7.0.0,&amp;gt;=6.0.2 (from -r requirements.txt (line 2))
  Downloading pulumi_aws-6.42.1-py3-none-any.whl.metadata (8.4 kB)
Collecting protobuf~=4.21 (from pulumi&amp;lt;4.0.0,&amp;gt;=3.0.0-&amp;gt;-r requirements.txt (line 1))
  Downloading protobuf-4.25.3-cp310-abi3-win_amd64.whl.metadata (541 bytes)
Collecting grpcio~=1.60.1 (from pulumi&amp;lt;4.0.0,&amp;gt;=3.0.0-&amp;gt;-r requirements.txt (line 1))
  Downloading grpcio-1.60.1-cp310-cp310-win_amd64.whl.metadata (4.2 kB)
Collecting dill~=0.3 (from pulumi&amp;lt;4.0.0,&amp;gt;=3.0.0-&amp;gt;-r requirements.txt (line 1))
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting six~=1.12 (from pulumi&amp;lt;4.0.0,&amp;gt;=3.0.0-&amp;gt;-r requirements.txt (line 1))
  Downloading six-1.16.0-py2.py3-none-any.whl.metadata (1.8 kB)
Collecting semver~=2.13 (from pulumi&amp;lt;4.0.0,&amp;gt;=3.0.0-&amp;gt;-r requirements.txt (line 1))
  Downloading semver-2.13.0-py2.py3-none-any.whl.metadata (5.0 kB)
Collecting pyyaml~=6.0 (from pulumi&amp;lt;4.0.0,&amp;gt;=3.0.0-&amp;gt;-r requirements.txt (line 1))
  Using cached PyYAML-6.0.1-cp310-cp310-win_amd64.whl.metadata (2.1 kB)
Collecting parver&amp;gt;=0.2.1 (from pulumi-aws&amp;lt;7.0.0,&amp;gt;=6.0.2-&amp;gt;-r requirements.txt (line 2))
  Downloading parver-0.5-py3-none-any.whl.metadata (2.7 kB)
Collecting typing-extensions&amp;gt;=4.11 (from pulumi-aws&amp;lt;7.0.0,&amp;gt;=6.0.2-&amp;gt;-r requirements.txt (line 2))
  Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting arpeggio&amp;gt;=1.7 (from parver&amp;gt;=0.2.1-&amp;gt;pulumi-aws&amp;lt;7.0.0,&amp;gt;=6.0.2-&amp;gt;-r requirements.txt (line 2))
  Downloading Arpeggio-2.0.2-py2.py3-none-any.whl.metadata (2.4 kB)
Collecting attrs&amp;gt;=19.2 (from parver&amp;gt;=0.2.1-&amp;gt;pulumi-aws&amp;lt;7.0.0,&amp;gt;=6.0.2-&amp;gt;-r requirements.txt (line 2))
  Downloading attrs-23.2.0-py3-none-any.whl.metadata (9.5 kB)
Downloading pulumi-3.121.0-py3-none-any.whl (263 kB)
   ---------------------------------------- 263.4/263.4 kB 5.5 MB/s eta 0:00:00
Downloading pulumi_aws-6.42.1-py3-none-any.whl (9.3 MB)
   ---------------------------------------- 9.3/9.3 MB 14.8 MB/s eta 0:00:00
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
   ---------------------------------------- 116.3/116.3 kB 6.6 MB/s eta 0:00:00
Downloading grpcio-1.60.1-cp310-cp310-win_amd64.whl (3.7 MB)
   ---------------------------------------- 3.7/3.7 MB 3.2 MB/s eta 0:00:00
Downloading parver-0.5-py3-none-any.whl (15 kB)
Downloading protobuf-4.25.3-cp310-abi3-win_amd64.whl (413 kB)
   ---------------------------------------- 413.4/413.4 kB ? eta 0:00:00
Using cached PyYAML-6.0.1-cp310-cp310-win_amd64.whl (145 kB)
Downloading semver-2.13.0-py2.py3-none-any.whl (12 kB)
Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Downloading Arpeggio-2.0.2-py2.py3-none-any.whl (55 kB)
   ---------------------------------------- 55.3/55.3 kB ? eta 0:00:00
Downloading attrs-23.2.0-py3-none-any.whl (60 kB)
   ---------------------------------------- 60.8/60.8 kB ? eta 0:00:00
Installing collected packages: arpeggio, typing-extensions, six, semver, pyyaml, protobuf, grpcio, dill, attrs, pulumi, parver, pulumi-aws
Successfully installed arpeggio-2.0.2 attrs-23.2.0 dill-0.3.8 grpcio-1.60.1 parver-0.5 protobuf-4.25.3 pulumi-3.121.0 pulumi-aws-6.42.1 pyyaml-6.0.1 semver-2.13.0 six-1.16.0 typing-extensions-4.12.2
Finished installing dependencies
Finished installing dependencies

Your new project is ready to go!

To perform an initial deployment, run `pulumi up`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As we can see, it created a virtual environment ".venv".&lt;/p&gt;

&lt;p&gt;Note: I have used pip as the package installer as finding error while using poetry.&lt;/p&gt;

&lt;p&gt;How project structure looks like?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d-----        01-07-2024     14:29                venv
-a----        01-07-2024     14:29             14 .gitignore
-a----        01-07-2024     14:29            117 Pulumi.dev.yaml                                                                                                      
-a----        01-07-2024     14:29            206 Pulumi.yaml
-a----        01-07-2024     14:29             48 requirements.txt
-a----        01-07-2024     14:29            229 __main__.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Contents inside requirements.txt&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pulumi&amp;gt;=3.0.0,&amp;lt;4.0.0
pulumi-aws&amp;gt;=6.0.2,&amp;lt;7.0.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;u&gt;Activate the Virtual Environment&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To activate the virtual environment&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.\venv\Scripts\activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check pip dependencies&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip freeze

Arpeggio==2.0.2
attrs==23.2.0
dill==0.3.8
grpcio==1.60.1
parver==0.5
protobuf==4.25.3
pulumi==3.121.0
pulumi_aws==6.42.1
PyYAML==6.0.1
semver==2.13.0
six==1.16.0
typing_extensions==4.12.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Running a Pulumi Project
&lt;/h2&gt;

&lt;p&gt;When you create a pulumi project, it generates a &lt;strong&gt;main&lt;/strong&gt;.py by default.&lt;/p&gt;

&lt;p&gt;Content of &lt;strong&gt;main&lt;/strong&gt;.py&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"""An AWS Python Pulumi program"""

import pulumi
from pulumi_aws import s3

# Create an AWS resource (S3 Bucket)
bucket = s3.Bucket('my-bucket')

# Export the name of the bucket
pulumi.export('bucket_name', bucket.id)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It means if we run this script, it should create an S3 bucket with the prefix "my-bucket".&lt;/p&gt;

&lt;p&gt;Lets run it&lt;br&gt;
command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pulumi up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Screenshot of the run&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6h6stwhc5w7yuupm7sq0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6h6stwhc5w7yuupm7sq0.png" alt="Image description" width="761" height="581"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, it created an S3 bucket&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq55ns2g0blqe866w9uzx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq55ns2g0blqe866w9uzx.png" alt="Image description" width="800" height="40"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Destroying Resources
&lt;/h2&gt;

&lt;p&gt;command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pulumi destroy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7oc1psw7w80ud3a78yvr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7oc1psw7w80ud3a78yvr.png" alt="Image description" width="800" height="580"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see bucket got destroyed.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>python</category>
      <category>aws</category>
      <category>pulumi</category>
    </item>
    <item>
      <title>Databricks - Variant Type Analysis</title>
      <dc:creator>Debashis Adak</dc:creator>
      <pubDate>Sat, 29 Jun 2024 12:34:57 +0000</pubDate>
      <link>https://dev.to/dadak5/databricks-variant-type-analysis-1bh1</link>
      <guid>https://dev.to/dadak5/databricks-variant-type-analysis-1bh1</guid>
      <description>&lt;p&gt;The VARIANT data type is a recent introduction in Databricks &lt;strong&gt;(available in Databricks Runtime 15.3 and above)&lt;/strong&gt; designed specifically for handling semi-structured data. It offers an efficient and flexible way to store and process this kind of data, which often has a dynamic or evolving schema.&lt;/p&gt;

&lt;p&gt;Here's a quick rundown of its key features (As per the documentation of Databricks):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Flexibility: VARIANT can store various data structures within a single column, including structs, arrays, maps, and scalars. This eliminates the need for pre-defined schemas, making it adaptable to changing data formats.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Performance: Compared to storing semi-structured data as JSON strings, VARIANT offers significant performance improvements. This is because VARIANT uses a binary encoding scheme for data representation, allowing for faster processing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ease of Use: Databricks recommends VARIANT as the preferred choice over JSON strings for working with semi-structured data. It provides familiar syntax for querying fields and elements within the VARIANT column using dot notation and array indexing.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Overall, the VARIANT data type streamlines working with semi-structured data in Databricks, enhancing flexibility, performance, and ease of use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Snowflake&lt;/strong&gt; has long offered the VARIANT data type, allowing you to store semi-structured data without pre-defining a schema. This eliminates the burden of schema design upfront.&lt;/p&gt;

&lt;p&gt;In contrast, for Delta Lake previously we relied on the MAP data type, which requires a defined schema. However, semi-structured data often exhibits schema variations across rows, creating challenges for data engineers. Parsing the data correctly before storage was a necessary but tedious step.&lt;/p&gt;

&lt;p&gt;In this exploration, I'll try to uncover the VARIANT data type in Databricks and its underlying mechanisms.&lt;/p&gt;

&lt;p&gt;Some important databricks documentation links&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.databricks.com/en/delta/variant.html" rel="noopener noreferrer"&gt;https://docs.databricks.com/en/delta/variant.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.databricks.com/en/semi-structured/variant.html" rel="noopener noreferrer"&gt;https://docs.databricks.com/en/semi-structured/variant.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.databricks.com/en/semi-structured/variant-json-diff.html" rel="noopener noreferrer"&gt;https://docs.databricks.com/en/semi-structured/variant-json-diff.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Code Reference
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2440252792644019/398134129842206/8684924100662862/latest.html" rel="noopener noreferrer"&gt;https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2440252792644019/398134129842206/8684924100662862/latest.html&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Steps To Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Provision a databricks cluster with a runtime environment 15.3&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I created a Test Cluster with runtime 15.3 Beta version (Apache Spark 3.5.0, Scala: 2.12)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmb1okjxfq5hrd0uj7zz5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmb1okjxfq5hrd0uj7zz5.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Verify your workbook if it is running with 15.3&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbvnkbv5mxwd85263koot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbvnkbv5mxwd85263koot.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Create a schema&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For the first time, we are going to create the schema. You can use your own schema. Post creation, it should start reflecting in the catalog.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh32r5eeaimkmjswi95og.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh32r5eeaimkmjswi95og.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Verify if parse_json function is present (from 15.3 version, this function should be present)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As per the documentation: &lt;a href="https://docs.databricks.com/en/semi-structured/variant.html" rel="noopener noreferrer"&gt;https://docs.databricks.com/en/semi-structured/variant.html&lt;/a&gt; to parse json data of a column, you can use parse_json function. It will validate incoming data if it is in JSON format or not. Also, it will create VARIANT data type of the column if you are creating the table using select.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsmwnwe1x7btohr7brpvs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsmwnwe1x7btohr7brpvs.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Create a table&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On this step, we are creating a table variant_data_exploration in the schema myschema parsing a json object As per the query, it will create 3 columns&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;id: int&lt;/li&gt;
&lt;li&gt;name: string&lt;/li&gt;
&lt;li&gt;raw: variant&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc0ph9be1s62l3rjbj2oi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc0ph9be1s62l3rjbj2oi.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 6: Table Schema Verification&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As you can see below, under the section, # Delta Statistics Columns,&lt;/p&gt;

&lt;p&gt;Column Names: id, name, raw&lt;/p&gt;

&lt;p&gt;Column Selection Method: first-32 (default behavior of a deltalake table. It gathers statistics for the first 32 columns)&lt;/p&gt;

&lt;p&gt;Location: dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqkfs1fhv8kuo083vyqr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqkfs1fhv8kuo083vyqr.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 7: Verify Files in Table Location&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

[
FileInfo(path='dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration/_delta_log/', name='_delta_log/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration/part-00000-603c8a87-dfdd-41a0-817d-9226cef0ab8a-c000.snappy.parquet', name='part-00000-603c8a87-dfdd-41a0-817d-9226cef0ab8a-c000.snappy.parquet', size=3943, modificationTime=1719567376000)
]


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gyfn29jv7qr3ol2m04q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gyfn29jv7qr3ol2m04q.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see, there are _delta_log directory (having deltatable metadata/stats related files) and one parquet file (part-00000-603c8a87-dfdd-41a0-817d-9226cef0ab8a-c000.snappy.parquet) holding a single row.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 8: Verify Files in __delta_log location&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

[FileInfo(path='dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration/_delta_log/00000000000000000000.crc', name='00000000000000000000.crc', size=2616, modificationTime=1719567381000),
 FileInfo(path='dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration/_delta_log/00000000000000000000.json', name='00000000000000000000.json', size=1741, modificationTime=1719567377000),
 FileInfo(path='dbfs:/user/hive/warehouse/myschema.db/variant_data_exploration/_delta_log/_commits/', name='_commits/', size=0, modificationTime=0)]


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0voadurwmzsrota0ydec.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0voadurwmzsrota0ydec.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mainly it has 2 file types&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;00000000000000000000.json (Holds column statistics &amp;amp; responsible for data pruning/ file skipping. For each commit, a new json file with incremented value gets created)&lt;/li&gt;
&lt;li&gt;00000000000000000000.crc (Every 10 transactions json files in the deltalog are converted to parquet files. The .crc file is a checksum added to prevent corruption if a parquet file is corrupted in flight)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Explore
&lt;/h2&gt;

&lt;p&gt;Lets see the content inside 00000000000000000000.json as this is the main driving factor of data skipping &amp;amp; query performance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk496s96n4i0u4dkp9ahr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk496s96n4i0u4dkp9ahr.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

---------------+----------------------------------------------------+
|add                                                                                                                                                                                                                                                                                            |commitInfo                                                                                                                                                                                                                                                                                            |metaData                                                                                                                                                                                                                                                                                   |protocol                                            |
+---------------------------------------------------------------------------------------------
|NULL                                                                                                                                                                                                                                                                                           |{0628-070916-m4o60ack, Databricks-Runtime/15.3.x-scala2.12, false, WriteSerializable, {398134129842206}, CREATE OR REPLACE TABLE AS SELECT, {1, 3943, 1}, {[], NULL, true, [], {}, false}, {true, false}, 1719567376334, 472f6c8b-cd1d-4347-acd4-c49c2ebd8072, 8684924100662862, abc@gmail.com}|NULL                                                                                                                                                                                                                                                                                       |NULL                                                |
|NULL                                                                                                                                                                                                                                                                                           |NULL                                                                                                                                                                                                                                                                                                  |{1719567374807, {parquet}, c22760ef-1595-4a59-974a-2a4dbb3a1386, [], {"type":"struct","fields":[{"name":"id","type":"integer","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"raw","type":"variant","nullable":true,"metadata":{}}]}}|NULL                                                |
|NULL                                                                                                                                                                                                                                                                                           |NULL                                                                                                                                                                                                                                                                                                  |NULL                                                                                                                                                                                                                                                                                       |{3, 7, [variantType-preview], [variantType-preview]}|
|{true, 1719567376000, part-00000-603c8a87-dfdd-41a0-817d-9226cef0ab8a-c000.snappy.parquet, 3943, {"numRecords":1,"minValues":{"id":1,"name":"abc"},"maxValues":{"id":1,"name":"abc"},"nullCount":{"id":0,"name":0,"raw":0}}, {1719567376000000, 1719567376000000, 1719567376000000, 268435456}}|NULL                                                                                                                                                                                                                                                                                                  |NULL                                                                                                                                                                                                                                                                                       |NULL                                                |
+---------------------------------------------------------------------------------------------



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Observations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Check this stats section in delta log&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

{"numRecords":1,"minValues":{"id":1,"name":"abc"},"maxValues":{"id":1,"name":"abc"},"nullCount":{"id":0,"name":0,"raw":0}}


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;deltatable gathered min value &amp;amp; max value statistics for only id &amp;amp; name column not for &lt;strong&gt;raw. It means FILTER condition in a SELECT query on a VARIANT datatype shouldn't contribute to file skipping&lt;/strong&gt;. You would still be dependent on other non-complex columns to have a file level data skipping.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What if we insert a NULL value then? Will it contribute in data skipping?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbq4zv82ogcs2ayk1mwqn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbq4zv82ogcs2ayk1mwqn.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now deltalog has version 00000000000000000001 available&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxh01eprcl89pcbghq7gt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxh01eprcl89pcbghq7gt.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Content of 00000000000000000001.json file&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|add                                                                                                                                                                                                                                                                                            |commitInfo                                                                                                                                                                                                                                                      |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|NULL                                                                                                                                                                                                                                                                                           |{0628-070916-m4o60ack, Databricks-Runtime/15.3.x-scala2.12, true, WriteSerializable, {398134129842206}, WRITE, {1, 1036, 1}, {Append, [], false}, 0, {true, false}, 1719572454185, 1b7a5e88-9f4b-4c9e-8af3-39a1b808b5cc, 8684924100662862, abc@gmail.com}|
|{true, 1719572454000, part-00000-477f86d9-19d7-462d-ab1d-e7891348b2a3-c000.snappy.parquet, 1036, {"numRecords":1,"minValues":{"id":2,"name":"def"},"maxValues":{"id":2,"name":"def"},"nullCount":{"id":0,"name":0,"raw":1}}, {1719572454000000, 1719572454000000, 1719572454000000, 268435456}}|NULL                                                                                                                                                                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;See the plan of the query&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

select * from myschema.variant_data_exploration where raw is not null


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gz8yygogsdoz29cvbu1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gz8yygogsdoz29cvbu1.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here no. of files read = 1 &amp;amp; no. of file pruned = 1. It means it skipped the file of the first commit. &lt;strong&gt;It means it contributed to file skipping.&lt;/strong&gt; Why?&lt;/p&gt;

&lt;p&gt;See the section 00000000000000000001.json file&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

"nullCount":{"id":0,"name":0,"raw":1}}


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;It means during the second commit, we have inserted the row with raw field is NULL &amp;amp; deltalake captured that stats. So, it is able to skip the file scan.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What if we insert a {} value then? Will it contribute in data skipping?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We all know, in many systems, while persisting a NULL semi-structured data, we persist the record as {}. It is kind of NULL representation of a JSON. Lets see how VARIANT responds to it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyccasqa9k1ul0quflu0b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyccasqa9k1ul0quflu0b.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Content of 00000000000000000002.json file&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|add                                                                                                                                                                                                                                                                                            |commitInfo                                                                                                                                                                                                                                                      |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|NULL                                                                                                                                                                                                                                                                                           |{0628-070916-m4o60ack, Databricks-Runtime/15.3.x-scala2.12, true, WriteSerializable, {398134129842206}, WRITE, {1, 1102, 1}, {Append, [], false}, 1, {true, false}, 1719573911083, 6a04ec9a-9a61-4875-a47d-7d26d14877cf, 8684924100662862, abc@gmail.com}|
|{true, 1719573911000, part-00000-4d9aa00d-2c82-4e96-b15c-16cba8b374a4-c000.snappy.parquet, 1102, {"numRecords":1,"minValues":{"id":3,"name":"ghi"},"maxValues":{"id":3,"name":"ghi"},"nullCount":{"id":0,"name":0,"raw":0}}, {1719573911000000, 1719573911000000, 1719573911000000, 268435456}}|NULL                                                                                                                                                                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Now if you see this below section, deltalake didn't consider {} as NULL value&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

"nullCount":{"id":0,"name":0,"raw":0}}


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;So, if we run a query like below, it will scan two files (1st transaction &amp;amp; 3rd transaction). See the below explain plan.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

select * from myschema.variant_data_exploration where raw:fb::string ='abc';


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa4a5krmof14rfwwo8yeu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa4a5krmof14rfwwo8yeu.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Variant provides huge flexibility in terms of storing semi-structured data&lt;/li&gt;
&lt;li&gt;For Variant data type, File pruning is possible if you are storing data as NULL not {} or any other&lt;/li&gt;
&lt;li&gt;As deltalake doesn't capture stats of internal fields of a variant column, if we query them it will result in loading all the parquet files (in a partition if a partitioned table) with NOT NULL variant data.&lt;/li&gt;
&lt;li&gt;If we are modelling the deltatable &amp;amp; our ETL is pushing NOT NULL values in variant, you can keep those columns outside of first 32 columns. Performance is expected to be the same.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Things to Follow-up
&lt;/h2&gt;

&lt;p&gt;As far as I can remember, few years back, I was debugging a Variant column performance issue in Snowflake. At that time, either snowflake team or in some snowflake forum (Can't remember exactly) claimed, snowflake persists Variant data in a &lt;strong&gt;very flattened format&lt;/strong&gt; in their storage. Also, it gathers stats of all the internal fields of a variant. That makes snowflake very unique as it doesn't have any performance impact between querying a normal column &amp;amp; an attribute within a Variant column. Not sure if it is true on today's date or not. Need to follow-up :)&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>spark</category>
      <category>bigdata</category>
      <category>datalake</category>
    </item>
  </channel>
</rss>
