DEV Community: Vishwa.P

Stop Copy-Pasting Notes: Building an AI-Powered Pipeline from Obsidian to Anki

Vishwa.P — Fri, 20 Feb 2026 21:48:06 +0000

The Problem: The Friction of Remembering

Taking detailed notes in Obsidian is fantastic for deep understanding and connecting concepts. But there's a huge problem: building a long-term memory for certifications, new tools, or complex system architectures requires active recall and spaced repetition. Obsidian is a powerful "factory" for building knowledge, but it's terrible as a "delivery truck" for rote memorization.

Manually copying and pasting notes from Obsidian into a spaced repetition system (SRS) like Anki creates massive friction. You end up either abandoning your flashcards entirely or spending hours duplicating data instead of learning. I needed a way to capture atomic facts in my daily notes and have them automatically flow into my flashcard decks without leaving my editor.

The Setup: The "Second Brain" Data Pipeline

I decided to treat my personal knowledge system like a Data Engineering pipeline. I needed a reliable way to ingest "raw" daily journals, transform them into structured domain notes, and extract flashcards as data marts for Anki.

The setup leverages:

Obsidian: Long-form connected notes.
Anki: The absolute gold standard for spaced repetition (utilizing FSRS).
Obsidian_to_Anki Plugin: For syncing notes into decks.

The Architecture: How It Works

The architecture is built around three distinct stages—much like a medallion data architecture (Bronze, Silver, Gold).

[Daily Journaling] --> [01_Raw Journal : Bronze]
                             |
                   +---------v---------+
                   | Automation Agent  |
                   | - Yaml Parser     |
                   | - Routing Logic   |
                   +---------+---------+
                             |
         +-------------------+-------------------+
         | (move_to_brain)   | (move_to_mart)    | (audit)
         v                   v                   v
   [02_Brain : Silver]  [03_Mart : Gold]  [movement_log.md]
                             |
                             v
              [Obsidian_to_Anki Integration]

01_Raw: The ingestion layer where daily notes and raw thoughts are jotted down.
02_Brain: The processed, structured domain knowledge (e.g., 05_Warehousing or 09_AI).
03_Mart: Specifically extracted, atomic flashcards ready for Anki sync.

The Solution: Frontmatter Routing and CurlyCloze

The "Secret Sauce": Frontmatter Automation

The key to completely removing the manual work was utilizing YAML frontmatter to give the notes "state" and instruct a background pipeline where to send the content.

By simply tagging my daily journal, the background automation parses the file and moves it strictly based on the taxonomy defined in the tags.

---
tags: [05_Warehousing]
Explain: false
move_journal_to_brain: true
move_brain_to_mart: true
processed: false
---

When my automation runs, it searches for files where processed: false. If move_journal_to_brain is true, the script builds a proper page under the 02_Brain folder. If move_brain_to_mart is true, it extracts the target text specifically for flashcard ingestion.

Writing Flashcards: The CurlyCloze Syntax

Instead of using a separate app to write Anki cards, I use the CurlyCloze syntax directly in my markdown blocks and rely on the Obsidian_to_Anki integration to parse them.

# My note on Snowflake

TARGET DECK: SecondBrain::05_Warehousing

START
Cloze
Snowflake micro-partitions are between {50 MB} and {500 MB} of uncompressed data.
END

START
Cloze
{Time Travel} allows access to historical data in Snowflake.
END

The automation moves these cloze blocks to 03_Mart, and the Anki plugin effortlessly scans the folder and updates my main database. Creating flashcards simply becomes a byproduct of formatting my daily notes!

What I Threw Away (And Why)

I tried these approaches first, but they failed:

Obsidian SRS Plugins: While functional, the mobile experience for Obsidian is clunky compared to AnkiMobile. Mixing note-taking UI with rapid-fire spaced repetition just diluted my cognitive focus.
Manual Staging Folders: I initially had a "Needs Review" folder where I manually sorted notes. This required a human-in-the-loop and quickly became a bottleneck. Moving to a frontmatter-driven configuration solved this instantly.

Putting It Together

By treating my personal notes as a data engineering pipeline, I removed the friction from spaced repetition. Notes are written once, dynamically routed based on markdown metadata, and seamlessly delivered to my Anki decks for long-term retention.

You can find the full repository structure, automation workflow logic, and taxonomy design here:

visuvishwa99 / second_brain

TL;DR

The Issue: Copying structured notes from Obsidian into Anki for spaced repetition is tedious and unsustainable.
The Fix: A metadata-driven pipeline that moves daily journals into structured domain folders and extracts cloze-deletion flashcards automatically using Obsidian-to-Anki.
The Outcome: Zero friction between learning a new concept and creating a testable flashcard, saving countless hours of manual data entry.

From Silent None to Insight: Debugging PySpark UDFs on AWS Glue with Decorators

Vishwa.P — Fri, 20 Feb 2026 03:14:48 +0000

Last month I was debugging a PySpark UDF that was silently returning None for about 2% of rows in a 10-million-row dataset. No error. No exception. Just... None.

I couldn't reproduce it locally because I didn't have the exact row that caused it. I couldn't add print() statements because -- as I painfully discovered -- print() inside a UDF doesn't show up anywhere useful. The output vanishes into executor logs that are buried three clicks deep in the Spark UI, if they exist at all.

That frustration led me to build a small set of PySpark debugging decorators. Some of them turned out to be genuinely useful. Others taught me more about Spark's architecture than I expected. And the whole thing sent me down a rabbit hole about how AWS Glue's Docker image actually works under the hood.

This post covers:

Three decorators I actually use in production debugging
Why print() inside a UDF doesn't do what you think
How AWS Glue's local Docker environment works (Livy, Sparkmagic, and the stdout black hole)
How to set up and test Glue jobs locally with Docker

Let's get into it.

The Setup: Testing Glue Jobs Locally with Docker

Before we get to the decorators, let me explain the environment. AWS Glue provides a Docker image that lets you develop and test ETL jobs on your own machine without spinning up cloud resources. This is a massive time-saver -- no waiting for Glue job cold starts, no paying for dev endpoint hours.

Here's how to get it running:

Pull the image

docker pull public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01

Start the container

Linux/Mac:

WORKSPACE=/path/to/your/project

docker run -itd \
    --name glue_jupyter \
    -p 8888:8888 \
    -p 4040:4040 \
    -v "$WORKSPACE:/home/glue_user/workspace/jupyter_workspace" \
    -v "$HOME/.aws:/home/glue_user/.aws:ro" \
    -e DISABLE_SSL=true \
    public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01 \
    /home/glue_user/jupyter/jupyter_start.sh \
    --NotebookApp.token='' \
    --NotebookApp.password=''

Windows (Git Bash / PowerShell):

docker run -itd \
    --name glue_jupyter \
    -p 8888:8888 \
    -p 4040:4040 \
    -v "C:/your/project://home/glue_user/workspace/jupyter_workspace" \
    -v "C:/Users/YourName/.aws://home/glue_user/.aws:ro" \
    -e DISABLE_SSL=true \
    public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01 \
    //home/glue_user/jupyter/jupyter_start.sh \
    --NotebookApp.token='' \
    --NotebookApp.password=''

Then open http://localhost:8888 in your browser. Navigate to the jupyter_workspace folder. Your local project files are right there, mounted into the container. Any changes you make are reflected on your host filesystem.

You can monitor Spark jobs in real-time at http://localhost:4040.

If you're using VS Code, connect to the Jupyter server at http://127.0.0.1:8888 and select the PySpark kernel.

Wait, How Does This Actually Work? The Livy Architecture

Here's something that tripped me up for hours and is worth understanding before we talk about decorators.

When you run a PySpark cell in the Glue Docker notebook, your code does not run inside the Jupyter kernel process. Here's what actually happens:

 [Your Browser]
       |
 [Jupyter Server]
       |
 [Sparkmagic Kernel]  -- sends your code over HTTP -->  [Livy Server :8998]
                                                              |
                                                       [Spark Driver JVM]
                                                              |
                                                       [Spark Executors]

Sparkmagic is the Jupyter kernel. It doesn't run your Python code directly. Instead, it serializes your cell's code as a string and sends it via HTTP POST to Livy, which is an open-source REST server for Spark.

Livy creates a Spark session, executes your code in a separate JVM-hosted Python process, captures whatever gets written to stdout, and sends that text back to Sparkmagic, which displays it in your notebook cell.

This architecture has a critical consequence: Livy only reliably captures stdout from the top-level cell execution. If your code calls a function, which calls a decorator, which calls print() three stack frames deep -- that output often gets lost in transit. Livy captures what Spark's driver process writes to stdout at the top level. Nested print() calls inside wrapper functions? Hit or miss. Usually miss.

This is why df.show() works (Spark's JVM writes directly to stdout at the top level) but print("hello") inside a decorator wrapper gets swallowed.

Understanding this saved me from a lot of "why doesn't this work" frustration. It's not a bug. It's the architecture.

The Decorators

With that context, here are the three decorators I actually kept after throwing away the ones that weren't pulling their weight.

Decorator 1: `@measure_time`

The simplest one, and probably the one I use most.

def measure_time(func):
    """Decorator to measure execution time of a function."""
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        duration = time.time() - start_time
        _log(f"[measure_time] {func.__name__} completed in {duration:.2f} seconds")
        return result
    return wrapper

Why it's useful: When you're chaining five transformations and the job takes 12 minutes, you need to know which transformation is the bottleneck. The Spark UI gives you stage-level timings, but this gives you function-level timings at a glance.

@measure_time
def build_features(input_df):
    return input_df.filter(...).join(...).groupBy(...).agg(...)

result = build_features(df)
result.show()
show_report()  # [measure_time] build_features completed in 4.32 seconds

Decorator 2: `@show_sample_output`

Quick peek at what your transformation actually produced.

def show_sample_output(n=5):
    """Decorator to show first n rows from resulting DataFrame."""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            result = func(*args, **kwargs)
            if isinstance(result, DataFrame):
                _log(f"\n[show_sample_output] First {n} rows from {func.__name__}:")
                _log(result._jdf.showString(n, 20, False))
            return result
        return wrapper
    return decorator

A note on _jdf: yes, _jdf is a private API -- it could break in a future Spark version. For production logging I'd use result.limit(n).toPandas().to_string() instead. But for a debugging decorator that you strip before deploy, I'm fine with it. It avoids the Livy stdout capture problem that .show() has, and it's fast.

Why it's useful: When you're building a pipeline with multiple transformation stages, you often want to verify the output shape at each step without manually adding .show() everywhere. Slap this decorator on, see your data, remove it when you're done.

@show_sample_output(3)
def clean_names(input_df):
    return input_df.withColumn("Name", upper(col("Name")))

result = clean_names(df)
show_report()

Decorator 3: `debug_udf` (The One That Actually Solved My Problem)

This is the one that came out of real pain. Here's the problem: UDFs run on executors, not the driver. You cannot:

Add print() statements (output goes to executor logs, not your notebook)
Use pdb or any debugger (it's a serialized function running on a remote JVM worker)
Read accumulator values inside the task (throws RuntimeError)
Easily figure out which input caused a failure

The solution: use a Spark Accumulator to ship (input, output) samples from executors back to the driver.

class ListAccumulatorParam(AccumulatorParam):
    def zero(self, initial_value):
        return []
    def addInPlace(self, acc1, acc2):
        return acc1 + acc2


def debug_udf(func, n=5):
    sc = SparkSession.builder.getOrCreate().sparkContext
    acc = sc.accumulator([], ListAccumulatorParam())

    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        result = func(*args, **kwargs)
        # Can't read acc.value here (throws RuntimeError inside tasks).
        # Just always add. Limit to n when printing on the driver.
        try:
            input_repr = str(args[0]) if args else "No Args"
            if len(input_repr) > 500:
                input_repr = input_repr[:500] + "..."
            acc.add([{"input": input_repr, "output": str(result)}])
        except Exception:
            pass
        return result
    return wrapper, acc


def print_debug_samples(acc, func_name="UDF", n=None):
    samples = acc.value
    if n is not None:
        samples = samples[:n]
    for i, s in enumerate(samples, 1):
        print(f"  Sample {i}: Input={s['input']}  ->  Output={s['output']}")

Usage:

def my_udf_logic(row):
    return f"Processed: {row.Name}"

debug_fn, acc = debug_udf(my_udf_logic, n=3)
my_udf = udf(debug_fn, StringType())

result = df.withColumn("Result", my_udf(struct("Name", "Age")))
result.show()
print_debug_samples(acc, "my_udf_logic", n=3)

Output:

+-------+---+------------------+
|   Name|Age|            Result|
+-------+---+------------------+
|  Alice| 34|  Processed: Alice|
|    Bob| 45|    Processed: Bob|
|Charlie| 29|Processed: Charlie|
|  David| 30|  Processed: David|
+-------+---+------------------+

==================================================
[debug_udf] my_udf_logic: 3 sample(s)
==================================================
  Sample 1: Input=Row(Name='Alice', Age=34)  ->  Output=Processed: Alice
  Sample 2: Input=Row(Name='Bob', Age=45)    ->  Output=Processed: Bob
  Sample 3: Input=Row(Name='Charlie', Age=29)->  Output=Processed: Charlie
==================================================

Now you can see exactly what your UDF received and what it returned. When that UDF is producing None for mysterious rows, you can bump n up to 1000, scan the samples, and find the problematic input.

One important gotcha: I originally tried to limit samples inside the wrapper using if counter.value < limit. Spark throws RuntimeError: Accumulator.value cannot be accessed inside tasks. You can only write to accumulators on executors, never read. The limiting has to happen on the driver side in print_debug_samples.

The Livy stdout Problem (and the Solution)

If you're using the Glue Docker image and wondering why print() inside decorators doesn't show up, here's the full explanation.

The problem: Sparkmagic sends your code to Livy, which runs it in a separate Spark process. Livy captures stdout, but only reliably from top-level execution. print() buried inside nested function calls (like decorators) gets lost.

What works:

df.show() -- Spark's JVM writes directly to stdout
df.printSchema() -- same thing
print() at the top level of a cell

What doesn't work:

print() inside a decorator wrapper function
logger.info() from anywhere (goes to Python logging, not stdout)
sys.stdout.write() inside nested calls

The solution I landed on: Buffer all decorator output into a list, then print everything in one shot at the end of the cell.

_report_lines = []

def _log(msg):
    _report_lines.append(str(msg))

def show_report():
    global _report_lines
    if _report_lines:
        print('\n'.join(_report_lines))
        _report_lines = []

Every decorator calls _log() instead of print(). At the end of your cell, you call show_report() and the whole buffer gets printed as a single top-level print() that Livy reliably captures.

@measure_time
@show_sample_output(3)
def my_transform(input_df):
    return input_df.filter(input_df.Age > 30)

result = my_transform(df)
show_report()  # everything appears in cell output

What I Threw Away (And Why)

I originally built more decorators. Here's what didn't survive, why, and what I do instead.

@cache_and_count -- Called .count() after caching to log the row count. The problem: .count() forces a full materialization of the DataFrame. On a large dataset, you're adding minutes of compute just to log a number. If I need the count, I call .count() explicitly once at a point where I know I need it, not on every function call via a decorator.

@log_checkpoint -- Printed schema, row count, and partition count. Same .count() problem, plus .rdd.getNumPartitions() triggers another action. Two extra Spark actions per decorated function. When I need schema info, I call .printSchema() inline. When I need partition counts for skew analysis, I check the Spark UI at localhost:4040 -- it has the information without triggering extra compute.

@log_partitions -- Used .rdd.glom().map(len).collect() to show rows per partition. This collects partition metadata to the driver. On a large, heavily partitioned dataset, this can OOM the driver. For partition skew analysis, the Spark UI's stage detail page shows you task-level input sizes -- same information, no extra actions, no OOM risk.

The pattern here: anything that triggers a Spark action (count, collect, show) inside a decorator is silently doubling your compute cost. Decorators should be lightweight. If you need that data, call it explicitly so the cost is visible in your code.

Putting It Together

Here's the complete, minimal version of spark_analyser.ipynb that I actually use. Three decorators, one UDF debugger, and the Livy-compatible output buffer.

I keep this as a separate notebook and load it at the top of my working notebook:

import nbformat
with open("/home/glue_user/workspace/jupyter_workspace/spark_analyser.ipynb") as f:
    nb = nbformat.read(f, as_version=4)
for cell in nb.cells:
    if cell.cell_type == "code":
        exec(cell.source)

Side note: %run spark_analyser.ipynb exists but it runs in a separate scope in the Glue Docker environment. The variables don't carry over to your notebook. The exec() approach runs everything in the current namespace, which is what you actually want.

The full source is available on

visuvishwa99 / pyspark-debug-toolkit

Essential decorators and utilities for debugging PySpark.

# pull this image from Docker "https://hub.docker.com/layers/amazon/aws-glue-libs/"
docker pull public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01

# for linux / inside docker
docker run -itd \
    --name glue_jupyter_v2 \
    -p 8888:8888 \
    -p 4040:4040 \
    -v "/home/your_user/local/<folder_name>:/home/glue_user/workspace/jupyter_workspace" \
    -v "/home/your_user/.aws:/home/glue_user/.aws:ro" \
    -e DISABLE_SSL=true \
    public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01 \
    /home/glue_user/jupyter/jupyter_start.sh \
    --NotebookApp.token='' \
    --NotebookApp.password=''

# windows (vscode)
docker run -itd \
    --name glue_jupyter_v2 \
    -p 8888:8888 \
    -p 4040:4040 \
    -v "C:/local/<folder_name>://home/glue_user/workspace/jupyter_workspace" \
    -v "C:/Users/<your_username>/.aws://home/glue_user/.aws:ro" \
    -e DISABLE_SSL=true \
    public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01 \
    //home/glue_user/jupyter/jupyter_start.sh \
    --NotebookApp.token='' \
    --NotebookApp.password=''

browser:

Open http://localhost:8888
open to "jupyter_workspace" folder.
Open "<notebook_name>.ipynb" and run

FOR VS Code

Open "<notebook_name>.ipynb" in VS Code.
Enter "http://127.0.0.1:8888".
Select pyspark / Python 3

Spark jobs in real-time at http://localhost:4040.

View on GitHub

TL;DR

@measure_time -- Lightweight, always useful, zero overhead.
@show_sample_output(n) -- Quick peek at transformation output during development.
debug_udf -- Uses Spark Accumulators to capture UDF inputs/outputs on the driver. Solves the "why is my UDF returning None" problem.
Avoid decorators that trigger Spark actions (.count(), .collect()) -- they silently double your compute cost.
Glue Docker uses Livy -- print() inside nested functions gets swallowed. Buffer output and print at the top level.
Accumulators can only be written on executors, never read -- limit your samples on the driver side.

The UDF debugger alone has saved me hours.

Adding Audit Columns to Existing Tables: Comparing Approaches for Large Datasets

Vishwa.P — Wed, 16 Apr 2025 18:57:39 +0000

Adding Audit Columns to Existing Tables: Comparing Approaches for Large Datasets

Introduction

In data engineering, adding audit columns like bd_insert_dtm and bd_updated_dtm to track when records are created or modified is a common requirement. When dealing with large datasets (2-5GB files), choosing the right approach becomes critical for performance and resource utilization.

This post compares four different methods to implement this seemingly simple task, helping you choose the right tool for your specific needs.

The Challenge

We need to add audit timestamp columns to existing tables with file sizes ranging from 2GB to 5GB. Let's explore our options:

Approach 1: PySpark

PySpark leverages distributed computing, making it ideal for large datasets. While it might seem like overkill for 2-5GB files, it scales beautifully as your data grows.

from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Add Audit Columns") \
    .getOrCreate()

# Read your data
df = spark.read.format("csv").option("header", "true").load("your_file.csv")

# Add audit columns
df_with_audit = df.withColumn("bd_insert_dtm", current_timestamp()) \
                  .withColumn("bd_updated_dtm", current_timestamp())

# Write the result
df_with_audit.write.format("csv").mode("overwrite").option("header", "true").save("output_path")

Pros:

Highly scalable to much larger datasets
Parallelized processing
Built-in functions for timestamps

Cons:

Startup overhead
Requires Spark environment
More complex for simple tasks

Approach 2: Pandas

Pandas offers simplicity and ease of use, loading the entire dataset into memory:

import pandas as pd
from datetime import datetime

# Read your data
df = pd.read_csv("your_file.csv")

# Add audit columns
current_time = datetime.now()
df["bd_insert_dtm"] = current_time
df["bd_updated_dtm"] = current_time

# Write the result
df.to_csv("output_path.csv", index=False)

Pros:

Simple and straightforward
Familiar API for data scientists
Great for quick iterations

Cons:

Loads entire dataset into memory
May struggle with 2GB+ files on machines with limited RAM
Single-threaded operations

Approach 3: Dask

Dask combines the familiar Pandas API with out-of-core processing for larger-than-memory datasets:

import dask.dataframe as dd
from datetime import datetime

# Read your data
df = dd.read_csv("your_file.csv")

# Add audit columns
current_time = datetime.now()
df["bd_insert_dtm"] = current_time
df["bd_updated_dtm"] = current_time

# Write the result
df.to_csv("output_path/*.csv", index=False)

Pros:

Pandas-like API
Handles larger-than-memory datasets
Parallel execution

Cons:

More complex than pandas
Output is split across multiple files by default
Some operations require careful consideration

Approach 4: Using Generators

Generators provide the most memory-efficient solution by processing the file line by line:

import csv
from datetime import datetime

def add_audit_columns(input_file, output_file):
    with open(input_file, 'r') as infile, open(output_file, 'w', newline='') as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile)

        # Handle header
        header = next(reader)
        header.extend(["bd_insert_dtm", "bd_updated_dtm"])
        writer.writerow(header)

        # Process rows
        current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        for row in reader:
            row.extend([current_time, current_time])
            writer.writerow(row)

# Usage
add_audit_columns("your_file.csv", "output_file.csv")

Pros:

Minimal memory footprint
Works on any machine
Simple to understand

Cons:

Sequential processing
Limited functionality compared to dataframe libraries
Manual handling of data types

Comparison and Recommendations

Approach	Memory Usage	Speed	Scalability	Ease of Use
PySpark	Medium	Fast	Excellent	Complex
Pandas	High	Medium	Poor	Simple
Dask	Medium	Fast	Good	Medium
Generator	Low	Slow	Poor	Medium

For 2-5GB files:

With sufficient RAM: Pandas offers the simplest solution
With limited RAM: Dask or generators are better choices
With an existing Spark environment: PySpark makes sense
For absolute memory efficiency: Go with generators

Conclusion

When adding audit columns to existing tables, the best approach depends on your specific constraints. For most cases with 2-5GB files, Dask provides an excellent balance between ease of use and performance. However, generators shine when working in extremely memory-constrained environments, while PySpark is the go-to solution if you anticipate scaling to much larger datasets in the future.

What approach are you using for adding audit columns to your tables? Let me know in the comments!

DEV Community: Vishwa.P

Stop Copy-Pasting Notes: Building an AI-Powered Pipeline from Obsidian to Anki

The Problem: The Friction of Remembering

The Setup: The "Second Brain" Data Pipeline

The Architecture: How It Works

The Solution: Frontmatter Routing and CurlyCloze

The "Secret Sauce": Frontmatter Automation

Writing Flashcards: The CurlyCloze Syntax

What I Threw Away (And Why)

Putting It Together

visuvishwa99 / second_brain

TL;DR

From Silent None to Insight: Debugging PySpark UDFs on AWS Glue with Decorators

The Setup: Testing Glue Jobs Locally with Docker

Pull the image

Start the container

Wait, How Does This Actually Work? The Livy Architecture

The Decorators

Decorator 1: @measure_time

Decorator 2: @show_sample_output

Decorator 3: debug_udf (The One That Actually Solved My Problem)

The Livy stdout Problem (and the Solution)

What I Threw Away (And Why)

Putting It Together

visuvishwa99 / pyspark-debug-toolkit

Essential decorators and utilities for debugging PySpark.

TL;DR

Adding Audit Columns to Existing Tables: Comparing Approaches for Large Datasets

Adding Audit Columns to Existing Tables: Comparing Approaches for Large Datasets

Introduction

The Challenge

Approach 1: PySpark

Approach 2: Pandas

Approach 3: Dask

Approach 4: Using Generators

Comparison and Recommendations

Conclusion

Decorator 1: `@measure_time`

Decorator 2: `@show_sample_output`

Decorator 3: `debug_udf` (The One That Actually Solved My Problem)