<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vishwa.P</title>
    <description>The latest articles on DEV Community by Vishwa.P (@vishwa_p_c4dc415de5).</description>
    <link>https://dev.to/vishwa_p_c4dc415de5</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1551442%2F2be926ad-45eb-4afa-9a3c-60c1a813595d.png</url>
      <title>DEV Community: Vishwa.P</title>
      <link>https://dev.to/vishwa_p_c4dc415de5</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vishwa_p_c4dc415de5"/>
    <language>en</language>
    <item>
      <title>Stop Copy-Pasting Notes: Building an AI-Powered Pipeline from Obsidian to Anki</title>
      <dc:creator>Vishwa.P</dc:creator>
      <pubDate>Fri, 20 Feb 2026 21:48:06 +0000</pubDate>
      <link>https://dev.to/vishwa_p_c4dc415de5/stop-copy-pasting-notes-building-an-ai-powered-pipeline-from-obsidian-to-anki-195f</link>
      <guid>https://dev.to/vishwa_p_c4dc415de5/stop-copy-pasting-notes-building-an-ai-powered-pipeline-from-obsidian-to-anki-195f</guid>
      <description>&lt;h2&gt;
  
  
  The Problem: The Friction of Remembering
&lt;/h2&gt;

&lt;p&gt;Taking detailed notes in Obsidian is fantastic for deep understanding and connecting concepts. But there's a huge problem: building a long-term memory for certifications, new tools, or complex system architectures requires active recall and spaced repetition. Obsidian is a powerful "factory" for building knowledge, but it's terrible as a "delivery truck" for rote memorization. &lt;/p&gt;

&lt;p&gt;Manually copying and pasting notes from Obsidian into a spaced repetition system (SRS) like Anki creates massive friction. You end up either abandoning your flashcards entirely or spending hours duplicating data instead of learning. I needed a way to capture atomic facts in my daily notes and have them automatically flow into my flashcard decks without leaving my editor.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup: The "Second Brain" Data Pipeline
&lt;/h2&gt;

&lt;p&gt;I decided to treat my personal knowledge system like a Data Engineering pipeline. I needed a reliable way to ingest "raw" daily journals, transform them into structured domain notes, and extract flashcards as data marts for Anki.&lt;/p&gt;

&lt;p&gt;The setup leverages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Obsidian&lt;/strong&gt;: Long-form connected notes.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Anki&lt;/strong&gt;: The absolute gold standard for spaced repetition (utilizing FSRS).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Obsidian_to_Anki Plugin&lt;/strong&gt;: For syncing notes into decks.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Architecture: How It Works
&lt;/h2&gt;

&lt;p&gt;The architecture is built around three distinct stages—much like a medallion data architecture (Bronze, Silver, Gold).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Daily Journaling] --&amp;gt; [01_Raw Journal : Bronze]
                             |
                   +---------v---------+
                   | Automation Agent  |
                   | - Yaml Parser     |
                   | - Routing Logic   |
                   +---------+---------+
                             |
         +-------------------+-------------------+
         | (move_to_brain)   | (move_to_mart)    | (audit)
         v                   v                   v
   [02_Brain : Silver]  [03_Mart : Gold]  [movement_log.md]
                             |
                             v
              [Obsidian_to_Anki Integration]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;01_Raw&lt;/strong&gt;: The ingestion layer where daily notes and raw thoughts are jotted down.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;02_Brain&lt;/strong&gt;: The processed, structured domain knowledge (e.g., &lt;code&gt;05_Warehousing&lt;/code&gt; or &lt;code&gt;09_AI&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;03_Mart&lt;/strong&gt;: Specifically extracted, atomic flashcards ready for Anki sync.&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  The Solution: Frontmatter Routing and CurlyCloze
&lt;/h2&gt;
&lt;h3&gt;
  
  
  The "Secret Sauce": Frontmatter Automation
&lt;/h3&gt;

&lt;p&gt;The key to completely removing the manual work was utilizing YAML frontmatter to give the notes "state" and instruct a background pipeline where to send the content. &lt;/p&gt;

&lt;p&gt;By simply tagging my daily journal, the background automation parses the file and moves it strictly based on the taxonomy defined in the tags.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;05_Warehousing&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;Explain&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="na"&gt;move_journal_to_brain&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;move_brain_to_mart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;processed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;When my automation runs, it searches for files where &lt;code&gt;processed: false&lt;/code&gt;. If &lt;code&gt;move_journal_to_brain&lt;/code&gt; is true, the script builds a proper page under the &lt;code&gt;02_Brain&lt;/code&gt; folder. If &lt;code&gt;move_brain_to_mart&lt;/code&gt; is true, it extracts the target text specifically for flashcard ingestion.&lt;/p&gt;
&lt;h3&gt;
  
  
  Writing Flashcards: The CurlyCloze Syntax
&lt;/h3&gt;

&lt;p&gt;Instead of using a separate app to write Anki cards, I use the &lt;code&gt;CurlyCloze&lt;/code&gt; syntax directly in my markdown blocks and rely on the &lt;code&gt;Obsidian_to_Anki&lt;/code&gt; integration to parse them.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# My note on Snowflake&lt;/span&gt;

TARGET DECK: SecondBrain::05_Warehousing

START
Cloze
Snowflake micro-partitions are between {50 MB} and {500 MB} of uncompressed data.
END

START
Cloze
{Time Travel} allows access to historical data in Snowflake.
END
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The automation moves these cloze blocks to &lt;code&gt;03_Mart&lt;/code&gt;, and the Anki plugin effortlessly scans the folder and updates my main database. Creating flashcards simply becomes a byproduct of formatting my daily notes!&lt;/p&gt;


&lt;h2&gt;
  
  
  What I Threw Away (And Why)
&lt;/h2&gt;

&lt;p&gt;I tried these approaches first, but they failed:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Obsidian SRS Plugins&lt;/strong&gt;: While functional, the mobile experience for Obsidian is clunky compared to AnkiMobile. Mixing note-taking UI with rapid-fire spaced repetition just diluted my cognitive focus.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Manual Staging Folders&lt;/strong&gt;: I initially had a "Needs Review" folder where I manually sorted notes. This required a human-in-the-loop and quickly became a bottleneck. Moving to a frontmatter-driven configuration solved this instantly.&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  Putting It Together
&lt;/h2&gt;

&lt;p&gt;By treating my personal notes as a data engineering pipeline, I removed the friction from spaced repetition. Notes are written once, dynamically routed based on markdown metadata, and seamlessly delivered to my Anki decks for long-term retention. &lt;/p&gt;

&lt;p&gt;You can find the full repository structure, automation workflow logic, and taxonomy design here:&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/visuvishwa99" rel="noopener noreferrer"&gt;
        visuvishwa99
      &lt;/a&gt; / &lt;a href="https://github.com/visuvishwa99/second_brain" rel="noopener noreferrer"&gt;
        second_brain
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
&lt;/div&gt;





&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;The Issue&lt;/strong&gt;: Copying structured notes from Obsidian into Anki for spaced repetition is tedious and unsustainable.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The Fix&lt;/strong&gt;: A metadata-driven pipeline that moves daily journals into structured domain folders and extracts cloze-deletion flashcards automatically using Obsidian-to-Anki.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The Outcome&lt;/strong&gt;: Zero friction between learning a new concept and creating a testable flashcard, saving countless hours of manual data entry.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>obsidian</category>
      <category>anki</category>
      <category>productivity</category>
      <category>ai</category>
    </item>
    <item>
      <title>From Silent None to Insight: Debugging PySpark UDFs on AWS Glue with Decorators</title>
      <dc:creator>Vishwa.P</dc:creator>
      <pubDate>Fri, 20 Feb 2026 03:14:48 +0000</pubDate>
      <link>https://dev.to/vishwa_p_c4dc415de5/from-silent-none-to-insight-debugging-pyspark-udfs-on-aws-glue-with-decorators-1o7m</link>
      <guid>https://dev.to/vishwa_p_c4dc415de5/from-silent-none-to-insight-debugging-pyspark-udfs-on-aws-glue-with-decorators-1o7m</guid>
      <description>&lt;p&gt;Last month I was debugging a PySpark UDF that was silently returning &lt;code&gt;None&lt;/code&gt; for about 2% of rows in a 10-million-row dataset. No error. No exception. Just... &lt;code&gt;None&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I couldn't reproduce it locally because I didn't have the exact row that caused it. I couldn't add &lt;code&gt;print()&lt;/code&gt; statements because -- as I painfully discovered -- &lt;strong&gt;print() inside a UDF doesn't show up anywhere useful&lt;/strong&gt;. The output vanishes into executor logs that are buried three clicks deep in the Spark UI, if they exist at all.&lt;/p&gt;

&lt;p&gt;That frustration led me to build a small set of PySpark debugging decorators. Some of them turned out to be genuinely useful. Others taught me more about Spark's architecture than I expected. And the whole thing sent me down a rabbit hole about how AWS Glue's Docker image actually works under the hood.&lt;/p&gt;

&lt;p&gt;This post covers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Three decorators I actually use in production debugging&lt;/li&gt;
&lt;li&gt;Why &lt;code&gt;print()&lt;/code&gt; inside a UDF doesn't do what you think&lt;/li&gt;
&lt;li&gt;How AWS Glue's local Docker environment works (Livy, Sparkmagic, and the stdout black hole)&lt;/li&gt;
&lt;li&gt;How to set up and test Glue jobs locally with Docker&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's get into it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup: Testing Glue Jobs Locally with Docker
&lt;/h2&gt;

&lt;p&gt;Before we get to the decorators, let me explain the environment. AWS Glue provides a Docker image that lets you develop and test ETL jobs on your own machine without spinning up cloud resources. This is a massive time-saver -- no waiting for Glue job cold starts, no paying for dev endpoint hours.&lt;/p&gt;

&lt;p&gt;Here's how to get it running:&lt;/p&gt;

&lt;h3&gt;
  
  
  Pull the image
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker pull public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Start the container
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Linux/Mac:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;WORKSPACE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/path/to/your/project

docker run &lt;span class="nt"&gt;-itd&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--name&lt;/span&gt; glue_jupyter &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-p&lt;/span&gt; 8888:8888 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-p&lt;/span&gt; 4040:4040 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$WORKSPACE&lt;/span&gt;&lt;span class="s2"&gt;:/home/glue_user/workspace/jupyter_workspace"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/.aws:/home/glue_user/.aws:ro"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;DISABLE_SSL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01 &lt;span class="se"&gt;\&lt;/span&gt;
    /home/glue_user/jupyter/jupyter_start.sh &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--NotebookApp&lt;/span&gt;.token&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;''&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--NotebookApp&lt;/span&gt;.password&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;''&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Windows (Git Bash / PowerShell):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-itd&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--name&lt;/span&gt; glue_jupyter &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-p&lt;/span&gt; 8888:8888 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-p&lt;/span&gt; 4040:4040 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"C:/your/project://home/glue_user/workspace/jupyter_workspace"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"C:/Users/YourName/.aws://home/glue_user/.aws:ro"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;DISABLE_SSL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01 &lt;span class="se"&gt;\&lt;/span&gt;
    //home/glue_user/jupyter/jupyter_start.sh &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--NotebookApp&lt;/span&gt;.token&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;''&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--NotebookApp&lt;/span&gt;.password&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;''&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Then open &lt;a href="http://localhost:8888" rel="noopener noreferrer"&gt;http://localhost:8888&lt;/a&gt; in your browser. Navigate to the &lt;code&gt;jupyter_workspace&lt;/code&gt; folder. Your local project files are right there, mounted into the container. Any changes you make are reflected on your host filesystem.&lt;/p&gt;

&lt;p&gt;You can monitor Spark jobs in real-time at &lt;a href="http://localhost:4040" rel="noopener noreferrer"&gt;http://localhost:4040&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you're using VS Code, connect to the Jupyter server at &lt;code&gt;http://127.0.0.1:8888&lt;/code&gt; and select the PySpark kernel.&lt;/p&gt;


&lt;h2&gt;
  
  
  Wait, How Does This Actually Work? The Livy Architecture
&lt;/h2&gt;

&lt;p&gt;Here's something that tripped me up for hours and is worth understanding before we talk about decorators.&lt;/p&gt;

&lt;p&gt;When you run a PySpark cell in the Glue Docker notebook, your code &lt;strong&gt;does not run inside the Jupyter kernel process&lt;/strong&gt;. Here's what actually happens:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; [Your Browser]
       |
 [Jupyter Server]
       |
 [Sparkmagic Kernel]  -- sends your code over HTTP --&amp;gt;  [Livy Server :8998]
                                                              |
                                                       [Spark Driver JVM]
                                                              |
                                                       [Spark Executors]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Sparkmagic&lt;/strong&gt; is the Jupyter kernel. It doesn't run your Python code directly. Instead, it serializes your cell's code as a string and sends it via HTTP POST to &lt;strong&gt;Livy&lt;/strong&gt;, which is an open-source REST server for Spark.&lt;/p&gt;

&lt;p&gt;Livy creates a Spark session, executes your code in a separate JVM-hosted Python process, captures whatever gets written to stdout, and sends that text back to Sparkmagic, which displays it in your notebook cell.&lt;/p&gt;

&lt;p&gt;This architecture has a critical consequence: &lt;strong&gt;Livy only reliably captures stdout from the top-level cell execution.&lt;/strong&gt; If your code calls a function, which calls a decorator, which calls &lt;code&gt;print()&lt;/code&gt; three stack frames deep -- that output often gets lost in transit. Livy captures what Spark's driver process writes to stdout at the top level. Nested &lt;code&gt;print()&lt;/code&gt; calls inside wrapper functions? Hit or miss. Usually miss.&lt;/p&gt;

&lt;p&gt;This is why &lt;code&gt;df.show()&lt;/code&gt; works (Spark's JVM writes directly to stdout at the top level) but &lt;code&gt;print("hello")&lt;/code&gt; inside a decorator wrapper gets swallowed.&lt;/p&gt;

&lt;p&gt;Understanding this saved me from a lot of "why doesn't this work" frustration. It's not a bug. It's the architecture.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Decorators
&lt;/h2&gt;

&lt;p&gt;With that context, here are the three decorators I actually kept after throwing away the ones that weren't pulling their weight.&lt;/p&gt;
&lt;h3&gt;
  
  
  Decorator 1: &lt;code&gt;@measure_time&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The simplest one, and probably the one I use most.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;measure_time&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Decorator to measure execution time of a function.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="nd"&gt;@functools.wraps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;wrapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;start_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;
        &lt;span class="nf"&gt;_log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[measure_time] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; completed in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; seconds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;wrapper&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Why it's useful:&lt;/strong&gt; When you're chaining five transformations and the job takes 12 minutes, you need to know &lt;em&gt;which&lt;/em&gt; transformation is the bottleneck. The Spark UI gives you stage-level timings, but this gives you function-level timings at a glance.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@measure_time&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_features&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;input_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(...).&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(...).&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(...).&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_features&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;show_report&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# [measure_time] build_features completed in 4.32 seconds
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Decorator 2: &lt;code&gt;@show_sample_output&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Quick peek at what your transformation actually produced.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;show_sample_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Decorator to show first n rows from resulting DataFrame.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;decorator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nd"&gt;@functools.wraps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;wrapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="nf"&gt;_log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;[show_sample_output] First &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows from &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nf"&gt;_log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_jdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;showString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;wrapper&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;decorator&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A note on &lt;code&gt;_jdf&lt;/code&gt;: yes, &lt;code&gt;_jdf&lt;/code&gt; is a private API -- it could break in a future Spark version. For production logging I'd use &lt;code&gt;result.limit(n).toPandas().to_string()&lt;/code&gt; instead. But for a debugging decorator that you strip before deploy, I'm fine with it. It avoids the Livy stdout capture problem that &lt;code&gt;.show()&lt;/code&gt; has, and it's fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it's useful:&lt;/strong&gt; When you're building a pipeline with multiple transformation stages, you often want to verify the output shape at each step without manually adding &lt;code&gt;.show()&lt;/code&gt; everywhere. Slap this decorator on, see your data, remove it when you're done.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@show_sample_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;input_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;clean_names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;show_report&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Decorator 3: &lt;code&gt;debug_udf&lt;/code&gt; (The One That Actually Solved My Problem)
&lt;/h3&gt;

&lt;p&gt;This is the one that came out of real pain. Here's the problem: UDFs run on executors, not the driver. You cannot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add &lt;code&gt;print()&lt;/code&gt; statements (output goes to executor logs, not your notebook)&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;pdb&lt;/code&gt; or any debugger (it's a serialized function running on a remote JVM worker)&lt;/li&gt;
&lt;li&gt;Read accumulator values inside the task (throws &lt;code&gt;RuntimeError&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Easily figure out &lt;em&gt;which input&lt;/em&gt; caused a failure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The solution: use a Spark Accumulator to ship (input, output) samples from executors back to the driver.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ListAccumulatorParam&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AccumulatorParam&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;zero&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;initial_value&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;addInPlace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;acc1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;acc2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;acc1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;acc2&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;debug_udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;sc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;sparkContext&lt;/span&gt;
    &lt;span class="n"&gt;acc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;accumulator&lt;/span&gt;&lt;span class="p"&gt;([],&lt;/span&gt; &lt;span class="nc"&gt;ListAccumulatorParam&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="nd"&gt;@functools.wraps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;wrapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Can't read acc.value here (throws RuntimeError inside tasks).
&lt;/span&gt;        &lt;span class="c1"&gt;# Just always add. Limit to n when printing on the driver.
&lt;/span&gt;        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;input_repr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No Args&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_repr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;input_repr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;input_repr&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="n"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;([{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;input_repr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)}])&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;pass&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;wrapper&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;acc&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;print_debug_samples&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;func_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UDF&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Sample &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: Input=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  -&amp;gt;  Output=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Usage:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_udf_logic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Processed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;debug_fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;acc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;debug_udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;my_udf_logic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;my_udf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;debug_fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;my_udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print_debug_samples&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_udf_logic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-------+---+------------------+
|   Name|Age|            Result|
+-------+---+------------------+
|  Alice| 34|  Processed: Alice|
|    Bob| 45|    Processed: Bob|
|Charlie| 29|Processed: Charlie|
|  David| 30|  Processed: David|
+-------+---+------------------+

==================================================
[debug_udf] my_udf_logic: 3 sample(s)
==================================================
  Sample 1: Input=Row(Name='Alice', Age=34)  -&amp;gt;  Output=Processed: Alice
  Sample 2: Input=Row(Name='Bob', Age=45)    -&amp;gt;  Output=Processed: Bob
  Sample 3: Input=Row(Name='Charlie', Age=29)-&amp;gt;  Output=Processed: Charlie
==================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now you can see exactly what your UDF received and what it returned. When that UDF is producing &lt;code&gt;None&lt;/code&gt; for mysterious rows, you can bump &lt;code&gt;n&lt;/code&gt; up to 1000, scan the samples, and find the problematic input.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One important gotcha:&lt;/strong&gt; I originally tried to limit samples inside the wrapper using &lt;code&gt;if counter.value &amp;lt; limit&lt;/code&gt;. Spark throws &lt;code&gt;RuntimeError: Accumulator.value cannot be accessed inside tasks&lt;/code&gt;. You can only &lt;em&gt;write&lt;/em&gt; to accumulators on executors, never &lt;em&gt;read&lt;/em&gt;. The limiting has to happen on the driver side in &lt;code&gt;print_debug_samples&lt;/code&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Livy stdout Problem (and the Solution)
&lt;/h2&gt;

&lt;p&gt;If you're using the Glue Docker image and wondering why &lt;code&gt;print()&lt;/code&gt; inside decorators doesn't show up, here's the full explanation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; Sparkmagic sends your code to Livy, which runs it in a separate Spark process. Livy captures stdout, but only reliably from top-level execution. &lt;code&gt;print()&lt;/code&gt; buried inside nested function calls (like decorators) gets lost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;df.show()&lt;/code&gt; -- Spark's JVM writes directly to stdout&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;df.printSchema()&lt;/code&gt; -- same thing&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;print()&lt;/code&gt; at the top level of a cell&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What doesn't work:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;print()&lt;/code&gt; inside a decorator wrapper function&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;logger.info()&lt;/code&gt; from anywhere (goes to Python logging, not stdout)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sys.stdout.write()&lt;/code&gt; inside nested calls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The solution I landed on:&lt;/strong&gt; Buffer all decorator output into a list, then print everything in one shot at the end of the cell.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_report_lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;_report_lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;show_report&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;_report_lines&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_report_lines&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_report_lines&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;_report_lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Every decorator calls &lt;code&gt;_log()&lt;/code&gt; instead of &lt;code&gt;print()&lt;/code&gt;. At the end of your cell, you call &lt;code&gt;show_report()&lt;/code&gt; and the whole buffer gets printed as a single top-level &lt;code&gt;print()&lt;/code&gt; that Livy reliably captures.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@measure_time&lt;/span&gt;
&lt;span class="nd"&gt;@show_sample_output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;input_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Age&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;my_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;show_report&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# everything appears in cell output
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What I Threw Away (And Why)
&lt;/h2&gt;

&lt;p&gt;I originally built more decorators. Here's what didn't survive, why, and what I do instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;@cache_and_count&lt;/code&gt;&lt;/strong&gt; -- Called &lt;code&gt;.count()&lt;/code&gt; after caching to log the row count. The problem: &lt;code&gt;.count()&lt;/code&gt; forces a full materialization of the DataFrame. On a large dataset, you're adding minutes of compute just to log a number. If I need the count, I call &lt;code&gt;.count()&lt;/code&gt; explicitly once at a point where I know I need it, not on every function call via a decorator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;@log_checkpoint&lt;/code&gt;&lt;/strong&gt; -- Printed schema, row count, and partition count. Same &lt;code&gt;.count()&lt;/code&gt; problem, plus &lt;code&gt;.rdd.getNumPartitions()&lt;/code&gt; triggers another action. Two extra Spark actions per decorated function. When I need schema info, I call &lt;code&gt;.printSchema()&lt;/code&gt; inline. When I need partition counts for skew analysis, I check the Spark UI at localhost:4040 -- it has the information without triggering extra compute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;@log_partitions&lt;/code&gt;&lt;/strong&gt; -- Used &lt;code&gt;.rdd.glom().map(len).collect()&lt;/code&gt; to show rows per partition. This collects partition metadata to the driver. On a large, heavily partitioned dataset, this can OOM the driver. For partition skew analysis, the Spark UI's stage detail page shows you task-level input sizes -- same information, no extra actions, no OOM risk.&lt;/p&gt;

&lt;p&gt;The pattern here: anything that triggers a Spark &lt;strong&gt;action&lt;/strong&gt; (count, collect, show) inside a decorator is silently doubling your compute cost. Decorators should be lightweight. If you need that data, call it explicitly so the cost is visible in your code.&lt;/p&gt;


&lt;h2&gt;
  
  
  Putting It Together
&lt;/h2&gt;

&lt;p&gt;Here's the complete, minimal version of &lt;code&gt;spark_analyser.ipynb&lt;/code&gt; that I actually use. Three decorators, one UDF debugger, and the Livy-compatible output buffer.&lt;/p&gt;

&lt;p&gt;I keep this as a separate notebook and load it at the top of my working notebook:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;nbformat&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/home/glue_user/workspace/jupyter_workspace/spark_analyser.ipynb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;nb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nbformat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;as_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cell&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;nb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cells&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cell&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cell_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cell&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Side note: &lt;code&gt;%run spark_analyser.ipynb&lt;/code&gt; exists but it runs in a separate scope in the Glue Docker environment. The variables don't carry over to your notebook. The &lt;code&gt;exec()&lt;/code&gt; approach runs everything in the current namespace, which is what you actually want.&lt;/p&gt;
&lt;h2&gt;
  
  
  The full source is available on 

&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/visuvishwa99" rel="noopener noreferrer"&gt;
        visuvishwa99
      &lt;/a&gt; / &lt;a href="https://github.com/visuvishwa99/pyspark-debug-toolkit" rel="noopener noreferrer"&gt;
        pyspark-debug-toolkit
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Essential decorators and utilities for debugging PySpark.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; pull this image from Docker "https://hub.docker.com/layers/amazon/aws-glue-libs/"&lt;/span&gt;
docker pull public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; for linux / inside docker&lt;/span&gt;
docker run -itd \
    --name glue_jupyter_v2 \
    -p 8888:8888 \
    -p 4040:4040 \
    -v &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/home/your_user/local/&amp;lt;folder_name&amp;gt;:/home/glue_user/workspace/jupyter_workspace&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
    -v &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;/home/your_user/.aws:/home/glue_user/.aws:ro&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
    -e DISABLE_SSL=true \
    public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01 \
    /home/glue_user/jupyter/jupyter_start.sh \
    --NotebookApp.token=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
    --NotebookApp.password=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;&lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; windows (vscode)&lt;/span&gt;
docker run -itd \
    --name glue_jupyter_v2 \
    -p 8888:8888 \
    -p 4040:4040 \
    -v &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;C:/local/&amp;lt;folder_name&amp;gt;://home/glue_user/workspace/jupyter_workspace&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
    -v &lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;"&lt;/span&gt;C:/Users/&amp;lt;your_username&amp;gt;/.aws://home/glue_user/.aws:ro&lt;span class="pl-pds"&gt;"&lt;/span&gt;&lt;/span&gt; \
    -e DISABLE_SSL=true \
    public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01 \
    //home/glue_user/jupyter/jupyter_start.sh \
    --NotebookApp.token=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt; \
    --NotebookApp.password=&lt;span class="pl-s"&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;span class="pl-pds"&gt;'&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;browser:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Open &lt;a href="http://localhost:8888" rel="nofollow noopener noreferrer"&gt;http://localhost:8888&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;open to "jupyter_workspace" folder.&lt;/li&gt;
&lt;li&gt;Open "&amp;lt;notebook_name&amp;gt;.ipynb" and run&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;FOR  VS Code&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Open "&amp;lt;notebook_name&amp;gt;.ipynb" in VS Code.&lt;/li&gt;
&lt;li&gt;Enter "&lt;a href="http://127.0.0.1:8888" rel="nofollow noopener noreferrer"&gt;http://127.0.0.1:8888&lt;/a&gt;".&lt;/li&gt;
&lt;li&gt;Select  pyspark / Python 3&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Spark jobs in real-time at &lt;a href="http://localhost:4040" rel="nofollow noopener noreferrer"&gt;http://localhost:4040&lt;/a&gt;.&lt;/p&gt;

&lt;/div&gt;

  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/visuvishwa99/pyspark-debug-toolkit" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;



&lt;/h2&gt;


&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;@measure_time&lt;/code&gt;&lt;/strong&gt; -- Lightweight, always useful, zero overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;@show_sample_output(n)&lt;/code&gt;&lt;/strong&gt; -- Quick peek at transformation output during development.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;debug_udf&lt;/code&gt;&lt;/strong&gt; -- Uses Spark Accumulators to capture UDF inputs/outputs on the driver. Solves the "why is my UDF returning None" problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid decorators that trigger Spark actions&lt;/strong&gt; (&lt;code&gt;.count()&lt;/code&gt;, &lt;code&gt;.collect()&lt;/code&gt;) -- they silently double your compute cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Glue Docker uses Livy&lt;/strong&gt; -- &lt;code&gt;print()&lt;/code&gt; inside nested functions gets swallowed. Buffer output and print at the top level.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accumulators can only be written on executors, never read&lt;/strong&gt; -- limit your samples on the driver side.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The UDF debugger alone has saved me hours.&lt;/p&gt;

</description>
      <category>pyspark</category>
      <category>aws</category>
      <category>dataengineering</category>
      <category>python</category>
    </item>
    <item>
      <title>Adding Audit Columns to Existing Tables: Comparing Approaches for Large Datasets</title>
      <dc:creator>Vishwa.P</dc:creator>
      <pubDate>Wed, 16 Apr 2025 18:57:39 +0000</pubDate>
      <link>https://dev.to/vishwa_p_c4dc415de5/adding-audit-columns-to-existing-tables-comparing-approaches-for-large-datasets-a6l</link>
      <guid>https://dev.to/vishwa_p_c4dc415de5/adding-audit-columns-to-existing-tables-comparing-approaches-for-large-datasets-a6l</guid>
      <description>&lt;h1&gt;
  
  
  Adding Audit Columns to Existing Tables: Comparing Approaches for Large Datasets
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In data engineering, adding audit columns like &lt;code&gt;bd_insert_dtm&lt;/code&gt; and &lt;code&gt;bd_updated_dtm&lt;/code&gt; to track when records are created or modified is a common requirement. When dealing with large datasets (2-5GB files), choosing the right approach becomes critical for performance and resource utilization.&lt;/p&gt;

&lt;p&gt;This post compares four different methods to implement this seemingly simple task, helping you choose the right tool for your specific needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge
&lt;/h2&gt;

&lt;p&gt;We need to add audit timestamp columns to existing tables with file sizes ranging from 2GB to 5GB. Let's explore our options:&lt;/p&gt;

&lt;h2&gt;
  
  
  Approach 1: PySpark
&lt;/h2&gt;

&lt;p&gt;PySpark leverages distributed computing, making it ideal for large datasets. While it might seem like overkill for 2-5GB files, it scales beautifully as your data grows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;current_timestamp&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize Spark session
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Add Audit Columns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Read your data
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_file.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add audit columns
&lt;/span&gt;&lt;span class="n"&gt;df_with_audit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bd_insert_dtm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
                  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bd_updated_dtm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Write the result
&lt;/span&gt;&lt;span class="n"&gt;df_with_audit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Highly scalable to much larger datasets&lt;/li&gt;
&lt;li&gt;Parallelized processing&lt;/li&gt;
&lt;li&gt;Built-in functions for timestamps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Startup overhead&lt;/li&gt;
&lt;li&gt;Requires Spark environment&lt;/li&gt;
&lt;li&gt;More complex for simple tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Approach 2: Pandas
&lt;/h2&gt;

&lt;p&gt;Pandas offers simplicity and ease of use, loading the entire dataset into memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="c1"&gt;# Read your data
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_file.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add audit columns
&lt;/span&gt;&lt;span class="n"&gt;current_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bd_insert_dtm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_time&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bd_updated_dtm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_time&lt;/span&gt;

&lt;span class="c1"&gt;# Write the result
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_path.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple and straightforward&lt;/li&gt;
&lt;li&gt;Familiar API for data scientists&lt;/li&gt;
&lt;li&gt;Great for quick iterations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loads entire dataset into memory&lt;/li&gt;
&lt;li&gt;May struggle with 2GB+ files on machines with limited RAM&lt;/li&gt;
&lt;li&gt;Single-threaded operations&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Approach 3: Dask
&lt;/h2&gt;

&lt;p&gt;Dask combines the familiar Pandas API with out-of-core processing for larger-than-memory datasets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dask.dataframe&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="c1"&gt;# Read your data
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_file.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add audit columns
&lt;/span&gt;&lt;span class="n"&gt;current_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bd_insert_dtm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_time&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bd_updated_dtm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_time&lt;/span&gt;

&lt;span class="c1"&gt;# Write the result
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_path/*.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pandas-like API&lt;/li&gt;
&lt;li&gt;Handles larger-than-memory datasets&lt;/li&gt;
&lt;li&gt;Parallel execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More complex than pandas&lt;/li&gt;
&lt;li&gt;Output is split across multiple files by default&lt;/li&gt;
&lt;li&gt;Some operations require careful consideration&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Approach 4: Using Generators
&lt;/h2&gt;

&lt;p&gt;Generators provide the most memory-efficient solution by processing the file line by line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_audit_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_file&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;infile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;newline&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;outfile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;infile&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outfile&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Handle header
&lt;/span&gt;        &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bd_insert_dtm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bd_updated_dtm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writerow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Process rows
&lt;/span&gt;        &lt;span class="n"&gt;current_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-%d %H:%M:%S&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;current_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_time&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writerow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="nf"&gt;add_audit_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_file.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_file.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Minimal memory footprint&lt;/li&gt;
&lt;li&gt;Works on any machine&lt;/li&gt;
&lt;li&gt;Simple to understand&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sequential processing&lt;/li&gt;
&lt;li&gt;Limited functionality compared to dataframe libraries&lt;/li&gt;
&lt;li&gt;Manual handling of data types&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Comparison and Recommendations
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Memory Usage&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Scalability&lt;/th&gt;
&lt;th&gt;Ease of Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PySpark&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Complex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pandas&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Poor&lt;/td&gt;
&lt;td&gt;Simple&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dask&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generator&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Slow&lt;/td&gt;
&lt;td&gt;Poor&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;For 2-5GB files:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;With sufficient RAM: Pandas offers the simplest solution&lt;/li&gt;
&lt;li&gt;With limited RAM: Dask or generators are better choices&lt;/li&gt;
&lt;li&gt;With an existing Spark environment: PySpark makes sense&lt;/li&gt;
&lt;li&gt;For absolute memory efficiency: Go with generators&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;When adding audit columns to existing tables, the best approach depends on your specific constraints. For most cases with 2-5GB files, Dask provides an excellent balance between ease of use and performance. However, generators shine when working in extremely memory-constrained environments, while PySpark is the go-to solution if you anticipate scaling to much larger datasets in the future.&lt;/p&gt;

&lt;p&gt;What approach are you using for adding audit columns to your tables? Let me know in the comments!&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>pyspark</category>
      <category>python</category>
    </item>
  </channel>
</rss>
