<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Google Jr</title>
    <description>The latest articles on DEV Community by Google Jr (@andilejaden).</description>
    <link>https://dev.to/andilejaden</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F461863%2Ff0e4d22f-3af9-4522-bb86-744285af514e.jpg</url>
      <title>DEV Community: Google Jr</title>
      <link>https://dev.to/andilejaden</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/andilejaden"/>
    <language>en</language>
    <item>
      <title>Messy vCards with Pure Python, Pandas and Polars</title>
      <dc:creator>Google Jr</dc:creator>
      <pubDate>Mon, 19 May 2025 12:42:31 +0000</pubDate>
      <link>https://dev.to/andilejaden/messy-vcards-with-pure-python-pandas-and-polars-2i79</link>
      <guid>https://dev.to/andilejaden/messy-vcards-with-pure-python-pandas-and-polars-2i79</guid>
      <description>&lt;h4&gt;
  
  
  The Problem That Started It All
&lt;/h4&gt;

&lt;p&gt;It hasn’t occurred to me in the past to write an article on a problem I was just solving. Today was different. In what may be my last code contribution with District 4 Labs, I felt encouraged to write about this particular experience. It began with a database dump that looked like a digital archaeological site. Buried inside a field called &lt;code&gt;value&lt;/code&gt; was an assortment of contact details stored in vCard format — names, titles, prefixes, emoji-filled nicknames, and the occasional vCard tag salad.&lt;/p&gt;

&lt;p&gt;The mission? Clean this mess. Extract just the useful data name, phone, email, company, position, etc from a sea of structured chaos.&lt;/p&gt;

&lt;p&gt;But cleaning up isn’t the only concern in the data world. Efficiency matters too. So, I decided not only to build a solution but to benchmark it using three different approaches: pure Python, Pandas, and Polars.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 1: The Straightforward Python Attempt&lt;/strong&gt;&lt;br&gt;
The first approach was a clean and readable pure Python script. Using &lt;code&gt;csv.reader&lt;/code&gt; and some well-placed string manipulation and regex, it looped through each line, parsed the vCard string, and extracted what looked like what I was looking for, name, phone, email, etc.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easy to write and reason about.&lt;/li&gt;
&lt;li&gt;Zero dependencies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slower as dataset size increased.&lt;/li&gt;
&lt;li&gt;Lacks built-in optimizations for columnar operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Time to clean a 20MB dataset&lt;/strong&gt;: ~2.45 seconds&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 2: When in Doubt, Call Pandas&lt;/strong&gt;&lt;br&gt;
Next up was my all-time trusted data wrangling friend, Pandas. I brought in &lt;code&gt;DataFrame&lt;/code&gt;, used &lt;code&gt;groupby&lt;/code&gt; and &lt;code&gt;apply&lt;/code&gt;, and did some gymnastics with lambda functions.&lt;/p&gt;

&lt;p&gt;It worked — but it wheezed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Powerful for structured data.&lt;/li&gt;
&lt;li&gt;Excellent ecosystem and documentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Memory-heavy.&lt;/li&gt;
&lt;li&gt;Slower due to single-threaded execution and object-based operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Time to clean a 20MB dataset&lt;/strong&gt;: ~12.02 seconds&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 3: Enter Polars, the Silent Speed Demon&lt;/strong&gt;&lt;br&gt;
Polars is the Rust-powered DataFrame library that feels like Pandas went to the gym and started eating clean. I didn’t know Polars until recently from a Data Engineer colleague. I was impressed the moment I read the introduction on their website.&lt;/p&gt;

&lt;p&gt;With lazy evaluation, native multi-threading, and optimized memory usage, the same operation that made Pandas sweat ran like lightning in Polars.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Super fast.&lt;/li&gt;
&lt;li&gt;Built for modern hardware and large data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Still maturing ecosystem.&lt;/li&gt;
&lt;li&gt;Smaller community than Pandas (for now).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Time to clean a 20MB dataset: ~1.31 seconds&lt;/p&gt;

&lt;p&gt;Results Recap: The Showdown&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4n6413kernraxv3r2h6g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4n6413kernraxv3r2h6g.png" alt="Image description" width="800" height="244"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lessons Learned&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Don’t underestimate pure Python. It may not win races, but it shows up.&lt;/li&gt;
&lt;li&gt;Pandas is great, but not always performant. Especially on larger, string-heavy datasets.&lt;/li&gt;
&lt;li&gt;Polars is the future. If you’re working with large-scale data pipelines or cleaning up gnarly datasets, it’s a game-changer.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Closing Thoughts&lt;/strong&gt;&lt;br&gt;
Data cleaning isn’t glamorous, but it’s where real-world projects live and breathe. Choosing the right tool can make the difference between an afternoon well spent and one spent watching your laptop fan spin like a Merlin engine.&lt;/p&gt;

&lt;p&gt;If you’re wrangling vCards or any structured text data at scale, give Polars a try. Your CPU will thank you.&lt;/p&gt;

&lt;p&gt;— -&lt;/p&gt;

&lt;p&gt;Got questions or want a copy of the scripts? Drop a comment or connect with me on GitHub. Let’s clean up the mess — fast.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
