DEV Community

Cover image for Messy vCards with Pure Python, Pandas and Polars
Google Jr
Google Jr

Posted on

Messy vCards with Pure Python, Pandas and Polars

The Problem That Started It All

It hasn’t occurred to me in the past to write an article on a problem I was just solving. Today was different. In what may be my last code contribution with District 4 Labs, I felt encouraged to write about this particular experience. It began with a database dump that looked like a digital archaeological site. Buried inside a field called value was an assortment of contact details stored in vCard format — names, titles, prefixes, emoji-filled nicknames, and the occasional vCard tag salad.

The mission? Clean this mess. Extract just the useful data name, phone, email, company, position, etc from a sea of structured chaos.

But cleaning up isn’t the only concern in the data world. Efficiency matters too. So, I decided not only to build a solution but to benchmark it using three different approaches: pure Python, Pandas, and Polars.

Phase 1: The Straightforward Python Attempt
The first approach was a clean and readable pure Python script. Using csv.reader and some well-placed string manipulation and regex, it looped through each line, parsed the vCard string, and extracted what looked like what I was looking for, name, phone, email, etc.

Pros

  • Easy to write and reason about.
  • Zero dependencies.

Cons

  • Slower as dataset size increased.
  • Lacks built-in optimizations for columnar operations.

Time to clean a 20MB dataset: ~2.45 seconds

Phase 2: When in Doubt, Call Pandas
Next up was my all-time trusted data wrangling friend, Pandas. I brought in DataFrame, used groupby and apply, and did some gymnastics with lambda functions.

It worked — but it wheezed.

Pros

  • Powerful for structured data.
  • Excellent ecosystem and documentation.

Cons

  • Memory-heavy.
  • Slower due to single-threaded execution and object-based operations.

Time to clean a 20MB dataset: ~12.02 seconds

Phase 3: Enter Polars, the Silent Speed Demon
Polars is the Rust-powered DataFrame library that feels like Pandas went to the gym and started eating clean. I didn’t know Polars until recently from a Data Engineer colleague. I was impressed the moment I read the introduction on their website.

With lazy evaluation, native multi-threading, and optimized memory usage, the same operation that made Pandas sweat ran like lightning in Polars.

Pros

  • Super fast.
  • Built for modern hardware and large data.

Cons

  • Still maturing ecosystem.
  • Smaller community than Pandas (for now).

Time to clean a 20MB dataset: ~1.31 seconds

Results Recap: The Showdown

Image description

Lessons Learned

  1. Don’t underestimate pure Python. It may not win races, but it shows up.
  2. Pandas is great, but not always performant. Especially on larger, string-heavy datasets.
  3. Polars is the future. If you’re working with large-scale data pipelines or cleaning up gnarly datasets, it’s a game-changer.

Closing Thoughts
Data cleaning isn’t glamorous, but it’s where real-world projects live and breathe. Choosing the right tool can make the difference between an afternoon well spent and one spent watching your laptop fan spin like a Merlin engine.

If you’re wrangling vCards or any structured text data at scale, give Polars a try. Your CPU will thank you.

— -

Got questions or want a copy of the scripts? Drop a comment or connect with me on GitHub. Let’s clean up the mess — fast.

Top comments (0)