DEV Community

Cover image for Hello World, PardoX: Why I Built a Data Engine in Rust (and Why I Need You to Break It)
Alberto Cardenas
Alberto Cardenas

Posted on

Hello World, PardoX: Why I Built a Data Engine in Rust (and Why I Need You to Break It)

1. Introduction: The Leap into the Void

I am not writing this from a boardroom in Silicon Valley, nor am I backed by a team of fifty senior engineers with unlimited budgets. I write this from my desk, surrounded by empty coffee mugs, feeling that specific blend of pride and terror that any developer feels before hitting git push on a public repository of this magnitude for the first time.

PardoX is my leap into the void.

For the past few months, I have immersed myself in a solitary obsession: performance. As a Data Engineer, I have experienced firsthand the frustration of watching pipelines collapse not because of logic complexity, but due to tool inefficiency. I’ve watched servers run out of RAM simply trying to read a poorly optimized CSV. That frustration turned into curiosity, and that curiosity transformed into PardoX. But I need to be brutally honest with you from line one: this is my first large-scale Open Source project. I am not a corporation; I am just an engineer obsessed with the idea that we can process data faster and with fewer resources.

The road to this v0.1 Beta has been intense, technical, and often overwhelming. Rust is a wonderful language, but its learning curve is a vertical wall. Fighting the borrow checker, understanding the unsafe memory management needed to interact with Python, and designing a robust FFI (Foreign Function Interface) architecture are not trivial tasks. And this brings me to the second honest confession of this launch: PardoX is a child of its time.

If I had attempted to build this engine three years ago, writing every single line of code, every unit test, and every piece of documentation manually, I would probably be releasing this in 2028. To be efficient and realistic, I have used Artificial Intelligence as a force multiplier. AI didn’t design the architecture—that vision is mine—nor did it decide on the memory trade-offs, but it was the tireless co-pilot that helped me translate complex ideas into Rust syntax, debug obscure compilation errors, and generate the necessary boilerplate to make the Python wrapper feel native. Without this symbiosis between human architect and digital assistant, PardoX would still remain just a diagram in my notebook.

What I present today is not a finished product wrapped in a bow. It is not an immaculate final version. The v0.1 Beta is, by design, the beginning of a journey. It is an invitation to enter the construction site. It is very likely you will find bugs. You might try to load a dataset with strange encoding, and the engine might panic. And that is exactly what I need.

I am publishing this because I firmly believe that software does not improve in the dark. I need you—Data Engineer, Backend Developer, or performance enthusiast—to take this engine and push it to its limits. I need your eyes on the code and your real-world experience. I am not looking for applause for a perfect job; I am looking for constructive criticism from colleagues who understand that building hard software requires courage.

So, welcome to PardoX. This represents months of work, learning, and sleepless nights. It is imperfect, it is fast, and it is mine. Now, it is yours too.

Chapter 1: The Obsession with "Zero-Copy"

If you are a Data Engineer or Data Scientist, you know this horror story: You have a 5 GB CSV file. You try to open it in Pandas. Your RAM jumps from 2 GB to 18 GB. Your laptop fan sounds like it's about to take off. And you haven't even started cleaning the data yet.
Why does this happen? The silent culprit is called "Serialization Overhead".
In the traditional Python ecosystem, reading data is a painfully bureaucratic process. The engine reads bytes from the disk, decodes them into Python strings (heavy PyObject wrappers), then tries to infer if they are numbers, and finally, if you are lucky, converts them into a NumPy array. In this process, data is copied and transformed multiple times. It's as if to move furniture from one house to another, you had to disassemble it, put it in boxes, take it out of the boxes, and reassemble it in every room it passes through. It is inefficient and it is slow.
PardoX was born from an obsession with eliminating those middlemen. The core philosophy is called Zero-Copy.
When I designed the PardoX ingestion engine, the rule was simple: Data must move from disk to operational memory exactly once. No intermediate objects. No dynamically growing Python lists.
We use a technique called Memory Mapping (mmap). Instead of "reading" the file in the traditional sense, we tell the Operating System: "Map this file directly into the process's virtual address space." PardoX then uses raw unsafe pointers in Rust to navigate those bytes.
When you execute px.read_csv("data.csv"), what really happens under the hood is a low-level choreography:

  1. Rust reserves a contiguous block of memory (the "HyperBlock").
  2. A Thread Pool scans the file in parallel, detecting newlines and commas without ever creating string objects.
  3. Numeric values are parsed directly from ASCII bytes into primitive f64 or i64 types and written directly into the final HyperBlock.

Python never sees the data during this process. Python only receives a pointer, a "reference" to that memory block. This means you can load massive datasets in a fraction of the time and, most importantly, using a fraction of the RAM. It's not magic; it's efficient resource management. It's treating your hardware with the respect it deserves.

Chapter 2: Hybrid Architecture (The Marriage between Rust and Python)

There is a question I was constantly asked during the development of PardoX: “If Rust is so fast and safe, why didn’t you make a pure Rust library? Why drag Python into the equation?”

The short answer is: Because I am a realist.

Rust is, without a doubt, an engineering marvel. Its type system and memory management are the modern gold standard. But let’s be honest: nobody wants to write 50 lines of strict code, fight the borrow checker, and define explicit lifetimes just to sum two columns in an Excel sheet. In the world of data analysis, human iteration speed is just as important as machine execution speed. Python won that war years ago because of its readability and simplicity.

However, Python has an “original sin”: the GIL (Global Interpreter Lock). It doesn’t matter how many cores your state-of-the-art server has; the standard Python interpreter (CPython) can only execute one thread of bytecode at a time. For CPU-intensive tasks, like processing millions of records, Python is like trying to drive a Ferrari in a school zone: you have the engine, but you’re not allowed to use it.

PardoX is a marriage of convenience between these two worlds, designed under a strict hybrid architecture.

The Brain (Python): We use Python for what it does best: the interface. When you write df.filter(...), you are using the friendly syntax we all know. Python acts as the “Remote Control.” It doesn’t process data; it just sends orders.

The Muscle (Rust): This is where the truth lives. PardoX compiles a shared dynamic library (.so on Linux, .dll on Windows, .dylib on Mac). When Python sends a command, PardoX crosses the FFI (Foreign Function Interface) bridge using ctypes.

What happens at that moment is critical: the GIL is released.

Once we enter Rust territory, we escape Python’s constraints. Suddenly, we can spawn 16, 32, or 64 threads in real parallelism. We can use SIMD instructions to add four numbers in a single clock cycle. We can manage memory manually, bit by bit.

Maintaining this marriage is not easy. It requires unsafe code blocks in Rust, where we tell the compiler: “Trust me, I know what I’m doing with this pointer.” A mistake here doesn’t throw a pretty Python exception; it causes a Segmentation Fault that kills the entire process. That is the tightrope I have walked these past months. But the result is worth it: the ergonomics of a Python script with the “bare-metal” performance of a compiled system. It’s having the steering wheel of a family sedan and the engine of a rocket ship.

Chapter 3: The Native Format (.prdx)

I know what you’re thinking. “Seriously, Alberto? Yet another file format? Didn’t we have enough with CSV, JSON, Parquet, Avro, Feather, and ORC?”

Believe me, the last thing I wanted to do was reinvent the wheel. But when you are chasing extreme performance, you realize that existing formats are designed with other priorities in mind. CSV is for human readability. JSON is for the web. Parquet is amazing for long-term storage because it compresses data aggressively, but that compression comes at a cost: your CPU has to work overtime to decompress before you can read the first byte of data.

The .prdx format was born from a specific need: Instant Persistence.

To understand why .prdx is fast, we first must understand why others are slow. The enemy here is called “Parsing” and “Deserialization”. Imagine saving a DataFrame to CSV. Your computer has to convert binary numbers (like 3.14159) into ASCII text ("3.14159"), byte by byte. When you want to read it back, the engine has to read the text, hunt for commas, handle quotes, and convert that text back into binary. It is a massive waste of clock cycles.

Parquet is better; it is binary. But Parquet is designed to save space. It uses complex algorithms (Run-Length Encoding, Dictionary Encoding, Snappy/Zstd). To read a Parquet file, your CPU has to “inflate” the data. It is fast, but it is still CPU-bound.

The .prdx format works differently. We don’t parse. We don’t decompress. We teleport.

Technically, a .prdx file is a structured Core Dump of RAM. I designed the file layout to be identical to how Rust organizes data in memory (Columnar Layout). When you execute df.to_prdx("data.prdx"), the engine takes the memory block and flushes it to disk exactly as it is.

But the real magic happens when reading. When using px.read_prdx(), we use a system call named mmap (Memory Map). Instead of saying, “Operating System, read this file and put it into RAM,” we say, “Operating System, trick the process into believing that this file on disk IS the RAM.”

This has three brutal consequences:

  1. Instant Load: Startup time is almost zero. The file is not “loaded”; it is mapped.
  2. On-Demand Paging: If you have a 100 GB file but only read the “Price” column, the Operating System will only fetch the memory pages corresponding to that column. You don’t waste RAM on what you don’t use.
  3. Hardware Speed: By eliminating the CPU from the equation (no parsing, no decompression), the only limit is the physical speed of your hard drive.

In my tests with NVMe PCIe Gen 4 drives, I have achieved sustained read speeds of 4.6 GB/s. The bottleneck is no longer my code, nor Python, nor Rust. The bottleneck is the silicon physics of my SSD. And that, my friends, is the only barrier I am willing to accept.

Chapter 4: User Experience (DX) - Familiarity Above All

There is an unwritten rule in software development that I learned the hard way: If you build the fastest engine in the world, but you need a 500-page manual to turn it on, no one will use it.

When I started designing the Python interface for PardoX, I faced a massive temptation. I wanted to create new, “better,” more logical function names. Instead of read_csv, I wanted the user to write pardox.ingest_stream(). Instead of df.head(), I wanted to use df.peek(). I felt very smart reinventing the wheel.

Fortunately, my pragmatic (and lazy) side won. I understood that developers’ “muscle memory” is sacred. Millions of people already know how to use Pandas. They know that filtering is done with brackets [] and that summing columns uses the + sign. Changing that is not innovation; it is arrogance.

The premise of the Developer Experience (DX) in PardoX is simple: If you know DataFrames, you know PardoX.

My goal was to create an “illusion of simplicity.” I want you to feel like you are writing the same old Python code, while underneath, the ground is moving at breakneck speeds.

Let me give you a technical example of this duality. When you write something as innocent as:

Python

df['total'] = df['price'] * df['quantity']
Enter fullscreen mode Exit fullscreen mode

To you, it is a multiplication. To PardoX, it is major surgery.

What You See (The Surface): A * operator. Simple, clean, readable.

What I See (The Basement):

  1. Python invokes the magic method mul.
  2. The wrapper intercepts the call and verifies that both columns are numeric and have the same length (schema validation).
  3. Memory pointers (ctypes.c_void_p) are extracted from both underlying arrays in the HyperBlock.
  4. The FFI border is crossed into Rust.
  5. Rust detects your CPU architecture (Do you have AVX2? Do you have NEON?).
  6. Rust divides the arrays into “chunks” that fit into your processor’s L1 cache.
  7. SIMD instructions are executed to multiply 8 numbers at once.
  8. A new pointer is returned to Python, encapsulated in a new Series.

All that chaos, all that unsafe memory management and hardware detection, happens in microseconds and is completely invisible to you. That is my responsibility, not yours.

PardoX is complex on the inside so it can be simple on the outside. I don’t want you to learn a new API. I want you to take your current scripts, change import pandas as pd to import pardox as px, and watch your execution times collapse. That is the true demonstration of technology: when the tool disappears, and only the result remains.

Chapter 5: The Future and Universality

If you’ve read this far, you might think PardoX is just “another fast library for Python.” And you would be right, but only partially. My vision for this engine is far broader. Python is merely the first client, the first guest at the party.

The beauty of having written the Core in Rust and exposing it through a standard C binary interface (C-ABI) is that PardoX doesn’t belong to any specific language. It belongs to the operating system.

The Road to Universality (Roadmap v0.1.x)

We are not waiting for version 2.0 to expand. The strategy for the coming weeks is to release incremental updates within this very Beta phase. I want to democratize performance.

  • v0.1.1 (PHP): Yes, PHP. Often ignored by the “Big Data” community, yet it powers half the web. I want a Laravel developer to process a 1GB CSV in milliseconds without blocking the server.
  • v0.1.2 (Node.js): For the modern backend. We will bring native bindings so that the JavaScript Event Loop never freezes while processing data.
  • The Horizon (Go, Java... and COBOL): We will move down the tech stack. And yes, I am serious about COBOL. There are terabytes of banking data trapped in mainframes that need modern speed. If we can compile a compatible binary, PardoX will be there.

Looking Ahead: What’s Coming in v0.2

While we stabilize the current Beta, my mind is already architecting version 0.2. This isn’t just about bug fixes; it’s about new offensive capabilities:

  • Native and Agnostic Connectivity: Currently, we read CSV and Postgres. In v0.2, the Rust engine will speak natively with MySQL, SQL Server, and MongoDB. But more importantly, we are going to support legacy flat files like .dat and .fixedwidth. I want PardoX to be the Swiss Army knife that connects modern databases with files from 20 years ago.
  • Advanced Types & Compiled Regex: Text manipulation is slow. In v0.2, we will introduce accelerated string manipulation kernels. Imagine running Regular Expressions (Regex) or splitting millions of text strings, but executed by Rust’s Regex engine (which is incredibly fast) instead of Python’s engine.
  • ML Bridge (The AI Bridge): This is the “Holy Grail.” We are designing a Zero-Copy export to NumPy and Apache Arrow. The goal is for you to load and clean data with PardoX, and then pass the memory pointer directly to PyTorch or TensorFlow to train models, without duplicating a single byte of RAM.
  • Testing Tools (Fake Postgres & Fake API): As an engineer, I hate spinning up heavy Docker containers just to validate a pipeline. We are going to implement a “Fake Postgres” and a “Fake API” inside PardoX. You will be able to simulate receiving data from a real database or a REST endpoint for your unit tests, all simulated in-memory at lightning speed.

PardoX is not just a DataFrame; it is portable data infrastructure. Today we start with Python. Tomorrow, the world.

Final Reflection: The Vertigo of Releasing

Releasing this gives me a sense of vertigo that is hard to explain. There is a perfectionist part of me that wants to keep the repository private forever—polishing that aggregation function one more time, refactoring that unsafe block again, ensuring the documentation reads like pure literature. But I have learned that software that isn’t released simply does not exist.

PardoX is, in many ways, like a newborn. It is loud, sometimes unpredictable, and requires constant attention. But it also holds infinite potential. What you see today is the foundation, the concrete slab upon which we will build data skyscrapers. It is my personal bet on a future where extreme performance is not exclusive to systems experts but an accessible tool in every developer’s backpack.

  • GitHub Repository: github.com/betoalien/PardoX
  • Official Documentation: betoalien.github.io/PardoX/

The Manifesto

On Noise and Opinions

On this journey, I have learned to filter out the noise. The internet is full of opinions on which tool is “the best.” Twitter and Reddit are battlefields where people theoretically argue whether one language is superior to another, whether static types are better than dynamic ones, or if you should rewrite everything in the latest trendy framework.

But honestly, I try not to get distracted by theoretical debates or synthetic benchmark wars. I focus on what builds. If you come to tell me that Rust is better than C++ or vice versa just to be right, I probably won’t answer. I don’t have time for holy wars.

But if you come with an idea, with a weird use case, with a bug you found processing data from a pharmacy in a remote village with an unstable connection… then we are on the same team. If you come with your hands dirty with code and a desire to solve a real problem, this is your home.

Join the Resistance (Call to Data Engineers)

This is my formal invitation. Join the beta.

I am specifically looking for Data Engineers and backend developers who deal with slow pipelines, maintenance windows that close too fast, and “Out of Memory” errors at 3 AM. Help me break this so I can build it better.

Download the engine:

pip install pardox
Enter fullscreen mode Exit fullscreen mode

Throw your worst CSVs at it, those 50GB monsters that make your RAM weep. Try to build a complex ETL, connect it to that legacy database no one dares to touch, and tell me where it breaks. Tell me what function is missing to make your life easier.

The code is compiled. The tests have passed. The coffee is ready.

Alberto Cárdenas.

📬 Contact Me: Tell Me Your Horror Story

I need to get out of my head and into your reality. Send me your use cases and your frustrations.

  • Direct Email: iam@albertocardenas.com (I read every email that adds value or proposes solutions).
  • LinkedIn: linkedin.com/in/albertocardenasd (Let’s connect. Mention you read the “pardoX” series so I accept you quickly).
  • X (PardoX Official): x.com/pardox_io (News and releases).
  • X (Personal): x.com/albertocardenas (My day-to-day in the trenches).
  • BlueSky: bsky.app/profile/pardoxio.bsky.social

See you in the compiler.

Top comments (0)