DEV Community: Aaron Steers

Introducing ReelTrust: What if data engineering could solve our AI deepfakes problem?

Aaron Steers — Mon, 27 Oct 2025 19:13:05 +0000

TL;DR: This weekend I built ReelTrust a new type of video authentication software, which I hope others will either continue to expand upon it or build something better - ASAP.

ReelTrust is designed to:

Prove a video's source and authenticity.
Protect that video against accusations of manipulation.
Ensure that future doctoring is clearly detectable.
Do all the above built on open source solution with transparency built in.

Under the hood, I treated this as a data engineering metadata pipeline problem: build the simplest-possible solution that handles huge audiovisual datasets efficiently, respecting privacy and security, with a focus on leveraging at scale for multiple audiences and use cases.

Why I built ReelTrust

For at least two years now, we've all known the AI technologies would be soon creating videos that look identical to real-life videos. And with Sora 2 now launched, we've woken up to find that that future is now. The news media (CNN, NYT, Reuters, etc) and media distribution industries (YouTube, TikTok, etc.) have collectively failed to prepare for our current reality. Not only a topic of news itself, I now regularly overhear people in coffee shops bemoaning that they "just can't tell what is real any more".

With things getting worse, and no plan in place, I finally felt I needed to invest some of my weekend hours and take some action myself.

How ReelTrust works

For those who just want to see some code, here's what the CLI looks like today. (Note, these code examples would be "real" once as the package is uploaded to PyPI.)

# Install uv if it's not installed already
brew install uv

# Create a verification package
uvx reeltrust create-package my-orig-video.mp4

# Verify the original video (passes)
uvx reeltrust verify my-orig-video.mp4 .data/outputs/packages/my-orig-video

# Verify a compressed/re-encoded video (passes)
uvx reeltrust verify my-video-compressed.mp4 .data/outputs/packages/my-orig-video

# Verify a clip taken from the original video (passes)
uvx reeltrust verify clip-from-my-video.mp4 .data/outputs/packages/my-orig-video

# Verify a doctored version of the original video (fails)
uvx reeltrust verify my-video-tampered.mp4 .data/outputs/packages/my-orig-video

# Verify a doctored clip taken from the original video (fails)
uvx reeltrust verify clip-from-my-video-tampered.mp4 .data/outputs/packages/my-orig-video

ReelTrust Rollout

Once an MVP (minimum viable product) is fully complete, ReelTrust would work in partnership across several parties:

1. Video Content Creators (CNN, New York Times, White House Correspondents, etc.)

While capturing video, content creators pre-process their video and create a ReelTrust verification package. The verification package is a robust set of fingerprint metadata on the audio and video files.

Creators invest in some additional processing and storage costs, in exchange for an improved trust relationship with their customers and their end users (i.e. content consumers).

2. Package Hosting and Indexing Providers (AWS S3)

In its simplest form, the verification package hosting service hosts packages in publicly accessible manner, such as a public S3 bucket or a public static site with the ability for users to list and download verification resources.

More advanced forms could provide a "find-by-clip" feature via advanced indexing, as well as services to privately store higher-res original video without surfacing that original content on the public internet.

All verification assets would be digitally signed at time of creation and/or at time of upload, providing proof of the video verification package's creation time, geolocation, and contents.

Importantly, while fingerprint information may be freely hosted on public servers, the high fidelity content itself is not publicly shared. This allows the public to freely verify authenticity and also to detect manipulation, while not enabling others to steal their content.

3. Video Distributors (YouTube, Facebook, TikTok)

Video distributors like YouTube would read their content creators' metadata during the video upload process, compare with attached ReelTrust package URIs, and accept or reject the video publish based on verification results. Apart from rejecting outright, videos could also be allowed to be published regardless of verification status, while displaying to users the results of verification inline to the videos being consumed.

When no ReelTrust verification information is available, the experience for users may be exactly the same as their experience today - except for a blanket caveat that zero verification has been performed on the content's authenticity. For positively confirmed verifications, the user would see something akin to a green checkmark or lock symbol (✅🔒) - with an option to view additional metadata on who created the original video, where and when it was authored, along with how exactly the videos contents were confirmed.

What does a future look like with ReelTrust?

The future with a system like ReelTrust would entail every important national event having signed, verified ReelTrust media certificates. Elections, interviews, news footage - it would all be certified and signed. Authentic footage of these events would be marked clearly as "✅🔒 ReelTrust Certified Video". Dubious or unsigned footage would be marked as such, and even edited footage that was still authentic would be clearly marked as edited.

All of this would be built on lessons learned on how to build the same distributed trust foundation that we rely on daily today for safe online shopping, and to keep malware out of our emails.

What does the future look like if we do nothing?

If we do nothing, things continue to get much worse. Fake video will be presented as real, while authentic videos will be slandered as AI fakes. If we do nothing, we lose our ability as a society to believe our eyes. It's already happening, so we need to act fast.

What is included in the Proof of Concept (POC)

I'm hosting on GitHub at https://github.com/aaronsteers/ReelTrust under the MIT license.

What it can do today is:

Create verification packages that combine a low_res video digest along with a variety of video and audio fingerprints.
Compare a given video against the verification package, printing a verification report and "PASS" or "FAIL" status of verification.
Works on re-encoded or compressed videos as well as the original fidelity.
Create 5-second side-by-side comparison clips where deviations are detected - comparing the evaluated video against the recorded low-res digest.

Tackling video authenticity as a data engineering problem

If you've worked with me, you know that video analysis is not my personal area of expertise - far from it! But data engineering is my expertise - and my approach was basically to approach video authenticity as a data engineering problem.

What's being built:

Composable pipelines to capture video/audio fingerprints.
Frame-level metadata slices (SSIM, perceptual hashes, stats).
CLI tools for verification, clip matching, and offset detection.

No AI/ML required (yet). Just reliable data structures, smart comparisons, and a heavy reliance on hashing, fingerprints, and clever vector embeddings.

Want to get involved?

My primary motivation in creating this project was to show (to myself, my friends, and the industry) that this is indeed possible, and that it is worth more investment.

Everything in the project is shared freely as open source. Feel free to fork the repo and improve it. Feel free to drop me a line if you want to talk. This was entirely created during my personal time, and I do want to see this become a reality.

Chime in below - do you think this is project overly ambitious - or too boring / obvious? Any fatal flaws in the approach? Is someone you know of already already building something similar to this?

Like or share so this can reach more eyeballs - and hopefully the industry partners that need to act eventually will.

Thanks!
-Aaron (AJ) Steers

Repo link here: https://github.com/aaronsteers/ReelTrust

The Modern Ways to (not) Install Python and Virtual Environments

Aaron Steers — Wed, 01 Oct 2025 20:29:21 +0000

TL;DR: You can use uv for everything now. Just brew install uv and then you can forget about brew, pyenv, pipx, pip, and all the others.

Where we are today

The Python ecosystem has changed a lot over the past few years. What used to be best practices are quickly becoming anti-patterns. The biggest challenge I see with folks using Python today is that they are stuck in old patterns that cost them both time and pain.

The Outlawed Patterns List

If you are using any of these patterns, please stop immediately.

Everything related to pyenv.
Calling pip directly, for any reason.
Calling python directly, and/or expecting python on PATH to be a specific version.
Anything related to venv, virtualenv, or activate, or manually creating or activating virtual environments in any way.

The Modern Way to Install Python

Note: You probably don't actually need to install Python before you start. In less than 2 seconds, the uv tool (as you'll see below) can install Python versions on-demand for apps, projects, and scripts. So you can probably skip this step entirely. But if not:

# Install uv or update it if it's already installed:
brew install uv

# Show the versions of Python available and/or installed:
uv python list

# Install a couple Python versions
uv python install 3.13
uv python install 3.12

# Set your global python version (`python` on `PATH`)
uv python install 3.13 --global --preview

The Modern Way to Manage Virtual Environments

You should never need to manually create a virtual environment again. The manner in which you create your virtual environment will depend on whether you are running an app, writing code for a project, or writing a script. But for all of the below patterns, the virtual environment is created automatically, with no way to mess up, and no way to forget to activate or deactivate an environment.

Identify your use case from one of these three patterns, and apply the suggested pattern:

1. Installing a Python App

Simply uv tool install my-tool and then you'll have my-tool on your PATH.

2. Running a Python App

Skip the install step and use uvx: uvx my-tool --version

This works from anywhere you have uv installed.

You can optionally set the python version to something explicit and uv will lightning-fast bootstrap that version of Python if it isn't already installed:

uvx --python=3.12 my-tool --version

You can also pin to a specific version of your tool or force the "latest" on each run:

uvx my-tool@latest --version
uvx my-tool@3.2.1 --version

3. Write Python Code for a Project

Simply use uv init to create a new project directory, uv add to add dependencies, and uv run to run them. Unlike it's predecessor, Poetry, you don't even have to explicitly install anything, since uv run implies uv sync.

# Create a `repos` directory if you don't have one yet:
mkdir ~/repos
cd ~/repos

# Create the `uv` project scaffold:
uv init my-new-project
cd my-new-project

# Optionally set the Python version you want for this project:
uv python pin 3.13

# You can immediately run the sample without any extra steps
uv run main.py

# You can add dependencies (replaces `pip install`):
uv add my-other-dependency-a my-other-depedency-b

# Dependencies are auto-installed whenever you call `uv run`:
uv run main.py

Write a Portable Python Script

Sometimes you don't want a full project, or you need to have many scripts each with their own dependencies. This is easy to do with uv.

Take this example my_helper_script.py:

#!/usr/bin/env -S uv run --script
#
# Inline dependency declaration (PEP 723):
# /// script
# requires-python = ">=3.10"
# dependencies = [
#   "my-dependency-a",
#   "my-dependency-b",
# ]
# ///
#
# Usage with uv CLI:
#    uv run --script ./my_helper_script.py
#
# With the custom shebang, you can also invoke directly:
#    ./my_helper_script.py
#
import my_dep_a
import my_dep_b

def main()
    print("Hi, there!")

if __name__ == "__main__":
    main()

Now you have a script that can run anywhere, on any version of Python (no Python install necessary), with your dependencies automatically bootstrapped and fully isolated from other Python environments and potential version conflicts. 🎉

What did I miss?

I'll update the article with any tips or tricks that others share in comments. Comment down below if I missed anything.

Escaping Your Java Habits in Python: Writing Clean, Pythonic Code

Aaron Steers — Fri, 29 Aug 2025 03:22:20 +0000

As engineers, many of us migrate between languages.
Fun fact: 20 years ago - what was the first language I was ever certified in? Java.

But now, a dozen languages later, I want to pull my hair out when I feel like I'm reading Java code inside a file that ends in a ".py" extension.

If you’ve spent significant time in Java, it’s natural to bring those habits along when coding in Python. Unfortunately, some of those habits can lead to over-engineered or awkward code that doesn’t at all feel Pythonic. Worse: your habits may be contributing to bugs, and (worse yet:) slowing down code review.

Not to bring shame, but to improve everyone's lives: I think these patterns are worth calling out — especially for developers making the jump from Java to Python.

1. Overcompensating for Dependency Injection (DI)

The Java mindset

Java is not particularly strong at dependency injection (DI). Without additional frameworks (e.g. Spring, Dagger, Guice, etc.), it's actually super difficult. Unbenounced to many, DI in Python is super trivial, actually. To manage dependencies, Java developers writing Python often build layers of abstractions, factories, and injection systems where none are truly needed.

How it leaks into Python

When we carry this mindset into Python, we often over-engineer DI using generics, abstract base classes, and unnecessary indirection. While Python can do this, it usually isn't needed - and our future selves will thank us if we keep things simple.

The Pythonic alternative

Prefer passing simple functions, Callables, and Unions of types.
Embrace Python’s duck typing: if an object behaves the way you need, you don’t need to enforce a generic type hierarchy.
Use default arguments or keyword arguments for flexibility.

Example:

# Java-style mindset in Python
class DataFetcher(Generic[T]):
    def fetch(self) -> T:
        raise NotImplementedError

class HttpDataFetcher(DataFetcher[str]):
    def fetch(self) -> str:
        return "data"

fetcher: DataFetcher[str] = HttpDataFetcher()
print(fetcher.fetch())

# Pythonic mindset

def fetch_data() -> str:
    return "data"

print(fetch_data())

The second version is shorter, clearer, and easier to maintain.

2. The “Everything Must Be a Class” Habit

The Java mindset

In Java, basically everything lives inside a class. Utility methods go into static classes. Even trivial helpers often get wrapped into objects because free functions aren’t idiomatic.

How it leaks into Python

When carried into Python, we end up with tiny, boilerplate-heavy classes that don’t add real value. For example, you might see a StringUtils or ConnectionBuilder class in Python, which is entirely unnecessary and adds unnecessary friction to your callers.

The Pythonic alternative

Write standalone helper functions if a class is not truly needed.
Use modules as namespaces (a Python file is already a container).
Only create classes when state or behavior needs to be encapsulated, or when instantiating the object adds something meaningful to your workflow.

Example:

# Java-style mindset in Python
class MathUtils:
    @staticmethod
    def add(a, b):
        return a + b

print(MathUtils.add(2, 3))d

# Pythonic mindset

def add(a, b):
    return a + b

print(add(2, 3))

Again, the second version is shorter, more natural, and easier to maintain.

Best yet: the person reading or reviewing the code knows (via static code analysis) that calling the function is correct - no pre-knowledge of how to work with the class is needed in order to review the code. This is an area where Java classically fails in regards to code review and also for AI agent applications: the amount of context needed to review code or to create new code is much higher if you need to fully understand a class's structure.

Compare this with helper functions: there's no "wrong way" to call a helper function; you only need to read the docstring and function signature (aka, use tooltips and Intellisense) to confirm if the function call is correct or not.

3. Overusing Interfaces and Abstract Base Classes

The Java mindset

Every service has an interface, every implementation must be bound to it. This enforces structure, but at the cost of boilerplate.

How it leaks into Python

Developers sometimes mimic this pattern with abstract base classes and heavy use of Generic types, even for simple use cases.

The Pythonic alternative

Use duck typing: if an object supports the methods you need, that’s enough.
When structure matters, consider typing.Protocol for lightweight contracts.

Example:

# Java-style mindset in Python
class Service(ABC):
    @abstractmethod
    def run(self):
        pass

class PrintService(Service):
    def run(self):
        print("running")

# Pythonic mindset
class PrintService:
    def run(self):
        print("running")

service = PrintService()
service.run()

The second method uses less code, and is easier to support.

🤫 Psst! Don't worry: the type checker will always tell you if you call a method that doesn't exist on the class! Adding type checks to your CI means this is always safe, with often 50% less code and a much more readable and maintainable implementation.

4. Verbose Builders Instead of Simple Keyword Arguments

The Java mindset

Builders are everywhere for object construction with optional arguments.

How it leaks into Python

Developers sometimes reimplement builder-style classes just to avoid long __init__ signatures.

The Pythonic alternative

Use keyword arguments with defaults.
For structured objects, use dataclasses or Pydantic models.

Example:

# Java-style builder in Python
class UserBuilder:
    def __init__(self):
        self._name = None
        self._age = None

    def set_name(self, name):
        self._name = name
        return self

    def set_age(self, age):
        self._age = age
        return self

    def build(self):
        return {"name": self._name, "age": self._age}

user = UserBuilder().set_name("Alice").set_age(30).build()

# Pythonic with dataclasses
from dataclasses import dataclass

@dataclass
class User:
    name: str
    age: int = 0

user = User(name="Alice", age=30)

Again, the Pythonic version requires 75% less code, while being more readable, more maintainable, and much less likely to have unexpected bugs creeping in over time.

5. Not Using Keyword Arguments When You Should

The Java mindset

Function calls are almost always positional (Java doesn't support keyword args), and IDEs enforce correctness through signatures and tooling, while your function and method args inevitable devolve into long list of fragile and hard-to-verify positional inputs.

How it leaks into Python

Ex-Java developers often continue to use positional arguments in Python - even for long, many-input functions. This makes code fragile and unreadable—especially during reviews, where it’s literally impossible to confirm correctness without knowing the implementation signature by heart.

The Pythonic alternative

Use keyword arguments for clarity and maintainability. They make function calls self-documenting and trivial to verify.

Example:

# Fragile and hard to review
create_source(name, config, catalog, state)

# Clear and verifiable
create_source(
    source_name=name,
    config=config,
    catalog=catalog,
    state=state,
)

The latter is obviously correct and clear to its reader, whereas the former is literally impossible to verify from the code alone. The Pythonic version takes <2 seconds to reach 100% confidence (if it passes lint checks: it's correct), whereas in the Java-esque version, there's almost zero ability to reach a high confidence at all. You are fully leaning on tests and type checks - and if any of the arg types are the same, then you are 100% leaning on tests.

🧠 Ironically: even though the Pythonic version has more characters, your eye passes over it faster, quickly and subconsciously confirming the implementation in 0-2 seconds, whereas your brain freezes or just gives up when trying to review the first version. Since you don't have access to docs with the order of the input args, your ability to code review that (admittedly shorter) code snippet is zero-to-nil - unless you really really suspect something is wrong with it.

What's worse: our code changes and evolves over time. Named args ensure your code will break cleanly when it is broken. Relying on positional args is like setting a time bomb that you know will eventually go off, but you just don't know when. 💣

Takeaways

If you’re coming from Java:

Don’t over-engineer dependency injection—Python’s simplicity usually covers most use cases.
Don’t create classes for everything—standalone functions are fine (and often preferred).
Skip unnecessary interfaces—use duck typing or protocols only if truly needed.
Use dataclasses and simple constructors instead of builders.
Prefer keyword arguments in function calls for clarity and reviewability.

The beauty of Python is its flexibility and minimalism. Lean into that. By shedding some habits from Java, your code will not only feel more Pythonic but will also be easier to read, maintain, and extend.

It's not about which language is better - it's about making your code readable, maintainable, and intuitive to leverage and support. In the age of AI, these can mean the difference between keeping up, or falling behind.

Why this matters for the future of AI

In the future, AI will write much more of the code that we all write and maintain today - but it has the same limitations that we have: it is reliant on limited context windows and on context-clues to correctly read and interpret code. If you want to future proof your code in 2025 and 2026: write code that even a mindless robot can review and maintain for you.

In other words, keep it Pythonic - regardless of whether you are writing Python, Java, or Kotlin. 😅

"Dismiss Review": The Underrated GitHub Feature You're Probably Not Using

Aaron Steers — Sun, 13 Apr 2025 00:31:27 +0000

If you’ve ever hesitated to hit “Request Changes” on a GitHub pull request (PR), you’re not alone.

In so may teams I've worked with, it feels like there's a quiet tension around that big red "❌". It feels heavy, negative, maybe even a hit to the author's morale. Like you’ve just pulled the emergency brake on someone else's momentum.

But what if I told you that GitHub has a built-in feature that solves this problem — and almost no one uses it?

Let me introduce you to: Dismiss Review.

Why "Request Changes" Feels Like a Last Resort

Here’s what often happens:

You're reviewing a PR and notice a few things worth changing. They're important and they do need to be addressed, but they are small, easy fixes. And as a reviewer, you may also be aware that you could be wrong about your requested change - it may already be addressed elsewhere or you could be misunderstanding the scenario.
You could leave a comment with “Please fix this,” but without the formal red “Request Changes” option. If you post as a comment though, you are communicating that this is non-blocking, which means you'd be okay with no change.
On the other hand, hitting “Request Changes” puts a red "❌" on the PR—and that feels like you're putting up a roadblock. There's a very significant risk that you will be busy later and unable to come back to re-review in a timely fashion.
As a reviewer, you worry: Will this discourage the author? Will it delay merging until you can come back and review it again?

So what do most people do?

They avoid using “Request Changes” at all - unless absolutely necessary — leaving feedback as a "Comment" even when it requests what we believe are important requested changes.

But Here’s the Thing…

You can dismiss a review.

Let me say that again: PR authors with "write" access can dismiss a review once the feedback has been addressed.

This is built-in GitHub functionality: simply click the ... menu next to a review and hit “Dismiss Review”:

After dismissing, the original feedback stays visible for transparency, but it’s no longer blocking the merge.

It’s like saying:

“Hey, this was useful feedback and it needed to be addressed, but now that it's resolved, we’re good to go—even if the reviewer doesn't come back to confirm.”

In fact, there's even a place for you as the PR author to explain why you are dismissing. And ALL of these are perfect reasons to dismiss:

"Changes applied, per feedback. Dismissing."
"Changes applied and original reviewer is on PTO. Dismissing."
"Per my reply, the concern raised is addressed elsewhere in the code. Dismissing."

So Why Aren’t More People Doing This?

The short and sad answer seems to be: hardly anyone knows this feature exists.

And without knowing that, reviewers are hesitant to be honest. Authors are hesitant to move forward.

That’s bad for everyone.

Let's Normalize a Better Workflow

Here’s what I suggest:

As reviewers, feel empowered to “Request Changes” when something actually needs to change—even small things.

It's clearer and more actionable for the author.
As authors, feel empowered to address the feedback and dismiss the review (with a reason) once you’ve done the work.

Especially if the reviewer is senior and busy—it’s not disrespectful, it’s responsible. You can also re-request their review, but that subsequent re-review won't block your merge if others have approved.
As reviewers, communicate your intent.

I often say: “Feel free to dismiss this review once these are addressed.”

This lets the author know I don’t need to come back and re-review—it’s not about red tape, it’s about getting things right.
Conversely, if I say "let's talk about this" or "I'm a bit worried about this", I would hope it's a clear signal that allowing time for a re-review is probably expected on my part, rather than dismissing.
As teams, normalize trust and ownership.

The GitHub workflow is a tool. At the end of the day, it comes down to people acting responsibly, clearly, and with mutual trust.

TL;DR

“Request Changes” is valuable—but relatively unknown.
“Dismiss Review” is the missing puzzle piece that unlocks a healthier review culture.
You can give and receive better feedback without slowing down merges.
Let’s use the tools GitHub gives us—and trust each other to use them well.

WDYT?

Are you already using the Dismiss Changes feature?
Are you worried about abuse/misuse?
Have you run into this conundrum yourself?

Lmk in the comments. And thanks for reading!

Separation of Storage and Compute, Part Deux

Aaron Steers — Sun, 23 Feb 2025 23:21:12 +0000

Disclaimer: These opinions are mine and mine alone, not a reflection on my employer.

A bit of history

The first "Separation of Storage and Compute" revolution started in 2010s, as all database platforms learned to scale up and down their compute independently from the underlying storage.

The first to the market with separation of storage and compute Spark in 2009 and SparkSQL in 2014, building on the Hadoop platform and promising extreme portability with the ability to run SQL against virtually any storage medium. While SparkSQL was open source and while it showed the world that decoupling storage and compute was possible, Spark failed to deliver in the categories of ease of use and ubiquity.

Snowflake launched in 2014 around the same time as SparkSQL, and a delivered better, more user-friendly experience - with built-in separation of storage and compute. Snowflake is internally architected as a managed data lake that "feels" like a traditional database - with auto spin-down of compute on by default. This means users can pay just for the compute they needed, with dirt-cheap storage backed by S3. But unlike Spark, Snowflake is not open source and can only be run via paid account with Snowflake. This soon becomes a non-negligible cost optimization problem for its customers, and now Snowflake compute costs represent the primary limiting factor to Snowflake data warehouse scalability.

SQL Server, Redshift, and others similarly follow along in the "separation and storage and compute" revolution, and the benefits of this first generation of compute and storage isolation meant that you as a user could equally scale up and down according to your query needs, with a "near-zero always on cost". If you need a lot of compute, you can spin it up on demand and run it only as long as you need it. For workloads that scale linearly, you can run 100 CPUs for 1 minute instead of running 1 CPU for 100 minutes - literally 100x performance for the same cost! And because there's no limit to how many compute workers you can spin up in the Cloud, contention between queries is non-existent and it's literally impossible for the warehouse to become overloaded or "too small" for your scaling requirements.

The second revolution is here

Today we are seeing a second revolution of separation of storage and compute, which is separation of the vendor you use for compute and the vendor you use for storage. Now, you can connect your BI tool to your lake without spinning up a Snowflake cluster. You can migrate from Databricks to Snowflake and back without paperwork and without any vendor lock-in.

As an analogy, consider Apple Music with its DRM, vs DRM-free MP3s that you own free and clear. Every music app can read your MP3s but no other music app can play your iTunes purchases. This limits your freedom and makes you think twice about where and how to buy your music. And similarly now for data: vendor lock in is a serious challenge for data engineers, and the lack of portability between architectures can easily become a multi-million dollar migration initiative for companies that want to move between platforms.

This is not a knock on Apple or commercial Data Warehouse vendors - it's just an allegory for the revolution we are seeing today with the widespread acceptance and adoption of Iceberg. Now, your lake lives in the cloud and it's readable and writable by every tool in your toolbox. Meaning, you can store data in Iceberg with the peace of mind that every tool you buy off the shelf will be able to read and write to it freely.

Understanding the impact

To understand the impact of this new paradigm, just consider your situation: where you currently have vendor lock-in, and where you are paying vendors just for "SELECT *" access to your data...

Is your org fully dependent on MS SQL and SQL Azure? No problem - it can read from Iceberg.
Are you a hard-core Spark enthusiast? Again, no problem - you can read and write in Iceberg, reading or writing with Spark on Iceberg, using whichever commodity hardware or commercial service you prefer in that moment.
Does your BI tool want to query data every day at 5am, but you don't want to pay Snowflake just for being an intermediate "SELECT *" compute passthrough? No problem. Just bypass Snowflake entirely and have your BI tool read directly to Iceberg.

What it means for database service providers

In short, for database service providers like Snowflake, MS SQL, Redshift, and Spark - they have to make the case that they are the compute you want to use. Emphasis on great UI/UX, great performance, and great features will make all the difference. And even still, they can no longer rely on being the only or even the primary query interface for their existing users. They should expect that their users will increasingly mix-and-match, and that their users will (smartly) bypass them in simple "SELECT *" use cases where there's no reason to pay for the compute spin-up. Wherever write operations are expensive or lacking features, they should expect users to mix-and-match write-providers as well, leaning on services - or self-managed commodity compute - that can write data cheaper or more effectively.

What it means for application service providers

While the rest of this article might seem like a "race to the bottom" in terms of pricing, there's another huge and positive impact that can be attained in this second revolution. That is, now every application, web service, service provider, and startup can now provide a direct, fast, cheap, and best-in-class data architecture for their users. Rather than leaning entirely on REST APIs, which are slow, cumbersome, and expensive to build+maintain, they can offer their users their own personal data lake. Free to query how the user likes, "zero-copy interoperable" with every major DB platform, and easily scalable down to zero and up to near-infinite concurrency.

This last part is what gets me personally very excited about Iceberg and other data lake storage providers that transcend vendor lock-in.

What do you think?

Do you share my enthusiasm, or do you think this is just more complexity in an already complex space? Are you worried about the race to the bottom, or are you excited (like me) that we'll soon all be free from vendor lock-in?

Quantum Computing and LLMs: Match Made in Heaven?

Aaron Steers — Sat, 01 Feb 2025 04:07:55 +0000

Quantum computers are amazing and fast but don't work for traditional computing. LLMs and amazing and slow, reliant on vector analytics, GPUs, and crazy amounts of compute power. Were these two destined for passion? were they forever destined to join forces and change everything, FOREVER??

Introduction

Large Language Models (LLMs) have revolutionized the way we process and generate human-like text. They power chatbots, coding assistants, search engines, and countless other applications. At the core of LLMs are large-scale vector computations: they leverage embeddings and transform multiple dimensions of data into vector spaces. This vector-centric foundation means that LLMs inherently function with some degree of approximation—sampling tokens, generating probabilities, and learning from fuzzy data distributions.

On the other hand, quantum computing has been the buzzword of the last few years when it comes to next-generation computing paradigms. Quantum machines promise exponential speedups on certain classes of problems, often those that require intense parallelism or optimization. They also naturally embrace phenomena like superposition and entanglement, which encode states in ways that classical bits simply cannot.

But how do these two buzzworthy worlds intersect? The question arises: Are quantum computers the perfect implementation for LLMs and vectors? Let’s dig in.

Vectors: The Lifeblood of LLMs

LLMs like GPT, BERT, and others revolve around vector operations at their core. Here’s why:

Embeddings: Each token (word or sub-word unit) is assigned a vector of weights that capture semantic meaning.
Transformers: The self-attention mechanism in Transformers repeatedly projects vectors into different subspaces, making billions—sometimes trillions—of vector multiplications per training iteration.
Approximation: Many LLMs rely on approximate algorithms to speed up training (e.g., approximate nearest neighbor searches, low-rank approximations).

Thus, massive vector operations define the workload of LLMs. If a computing paradigm handles these huge vector and matrix multiplications efficiently—especially in an approximate sense—it becomes interesting for powering next-generation language models.

2. The Failure Intolerance of Traditional Computing

Classical computing is designed around the notion of deterministic, precise operations. Every bit flip is a problem if unintended. At scale, you need advanced error-correction or you risk introducing subtle flaws that can derail entire computations. Of course, GPUs and specialized hardware (like TPUs) are extremely fast for matrix operations, but they still lean on deterministic outcomes.

However, LLMs do not necessarily require absolute precision for every multiplication. Training and inference can be surprisingly robust to noise:

Noise in training: Some degree of randomness or noise (like dropout, quantization, or approximate matrix multiplication) can even help generalization.
Approximate results: Modern LLM pipelines often use half-precision or lower precision calculations—because “close enough” is generally good enough for gradient-based learning.

In other words, while classical computing seeks to minimize every error, LLMs thrive in a domain where some error is acceptable as long as the overall performance is strong.

3. Quantum Computing and Approximation

Quantum computing, in its current forms, faces the challenge of being prone to errors (due to decoherence, imperfect gate operations, etc.). We’re in the NISQ (Noisy Intermediate-Scale Quantum) era, which means systems are relatively small and noisy. Advanced error-corrected quantum computers are still on the horizon.

At first glance, this might seem problematic for large-scale matrix operations. But interestingly:

Approximate Algorithms: Quantum algorithms often produce approximate answers with a certain probability distribution. This might naturally align with LLM architectures, which already operate probabilistically.
Quantum Parallelism: Quantum bits (qubits) can, in theory, represent multiple states simultaneously. This could lead to more efficient ways to process huge vector sets or to sample from complex probability distributions (core to LLM next-token sampling).
Variational Circuits: Already popular in quantum machine learning, these circuits aim to find parameter sets that optimize a certain objective—very reminiscent of training an LLM.

So rather than seeing noise as an irredeemable flaw, the LLM space might view it as a feature, or at least a manageable limitation. If the final result is “good enough,” that might be enough. “Perfect” is not always necessary for language generation tasks.

4. Why “Good Enough, Very Fast” is Actually Perfect in Vectors

Vector-based applications—be they LLM embeddings or neural network layers—are especially forgiving for near-enough solutions:

Dimensional Redundancy: High-dimensional embeddings often encode overlapping or redundant features. A small error in one dimension can be compensated by signals in others.
Probabilistic Outcomes: LLMs produce distributions over the next token. It’s about capturing the shape of that distribution, not necessarily nailing every micro detail.
Efficiency Gains: Minimizing the exact floating-point error at scale can be computationally prohibitive. If quantum systems can solve or approximate high-dimensional vector operations more efficiently, that’s a big win.

In short, the approximate nature of quantum hardware could align well with the approximate nature of LLM computing tasks—especially as we push vector embeddings to ever-larger scales.

5. The (Big) Challenges

While the synergy sounds great on paper, there are still some big caveats:

Hardware Scalability Quantum devices are still limited in qubit count, gate fidelity, and coherence times. Training a large-scale LLM on a quantum computer is nowhere near feasible with today’s hardware.
Error Correction Overheads True fault-tolerant quantum computing is an active research area. Error correction schemes require many additional qubits to encode a single “logical qubit.”
Complexity of Implementation Even if we can do approximate calculations, designing and implementing a quantum-based LLM pipeline is extremely non-trivial. Tools, frameworks, and mental models are still evolving.
Cost and Accessibility Quantum systems are expensive and specialized. Cloud-based quantum computing might help, but it’s still not as plug-and-play as a GPU-based environment.

Quantum ML Tools to Explore

If you’re curious to explore quantum machine learning in a hands-on way, consider checking out some of the emerging frameworks and libraries:

PennyLane (by Xanadu) – A platform for differentiable quantum computing, letting you integrate quantum circuits into machine learning workflows.
Qiskit Machine Learning (by IBM) – Tools for building and training quantum ML models using Qiskit’s quantum simulators or real hardware.
TensorFlow Quantum – A quantum extension of TensorFlow that facilitates hybrid quantum-classical machine learning experiments.

Even if you’re not ready to dive into production-grade quantum computing, these libraries are great for experimenting with smaller-scale quantum ML concepts.

6. Pathways to Quantum-LLM Hybrid Approaches

We may not see a pure quantum LLM tomorrow, but in the meantime, hybrid approaches are emerging:

Hybrid Classical-Quantum Workflows: Run classical steps (like data preprocessing or embedding) on CPUs/GPUs, then offload certain optimization or sampling tasks to a quantum coprocessor.
Quantum-Inspired Algorithms: Some research in “quantum-inspired” linear algebra methods is already benefiting classical HPC (e.g., random projection, approximate matrix factorization).
Approximate Nearest Neighbor on Quantum Hardware: Vector search is core to semantic retrieval for LLMs. If quantum systems excel at certain approximate similarity calculations, that might become the first big application.

Conclusion

So, are quantum computers the perfect implementation for LLMs and vectors? LLMs thrive on approximate computations in massive vector spaces, and quantum computing intrinsically deals in probabilistic, error-prone paradigms. The synergy of “good enough, very fast” lines up very well with both the promise and constraints of quantum hardware.

As quantum technology matures, we’ll likely see incremental integrations and specialized workflows rather than a total quantum takeover. Quantum computers fail "gracefully" in a domain where perfect accuracy isn’t the ultimate goal is a compelling argument for their future role in powering sophisticated language models and other vector-heavy, approximate tasks.

What do you think?

Are there quantum-inspired ways to speed up your vector operations today? Have you experimented with quantum frameworks or approximate solutions in your LLM projects? Feel free to share your insights in the comments below!

Happy coding!

DEV Community: Aaron Steers

Introducing ReelTrust: What if data engineering could solve our AI deepfakes problem?

Why I built ReelTrust

How ReelTrust works

ReelTrust Rollout

1. Video Content Creators (CNN, New York Times, White House Correspondents, etc.)

2. Package Hosting and Indexing Providers (AWS S3)

3. Video Distributors (YouTube, Facebook, TikTok)

What does a future look like with ReelTrust?

What does the future look like if we do nothing?

What is included in the Proof of Concept (POC)

Tackling video authenticity as a data engineering problem

Want to get involved?

The Modern Ways to (not) Install Python and Virtual Environments

Where we are today

The Outlawed Patterns List

The Modern Way to Install Python

The Modern Way to Manage Virtual Environments

1. Installing a Python App

2. Running a Python App

3. Write Python Code for a Project

Write a Portable Python Script

What did I miss?

Escaping Your Java Habits in Python: Writing Clean, Pythonic Code

1. Overcompensating for Dependency Injection (DI)

The Java mindset

How it leaks into Python

The Pythonic alternative

2. The “Everything Must Be a Class” Habit

The Java mindset

How it leaks into Python

The Pythonic alternative

3. Overusing Interfaces and Abstract Base Classes

The Java mindset

How it leaks into Python

The Pythonic alternative

4. Verbose Builders Instead of Simple Keyword Arguments

The Java mindset

How it leaks into Python

The Pythonic alternative

5. Not Using Keyword Arguments When You Should

The Java mindset

How it leaks into Python

The Pythonic alternative

Takeaways

Why this matters for the future of AI

"Dismiss Review": The Underrated GitHub Feature You're *Probably* Not Using

Why "Request Changes" Feels Like a Last Resort

But Here’s the Thing…

So Why Aren’t More People Doing This?

Let's Normalize a Better Workflow

TL;DR

WDYT?

Separation of Storage and Compute, Part Deux

A bit of history

The second revolution is here

Understanding the impact

What it means for database service providers

What it means for application service providers

What do you think?

Quantum Computing and LLMs: Match Made in Heaven?

Introduction

Vectors: The Lifeblood of LLMs

2. The Failure Intolerance of Traditional Computing

3. Quantum Computing and Approximation

4. Why “Good Enough, Very Fast” is Actually Perfect in Vectors

5. The (Big) Challenges

Quantum ML Tools to Explore

6. Pathways to Quantum-LLM Hybrid Approaches

Conclusion

"Dismiss Review": The Underrated GitHub Feature You're Probably Not Using