Sebastian Zakłada 🧛

Posted on Nov 20, 2024

Improving Tinybird DevEx 🚀 Data Re-Population CLI Tool

#tinybird #devex #cli #engineeringvampire

Quote of the day

The key isn't blindly following "best practices" - it's about understanding your tools well enough to know exactly when to use them.

We continue exploring ways to improve Tinybird developer experience in the command line. In my last article, we explored enhancing the Tinybird CLI with Oh My Posh, creating a custom prompt that displays critical workspace and environment information. That improvement was aimed at preventing accidental deployments and keeping aware of the current terminal session context.

Today, we're diving deeper into another crucial aspect of Tinybird development - managing data in your downstream data sources efficiently.

The Data Management Dilemma

Ever found yourself neck-deep in Tinybird (or any other data-centric) development, making sweeping changes to your processing pipelines and materialization logic, only to realize you need a fresh start? Any early stage feature development is prone to changing requirements and that makes data structures a moving target.

Or perhaps you've encountered that heart-stopping moment when production data becomes corrupted, and you need a reliable and fast way to recover?

Can you guess my age from this image...?

I've been there, and today I'll share how we've tackled these challenges at my eSUB mothership.

What Can Trigger a Full Refresh

Analytics platforms such as Tinybird, where you ingest and transform data to produce materialized views tuned for best performance and cost efficiency, they all come with their own sets of challenges. During rapid development cycles, you might find yourself:

Making changes to pipelines
Implementing major shifts in business logic
Recovering from unexpected data corruption in an environment
Periodically refreshing development environments

All of that may demand a clean slate as it's no longer possible to support new data models or changed requirements when there is breaking changes involved.

The Pitfalls of Destructive Full Refreshes

Before we get into the weeds and I can explain the tool that I've built, let me get one thing straight - full data re-builds are super bad, and YOU SHOULD AVOID DOING THAT AT ALL COST.

I just threw a wrench into my whole "the-best-ever-o-matic new CLI tool coming up!" sales pitch, didn't I? 😄

Jokes aside, and I am a jolly person, it has to be said out loud and clear - when faced with the need to refresh data, performing a destructive full refresh in production environments carries significant risks that should make you pause and re-consider your approach.

Service Interruption - during a full refresh, your analytical queries and dashboards may become unavailable or return incomplete results, potentially impacting business operations and decision-making.
Resource Intensive - full refreshes typically consume substantial computational resources, which lead to cost increase, performance hit of other running processes and even potentially exceeding service quotas or limits
Data Consistency Challenges - full refresh throws your whole system into partially inconsistent state, if it receives new real-time events during that time, that data may be lost or you risk having inconsistent states between different parts of your system as a result of downstream pipes and even other systems receiving incomplete or incorrect data
Recovery Complexity - if something goes wrong during the refresh you may not have an easy way to roll back and the original state could be permanently lost

That's just to name a few.

Instead of destructive full refreshes, consider these alternatives:

Implementing incremental updates where possible. Sometimes all it takes to avoid a breaking change from happening is a little bit of thinking ahead and preparation. It's possible more often than you think.
Versioning your pipes and datasets. Can't emphasize this enough. If it hits production, then it's your contract and unless you really know what you are doing (like, really REALLY) making any breaking changes to an already published contract is a big no-no. Think of your Tinybird data carrying structures (pipes, data sources) as APIs. In the end, most often than not you are going to publish results of your user-facing pipes as APIs. And what do you do when you have to change a contract in the existing API? You version it. The same principle applies in Tinybird or really to any part of the system you are building.

With proper versioning you can maintain parallel data structures during transitions, empowering you with the capability to do many cool things such as

A/B testing and canaries (NOT the birds!)
super easy rollback to proven-and-tested previous versions
making sure that your new contracts are stable before fully committing

Going Against the Flow

You must be pretty confused at this point, asking yourself the question... why building the tool?! After all I just said that triggering full data rebuild is the last thing you should do!

Look... while I just spent a good chunk of this post telling you why full rebuilds are generally a terrible idea (and they really are!), sometimes you just need the right tool for the job. You know what they say about rules being made to be broken? Well, not exactly broken, but more like... carefully bent when you really know what you're doing! 😉

Sometimes you DO need that nuclear option - maybe you're in development, testing some wild new ideas, or dealing with a situation where a full rebuild is actually the safest bet. The key isn't blindly following "best practices" - it's about understanding your tools well enough to know exactly when to use them.

And hey, that's exactly why I built this tool - for those specific situations where a full rebuild is exactly what the doctor has prescribed!

The Manual Console-driven Method

For simple use-case scenarios where all you need is trashing data in a single data source it's perfectly fine to trigger rebuilds from the Tinybird Console as all it takes is just a few clicks to get data flowing again.

But let's be real - when you're dealing with anything but a sample app, you will be dealing with dozens if not hundreds of data sources and a complex web of dependencies. Clicking through the UI becomes a pain real quick. You also need to consider orchestrating the order of operations as you definitely don't want to refresh that downstream pipe before its dependencies are ready, right?

With more than just a handful of pipes and data sources present in a project these scenarios can quickly turn into time-consuming ordeals. Lack of proper tooling can turn into a significant operational burden and before you know it, you're spending half your day just clicking through the UI or trying to remember which pipe needs to be refreshed first. Been there, done that, got the t-shirt!

There's an old engineering wisdom that states: if you find yourself doing something manually more than twice, it's time to automate it. This principle became crystal clear when I was recently working on Tinybird implementation, where I found myself repeatedly resetting data sources during rapid prototyping stages. The first time, it was a novelty. The second time, it felt like déjà vu. By the third time, it was clear

I Needed Automation

That's when the data re-population script was born - forged from the necessity of automation, refined and polished through repeated use.

One tool to bring them all and in the darkness bind them 😉😅

Utilizing the native Tinybird CLI capability to manage pipes and data sources from the command line, I crafted a specialized script that handled truncating and repopulating Tinybird data sources in a carefully orchestrated pattern.

My preciousss... 💍

Let's Dive In

After acknowledging all the risks and considerations we discussed earlier, I am giving you a tool that transforms what could be a hazardous operation into a controlled, deliberate process. Think of it as your "break glass in case of emergency" tool - but one that actually knows what it's doing!

The tool provides two core capabilities, each wrapped in multiple layers of safety controls. First, it manages your data sources through selective truncation based on configurable rules, with built-in protection for critical sources through prefix-based exclusions. Second, it handles orchestrated repopulation using a three-phase strategy that respects data dependencies: starting with core reference data, moving through standard operational pipes, and finishing with dependent calculations (that part you should adjust to match your use-case).

Getting started is straightforward:

# Clone the repository
git clone https://github.com/sebekz/tinybird-devex-plus.git

# Navigate to script location
cd tinybird-devex-plus/tinybird/scripts

# Make it executable (Unix/Linux/macOS)
chmod +x repopulateAllDataSources.sh

You can also pull only the script from the GitHub repo both approaches are fine.

One must not forget about all the required dependencies. You'll need the following tools installed:

Tinybird CLI - obviously...
jq - for all the JSON parsing that the script is doing

Once it's done you are ready to run it 💥!

# Usage:       . scripts/repopulateAllDataSources.sh [COMMAND]
#
# Commands:
#              dryrun     - Performs a dry run without mutating any data sources
#              repopulate - DESTRUCTIVE OPERATION! Truncates and populates 
#                           matching data sources
#
# Example:     . scripts/repopulateAllDataSources.sh dryrun
#              . scripts/repopulateAllDataSources.sh repopulate
#
# Note:        You must be in /tinybird folder for this script to be able to 
#              use tb authentication details

But wait... didn't we forget about something?

Safety First: The Art of Not Breaking Things

Remember how we talked about the risks of full rebuilds? Remember that scene in Jurassic Park where Samuel L. Jackson's character says "Hold onto your butts" before rebooting the park's systems? Well, data repopulation isn't quite as dramatic, but it deserves the same level of respect.

I added a bunch of extra safety measures ensuring, that anytime you run the script, you must turn all the required safety keys first.

By default, this script protects your critical data sources - landing sources (with source_ prefix), operational data (ops_ prefix), and snapshots (snapshot_ prefix) remain untouched. These particular prefixes resulted from the use-case scenarios that I was working on, yours may and most likely will be different so feel free to adjust that part to your liking!

Protection goes beyond just prefixes. The tool implements what I like to call "progressive do-not-break-everything safety" - you need to explicitly confirm your intentions at several checkpoints before any action is taken.

The script enforces several safety measures:

🔑 Mandatory workspace verification before execution
🔑 Configuration review prompts that demand explicit confirmation
🔑 Dry run capability for risk-free testing
🔑 Final Warning and confirmation
🔑 Built-in exclusions for critical data sources

Each step requires explicit "y" or "Y" confirmation. Any other response (including empty) aborts the operation.

You can see that it's not just a simple reset button - it provides a safe and configurable way for dropping nukes on your data. Think of it as a "reset button with guardrails". Safety is enforced through:

Usage

Run script without any parameters to see all the available commands.

YES. I have OCD when it comes to nice looking tools. Can't help myself with all the colors, formatting, etc. Some call it an utter waste of time, but I find that good looking tool or good-looking code is just better - as you put more thought to how it looks you end up putting more thought into how it WORKS.

I guess that's a good topic for a separate article 😊 and I digressed again. That's also something I am good at 😊

When you run the tool, it first walks you through a configuration review. You'll see exactly what data sources would be affected and what the execution plan looks like. It verifies your current workspace context and makes sure you're operating in the environment you intend to.

As mentioned earlier - do not rush to press Y, take your time and triple check the settings.

Always start with dryrun. It's not just a suggestion - it's best practice. This mode lets you validate every step of the process without touching your data. Only when you're absolutely certain about the changes should you proceed with the actual repopulation.

During execution, the tool orchestrates all the required steps. It first truncates the selected data sources, then executes priority pipes to establish your core data foundation. From there, it processes standard operational pipes, and finally handles any dependent calculations. This ordered approach ensures data consistency throughout the rebuild process.

Here is how it looks like in action, starting with the truncation

after which data sources are re-populated from their respective pipes

The result? A tool that respects the destructive nature of data rebuilds while providing the automation we need for those rare but necessary occasions. It's not about making rebuilds easy - it's about making them safe when they're unavoidable.

Room for improvement

Like any tool born from practical needs, this one has its own wishlist for the future. Two main items are keeping me up at night (along with my usual vampire schedule 🧛):

Moving configuration to script parameters - because hardcoding configuration is so last century
Migrating the code to Python - as much as I enjoy the old-school Bash routine, Python would make this tool much more maintainable and extensible
Removing repetitions, shortcuts and unnecessary code - a result of iterative approach and YAGNI as soon as I got what I needed back when I needed it

I'll probably not make the time to tackle any of these improvements, so if you are feeling adventurous - go for it and share! I would love to see what you were able to come up with! But for now, the current version gets the job done while keeping your data safe from accidental nuclear launches.

Where to Next?

In the upcoming articles, I will explore:

Tinybird Console Pro Tips - unlocking the full-width view and most importantly, the dark mode - vampires like me 🧛 fear the light!
IaC Blueprint - structuring your Tinybird projects like a pro
Multi-tenancy Guide - implementing secure and scalable multi-tenant analytics with Tinybird and AWS
Production Checklist - everything you need for truly production-ready analytics
DynamoDB with Tinybird - tips and tricks learned when working on a real-world implementation
Node.js SDK - Tinybird SDK for Node.js that you never knew you needed!
and many more!

Stay tuned!

Enjoying this content? 🧛

If you found this article helpful, consider following me for more similarly engaging posts!

Note

I'd love to hear your thoughts, suggestions, or even gentle 😄 corrections - don't hesitate to drop a comment.

Your feedback, whether it's a nod of approval or pointing out areas for improvement, will help me craft better content in the future!

Disclaimer

This article is an independent developer guide. I am not affiliated with, sponsored by, or officially connected to Tinybird in any way. All views and recommendations expressed are my own.

While these customizations are generally safe, if your coding adventures somehow result in your workstation spontaneously combusting, I must respectfully decline any responsibility for damage to you or your property. However, as a courtesy to your friendly Engineering Vampire 🧛 please do give advance notice of any potential... mishaps. It would be terribly wasteful to let good blood 🩸 go to waste. Just saying!

Top comments (1)

Comment hidden by post author - thread only accessible via permalink

Philip • Nov 20 '24

Thanks for sharing! Wanna add another tool — EchoAPI could complement this approach by offering a robust CLI tool for seamless API management, enabling developers to mock, test, and iterate on their API workflows without manual intervention. Learn more about EchoAPI’s CLI capabilities and how they streamline development by visiting echoapi.io/.

Some comments have been hidden by the post's author - find out more

DEV Community