DEV Community: Datamonk

MariaDB to PostgreSQL: Lessons & Challenges

Datamonk — Fri, 14 Feb 2025 12:18:44 +0000

Mariadb to postgresql

Recently, we did a migration of a rails application from mariadb to postgresql. This article goes into why and how of this migration.

K1 acquired MariaDB and appointed a new CEO. This signaled a shift in the database's direction. A private equity firm was now in control of the company overseeing the open-source development of a database that our application heavily relied upon.

Over the past few years, MariaDB's feature development has stagnated, largely due to its commitment to maintaining MySQL compatibility and a stronger focus on sales rather than innovation. Meanwhile, PostgreSQL has been on a relentless path of growth, continuously closing the gap with Oracle by rolling out feature after feature.

PostgreSQL’s journey started in 1996, and its development is spearheaded by the PostgreSQL Global Development Group, with contributions from major players like AWS, Microsoft, NTT Data, EnterpriseDB, Crunchy Data, and PgEdge. The core contributor team is distributed across the globe, with experts from Europe, Japan, the US, China, India, and Latin America. Much like Linux, PostgreSQL is likely to remain open-source for the foreseeable future. For our client, this acquisition was the last straw that broke camel's back.

Why move from mariadb to postgresql ?

The feature gap: The feature gap between postgresql and mariadb has grown significantly over last 5-7 years. With each release, PostgreSQL has integrated valuable features inspired by Oracle, making it an increasingly attractive choice for enterprises. As a result, it has evolved into a one-stop database solution for all kinds of application needs.

Postgres's Rich extension system: If you need postgres to manage your cron jobs, there is pg_cron. If you want to store spatial data in the database, there is postGIS. For timeseries, timescaleDB; for vector storage, there is pgvector. Moreover, there are many startups modifying postgres in unexpected ways to bring new kind of database to the market. For instance, Neon Tech has separated the storage and compute layers, allowing databases to scale to zero. This enables independent scaling of storage and compute, much like Databricks and Snowflake do for data lakes. The rapid pace of innovation in PostgreSQL presents immense potential benefits for any application.

Corporate interest vs community interest:

MariaDB operates as a corporate entity, prioritizing business interests, while PostgreSQL thrives on its vibrant open-source community. The PostgreSQL Global Development Group’s mission is to advance the database based on the needs of its developers and users, ensuring a steady flow of meaningful enhancements.

High Risk

While this migration looks very lucrative, it is not without its own risks.

Long time to finish: These migrations take a long time to finish even for a simple migration. A seemingly straightforward upgrade from MySQL 5 to MySQL 8 can take up to nine months. This process includes:

time to migrate
Updating application code
Conducting regression and load testing

Coordination between different application teams: Teams need to make sure that the new database doesn't cause features to break and passes all the tests. Having good test coverage is very crucial to such a migration project.

Fundamental change to the application: Since everything sits on top of the database, it impacts everything: application, background jobs, BI etc.

Migrating requires adjustments to the schema, data, and application code. Given that application code can run into millions of lines, the risk of something breaking is significant.

High Rewards

More features: With a thriving open-source community behind the project, new features are continuously being added. The ecosystem of extensions is expanding rapidly, bringing innovative capabilities to the database core.

Simple applictaion logic makes it easier to maintain: A feature-rich database like PostgreSQL allows developers to offload logic to the database itself, reducing complexity at the application level. This not only improves performance but also enhances maintainability and throughput.

Yandex data migration case study

Yandex Mail initially stored its metadata in an Oracle database. By 2012, the growing feature demands and high licensing costs of Oracle became unsustainable, with expenses running into millions of dollars. Additionally, Oracle's restrictive licensing prohibited publishing benchmarks comparing its database performance, which led to its absence from clickbench benchmark. So, they tried migrating to a different database which could support their needs at a resonable cost. It took them 10 mean years to finish the migration.

First time, they tried switching to mysql. During this time, they tried to fix everything that is wrong with the codebase increasing the scope of the project and eventually setting it up for failure. Second time, they tried writing their own custom DBMS (maybe they tried to reverse engineer oracle). That failed as well. Well, third time was a charm.

How ?

We first started with schema migration, then data export and import. And finally ended with making code changes to make the code compatible with the new postgresql database.

Schema transfer

There are many nuances one has to keep in mind while migrating from one database to the another.

Datatype challenges: PostgreSQL does not support unsigned integers, unlike MariaDB. Additionally, PostgreSQL does not allow text fields with size attributes, nor does it accept backticks (`) in queries, unlike MariaDB. Once we jumped through these hoops, we were able to generate the matching schema in the postgresql database.

Fulltext indexes: Yandex Mail relied on full-text search for keyword retrieval in entity titles and bodies. While MariaDB offers simple full-text search indexing, PostgreSQL provides richer configurations for advanced search capabilities.

Binary fields: Binary fields are little tricky, as in how they deal with null characters. Mariadb allows null character in the fields, while postgresql does not. It is not a feature that they decide to offer/not offer. But difference arises from storage mechanisms. MariaDB uses a length-prefixed buffer, while PostgreSQL employs C-style null-terminated strings.

Data export and import

We will cover three ways in which data can be exported from mariadb to postgresql.

pgloader: It is an automated tool for loading data into postgresql from different sources. It reads data from different sources like mysql, sqlite and csv. After that, it creates the corresponding schema in postgresql and insert the data into relevant tables. Later on, it rebuilds the index after loading the dataset. It is good for simple datasets with simple data types.

Dumping SQL and rerunning on the database: This process involves raw SQL statements from the source database and making changes to the SQL to match the target databases. Its a good option when upgrading the database to a newer version.

CSV Export and Import: This option is good for data that doesn't contain binary fields as that would cause various issues while dumping and parsing csv. Since csv is widely supported format for importing and exporting, it would be fine.

Code changes

Apart from simple changes in the code base to replace quotes and remove certain keyboards, there were some other changes needed in the SQL that was used to run directly on the database, as opposed to an ORM.

Raw SQL in code: The codebase used interval keyword, which is only supported in the postgresql with escaped quotes. Easy change.

Search: Search used MATCH ... AGAINST clause which had to be replaced with tsvector(...) @@ tsquery(...) clause after making the relevant index on appropriate columns.

Binary fields: The syntax is little different in postgres compared to mariadb. But, that could be easily managed.

Learning from this project

Having good test coverage allows the team to make code changes with confidence and help them iterate faster with code changes. This is probably the single most important thing. So, if the codebase doesn't have test coverage for some modules. The team may start with writing some tests.
Limit the scope of the migration. It maybe tempting to fix everything wrong with the codebase that you see along the way. But, limiting the scope to just doing migration will help team stay focused and achieve the results.
Run small experiement independently on a toy dataset. We did this with the fulltext search module. These experiemnt allowed us to find the right search solution when migrating the new database.
Run loadtesting and benchmarks. The old database maybe tuned for performance with a lot of custom indexes. So, take your time to build indexes in the destination database and tune them for your workload.

Conclusion

Migrating from Oracle to PostgreSQL was a long and complex process, but it enabled Yandex Mail to achieve greater scalability, flexibility, and cost savings. The transition reinforced the importance of thorough planning, controlled scope, and rigorous testing in large-scale database migrations.

Tenant Based Filtering: Apache Superset

Datamonk — Fri, 07 Feb 2025 19:01:14 +0000

In this post, we will explore how we successfully implemented Row-Level Security (RLS) in Apache Superset to create a multi-tenant dashboard that dynamically filters data based on the logged-in user’s company.

The dvdrental database, based on the Sakila dataset, is a well-known sample database used for learning SQL and database management. It represents a DVD rental store, with tables for customers, rentals, payments, and films. To transform this database into a multi-tenant system, multiple companies needed to use the same database while ensuring they could only access their own data.

To achieve this, a company_id column was added to key tables such as rental, payment, and customer. This allowed the system to associate customers and transactions with specific companies. The main objective was to create a single dashboard in Apache Superset that could be used by different users while dynamically filtering data based on their assigned company.

The key requirements were:

Admins should have access to all data across all companies.
Company Users should only see data relevant to their specific company.
Row-Level Security (RLS) should be implemented to automatically filter data based on the logged-in user's company_id, retrieved from their email.

With RLS in place, data was filtered at the query level, ensuring that a single dashboard could be used securely by multiple companies. Admins had a complete view of all data, while regular users could only see company-specific information. The company_id was linked to user emails, allowing for seamless access control without requiring manual input from users.

This approach provided a secure, scalable, and efficient multi-tenant dashboard, enabling different companies to operate within the same system without exposing their data to others.

Defining Roles For Users

The configuration process begins in the Superset User Interface (UI), where roles and users are created and assigned specific permissions.

In the Security section of Superset, roles can be defined with specific access levels. Each role determines what a user can do within the system, such as accessing datasets, viewing dashboards, and reading charts. By assigning permissions at the role level, different users can have varying levels of access. For example, an Admin role is configured to have full access to all data, while Company User roles are restricted to seeing only data associated with their company.

Once roles are created, users are added in the Superset UI and assigned these roles. When a user logs in, the system automatically applies the RLS filter based on their company_id. Superset achieves this by dynamically injecting a WHERE clause into queries executed by the user.

This dynamic filtering ensures that users have access only to the data they are authorized to see, making Superset an efficient tool for multi-tenant dashboards while maintaining strict data security and access control.

After applying Row-Level Security (RLS) to each user and assigning them access to their specific rows, the admin retains full access to all the data across companies. For example, if there are 1,000 records in the admin table, one user might have access to only 200 rows, while another user might see 250 rows, depending on their company's data.

Dashboards

This is clearly reflected in the dashboards. The admin dashboard displays the complete dataset, including the total sales from all companies.

For instance, in the pie chart, you can see that the total sales for DVD rentals amount to 1.95 million, representing data from all over the country.

USER 1

USER 2

However, when viewing the dashboards of individual users, the total sales figure is noticeably reduced. This happens because their dashboards only reflect company-specific data.

In Conclusion

By implementing RLS, we achieve effective tenant-based filtering, ensuring that data is securely and efficiently segmented for different users. This not only enhances security but also improves data management, making it a powerful tool for multi-tenant applications.

A Tale of Two Databases

Datamonk — Fri, 07 Feb 2025 05:59:25 +0000

Sqlite vs Postgresql

Sqlite is one of the most deployed database in the world. Its small(<1 MB>), fast and reliable(supported till 2050). Postgresql is world's most advanced open source object-relational database. It has been developed over 35 years in the wild with open-source community. Out of 177 ANSI SQL features, it complies with 170, highest of any database. While sqlite offers 5 datatypes(NULL, integers, real, text and blob), postgres boasts of 43 datatypes including xml, JSON and ip_addresses. And more(time series, geometry, raster) can be added through the extensions. It is clear that while sqlite is a fast rabbit, postgresql is the elephant in the room.

Rather than comparing the documentation of these two wonderful databases, we will look at some of hoops we will need to jump in order to support sqlite for a web application.

1. Flexible typing

In this article talks about how flexible typing in sqlite caught them off guard and they ended up storing longer slug in 256 byte string and stored Uuid in integer column type. They only got to know this once they migrated their database from sqlite to postgres. This would have caused bug that would be difficult to reproduce and find out.

CREATE TABLE articles (id integer primary key, title text, wc integer);

insert into articles values (1, 'hello world', 3);
insert into articles values (3, 'hello world 2', 'coca cola');

The above command would run totally fine in sqlite

sqlite> select * from articles;
┌────┬───────────────┬───────────┐
│ id │     title     │    wc     │
├────┼───────────────┼───────────┤
│ 1  │ hello world   │ 3         │
│ 2  │ hello world 2 │ 4         │
│ 3  │ hello world 2 │ coca cola │
│ 4  │ hello world 2 │ 10        │
└────┴───────────────┴───────────┘

while it may not cause any problem as is because the aggregate functions(max, sum) will just ignore the column that is not castable. Migrating this data/database to a strict typing database may be a problem.

However, this can all be avoided by just making strict tables.
This will throw error

CREATE TABLE articles_strict(id integer primary key, title TEXT, wc integer) STRICT;

insert into articles_strict values (1, 'hello world', 3);
insert into articles_strict values (2, 2, 'hello world 2');

2. Single thread writer

Sqlite allows only one writer at a time. In its default rollback journal mode, It uses a locking-based mechanism where a write operation will lock the entire database and other read/write operations will have to wait in queue. Similarly write operation will have to wait in queue for the read operation.

However, this can all be avoided with WAL (write ahead logging mode) where reader will not block writer and writer will not block readers. But writer will block other writers and they will have to wait in queue

3. Recursive depth first search

Depth-first recursive function: Sqlite does NOT support graph search functions in its recrsive search. However, a depth-first/breadth first thing can be implemented in sqlite using order by clause.

CREATE TABLE org(
  name TEXT PRIMARY KEY,
  boss TEXT REFERENCES org,
  tenure INTEGER
) WITHOUT ROWID;

INSERT INTO org VALUES('Alice',NULL, 10);
INSERT INTO org VALUES('Bob','Alice', 1);
INSERT INTO org VALUES('Cindy','Alice', 10);
INSERT INTO org VALUES('Dave','Bob', 1);
INSERT INTO org VALUES('Emma','Bob', 10);
INSERT INTO org VALUES('Fred','Cindy', 1);
INSERT INTO org VALUES('Gail','Cindy', 10);

while querying we can traverse the tree in depth-first order while also sorting the siblings by tenure. amazing. This can be used to solve the comment ordering in lobsters.

WITH RECURSIVE
  under_alice(name,level, tenure) AS (
    VALUES('Alice',0, 10)
    UNION ALL
    SELECT org.name, under_alice.level+1, org.tenure
      FROM org JOIN under_alice ON org.boss=under_alice.name
     ORDER BY 2 DESC, 3 DESC
  )
SELECT substr('..........',1,level*3) || name || '  '|| tenure  FROM under_alice;

┌─────────────────────────────────────────────────────────┐
│ substr('..........',1,level*3) || name || '  '|| tenure │
├─────────────────────────────────────────────────────────┤
│ Alice  10                                               │
│ ...Cindy  10                                            │
│ ......Gail  10                                          │
│ ......Fred  1                                           │
│ ...Bob  1                                               │
│ ......Emma  10                                          │
│ ......Dave  1                                           │
└─────────────────────────────────────────────────────────┘

4. Materialized views

There are no materialized views in sqlite. The closest workaround suggested in the thread above is to CREATE TABLE AS and run a job to refresh it every hour/day.

5. Full text search

For this experiment, we will use the sakilla database to showcase how to run a fulltext search query. Simplest way to do a search query to use like operation, but this will trigger a full table scan.

sqlite> .schema film 
CREATE TABLE film (
  film_id int NOT NULL,
  title VARCHAR(255) NOT NULL,
  description BLOB SUB_TYPE TEXT DEFAULT NULL,
  ...
  PRIMARY KEY  (film_id),
);

sqlite> explain query plan select title, description from film where description like '%shark%';
QUERY PLAN
`--SCAN film

Fortunately, sqlite provides fts5. If the query below returns 1, fts5 is enabled in the sqlite installation.

SELECT sqlite_version(), sqlite_compileoption_used('ENABLE_FTS5');

create the virtual fts table

CREATE VIRTUAL TABLE film_fts USING fts5(title, description, content='film', content_rowid='film_id');

INSERT INTO film_fts(film_fts) VALUES('rebuild');

Then, query the tables using MATCH operator.

sqlite>  select * from film_fts where title match 'suit';
┌────────────┬──────────────────────────────────────────────────────────────┐
│   title    │                         description                          │
├────────────┼──────────────────────────────────────────────────────────────┤
│ CORE SUIT  │ A Unbelieveable Tale of a Car And a Explorer who must Confro │
│            │ nt a Boat in A Manhattan Penthouse                           │
├────────────┼──────────────────────────────────────────────────────────────┤
│ SPEED SUIT │ A Brilliant Display of a Frisbee And a Mad Scientist who mus │
│            │ t Succumb a Robot in Ancient China                           │
├────────────┼──────────────────────────────────────────────────────────────┤
│ SUIT WALLS │ A Touching Panorama of a Lumberjack And a Frisbee who must B │
│            │ uild a Dog in Australia                                      │
└────────────┴──────────────────────────────────────────────────────────────┘

However, if I insert a new record in film, it won't be automatically available in film_fts. To make the data available in the film_fts, we will need to run the rebuild again.

INSERT INTO film (film_id, title, description, release_year, language_id, original_language_id, rental_duration, rental_rate, length, replacement_cost, rating, special_features, last_update) 
VALUES (1001, 'CORE SUIT 2', 'A Unbelievable Tale of a Car And a Explorer who must Confront a Boat in A Manhattan Penthouse', 2006, 1, NULL, 6, 0.99, 86, 20.99, 'PG', 'Deleted Scenes,Behind the Scenes', '2006-02-15 05:03:42');

However, running rebuild again and again can be expensive, a better solution is to use triggers.

Trigger for insert

CREATE TRIGGER film_insert AFTER INSERT ON film
BEGIN
    INSERT INTO film_fts(rowid, title, description) VALUES (new.film_id, new.title, new.description);
END;

Trigger for update

CREATE TRIGGER film_update AFTER UPDATE ON film
BEGIN
    INSERT INTO film_fts(film_fts, rowid, title, description) VALUES ('delete', old.film_id, old.title, old.description);
    INSERT INTO film_fts(rowid, title, description) VALUES (new.film_id, new.title, new.description);
END;

Trigger for delete

CREATE TRIGGER film_delete AFTER DELETE ON film
BEGIN
    INSERT INTO film_fts(film_fts, rowid, title, description) VALUES ('delete', old.film_id, old.title, old.description);
END;

In Conclusion

SQLite and PostgreSQL both serve their purposes, and you can technically use either for almost anything. However, the real question is: why force one to handle something beyond its capabilities when there’s an option that excels at it? Yes, you can configure and tweak SQLite to make it work, but why shoehorn it when you can let PostgreSQL handle your load efficiently?

How DeepSeek is Making High-Performance AI Accessible to All

Datamonk — Fri, 31 Jan 2025 16:33:03 +0000

AI research is evolving fast, but training massive models is still a tough challenge because of the huge computing power needed. That’s where DeepSeek is changing the game. They’ve found a way to build top-tier AI models without burning through an enormous number of GPUs. By using a smart mix of cost-effective training strategies, Nvidia’s PTX assembly, and reinforcement learning, they’ve created cutting-edge models like DeepSeek-R1-Zero and DeepSeek-R1—proving that innovation doesn’t always have to come with an extreme price tag.

Optimizing GPU Usage: Cost-Effective Training at Scale

DeepSeek has set a new benchmark in efficient AI model training. For instance, DeepSeek trained its DeepSeek-V3 Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster of 2,048 Nvidia H800 GPUs in just two months—totaling 2.8 million GPU hours, according to its research paper.

In comparison, OpenAI’s GPT-4, one of the most advanced language models, is estimated to have been trained using tens of thousands of Nvidia A100/H100 GPUs over several months, with a significantly higher compute cost. Similarly, Meta’s Llama 3, which has 405 billion parameters, required 30.8 million GPU hours—11 times more compute power than DeepSeek-V3—using 16,384 H100 GPUs over 54 days.

DeepSeek’s approach stands out because of its cost-effective training strategies and efficient utilization of Nvidia’s PTX assembly and reinforcement learning techniques. By optimizing GPU usage and computational efficiency, it is proving that cutting-edge AI models don’t have to come with an astronomical price tag. This shift challenges the traditional belief that only massive GPU clusters can produce state-of-the-art models, making high-performance AI development more accessible and sustainable.

The Real Cost of DeepSeek-R1: Beyond the $6M Hype

There's been a lot of hype around DeepSeek's claim that they trained their latest model for just $6 million. But let’s be real—this number only covers the GPU rental costs for the final pre-training run. The actual investment behind DeepSeek R1 is much, much bigger.

Born from High-Flyer, a Chinese hedge fund that embraced AI early, DeepSeek made a bold move in 2021—acquiring 10,000 A100 GPUs before export restrictions tightened. This early bet secured a massive computational advantage. By 2023, they spun off as an independent AI lab, self-funded and ahead of the curve.

Today, DeepSeek operates with around 50,000 Hopper GPUs, including H800s, H100s, and incoming H20s, shared with High-Flyer’s trading operations. Their actual investment? Likely over $500 million, with total infrastructure costs nearing $1.3 billion.

Beyond hardware, DeepSeek’s strength lies in its elite 150-person team, handpicked for skill over credentials. Top engineers earn over $1.3 million annually—outpacing salaries at Chinese tech giants. Free from corporate bureaucracy, they optimize everything in-house, running their own data centers to push AI research further.

DeepSeek isn’t just cost-efficient—it’s strategically built for dominance. The "$6M" figure is a footnote in a much bigger story of foresight, risk-taking, and deep R&D.

Unlocking Maximum Efficiency: How DeepSeek Used PTX to Push GPU Limits

While most AI companies stick to Nvidia’s CUDA framework to train large models, DeepSeek took a bold and unconventional approach—leveraging PTX (Parallel Thread Execution) assembly to unlock previously untapped efficiency in GPU operations. This decision played a crucial role in the success of DeepSeek-R1-Zero and DeepSeek-R1, allowing them to be trained with fewer GPUs while maintaining high performance.

But what exactly is PTX, and why does it matter?
PTX: The Assembly Language of Nvidia GPUs

CUDA is often the go-to framework for AI development because it provides an easy-to-use interface for GPU programming. However, CUDA is essentially a high-level abstraction—it translates code into PTX, which is Nvidia’s intermediate assembly language for GPU execution.

Think of it like this:

CUDA is like writing in Python—easy to use, but not the fastest.
PTX is like writing in assembly language—harder to master, but gives full control over hardware performance.

DeepSeek went beyond CUDA, rewriting key computational operations directly in PTX. This allowed them to optimize memory access, reduce instruction overhead, and execute GPU instructions with greater precision, pushing performance to its limits.

Why PTX Gave DeepSeek-R1-Zero and R1 an Edge

By tapping into PTX, DeepSeek unlocked three major advantages that traditional CUDA-based models often miss:

Ultra-Fine Hardware Control

CUDA automatically optimizes code, but it doesn’t always make the best choices for efficiency.
With PTX, DeepSeek manually fine-tuned GPU instructions, ensuring every computational cycle was used efficiently.

Optimized Memory Utilization

One of the biggest bottlenecks in training AI models is memory overhead. CUDA’s default memory allocation can be inefficient, leading to wasted GPU memory.
DeepSeek restructured tensor operations at the PTX level, reducing memory bottlenecks and increasing throughput.

Better Instruction Scheduling & Parallel Execution

GPUs are designed to process thousands of operations in parallel, but CUDA’s compiler doesn’t always schedule instructions optimally.
DeepSeek rewrote key computational kernels in PTX, achieving faster execution times and fewer processing stalls.

PTX in Action: The DeepSeek Difference

Most AI companies throw more GPUs at the problem to speed up training. DeepSeek, on the other hand, focused on efficiency. By bypassing CUDA in key areas and directly optimizing PTX execution, they maximized GPU utilization without increasing hardware costs.

This shift in approach redefines what’s possible in AI training. Instead of relying on brute-force compute power, DeepSeek proved that smart software optimizations can be just as impactful as expensive hardware upgrades.

By mastering PTX, DeepSeek is not just developing AI models—it’s reshaping how AI is built, proving that next-generation models can be trained smarter, not harder.

DeepSeek-R1-Zero: Reinforcement Learning Without Supervised Fine-Tuning

A Bold Step in AI Development

With DeepSeek-R1-Zero, an AI model trained purely through Reinforcement Learning (RL) without any Supervised Fine-Tuning (SFT). Traditionally, AI models undergo supervised fine-tuning before reinforcement learning to improve their reasoning skills. However, DeepSeek-R1-Zero skipped this step entirely, proving that a model can develop strong reasoning abilities autonomously through trial and feedback. This unconventional method challenges the long-held belief that supervised fine-tuning is essential for high-performance AI.

Exceptional Reasoning Without Supervised Training

One of the most remarkable aspects of DeepSeek-R1-Zero is its ability to generalize and solve complex problems using RL alone. Unlike traditional models that rely on large amounts of labeled data, this model learned organically by interacting with its environment. Additionally, its performance could be further improved using majority voting, a technique that refines responses by selecting the most common answer across multiple attempts. On the AIME benchmark, DeepSeek-R1-Zero’s accuracy increased from 71.0% to 86.7% when majority voting was applied, even surpassing OpenAI-o1-0912. This achievement highlights the true potential of reinforcement learning in building highly capable AI systems.

The Self-Evolution Process

A key aspect of DeepSeek-R1-Zero’s development was its ability to self-evolve without human intervention. Since the model was trained purely with RL, researchers could closely observe how it progressed and refined its reasoning over time. Instead of improving based on human-provided examples, DeepSeek-R1-Zero learned from reinforcement feedback alone. By increasing test-time computation—giving itself more time to process and generate reasoning tokens—the model naturally improved its problem-solving strategies. This process demonstrated that AI can teach itself to think more deeply without requiring external adjustments.

Emergent Behaviors: Reflection and Alternative Problem-Solving

One of the most fascinating discoveries during DeepSeek-R1-Zero’s training was the spontaneous emergence of advanced reasoning behaviors. The model began to reflect on its own responses, revisiting and improving previous answers. It also started exploring multiple ways to solve a problem, rather than sticking to a single fixed approach. These behaviors weren’t explicitly programmed but emerged organically as a result of the RL training process. This milestone suggests that reinforcement learning can lead AI to develop structured thinking on its own, a significant step toward more autonomous and intelligent models.

The “Aha Moment” – A Breakthrough in AI Reasoning

One of the most intriguing moments in DeepSeek-R1-Zero’s evolution was the so-called "aha moment". At a certain stage in its development, the model realized that allocating more time to difficult problems led to better solutions. Instead of rushing to generate responses, it started pausing, reconsidering, and refining its reasoning process. This shift wasn’t directly taught to the model—it emerged naturally as a result of reinforcement learning optimizing for better problem-solving strategies. For researchers, witnessing this shift was just as exciting as it was for the model itself. It highlighted the power of reinforcement learning to drive independent intelligence and showed how AI can develop strategies beyond what was explicitly programmed.

Challenges: Readability and Language Mixing

Despite its impressive reasoning capabilities, DeepSeek-R1-Zero was not without flaws. One major issue was readability, as the model’s reasoning process was often difficult to follow. Additionally, it sometimes suffered from language mixing, blending multiple languages in its responses, which reduced clarity. These challenges showed that while RL alone can drive strong reasoning development, a balance between autonomous learning and structured human guidance is still necessary for a more practical AI system.

Refining the Model: The Introduction of DeepSeek-R1

To address these shortcomings, DeepSeek introduced DeepSeek-R1, a refined version that combines RL with a human-friendly “cold-start” dataset. This hybrid approach maintains the strong reasoning capabilities of DeepSeek-R1-Zero while improving readability and response structure. By integrating some level of human supervision, DeepSeek-R1 ensures that its reasoning remains strong while also making its output more coherent and accessible.

Upgrading to DeepSeek-R1: Reinforcement Learning with SFT and Cold Start Data

DeepSeek took a significant leap forward from DeepSeek-R1-Zero by integrating Supervised Fine-Tuning (SFT) and Cold Start Data into its training pipeline, leading to the development of DeepSeek-R1. This upgrade enhanced the model’s reasoning capabilities and alignment with human preferences, setting a new standard for open-source AI models.

The Two-Stage RL and SFT Pipeline

DeepSeek-R1 improved upon its predecessor by incorporating a refined two-stage reinforcement learning (RL) and SFT process:

Two RL Stages: Focused on refining reasoning abilities while ensuring outputs align with human expectations.
Two SFT Stages: Built a strong foundation for both reasoning and non-reasoning tasks to improve overall model performance.

Addressing the Cold Start Problem

One of the major challenges in AI training is the cold start problem, where models struggle in early training phases due to a lack of initial guidance. DeepSeek tackled this by carefully curating high-quality, diverse datasets for the first SFT stage. This ensured the model acquired solid foundational knowledge before reinforcement learning took over.

Readability Improvements: Unlike DeepSeek-R1-Zero, which sometimes generated unreadable or mixed-language responses, DeepSeek-R1’s cold start data was designed with structured formatting, including a clear reasoning process and summary for each response.
Performance Boost: By strategically crafting cold-start data with human-guided patterns, DeepSeek-R1 exhibited superior reasoning abilities compared to DeepSeek-R1-Zero.

Enhancing Reasoning with Reinforcement Learning

After establishing a solid foundation with cold-start data, DeepSeek-R1 employed large-scale reinforcement learning to further enhance its reasoning skills. This phase focused on areas requiring structured logic, such as coding, mathematics, and science.

Language Consistency Rewards: To prevent responses from mixing multiple languages, DeepSeek introduced a reward mechanism that prioritized target-language consistency, ensuring more user-friendly outputs.
Optimized Reasoning Tasks: The model balanced accuracy in logic-driven tasks with human readability, refining its problem-solving approach through iterative reinforcement learning.

Supervised Fine-Tuning for Diverse Capabilities

Once RL training reached convergence, the next step was supervised fine-tuning (SFT) to further refine the model across reasoning and non-reasoning tasks.

Reasoning Data: Leveraging rejection sampling, DeepSeek-R1 curated a dataset of 600k reasoning-focused training samples, ensuring only high-quality responses were included.
Non-Reasoning Data: A separate dataset of 200k samples covered diverse areas like writing, factual Q&A, self-cognition, and translation, enabling DeepSeek-R1 to perform well beyond just logic-based tasks.

Reinforcement Learning for Holistic Improvement

To ensure DeepSeek-R1 aligned with human preferences while maintaining strong reasoning, an additional reinforcement learning phase was introduced. This phase prioritized:

Helpfulness: Ensuring responses were relevant and user-friendly, with a focus on clear and useful summaries.
Harmlessness: Filtering out biases, harmful content, or misleading information while maintaining logical accuracy.
Balanced Training: Integrating reasoning and general-purpose training data to create a well-rounded model capable of excelling in both structured problem-solving and open-ended tasks.

Final Outcome: A Breakthrough in Open-Source AI

Reinforcement Learning for Holistic Improvement

By combining reinforcement learning, supervised fine-tuning, and strategically curated cold start data, DeepSeek-R1 emerged as a groundbreaking model, outperforming its predecessors. Its distilled versions (DeepSeek-R1-Distill) achieved state-of-the-art results in reasoning benchmarks, proving the effectiveness of this hybrid training approach. DeepSeek-R1 not only pushes the boundaries of AI reasoning but also ensures outputs are more user-friendly, readable, and aligned with human expectations.

Simplifying Data Governance with DataHub and PostgreSQL Integration

Datamonk — Tue, 28 Jan 2025 12:42:25 +0000

In the digital age, data is growing at an unprecedented rate. Terabytes of new data are created—almost as effortlessly as you take a breath. This rapid influx of data presents both opportunities and challenges for businesses, organizations, and data professionals alike. The need to manage, understand, and make sense of this data has never been more crucial.

As data continues to grow, tools like Datahub for managing and governing become essential. DataHub stands out as one such tool—empowering organizations to centralize, organize, and make their work easy and data more accessible and useful. Whether you’re working with small datasets or large enterprise databases, DataHub offers the solutions needed to ensure your data is not only discoverable but also governed and well-documented.

Managing a database is just a small part of the larger picture, much like how DataHub plays a key role in the vast ocean of data management. In this blog, we’ll take a closer look at how you can ingest a PostgreSQL database into DataHub, explore its key features, and see how it can help you manage and visualize your data with ease.

Prerequisite

(Note:- It is an Ubuntu process)

python 3.8+
postgres
Run Datahub locally, refer this

Overview

When ingesting a PostgreSQL database into DataHub, the process begins by connecting DataHub to the database using a configured source connector(precisely will create a .yml file which will act like a bridge between the datahub and the postgres database). DataHub then extracts metadata such as database structure, schemas, tables, columns, and relationships. Key features like table-level lineage, data profiling, and classification are enabled during this process to enhance data discovery and governance.

The ingestion process also captures additional metadata such as column types, foreign key relationships, constraints. Once ingested, this metadata is indexed and made available for searching, browsing, and visualization within the DataHub platform.

The overall goal is to centralize the metadata, improve data discoverability, and facilitate collaboration across teams by providing a comprehensive view of the database structure and data flow.

After running the command

datahub docker quickstart

you will be able to access the Datahub UI on http://localhost:9002

Source Connector

To ingest metadata from PostgreSQL into DataHub, source connector is used, which is defined in a configuration file—typically a .yml file. For instance, create a file called postgresql_ingestion.yml that contains all the necessary connection details and parameters required for the ingestion process.

The configuration file acts as a bridge between your PostgreSQL database and DataHub. It includes essential details like the host, database name, username, and the DataHub server link to ensure that metadata is successfully extracted and ingested into the platform.
like this-

After setting up the source connector in the .yml configuration file, the next step is to ingest your database into DataHub. This process involves running the ingestion command, which uses the configuration file to extract metadata from your database and push it to your DataHub instance.

datahub ingest -c postgres2_ingestion.yml

With the dataset now ingested into DataHub, all its information is seamlessly organized and made available for efficient management and governance through a suite of advanced features.

Features

Let's understand these features using the weather data project example I discussed in the Data Orchestration blog post. This project involves a schema containing hourly weather data, daily weather summaries, and global averages, now seamlessly integrated into DataHub for enhanced data management and governance.

Containers

These can be used to represent Weather data into logical groupings like databases, schemas, and tables. For example, hourly weather data can be in one schema, while daily and global data occupy others, enabling clear categorization for streamlined navigation.

Classification

By tagging sensitive weather data or business-critical metrics (e.g., identifying PII(Personally Identifiable Information) like location, date, address or compliance-relevant data like personal or sensitive data), we can classify the data effectively to enforce security, regulatory compliance, and usage guidelines.

Data Profiling

Through SQL profiling, DataHub can generate comprehensive statistics for tables, such as average temperatures, the frequency of anomalies, and missing data percentages etc.

Description

Each table, column, or dataset can include metadata descriptions (e.g., "Hourly Weather Data includes temperature, humidity, and wind speed per hour"). This ensures clear context, enabling users and systems to understand the data easily without repeated explanations.

Detect Deleted Entities

If data is removed (e.g., outdated weather stations or deprecated data tables) from the PostgreSQL database, DataHub detects and reflects these changes, reducing clutter and ensuring that outdated references no longer impact the system.

Domains

Domains allow grouping datasets by their purpose, like "Forecasting," "Historical Weather," and "Climate Analytics." This structure simplifies data governance by enabling domain-specific visibility and control.

Platform Instance

Platform-specific metadata allows you to tag and distinguish between different instances of the PostgreSQL database, such as production and staging environments. This makes it easier to track where the data is coming from and avoids any confusion between test data and live production data.

Schema Metadata

The weather project’s database schema, with details on tables, columns, and relationships, is automatically extracted and indexed, enabling quick discovery of how data is organized. For instance, knowing the relationship between hourly data and daily summaries aids in analysis automation.

Table-Level Lineage

Tracking lineage maps the entire journey of the data, showing how the hourly weather dataset is aggregated into daily summaries and further processed to generate global averages. This transparency ensures trust in the results and makes it easier to identify and resolve any issues.

So...

After exploring all the features, it's clear that integrating DataHub with your PostgreSQL database can really take data management to the next level. For example, lineage tracking allows you to easily trace how your weather data flows, from hourly temperature readings to daily summaries and even global averages. This transparency not only builds trust in the data but also makes it much easier to fix any issues if something goes wrong.

Another powerful feature is data profiling, which helps you monitor the quality of your data. Whether it's checking for missing values or spotting unusual patterns, it ensures the data you're working with is reliable—something that's especially important when you're forecasting weather trends or working with large datasets.

By using these features, you're not just managing data—you're ensuring that it’s secure, accurate, and trustworthy. DataHub is like an extra layer of safety and efficiency for your weather project, streamlining workflows while giving you confidence that the data you’re using is always up to par.

Adopting DataHub isn't just about tools, it's about improving how we handle data so that we can make better decisions and keep things running smoothly.

A detailed study on three popular ETL tools for Workflow Orchestration.

Datamonk — Thu, 23 Jan 2025 13:51:13 +0000

Data Orchestration Tool Analysis: Airflow, Dagster, Flyte

Datamonk ・ Jan 23

#dataengineering #tooling #python #opensource

Data Orchestration Tool Analysis: Airflow, Dagster, Flyte

Datamonk — Thu, 23 Jan 2025 13:09:19 +0000

Introduction

Data orchestration tools are key for managing data pipelines in modern workflows. When it comes to tools, Apache Airflow, Dagster, and Flyte are popular tools serving this need, but they serve different purposes and follow different philosophies. Choosing the right tool for your requirements is essential for scalability and efficiency. In this blog, I will compare Apache Airflow, Dagster, and Flyte, exploring their evolution, features, and unique strengths, while sharing insights from my hands-on experience with these tools in a weather data pipeline project.

Overview

In the weather data project, I got the chance to work with these three tools—Airflow, Dagster, and Flyte, and gained understanding for what makes each one unique. In this blog, I’ll share my experience comparing them and break down how each one works and what sets them apart.

Apache Airflow

Apache Airflow got its start at Airbnb back in October 2014 a Python-based orchestrator with a web interface, designed to handle the company’s growing workflow challenges. It joined the Apache Incubator in 2016 and finally earned its spot as a top-level Apache Software Foundation project in 2019, marking a major milestone in its journey.

Airflow proved to be a blessing, simplifying the management and scheduling of their complex tasks effortlessly.

In the weather data project, I used Airflow to automate the data pipeline, ensuring tasks like fetching, processing, and storing weather data ran in the correct order. Each task depended on the successful completion of the previous one, ensuring smooth and sequential execution from start to finish.

An Airflow DAG file consists of three main components: the DAG instantiation, the tasks, task dependencies and the task order. It looks something like this:

# Dag Instance
@dag(
    dag_id="weather_dag",
    schedule_interval="0 0 * * *",  # Daily at midnight
    start_date=datetime.datetime(2025, 1, 19, tzinfo=IST),
    catchup=False,
    dagrun_timeout=datetime.timedelta(hours=24),
)
# Task Definitions
def weather_dag():
    @task()
    def create_tables():         
        create_table()  

    @task()
    def fetch_weather(city: str, date: str):         
        fetch_and_store_weather(city, date)  

    @task()
    def fetch_daily_weather(city: str):     
        fetch_day_average(city.title())  

    @task()
    def global_average(city: str):     
        fetch_global_average(city.title())  

# Task Dependencies
    create_task = create_tables()
    fetch_weather_task = fetch_weather("Alwar", "2025-01-19")
    fetch_daily_weather_task = fetch_daily_weather("Alwar")
    global_average_task = global_average("Alwar")
# Task Order
    create_task >> fetch_weather_task >> fetch_daily_weather_task >> global_average_task

weather_dag_instance = weather_dag()

And it’s all managed through the Airflow UI, which provides a way to monitor and track the progress of the entire pipeline.

DAGSTER

Dagster was developed by Elementl, founded by Nick Schrock, CEO of Elementl in April 2019, With a vision to reshape the data management ecosystem, Nick introduced Dagster—a fresh programming model for data processing.

Unlike traditional tools that focus primarily on tasks or jobs, Dagster emphasizes the relationships between inputs and outputs. Its asset-centric approach focuses on treating data assets as the central units of computation in a pipeline.

Each asset is represented as a dataset and the pipelines revolves around how the assets depend on each other.

@asset(
        description='Table Creation for the Weather Data',
        metadata={
            'description': 'Creates databse tables needed for weather data.',
            'created_at': datetime.datetime.now().isoformat()
        }
)
def setup_database() -> None:
    create_table()

@asset(
        deps=[setup_database], # setup_database is a dependency
        description="The hourly data",
        metadata={
            'city and date': f"{city} on {date}",
            'created_at': datetime.datetime.now().isoformat()
        }
)
def fetch_weather():
    weather_data = fetch_and_store_weather(city, date)
    return MaterializeResult(
        metadata={
            'number of rows': weather_data
        }       
    )

@asset(
        deps=[fetch_weather], # fetch_weather is a dependency
        description="The Day Average",
        metadata={
            'city and date': f"{city} on {date}",
            'created_at': datetime.datetime.now().isoformat()
        }
)
def fetch_daily_weather():
    weather_data = fetch_day_average(city)  
    # asset based graphs
    columns = ["ID", "City", "Date", "Max Temp (°C)", "Min Temp (°C)", "Condition", "Avg Humidity (%)"]
    weather_df = pd.DataFrame(weather_data, columns=columns)
    return MaterializeResult(
        metadata={
            "Row added" : MetadataValue.md(weather_df.head().to_markdown()),
        }
    )

@asset(
        deps=[fetch_daily_weather], # fetch_daily_weather is a dependency
        description="The Whole Average",
        metadata={      
            'city': city,
            'created_at': datetime.datetime.now().isoformat()
        }
)  
def global_weather():
    fetch_global_average(city.title())

Dagster builds a clear dependency graph, making pipeline transparent and easy to debug.

Traditional Task-Based Workflow

Task 1: Fetch weather data.
Task 2: Clean the data.
Task 3: Store the cleaned data in a database.

Asset-Centric Workflow in Dagster

Asset 1: Raw weather data (produced by fetching from an API).
Asset 2: Cleaned weather data (transformed from raw weather data).
Asset 3: Stored weather dataset (created from cleaned data)

With Dagster, you can build custom asset graphs, linking them directly to pipeline steps. This feature stands out because it’s helping you monitor data as it evolves through each pipeline stage. It adds a level of clarity and interactivity to the workflow, making debugging and monitoring far more intuitive—a functionality I didn’t encounter with Airflow.

And it’s not just asset-centric, if you prefer the task-based approach like in Airflow, Dagster’s got you covered too. You can define your tasks using @ops (operations) in Dagster, just like you’d use @task in Airflow. So whether you're into working with assets or tasks, you’ve got the flexibility to choose the approach that works best for you.

FLYTE

Flyte, a workflow orchestration tool, was initially developed by Lyft in 2016 as an internal platform to manage complex machine learning and data processing pipelines. Later, in 2020 it open-sourced, making it accessible for other companies to use. It leverages Kubernetes, allowing businesses to scale and manage their data and ML workflows in a reliable and efficient way.

Primarily designed to handle both machine learning and data engineering workflows, Built on Kubernetes, Flyte leverages its containerized infrastructure for handling large-scale jobs, which enables efficient resource scaling and management.

In flyte tasks are defined using Python functions and then composed into workflows. Each task represents a unit of work, and tasks can be connected with dependencies, indicating their execution order. It is somewhere similar to airflow task-centric approach.

@task()
def setup_database():  
    create_table()

@task()
def fetch_weather(city: str, date: str): 
    fetch_and_store_weather(city, date)   

@task() 
def fetch_daily_weather(city: str):  
    fetch_day_average(city)  

@task()
def global_weather(city: str):   
    fetch_global_average(city.title())

@workflow         #defining the workflow
def wf(city: str='Noida', date: str='2025-01-17') -> typing.Tuple[str, int]:
    # The workflow will execute the tasks in the order they are defined
    setup_database()
    fetch_weather(city, date)
    fetch_daily_weather(city)
    global_weather(city)
    return f"Workflow executed successfully for {city} on {date}", 0

if __name__ == "__main__":
    print(f"Running wf() {wf()}")

Flyte makes local execution easy with flytectl, which sets up a sandbox container for testing workflows. Plus, it lets you run Python code locally, so you can test and debug your workflows before deploying them to the cloud.

Flyte emerges as a modern solution for virtually every aspect of tech workflows, offering the following key benefits

Comparison

Dag Versioning

While working on the weather data project in Airflow, one of the challenges I encountered was managing changes in the pipeline over time—a common issue known as DAG versioning. If you update a pipeline to add or modify tasks, there’s no native way to version these changes, run different versions side by side, or to rollback to previous task. User faces hard-time and more complexity while taking precautions like appending version numbers to DAG IDs, using Git for code tracking, or maintaining separate environments.

In contrast, Dagster solves this problem effectively with its asset-centric approach, built-in support for backfills, and asset snapshots. Each asset is versioned independently, so if any new asset is updated it doesn’t disrupt the other.

As the modern data stack has grown and is still growing, tools can't just limit themselves to only executing, managing, and optimizing data assets anymore. They need to fit into the entire development workflow—from local testing to production deployments—while being cloud-native to support scalability and flexibility.

Whereas,
Flyte addresses DAG versioning seamlessly by supporting versioned workflows natively. When you update or modify a task, Flyte allows you to track and manage different workflow versions without disrupting ongoing processes. Flyte enables you to test updated tasks without affecting the entire workflow, ensuring smoother iteration and flexibility.

Scaling

In data engineering, scaling up is something Dagster handles really well with its flexible architecture for handling large data workflows whereas in Airflow managing resources and scaling can be challenging. However, when it comes to machine learning, Flyte stands out as the more favorable choice, thanks to its built-in support for ML workflows, model versioning, and Kubernetes-based scalability.

Modern AI Orchestration

When you compare Airflow, Dagster, and Flyte, it's clear how they handle different project needs. Like in our weather data project. Airflow excels in scheduling tasks but falls short when you need AI-specific task or handle high-volume, real-time data like in weather prediction models. Dagster focuses heavily on the data-driven approach, which is great, but it lacks some of the dynamic scalability that a complex project like weather forecasting requires. Flyte, however, shines when it comes to AI orchestration. It handles intensive workloads, scales effectively for complex data processing, and automates the workflow, making it ideal for things like predicting weather patterns or managing large sets of weather-related data, all while being efficient and reliable.

CONCLUSION

When deciding between the tools, consider the scale and focus of your workflows. If you need flexibility in pipeline structure and asset management, Dagster is a strong contender. For machine learning workflows with the added benefit of seamless scaling, Flyte should be your go-to solution. Meanwhile, if you are managing straightforward, traditional data engineering tasks, Airflow’s simplicity will still make it a valuable tool. Each of these tools brings unique features and advantages, so understanding your project’s needs will guide you toward the optimal choice.