<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Datamonk</title>
    <description>The latest articles on DEV Community by Datamonk (@datamonk_).</description>
    <link>https://dev.to/datamonk_</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2746984%2F1f918d97-0fc5-4474-bba9-89bb2146bf37.jpg</url>
      <title>DEV Community: Datamonk</title>
      <link>https://dev.to/datamonk_</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/datamonk_"/>
    <language>en</language>
    <item>
      <title>MariaDB to PostgreSQL: Lessons &amp; Challenges</title>
      <dc:creator>Datamonk</dc:creator>
      <pubDate>Fri, 14 Feb 2025 12:18:44 +0000</pubDate>
      <link>https://dev.to/datamonk_/mariadb-to-postgresql-lessons-challenges-42oa</link>
      <guid>https://dev.to/datamonk_/mariadb-to-postgresql-lessons-challenges-42oa</guid>
      <description>&lt;h2&gt;
  
  
  Mariadb to postgresql
&lt;/h2&gt;

&lt;p&gt;Recently, we did a migration of a rails application from mariadb to postgresql. This article goes into why and how of this migration.&lt;/p&gt;

&lt;p&gt;K1 &lt;a href="https://mariadb.com/newsroom/press-releases/k1-acquires-a-leading-database-software-company-mariadb-and-appoints-new-ceo/" rel="noopener noreferrer"&gt;acquired&lt;/a&gt; MariaDB and appointed a new CEO. This signaled a shift in the database's direction. A private equity firm was now in control of the company overseeing the open-source development of a database that our application heavily relied upon. &lt;/p&gt;

&lt;p&gt;Over the past few years, MariaDB's feature development has stagnated, largely due to its commitment to maintaining MySQL compatibility and a stronger focus on sales rather than innovation. Meanwhile, PostgreSQL has been on a relentless path of growth, continuously closing the gap with Oracle by rolling out feature after feature. &lt;/p&gt;

&lt;p&gt;PostgreSQL’s journey started in 1996, and its development is spearheaded by the PostgreSQL Global Development Group, with contributions from major players like AWS, Microsoft, NTT Data, EnterpriseDB, Crunchy Data, and PgEdge. The core contributor team is distributed across the globe, with experts from Europe, Japan, the US, China, India, and Latin America. Much like Linux, PostgreSQL is likely to remain open-source for the foreseeable future. For our client, this acquisition was the last straw that broke camel's back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why move from mariadb to postgresql ?
&lt;/h2&gt;

&lt;p&gt;The feature gap: The feature gap between postgresql and mariadb has grown significantly over last 5-7 years. With each release, PostgreSQL has integrated valuable features inspired by Oracle, making it an increasingly attractive choice for enterprises. As a result, it has evolved into a one-stop database solution for all kinds of application needs.&lt;/p&gt;

&lt;p&gt;Postgres's Rich extension system: If you need postgres to manage your cron jobs, there is &lt;strong&gt;pg_cron&lt;/strong&gt;. If you want to store spatial data in the database, there is &lt;strong&gt;postGIS&lt;/strong&gt;. For timeseries, &lt;strong&gt;timescaleDB&lt;/strong&gt;; for vector storage, there is &lt;strong&gt;pgvector&lt;/strong&gt;. Moreover, there are many startups modifying postgres in unexpected ways to bring new kind of database to the market. For instance, Neon Tech has separated the storage and compute layers, allowing databases to scale to zero. This enables independent scaling of storage and compute, much like Databricks and Snowflake do for data lakes. The rapid pace of innovation in PostgreSQL presents immense potential benefits for any application.&lt;/p&gt;


&lt;h3&gt;Corporate interest vs community interest:&lt;/h3&gt; MariaDB operates as a corporate entity, prioritizing business interests, while PostgreSQL thrives on its vibrant open-source community. The PostgreSQL Global Development Group’s mission is to advance the database based on the needs of its developers and users, ensuring a steady flow of meaningful enhancements.
&lt;h2&gt;
  
  
  High Risk
&lt;/h2&gt;

&lt;p&gt;While this migration looks very lucrative, it is not without its own risks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long time to finish:&lt;/strong&gt; These migrations take a long time to finish even for a simple migration. A seemingly straightforward upgrade from MySQL 5 to MySQL 8 can take up to nine months. This process includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;time to migrate&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Updating application code&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Conducting regression and load testing&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Coordination between different application teams:&lt;/strong&gt; Teams need to make sure that the new database doesn't cause features to break and passes all the tests. Having good &lt;strong&gt;test&lt;/strong&gt; coverage is very crucial to such a migration project. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fundamental change to the application:&lt;/strong&gt; Since everything sits on top of the database, it impacts everything: application, background jobs, BI etc. &lt;/p&gt;

&lt;p&gt;Migrating requires adjustments to the schema, data, and application code. Given that application code can run into millions of lines, the risk of something breaking is significant.&lt;/p&gt;

&lt;h2&gt;
  
  
  High Rewards
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;More features:&lt;/strong&gt; With a thriving open-source community behind the project, new features are continuously being added. The ecosystem of extensions is expanding rapidly, bringing innovative capabilities to the database core.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple applictaion logic makes it easier to maintain:&lt;/strong&gt; A feature-rich database like PostgreSQL allows developers to offload logic to the database itself, reducing complexity at the application level. This not only improves performance but also enhances maintainability and throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  Yandex data &lt;a href="https://www.youtube.com/watch?v=-SS4R1sFH3c" rel="noopener noreferrer"&gt;migration&lt;/a&gt; case study
&lt;/h2&gt;

&lt;p&gt;Yandex Mail initially stored its metadata in an Oracle database. By 2012, the growing feature demands and high licensing costs of Oracle became unsustainable, with expenses running into millions of dollars. Additionally, Oracle's restrictive licensing prohibited publishing benchmarks comparing its database performance, which led to its absence from &lt;a href="https://benchmark.clickhouse.com/" rel="noopener noreferrer"&gt;clickbench benchmark&lt;/a&gt;. So, they tried migrating to a different database which could support their needs at a resonable cost. It took them 10 mean years to finish the migration. &lt;/p&gt;

&lt;p&gt;First time, they tried switching to mysql. During this time, they tried to fix everything that is wrong with the codebase increasing the scope of the project and eventually setting it up for failure. Second time, they tried writing their own custom DBMS (maybe they tried to reverse engineer oracle). That failed as well. Well, third time was a charm.&lt;/p&gt;

&lt;h2&gt;
  
  
  How ?
&lt;/h2&gt;

&lt;p&gt;We first started with schema migration, then data export and import. And finally ended with making code changes to make the code compatible with the new postgresql database. &lt;/p&gt;

&lt;h2&gt;
  
  
  Schema transfer
&lt;/h2&gt;

&lt;p&gt;There are many nuances one has to keep in mind while migrating from one database to the another. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Datatype challenges:&lt;/strong&gt;  PostgreSQL does not support unsigned integers, unlike MariaDB. Additionally, PostgreSQL does not allow text fields with size attributes, nor does it accept backticks (`) in queries, unlike MariaDB. Once we jumped through these hoops, we were able to generate the matching schema in the postgresql database. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fulltext indexes:&lt;/strong&gt; Yandex Mail relied on full-text search for keyword retrieval in entity titles and bodies. While MariaDB offers simple full-text search indexing, PostgreSQL provides richer configurations for advanced search capabilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Binary fields:&lt;/strong&gt; Binary fields are little tricky, as in how they deal with null characters. Mariadb allows null character in the fields, while postgresql does not. It is not a feature that they decide to offer/not offer. But difference arises from storage mechanisms. MariaDB uses a length-prefixed buffer, while PostgreSQL employs C-style null-terminated strings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data export and import
&lt;/h2&gt;

&lt;p&gt;We will cover three ways in which data can be exported from mariadb to postgresql.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;pgloader:&lt;/strong&gt; It is an automated tool for loading data into postgresql from different sources. It reads data from different sources like mysql, sqlite and csv. After that, it creates the corresponding schema in postgresql and insert the data into relevant tables. Later on, it rebuilds the index after loading the dataset. It is good for simple datasets with simple data types. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dumping SQL and rerunning on the database:&lt;/strong&gt; This process involves raw SQL statements from the source database and making changes to the SQL to match the target databases. Its a good option when upgrading the database to a newer version. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CSV Export and Import:&lt;/strong&gt; This option is good for data that doesn't contain binary fields as that would cause various issues while dumping and parsing csv. Since csv is widely supported format for importing and exporting, it would be fine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code changes
&lt;/h2&gt;

&lt;p&gt;Apart from simple changes in the code base to replace quotes and remove certain keyboards, there were some other changes needed in the SQL that was used to run directly on the database, as opposed to an ORM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raw SQL in code:&lt;/strong&gt; The codebase used &lt;code&gt;interval&lt;/code&gt; keyword, which is only supported in the postgresql with escaped quotes. Easy change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Search:&lt;/strong&gt; Search used &lt;code&gt;MATCH ... AGAINST&lt;/code&gt; clause which had to be replaced with &lt;code&gt;tsvector(...) @@ tsquery(...)&lt;/code&gt; clause after making the relevant index on appropriate columns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Binary fields:&lt;/strong&gt; The syntax is little different in postgres compared to mariadb. But, that could be easily managed. &lt;/p&gt;

&lt;h2&gt;
  
  
  Learning from this project
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Having good test coverage allows the team to make code changes with confidence and help them iterate faster with code changes. This is probably the single most important thing. So, if the codebase doesn't have test coverage for some modules. The team may start with writing some tests. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Limit the scope of the migration. It maybe tempting to fix everything wrong with the codebase that you see along the way. But, limiting the scope to just doing migration will help team stay focused and achieve the results. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Run small experiement independently on a toy dataset. We did this with the fulltext search module. These experiemnt allowed us to find the right search solution when migrating the new database.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Run loadtesting and benchmarks. The old database maybe tuned for performance with a lot of custom indexes. So, take your time to build indexes in the destination database and tune them for your workload.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Migrating from Oracle to PostgreSQL was a long and complex process, but it enabled Yandex Mail to achieve greater scalability, flexibility, and cost savings. The transition reinforced the importance of thorough planning, controlled scope, and rigorous testing in large-scale database migrations.&lt;/p&gt;

</description>
      <category>mariadb</category>
      <category>postgres</category>
      <category>database</category>
      <category>sql</category>
    </item>
    <item>
      <title>Tenant Based Filtering: Apache Superset</title>
      <dc:creator>Datamonk</dc:creator>
      <pubDate>Fri, 07 Feb 2025 19:01:14 +0000</pubDate>
      <link>https://dev.to/datamonk_/tenant-based-filtering-apache-superset-5db7</link>
      <guid>https://dev.to/datamonk_/tenant-based-filtering-apache-superset-5db7</guid>
      <description>&lt;p&gt;In this post, we will explore how we successfully implemented Row-Level Security &lt;a href="https://superset.apache.org/docs/security/" rel="noopener noreferrer"&gt;(RLS)&lt;/a&gt; in Apache Superset to create a multi-tenant dashboard that dynamically filters data based on the logged-in user’s company.&lt;/p&gt;

&lt;p&gt;The dvdrental database, based on the &lt;a href="https://dev.mysql.com/doc/sakila/en/sakila-introduction.html" rel="noopener noreferrer"&gt;Sakila&lt;/a&gt; dataset, is a well-known sample database used for learning SQL and database management. It represents a DVD rental store, with tables for customers, rentals, payments, and films. To transform this database into a &lt;strong&gt;multi-tenant system&lt;/strong&gt;, multiple companies needed to use the same database while ensuring they could only access their own data.&lt;/p&gt;

&lt;p&gt;To achieve this, a &lt;strong&gt;company_id&lt;/strong&gt; column was added to key tables such as &lt;strong&gt;rental&lt;/strong&gt;, &lt;strong&gt;payment&lt;/strong&gt;, and &lt;strong&gt;customer&lt;/strong&gt;. This allowed the system to associate customers and transactions with specific companies. The main objective was to create a single dashboard in &lt;strong&gt;Apache Superset&lt;/strong&gt; that could be used by different users while dynamically filtering data based on their assigned company.&lt;/p&gt;

&lt;p&gt;The key requirements were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Admins should have access to all data across all companies.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Company Users should only see data relevant to their specific company.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Row-Level Security (RLS)&lt;/strong&gt; should be implemented to automatically filter data based on the logged-in user's company_id, retrieved from their email.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With RLS in place, data was filtered at the query level, ensuring that a single dashboard could be used securely by multiple companies. Admins had a complete view of all data, while regular users could only see company-specific information. The company_id was linked to user emails, allowing for seamless access control without requiring manual input from users.&lt;/p&gt;

&lt;p&gt;This approach provided a &lt;strong&gt;secure&lt;/strong&gt;, &lt;strong&gt;scalable&lt;/strong&gt;, and &lt;strong&gt;efficient multi-tenant dashboard&lt;/strong&gt;, enabling different companies to operate within the same system without exposing their data to others.&lt;/p&gt;

&lt;h2&gt;
  
  
  Defining Roles For Users
&lt;/h2&gt;

&lt;p&gt;The configuration process begins in the &lt;strong&gt;Superset User Interface (UI)&lt;/strong&gt;, where roles and users are created and assigned specific permissions.&lt;/p&gt;

&lt;p&gt;In the Security section of Superset, roles can be defined with specific access levels. Each role determines what a user can do within the system, such as accessing datasets, viewing dashboards, and reading charts. By assigning permissions at the role level, different users can have varying levels of access. For example, an Admin role is configured to have full access to all data, while Company User roles are restricted to seeing only data associated with their company.&lt;/p&gt;

&lt;p&gt;Once roles are created, users are added in the Superset UI and assigned these roles. When a user logs in, the system automatically applies the RLS filter based on their company_id. Superset achieves this by dynamically injecting a &lt;strong&gt;WHERE clause&lt;/strong&gt; into queries executed by the user.&lt;/p&gt;

&lt;p&gt;This &lt;strong&gt;dynamic filtering&lt;/strong&gt; ensures that users have access only to the data they are authorized to see, making Superset an efficient tool for multi-tenant dashboards while maintaining strict data security and access control.&lt;/p&gt;

&lt;p&gt;After applying Row-Level Security (RLS) to each user and assigning them access to their specific rows, the admin retains full access to all the data across companies. For example, if there are 1,000 records in the admin table, one user might have access to only 200 rows, while another user might see 250 rows, depending on their company's data. &lt;/p&gt;

&lt;h2&gt;
  
  
  Dashboards
&lt;/h2&gt;

&lt;p&gt;This is clearly reflected in the dashboards. The admin dashboard displays the complete dataset, including the total sales from all companies.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwrexcart3mzd2h0r1h1v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwrexcart3mzd2h0r1h1v.png" alt="Admin" width="800" height="413"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For instance, in the pie chart, you can see that the total sales for DVD rentals amount to 1.95 million, representing data from all over the country. &lt;/p&gt;

&lt;p&gt;USER 1&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3qs8v2lo2hl1hf3w9qq0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3qs8v2lo2hl1hf3w9qq0.png" alt="User1" width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;USER 2&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjl0x67kid43zsij0g93l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjl0x67kid43zsij0g93l.png" alt="User2" width="800" height="412"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, when viewing the dashboards of individual users, the total sales figure is noticeably reduced. This happens because their dashboards only reflect company-specific data.&lt;/p&gt;

&lt;h2&gt;
  
  
  In Conclusion
&lt;/h2&gt;

&lt;p&gt;By implementing RLS, we achieve effective tenant-based filtering, ensuring that data is securely and efficiently segmented for different users. This not only enhances security but also improves data management, making it a powerful tool for multi-tenant applications.&lt;/p&gt;

</description>
      <category>superset</category>
      <category>apache</category>
      <category>dataengineering</category>
      <category>analyst</category>
    </item>
    <item>
      <title>A Tale of Two Databases</title>
      <dc:creator>Datamonk</dc:creator>
      <pubDate>Fri, 07 Feb 2025 05:59:25 +0000</pubDate>
      <link>https://dev.to/datamonk_/a-tale-of-two-databases-hf3</link>
      <guid>https://dev.to/datamonk_/a-tale-of-two-databases-hf3</guid>
      <description>&lt;h2&gt;
  
  
  Sqlite vs Postgresql
&lt;/h2&gt;

&lt;p&gt;Sqlite is one of the &lt;a href="https://www.sqlite.org/mostdeployed.html" rel="noopener noreferrer"&gt;most deployed&lt;/a&gt; database in the world. Its small(&amp;lt;1 MB&amp;gt;), fast and reliable(supported till 2050). Postgresql is world's most advanced open source object-relational database. It has been developed over 35 years in the wild with open-source community. Out of 177 ANSI SQL features, it complies with &lt;a href="https://www.postgresql.org/docs/current/features.html" rel="noopener noreferrer"&gt;170&lt;/a&gt;, highest of any database. While sqlite offers 5 datatypes(NULL, integers, real, text and blob), postgres boasts of &lt;a href="https://www.postgresql.org/docs/current/datatype.html" rel="noopener noreferrer"&gt;43 datatypes&lt;/a&gt; including xml, JSON and ip_addresses. And more(time series, geometry, raster) can be added through the extensions. It is clear that while sqlite is a fast rabbit, postgresql is the elephant in the room. &lt;/p&gt;

&lt;p&gt;Rather than comparing the documentation of these two wonderful databases, we will look at some of hoops we will need to jump in order to support sqlite for a web application.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Flexible typing
&lt;/h3&gt;

&lt;p&gt;In this &lt;a href="https://www.twilio.com/en-us/blog/sqlite-postgresql-complicated" rel="noopener noreferrer"&gt;article&lt;/a&gt; talks about how flexible typing in sqlite caught them off guard and they ended up storing longer slug in 256 byte string and stored Uuid in integer column type. They only got to know this once they migrated their database from sqlite to postgres. This would have caused bug that would be difficult to reproduce and find out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt; &lt;span class="k"&gt;primary&lt;/span&gt; &lt;span class="k"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wc&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;insert&lt;/span&gt; &lt;span class="k"&gt;into&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="k"&gt;values&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'hello world'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;insert&lt;/span&gt; &lt;span class="k"&gt;into&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="k"&gt;values&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'hello world 2'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'coca cola'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above command would run totally fine in sqlite&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;sqlite&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="err"&gt;┌────┬───────────────┬───────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;     &lt;span class="n"&gt;title&lt;/span&gt;     &lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="n"&gt;wc&lt;/span&gt;     &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├────┼───────────────┼───────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;hello&lt;/span&gt; &lt;span class="n"&gt;world&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;         &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;hello&lt;/span&gt; &lt;span class="n"&gt;world&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;         &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;hello&lt;/span&gt; &lt;span class="n"&gt;world&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;coca&lt;/span&gt; &lt;span class="n"&gt;cola&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;hello&lt;/span&gt; &lt;span class="n"&gt;world&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;        &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└────┴───────────────┴───────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;while it may not cause any problem as is because the aggregate functions(max, sum) will just ignore the column that is not castable. Migrating this data/database to a strict typing database may be a problem.&lt;/p&gt;

&lt;p&gt;However, this can all be avoided by just making &lt;a href="https://www.sqlite.org/stricttables.html" rel="noopener noreferrer"&gt;strict tables&lt;/a&gt;. &lt;br&gt;
This will throw error&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;articles_strict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt; &lt;span class="k"&gt;primary&lt;/span&gt; &lt;span class="k"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wc&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;STRICT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;insert&lt;/span&gt; &lt;span class="k"&gt;into&lt;/span&gt; &lt;span class="n"&gt;articles_strict&lt;/span&gt; &lt;span class="k"&gt;values&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'hello world'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;insert&lt;/span&gt; &lt;span class="k"&gt;into&lt;/span&gt; &lt;span class="n"&gt;articles_strict&lt;/span&gt; &lt;span class="k"&gt;values&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'hello world 2'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. Single thread writer
&lt;/h2&gt;

&lt;p&gt;Sqlite allows only one writer at a time. In its default rollback journal mode, It uses a locking-based mechanism where a write operation will lock the entire database and other read/write operations will have to wait in queue. Similarly write operation will have to wait in queue for the read operation. &lt;/p&gt;

&lt;p&gt;However, this can all be avoided with WAL (write ahead logging mode) where reader will not block writer and writer will not block readers. But writer will &lt;a href="https://www.sqlite.org/whentouse.html#:~:text=If%20many%20threads%20and%2For,a%20time%20per%20database%20file." rel="noopener noreferrer"&gt;block&lt;/a&gt; other writers and they will have to wait in queue&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Recursive depth first search
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Depth-first recursive function: Sqlite &lt;a href="https://modern-sql.com/caniuse/search_(recursion)" rel="noopener noreferrer"&gt;does NOT&lt;/a&gt; support graph search functions in its recrsive search. However, a depth-first/breadth first thing can be implemented in sqlite using &lt;code&gt;order by&lt;/code&gt; clause.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;boss&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;tenure&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;WITHOUT&lt;/span&gt; &lt;span class="n"&gt;ROWID&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Alice'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Bob'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'Alice'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Cindy'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'Alice'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Dave'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'Bob'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Emma'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'Bob'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Fred'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'Cindy'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Gail'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'Cindy'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;while querying we can traverse the tree in depth-first order while also sorting the siblings by tenure. amazing. This can be used to solve the &lt;a href="https://github.com/lobsters/lobsters/blob/master/app/models/comment.rb#L695-L728" rel="noopener noreferrer"&gt;comment ordering&lt;/a&gt; in lobsters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="k"&gt;RECURSIVE&lt;/span&gt;
  &lt;span class="n"&gt;under_alice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tenure&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;VALUES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Alice'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;under_alice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;level&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tenure&lt;/span&gt;
      &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;under_alice&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;boss&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;under_alice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
     &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;substr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'..........'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;level&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;'  '&lt;/span&gt;&lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;tenure&lt;/span&gt;  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;under_alice&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────┐
│ substr('..........',1,level*3) || name || '  '|| tenure │
├─────────────────────────────────────────────────────────┤
│ Alice  10                                               │
│ ...Cindy  10                                            │
│ ......Gail  10                                          │
│ ......Fred  1                                           │
│ ...Bob  1                                               │
│ ......Emma  10                                          │
│ ......Dave  1                                           │
└─────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Materialized views
&lt;/h2&gt;

&lt;p&gt;There are &lt;a href="https://news.ycombinator.com/item?id=40074323" rel="noopener noreferrer"&gt;no materialized views&lt;/a&gt; in sqlite. The closest workaround suggested in the thread above is to &lt;code&gt;CREATE TABLE AS&lt;/code&gt; and run a job to refresh it every hour/day.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Full text search
&lt;/h2&gt;

&lt;p&gt;For this experiment, we will use the &lt;a href="https://dev.mysql.com/doc/sakila/en/sakila-introduction.html" rel="noopener noreferrer"&gt;sakilla&lt;/a&gt; database to showcase how to run a fulltext search query. Simplest way to do a search query to use &lt;code&gt;like&lt;/code&gt; operation, but this will trigger a full table scan.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;sqlite&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;schema&lt;/span&gt; &lt;span class="n"&gt;film&lt;/span&gt; 
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;film&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;film_id&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="nb"&gt;BLOB&lt;/span&gt; &lt;span class="n"&gt;SUB_TYPE&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;...&lt;/span&gt;
  &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;film_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="n"&gt;sqlite&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;explain&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="n"&gt;plan&lt;/span&gt; &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;film&lt;/span&gt; &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="s1"&gt;'%shark%'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;QUERY&lt;/span&gt; &lt;span class="n"&gt;PLAN&lt;/span&gt;
&lt;span class="nv"&gt;`--SCAN film
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fortunately, sqlite provides &lt;a href="https://www.sqlite.org/fts5.html" rel="noopener noreferrer"&gt;fts5&lt;/a&gt;. If the query below returns 1, fts5 is enabled in the sqlite installation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;sqlite_version&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;sqlite_compileoption_used&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'ENABLE_FTS5'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;create the virtual fts table&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;VIRTUAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;film_fts&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;fts5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'film'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_rowid&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'film_id'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;film_fts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;film_fts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'rebuild'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, query the tables using &lt;code&gt;MATCH&lt;/code&gt; operator.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;sqlite&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;  &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;film_fts&lt;/span&gt; &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="s1"&gt;'suit'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="err"&gt;┌────────────┬──────────────────────────────────────────────────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="n"&gt;title&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt;                         &lt;span class="n"&gt;description&lt;/span&gt;                          &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├────────────┼──────────────────────────────────────────────────────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;CORE&lt;/span&gt; &lt;span class="n"&gt;SUIT&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="n"&gt;Unbelieveable&lt;/span&gt; &lt;span class="n"&gt;Tale&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;Car&lt;/span&gt; &lt;span class="k"&gt;And&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;Explorer&lt;/span&gt; &lt;span class="n"&gt;who&lt;/span&gt; &lt;span class="n"&gt;must&lt;/span&gt; &lt;span class="n"&gt;Confro&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;            &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;nt&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;Boat&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="n"&gt;Manhattan&lt;/span&gt; &lt;span class="n"&gt;Penthouse&lt;/span&gt;                           &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├────────────┼──────────────────────────────────────────────────────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;SPEED&lt;/span&gt; &lt;span class="n"&gt;SUIT&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="n"&gt;Brilliant&lt;/span&gt; &lt;span class="n"&gt;Display&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;Frisbee&lt;/span&gt; &lt;span class="k"&gt;And&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;Mad&lt;/span&gt; &lt;span class="n"&gt;Scientist&lt;/span&gt; &lt;span class="n"&gt;who&lt;/span&gt; &lt;span class="n"&gt;mus&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;            &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="n"&gt;Succumb&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;Robot&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;Ancient&lt;/span&gt; &lt;span class="n"&gt;China&lt;/span&gt;                           &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;├────────────┼──────────────────────────────────────────────────────────────┤&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;SUIT&lt;/span&gt; &lt;span class="n"&gt;WALLS&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="n"&gt;Touching&lt;/span&gt; &lt;span class="n"&gt;Panorama&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;Lumberjack&lt;/span&gt; &lt;span class="k"&gt;And&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;Frisbee&lt;/span&gt; &lt;span class="n"&gt;who&lt;/span&gt; &lt;span class="n"&gt;must&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;            &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;uild&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;Dog&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;Australia&lt;/span&gt;                                      &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└────────────┴──────────────────────────────────────────────────────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, if I insert a new record in film, it won't be automatically available in &lt;code&gt;film_fts&lt;/code&gt;. To make the data available in the &lt;code&gt;film_fts&lt;/code&gt;, we will need to run the rebuild again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;film&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;film_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;release_year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;original_language_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rental_duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rental_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;replacement_cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rating&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;special_features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_update&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1001&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'CORE SUIT 2'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'A Unbelievable Tale of a Car And a Explorer who must Confront a Boat in A Manhattan Penthouse'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2006&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;86&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'PG'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Deleted Scenes,Behind the Scenes'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'2006-02-15 05:03:42'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, running rebuild again and again can be expensive, a better solution is to use triggers. &lt;/p&gt;

&lt;p&gt;Trigger for insert&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TRIGGER&lt;/span&gt; &lt;span class="n"&gt;film_insert&lt;/span&gt; &lt;span class="k"&gt;AFTER&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;film&lt;/span&gt;
&lt;span class="k"&gt;BEGIN&lt;/span&gt;
    &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;film_fts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rowid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;film_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trigger for update&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TRIGGER&lt;/span&gt; &lt;span class="n"&gt;film_update&lt;/span&gt; &lt;span class="k"&gt;AFTER&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;film&lt;/span&gt;
&lt;span class="k"&gt;BEGIN&lt;/span&gt;
    &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;film_fts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;film_fts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rowid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'delete'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;old&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;film_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;old&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;old&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;film_fts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rowid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;film_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trigger for delete&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TRIGGER&lt;/span&gt; &lt;span class="n"&gt;film_delete&lt;/span&gt; &lt;span class="k"&gt;AFTER&lt;/span&gt; &lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;film&lt;/span&gt;
&lt;span class="k"&gt;BEGIN&lt;/span&gt;
    &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;film_fts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;film_fts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rowid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'delete'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;old&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;film_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;old&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;old&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  In Conclusion
&lt;/h2&gt;

&lt;p&gt;SQLite and PostgreSQL both serve their purposes, and you can technically use either for almost anything. However, the real question is: why force one to handle something beyond its capabilities when there’s an option that excels at it? Yes, you can configure and tweak SQLite to make it work, but why shoehorn it when you can let PostgreSQL handle your load efficiently?&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>sqlite</category>
      <category>database</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How DeepSeek is Making High-Performance AI Accessible to All</title>
      <dc:creator>Datamonk</dc:creator>
      <pubDate>Fri, 31 Jan 2025 16:33:03 +0000</pubDate>
      <link>https://dev.to/datamonk_/how-deepseek-is-making-high-performance-ai-accessible-to-all-26fp</link>
      <guid>https://dev.to/datamonk_/how-deepseek-is-making-high-performance-ai-accessible-to-all-26fp</guid>
      <description>&lt;p&gt;AI research is evolving fast, but training massive models is still a tough challenge because of the huge computing power needed. That’s where DeepSeek is changing the game. They’ve found a way to build top-tier AI models without burning through an enormous number of GPUs. By using a smart mix of cost-effective training strategies, Nvidia’s PTX assembly, and reinforcement learning, they’ve created cutting-edge models like DeepSeek-R1-Zero and DeepSeek-R1—proving that innovation doesn’t always have to come with an extreme price tag.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimizing GPU Usage: Cost-Effective Training at Scale
&lt;/h2&gt;

&lt;p&gt;DeepSeek has set a new benchmark in efficient AI model training. For instance, DeepSeek trained its DeepSeek-V3 Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster of 2,048 Nvidia H800 GPUs in just two months—totaling 2.8 million GPU hours, according to its research paper.&lt;/p&gt;

&lt;p&gt;In comparison, OpenAI’s GPT-4, one of the most advanced language models, is estimated to have been trained using tens of thousands of Nvidia A100/H100 GPUs over several months, with a significantly higher compute cost. Similarly, Meta’s Llama 3, which has 405 billion parameters, required 30.8 million GPU hours—11 times more compute power than DeepSeek-V3—using 16,384 H100 GPUs over 54 days.&lt;/p&gt;

&lt;p&gt;DeepSeek’s approach stands out because of its cost-effective training strategies and efficient utilization of Nvidia’s PTX assembly and reinforcement learning techniques. By optimizing GPU usage and computational efficiency, it is proving that cutting-edge AI models don’t have to come with an astronomical price tag. This shift challenges the traditional belief that only massive GPU clusters can produce state-of-the-art models, making high-performance AI development more accessible and sustainable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Cost of DeepSeek-R1: Beyond the $6M Hype
&lt;/h2&gt;

&lt;p&gt;There's been a lot of hype around DeepSeek's claim that they trained their latest model for just $6 million. But let’s be real—this number only covers the GPU rental costs for the final pre-training run. The actual investment behind DeepSeek R1 is much, much bigger.&lt;/p&gt;

&lt;p&gt;Born from High-Flyer, a Chinese hedge fund that embraced AI early, DeepSeek made a bold move in 2021—acquiring 10,000 A100 GPUs before export restrictions tightened. This early bet secured a massive computational advantage. By 2023, they spun off as an independent AI lab, self-funded and ahead of the curve.&lt;/p&gt;

&lt;p&gt;Today, DeepSeek operates with around 50,000 Hopper GPUs, including H800s, H100s, and incoming H20s, shared with High-Flyer’s trading operations. Their actual investment? Likely over $500 million, with total infrastructure costs nearing $1.3 billion.&lt;/p&gt;

&lt;p&gt;Beyond hardware, DeepSeek’s strength lies in its elite 150-person team, handpicked for skill over credentials. Top engineers earn over $1.3 million annually—outpacing salaries at Chinese tech giants. Free from corporate bureaucracy, they optimize everything in-house, running their own data centers to push AI research further.&lt;/p&gt;

&lt;p&gt;DeepSeek isn’t just cost-efficient—it’s strategically built for dominance. The "$6M" figure is a footnote in a much bigger story of foresight, risk-taking, and deep R&amp;amp;D.&lt;/p&gt;

&lt;h2&gt;
  
  
  Unlocking Maximum Efficiency: How DeepSeek Used PTX to Push GPU Limits
&lt;/h2&gt;

&lt;p&gt;While most AI companies stick to Nvidia’s CUDA framework to train large models, DeepSeek took a bold and unconventional approach—leveraging PTX (Parallel Thread Execution) assembly to unlock previously untapped efficiency in GPU operations. This decision played a crucial role in the success of DeepSeek-R1-Zero and DeepSeek-R1, allowing them to be trained with fewer GPUs while maintaining high performance.&lt;/p&gt;

&lt;p&gt;But what exactly is PTX, and why does it matter?&lt;br&gt;
PTX: The Assembly Language of Nvidia GPUs&lt;/p&gt;

&lt;p&gt;CUDA is often the go-to framework for AI development because it provides an easy-to-use interface for GPU programming. However, CUDA is essentially a high-level abstraction—it translates code into PTX, which is Nvidia’s intermediate assembly language for GPU execution.&lt;/p&gt;

&lt;p&gt;Think of it like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;CUDA is like writing in Python—easy to use, but not the fastest.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;PTX is like writing in assembly language—harder to master, but gives full control over hardware performance.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DeepSeek went beyond CUDA, rewriting key computational operations directly in PTX. This allowed them to optimize memory access, reduce instruction overhead, and execute GPU instructions with greater precision, pushing performance to its limits.&lt;/p&gt;

&lt;h3&gt; Why PTX Gave DeepSeek-R1-Zero and R1 an Edge &lt;/h3&gt;

&lt;p&gt;By tapping into PTX, DeepSeek unlocked three major advantages that traditional CUDA-based models often miss:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ultra-Fine Hardware Control&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;CUDA automatically optimizes code, but it doesn’t always make the best choices for efficiency.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;With PTX, DeepSeek manually fine-tuned GPU instructions, ensuring every computational cycle was used efficiently.     &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Optimized Memory Utilization&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;One of the biggest bottlenecks in training AI models is memory overhead. CUDA’s default memory allocation can be inefficient, leading to wasted GPU memory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;DeepSeek restructured tensor operations at the PTX level, reducing memory bottlenecks and increasing throughput.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Better Instruction Scheduling &amp;amp; Parallel Execution&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;GPUs are designed to process thousands of operations in parallel, but CUDA’s compiler doesn’t always schedule instructions optimally.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;DeepSeek rewrote key computational kernels in PTX, achieving faster execution times and fewer processing stalls.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;PTX in Action: The DeepSeek Difference&lt;/h3&gt;

&lt;p&gt;Most AI companies throw more GPUs at the problem to speed up training. DeepSeek, on the other hand, focused on efficiency. By bypassing CUDA in key areas and directly optimizing PTX execution, they maximized GPU utilization without increasing hardware costs.&lt;/p&gt;

&lt;p&gt;This shift in approach redefines what’s possible in AI training. Instead of relying on brute-force compute power, DeepSeek proved that smart software optimizations can be just as impactful as expensive hardware upgrades.&lt;/p&gt;

&lt;p&gt;By mastering PTX, DeepSeek is not just developing AI models—it’s reshaping how AI is built, proving that next-generation models can be trained smarter, not harder.&lt;/p&gt;

&lt;h2&gt;
  
  
  DeepSeek-R1-Zero: Reinforcement Learning Without Supervised Fine-Tuning
&lt;/h2&gt;

&lt;h3&gt;A Bold Step in AI Development&lt;/h3&gt;

&lt;p&gt;With DeepSeek-R1-Zero, an AI model trained purely through Reinforcement Learning (RL) without any Supervised Fine-Tuning (SFT). Traditionally, AI models undergo supervised fine-tuning before reinforcement learning to improve their reasoning skills. However, DeepSeek-R1-Zero skipped this step entirely, proving that a model can develop strong reasoning abilities autonomously through trial and feedback. This unconventional method challenges the long-held belief that supervised fine-tuning is essential for high-performance AI.&lt;/p&gt;

&lt;h3&gt;Exceptional Reasoning Without Supervised Training&lt;/h3&gt;

&lt;p&gt;One of the most remarkable aspects of DeepSeek-R1-Zero is its ability to generalize and solve complex problems using RL alone. Unlike traditional models that rely on large amounts of labeled data, this model learned organically by interacting with its environment. Additionally, its performance could be further improved using majority voting, a technique that refines responses by selecting the most common answer across multiple attempts. On the AIME benchmark, DeepSeek-R1-Zero’s accuracy increased from 71.0% to 86.7% when majority voting was applied, even surpassing OpenAI-o1-0912. This achievement highlights the true potential of reinforcement learning in building highly capable AI systems.&lt;/p&gt;

&lt;h3&gt;The Self-Evolution Process&lt;/h3&gt;

&lt;p&gt;A key aspect of DeepSeek-R1-Zero’s development was its ability to self-evolve without human intervention. Since the model was trained purely with RL, researchers could closely observe how it progressed and refined its reasoning over time. Instead of improving based on human-provided examples, DeepSeek-R1-Zero learned from reinforcement feedback alone. By increasing test-time computation—giving itself more time to process and generate reasoning tokens—the model naturally improved its problem-solving strategies. This process demonstrated that AI can teach itself to think more deeply without requiring external adjustments.&lt;/p&gt;

&lt;h3&gt;Emergent Behaviors: Reflection and Alternative Problem-Solving&lt;/h3&gt;

&lt;p&gt;One of the most fascinating discoveries during DeepSeek-R1-Zero’s training was the spontaneous emergence of advanced reasoning behaviors. The model began to reflect on its own responses, revisiting and improving previous answers. It also started exploring multiple ways to solve a problem, rather than sticking to a single fixed approach. These behaviors weren’t explicitly programmed but emerged organically as a result of the RL training process. This milestone suggests that reinforcement learning can lead AI to develop structured thinking on its own, a significant step toward more autonomous and intelligent models.&lt;/p&gt;

&lt;h3&gt;The “Aha Moment” – A Breakthrough in AI Reasoning&lt;/h3&gt;

&lt;p&gt;One of the most intriguing moments in DeepSeek-R1-Zero’s evolution was the so-called "aha moment". At a certain stage in its development, the model realized that allocating more time to difficult problems led to better solutions. Instead of rushing to generate responses, it started pausing, reconsidering, and refining its reasoning process. This shift wasn’t directly taught to the model—it emerged naturally as a result of reinforcement learning optimizing for better problem-solving strategies. For researchers, witnessing this shift was just as exciting as it was for the model itself. It highlighted the power of reinforcement learning to drive independent intelligence and showed how AI can develop strategies beyond what was explicitly programmed.&lt;/p&gt;

&lt;h3&gt;Challenges: Readability and Language Mixing&lt;/h3&gt;

&lt;p&gt;Despite its impressive reasoning capabilities, DeepSeek-R1-Zero was not without flaws. One major issue was readability, as the model’s reasoning process was often difficult to follow. Additionally, it sometimes suffered from language mixing, blending multiple languages in its responses, which reduced clarity. These challenges showed that while RL alone can drive strong reasoning development, a balance between autonomous learning and structured human guidance is still necessary for a more practical AI system.&lt;/p&gt;

&lt;h3&gt;Refining the Model: The Introduction of DeepSeek-R1&lt;/h3&gt;

&lt;p&gt;To address these shortcomings, DeepSeek introduced DeepSeek-R1, a refined version that combines RL with a human-friendly “cold-start” dataset. This hybrid approach maintains the strong reasoning capabilities of DeepSeek-R1-Zero while improving readability and response structure. By integrating some level of human supervision, DeepSeek-R1 ensures that its reasoning remains strong while also making its output more coherent and accessible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Upgrading to DeepSeek-R1: Reinforcement Learning with SFT and Cold Start Data
&lt;/h2&gt;

&lt;p&gt;DeepSeek took a significant leap forward from DeepSeek-R1-Zero by integrating Supervised Fine-Tuning (SFT) and Cold Start Data into its training pipeline, leading to the development of DeepSeek-R1. This upgrade enhanced the model’s reasoning capabilities and alignment with human preferences, setting a new standard for open-source AI models.&lt;/p&gt;

&lt;h3&gt;The Two-Stage RL and SFT Pipeline&lt;/h3&gt;

&lt;p&gt;DeepSeek-R1 improved upon its predecessor by incorporating a refined two-stage reinforcement learning (RL) and SFT process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Two RL Stages&lt;/strong&gt;: Focused on refining reasoning abilities while ensuring outputs align with human expectations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two SFT Stages&lt;/strong&gt;: Built a strong foundation for both reasoning and non-reasoning tasks to improve overall model performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;Addressing the Cold Start Problem&lt;/h3&gt;

&lt;p&gt;One of the major challenges in AI training is the cold start problem, where models struggle in early training phases due to a lack of initial guidance. DeepSeek tackled this by carefully curating high-quality, diverse datasets for the first SFT stage. This ensured the model acquired solid foundational knowledge before reinforcement learning took over.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Readability Improvements&lt;/strong&gt;: Unlike DeepSeek-R1-Zero, which sometimes generated unreadable or mixed-language responses, DeepSeek-R1’s cold start data was designed with structured formatting, including a clear reasoning process and summary for each response.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance Boost&lt;/strong&gt;: By strategically crafting cold-start data with human-guided patterns, DeepSeek-R1 exhibited superior reasoning abilities compared to DeepSeek-R1-Zero.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;Enhancing Reasoning with Reinforcement Learning&lt;/h3&gt;

&lt;p&gt;After establishing a solid foundation with cold-start data, DeepSeek-R1 employed large-scale reinforcement learning to further enhance its reasoning skills. This phase focused on areas requiring structured logic, such as coding, mathematics, and science.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Language Consistency Rewards&lt;/strong&gt;: To prevent responses from mixing multiple languages, DeepSeek introduced a reward mechanism that prioritized target-language consistency, ensuring more user-friendly outputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimized Reasoning Tasks&lt;/strong&gt;: The model balanced accuracy in logic-driven tasks with human readability, refining its problem-solving approach through iterative reinforcement learning.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;Supervised Fine-Tuning for Diverse Capabilities&lt;/h3&gt;

&lt;p&gt;Once RL training reached convergence, the next step was supervised fine-tuning (SFT) to further refine the model across reasoning and non-reasoning tasks.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning Data&lt;/strong&gt;: Leveraging rejection sampling, DeepSeek-R1 curated a dataset of 600k reasoning-focused training samples, ensuring only high-quality responses were included.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-Reasoning Data&lt;/strong&gt;: A separate dataset of 200k samples covered diverse areas like writing, factual Q&amp;amp;A, self-cognition, and translation, enabling DeepSeek-R1 to perform well beyond just logic-based tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;Reinforcement Learning for Holistic Improvement&lt;/h3&gt;

&lt;p&gt;To ensure DeepSeek-R1 aligned with human preferences while maintaining strong reasoning, an additional reinforcement learning phase was introduced. This phase prioritized:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Helpfulness&lt;/strong&gt;: Ensuring responses were relevant and user-friendly, with a focus on clear and useful summaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Harmlessness&lt;/strong&gt;: Filtering out biases, harmful content, or misleading information while maintaining logical accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Balanced Training&lt;/strong&gt;: Integrating reasoning and general-purpose training data to create a well-rounded model capable of excelling in both structured problem-solving and open-ended tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Final Outcome: A Breakthrough in Open-Source AI&lt;/strong&gt;
&lt;/h3&gt;

&lt;h3&gt;Reinforcement Learning for Holistic Improvement&lt;/h3&gt;

&lt;p&gt;By combining reinforcement learning, supervised fine-tuning, and strategically curated cold start data, DeepSeek-R1 emerged as a groundbreaking model, outperforming its predecessors. Its distilled versions (DeepSeek-R1-Distill) achieved state-of-the-art results in reasoning benchmarks, proving the effectiveness of this hybrid training approach. DeepSeek-R1 not only pushes the boundaries of AI reasoning but also ensures outputs are more user-friendly, readable, and aligned with human expectations.&lt;/p&gt;

</description>
      <category>deepseek</category>
      <category>openai</category>
      <category>ai</category>
    </item>
    <item>
      <title>Simplifying Data Governance with DataHub and PostgreSQL Integration</title>
      <dc:creator>Datamonk</dc:creator>
      <pubDate>Tue, 28 Jan 2025 12:42:25 +0000</pubDate>
      <link>https://dev.to/datamonk_/simplifying-data-governance-with-datahub-and-postgresql-integration-2pc9</link>
      <guid>https://dev.to/datamonk_/simplifying-data-governance-with-datahub-and-postgresql-integration-2pc9</guid>
      <description>&lt;p&gt;In the digital age, data is growing at an unprecedented rate. Terabytes of new data are created—almost as effortlessly as you take a breath. This rapid influx of data presents both opportunities and challenges for businesses, organizations, and data professionals alike. The need to manage, understand, and make sense of this data has never been more crucial.&lt;/p&gt;

&lt;p&gt;As data continues to grow, tools like &lt;a href="https://datahubproject.io/" rel="noopener noreferrer"&gt;Datahub&lt;/a&gt; for managing and governing become essential. DataHub stands out as one such tool—empowering organizations to centralize, organize, and make their work easy and data more accessible and useful. Whether you’re working with small datasets or large enterprise databases, DataHub offers the solutions needed to ensure your data is not only discoverable but also governed and well-documented.&lt;/p&gt;

&lt;p&gt;Managing a database is just a small part of the larger picture, much like how DataHub plays a key role in the vast ocean of data management. In this blog, we’ll take a closer look at how you can ingest a PostgreSQL database into DataHub, explore its key features, and see how it can help you manage and visualize your data with ease.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisite
&lt;/h2&gt;

&lt;p&gt;(Note:- &lt;em&gt;It is an Ubuntu process&lt;/em&gt;)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;python 3.8+&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ubuntu.com/server/docs/install-and-configure-postgresql" rel="noopener noreferrer"&gt;postgres&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Run Datahub locally, refer &lt;a href="https://datahubproject.io/docs/quickstart" rel="noopener noreferrer"&gt;this&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;When ingesting a PostgreSQL database into DataHub, the process begins by connecting DataHub to the database using a configured source connector(precisely will create a .yml file which will act like a bridge between the datahub and the postgres database). DataHub then extracts metadata such as database structure, schemas, tables, columns, and relationships. Key features like table-level lineage, data profiling, and classification are enabled during this process to enhance data discovery and governance.&lt;/p&gt;

&lt;p&gt;The ingestion process also captures additional metadata such as column types, foreign key relationships, constraints. Once ingested, this metadata is indexed and made available for searching, browsing, and visualization within the DataHub platform.&lt;/p&gt;

&lt;p&gt;The overall goal is to centralize the metadata, improve data discoverability, and facilitate collaboration across teams by providing a comprehensive view of the database structure and data flow.&lt;/p&gt;

&lt;p&gt;After running the command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;datahub docker quickstart
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;you will be able to access the Datahub UI on &lt;a href="http://localhost:9002" rel="noopener noreferrer"&gt;http://localhost:9002&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Source Connector
&lt;/h2&gt;

&lt;p&gt;To ingest metadata from PostgreSQL into DataHub, source connector is used, which is defined in a configuration file—typically a .yml file. For instance, create a file called postgresql_ingestion.yml that contains all the necessary connection details and parameters required for the ingestion process.&lt;/p&gt;

&lt;p&gt;The configuration file acts as a bridge between your PostgreSQL database and DataHub. It includes essential details like the host, database name, username, and the DataHub server link to ensure that metadata is successfully extracted and ingested into the platform.&lt;br&gt;
like this-&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqx8pdl5r19c5p3lelrjh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqx8pdl5r19c5p3lelrjh.png" alt=".yml file" width="480" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After setting up the source connector in the .yml configuration file, the next step is to ingest your database into DataHub. This process involves running the ingestion command, which uses the configuration file to extract metadata from your database and push it to your DataHub instance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;datahub ingest &lt;span class="nt"&gt;-c&lt;/span&gt; postgres2_ingestion.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the dataset now ingested into DataHub, all its information is seamlessly organized and made available for efficient management and governance through a suite of advanced features.&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;p&gt;Let's understand these features using the weather data project example I discussed in the Data Orchestration blog post. This project involves a schema containing hourly weather data, daily weather summaries, and global averages, now seamlessly integrated into DataHub for enhanced data management and governance.&lt;/p&gt;

&lt;h3&gt;Containers&lt;/h3&gt;

&lt;p&gt;These can be used to represent Weather data into logical groupings like databases, schemas, and tables. For example, hourly weather data can be in one schema, while daily and global data occupy others, enabling clear categorization for streamlined navigation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhdc4hqtux05k31rxni6a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhdc4hqtux05k31rxni6a.png" alt="containers" width="800" height="258"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;Classification&lt;/h3&gt;

&lt;p&gt;By tagging sensitive weather data or business-critical metrics (e.g., identifying PII(Personally Identifiable Information) like location, date, address or compliance-relevant data like personal or sensitive data), we can classify the data effectively to enforce security, regulatory compliance, and usage guidelines.&lt;/p&gt;

&lt;h3&gt;Data Profiling&lt;/h3&gt;

&lt;p&gt;Through SQL profiling, DataHub can generate comprehensive statistics for tables, such as average temperatures, the frequency of anomalies, and missing data percentages etc.&lt;/p&gt;

&lt;h3&gt;Description&lt;/h3&gt;

&lt;p&gt;Each table, column, or dataset can include metadata descriptions (e.g., "Hourly Weather Data includes temperature, humidity, and wind speed per hour"). This ensures clear context, enabling users and systems to understand the data easily without repeated explanations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz3vej9vcixqz1wpppbl7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz3vej9vcixqz1wpppbl7.png" alt="description" width="800" height="258"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;Detect Deleted Entities &lt;/h3&gt;

&lt;p&gt;If data is removed (e.g., outdated weather stations or deprecated data tables) from the PostgreSQL database, DataHub detects and reflects these changes, reducing clutter and ensuring that outdated references no longer impact the system.&lt;/p&gt;

&lt;h3&gt;Domains&lt;/h3&gt;

&lt;p&gt;Domains allow grouping datasets by their purpose, like "Forecasting," "Historical Weather," and "Climate Analytics." This structure simplifies data governance by enabling domain-specific visibility and control.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4yo870u6s84982nyp786.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4yo870u6s84982nyp786.png" alt="Domains" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;Platform Instance&lt;/h3&gt;

&lt;p&gt;Platform-specific metadata allows you to tag and distinguish between different instances of the PostgreSQL database, such as production and staging environments. This makes it easier to track where the data is coming from and avoids any confusion between test data and live production data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdluzylm7hg03cxwfodmp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdluzylm7hg03cxwfodmp.png" alt="tagging" width="800" height="271"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;Schema Metadata &lt;/h3&gt;

&lt;p&gt;The weather project’s database schema, with details on tables, columns, and relationships, is automatically extracted and indexed, enabling quick discovery of how data is organized. For instance, knowing the relationship between hourly data and daily summaries aids in analysis automation.&lt;/p&gt;

&lt;h3&gt;Table-Level Lineage&lt;/h3&gt;

&lt;p&gt;Tracking lineage maps the entire journey of the data, showing how the hourly weather dataset is aggregated into daily summaries and further processed to generate global averages. This transparency ensures trust in the results and makes it easier to identify and resolve any issues.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqcehin6etywv5jm9ykad.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqcehin6etywv5jm9ykad.png" alt="lineage" width="800" height="271"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  So...
&lt;/h2&gt;

&lt;p&gt;After exploring all the features, it's clear that integrating DataHub with your PostgreSQL database can really take data management to the next level. For example, lineage tracking allows you to easily trace how your weather data flows, from hourly temperature readings to daily summaries and even global averages. This transparency not only builds trust in the data but also makes it much easier to fix any issues if something goes wrong.&lt;/p&gt;

&lt;p&gt;Another powerful feature is data profiling, which helps you monitor the quality of your data. Whether it's checking for missing values or spotting unusual patterns, it ensures the data you're working with is reliable—something that's especially important when you're forecasting weather trends or working with large datasets.&lt;/p&gt;

&lt;p&gt;By using these features, you're not just managing data—you're ensuring that it’s secure, accurate, and trustworthy. DataHub is like an extra layer of safety and efficiency for your weather project, streamlining workflows while giving you confidence that the data you’re using is always up to par.&lt;/p&gt;

&lt;p&gt;Adopting DataHub isn't just about tools, it's about improving how we handle data so that we can make better decisions and keep things running smoothly.&lt;/p&gt;

</description>
      <category>database</category>
      <category>postgres</category>
      <category>datahub</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>A detailed study on three popular ETL tools for Workflow Orchestration.</title>
      <dc:creator>Datamonk</dc:creator>
      <pubDate>Thu, 23 Jan 2025 13:51:13 +0000</pubDate>
      <link>https://dev.to/datamonk_/a-detailed-study-on-three-popular-etl-tools-for-workflow-orchestration-7hn</link>
      <guid>https://dev.to/datamonk_/a-detailed-study-on-three-popular-etl-tools-for-workflow-orchestration-7hn</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/datamonk_" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2746984%2F1f918d97-0fc5-4474-bba9-89bb2146bf37.jpg" alt="datamonk_"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/datamonk_/different-orchestration-tool-analysis-airflow-dagster-flyte-192d" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;Data Orchestration Tool Analysis: Airflow, Dagster, Flyte&lt;/h2&gt;
      &lt;h3&gt;Datamonk ・ Jan 23&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#dataengineering&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#tooling&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#python&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#opensource&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
      <category>dataengineering</category>
      <category>devops</category>
      <category>softwaredevelopment</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Data Orchestration Tool Analysis: Airflow, Dagster, Flyte</title>
      <dc:creator>Datamonk</dc:creator>
      <pubDate>Thu, 23 Jan 2025 13:09:19 +0000</pubDate>
      <link>https://dev.to/datamonk_/different-orchestration-tool-analysis-airflow-dagster-flyte-192d</link>
      <guid>https://dev.to/datamonk_/different-orchestration-tool-analysis-airflow-dagster-flyte-192d</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Data orchestration tools are key for managing data pipelines in modern workflows. When it comes to tools, Apache Airflow, Dagster, and Flyte are popular tools serving this need, but they serve different purposes and follow different philosophies. Choosing the right tool for your requirements is essential for scalability and efficiency. In this blog, I will compare &lt;a href="https://airflow.apache.org/" rel="noopener noreferrer"&gt;Apache Airflow&lt;/a&gt;, &lt;a href="https://dagster.io/" rel="noopener noreferrer"&gt;Dagster&lt;/a&gt;, and &lt;a href="https://flyte.org/" rel="noopener noreferrer"&gt;Flyte&lt;/a&gt;, exploring their evolution, features, and unique strengths, while sharing insights from my hands-on experience with these tools in a weather data pipeline project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;In the weather data project, I got the chance to work with these three tools—Airflow, Dagster, and Flyte, and gained understanding for what makes each one unique. In this blog, I’ll share my experience comparing them and break down how each one works and what sets them apart.&lt;/p&gt;

&lt;h2&gt;
  
  
  Apache Airflow
&lt;/h2&gt;

&lt;p&gt;Apache Airflow got its start at Airbnb back in October 2014 a Python-based orchestrator with a web interface, designed to handle the company’s growing workflow challenges. It joined the Apache Incubator in 2016 and finally earned its spot as a top-level Apache Software Foundation project in 2019, marking a major milestone in its journey.&lt;/p&gt;

&lt;p&gt;Airflow proved to be a blessing, simplifying the management and scheduling of their complex tasks effortlessly.&lt;/p&gt;

&lt;p&gt;In the weather data project, I used Airflow to automate the data pipeline, ensuring tasks like fetching, processing, and storing weather data ran in the correct order. Each task depended on the successful completion of the previous one, ensuring smooth and sequential execution from start to finish.&lt;/p&gt;

&lt;p&gt;An Airflow DAG file consists of three main components: the DAG instantiation, the tasks, task dependencies and the task order. It looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Dag Instance
&lt;/span&gt;&lt;span class="nd"&gt;@dag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dag_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather_dag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 0 * * *&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Daily at midnight
&lt;/span&gt;    &lt;span class="n"&gt;start_date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tzinfo&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;IST&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dagrun_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hours&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Task Definitions
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;weather_dag&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_tables&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;         
        &lt;span class="nf"&gt;create_table&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  

    &lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;         
        &lt;span class="nf"&gt;fetch_and_store_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

    &lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_daily_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;     
        &lt;span class="nf"&gt;fetch_day_average&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  

    &lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;global_average&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;     
        &lt;span class="nf"&gt;fetch_global_average&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  

&lt;span class="c1"&gt;# Task Dependencies
&lt;/span&gt;    &lt;span class="n"&gt;create_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_tables&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;fetch_weather_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alwar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-01-19&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;fetch_daily_weather_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_daily_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alwar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;global_average_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;global_average&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alwar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Task Order
&lt;/span&gt;    &lt;span class="n"&gt;create_task&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;fetch_weather_task&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;fetch_daily_weather_task&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;global_average_task&lt;/span&gt;

&lt;span class="n"&gt;weather_dag_instance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;weather_dag&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And it’s all managed through the Airflow UI, which provides a way to monitor and track the progress of the entire pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx7rfciwb4j9nt25uw4m0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx7rfciwb4j9nt25uw4m0.png" alt="Apache Airflow UI" width="800" height="377"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  DAGSTER
&lt;/h2&gt;

&lt;p&gt;Dagster was developed by Elementl, founded by Nick Schrock, CEO of Elementl in April 2019, With a vision to reshape the data management ecosystem, Nick introduced Dagster—a fresh programming model for data processing.&lt;/p&gt;

&lt;p&gt;Unlike traditional tools that focus primarily on tasks or jobs, Dagster emphasizes the relationships between inputs and outputs. Its asset-centric approach focuses on treating data assets as the central units of computation in a pipeline.&lt;/p&gt;

&lt;p&gt;Each asset is represented as a dataset and the pipelines revolves around how the assets depend on each other.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@asset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Table Creation for the Weather Data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Creates databse tables needed for weather data.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;setup_database&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;create_table&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@asset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;deps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;setup_database&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="c1"&gt;# setup_database is a dependency
&lt;/span&gt;        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The hourly data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;city and date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; on &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_weather&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;weather_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_and_store_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;MaterializeResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;number of rows&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;weather_data&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;       
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@asset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;deps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;fetch_weather&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="c1"&gt;# fetch_weather is a dependency
&lt;/span&gt;        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The Day Average&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;city and date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; on &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_daily_weather&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;weather_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_day_average&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
    &lt;span class="c1"&gt;# asset based graphs
&lt;/span&gt;    &lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;City&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Max Temp (°C)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Min Temp (°C)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Condition&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Avg Humidity (%)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;weather_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;weather_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;MaterializeResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Row added&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MetadataValue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;md&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;weather_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;to_markdown&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@asset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;deps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;fetch_daily_weather&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="c1"&gt;# fetch_daily_weather is a dependency
&lt;/span&gt;        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The Whole Average&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;      
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;global_weather&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;fetch_global_average&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dagster builds a clear dependency graph, making pipeline transparent and easy to debug.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk0wg8ms17t0sagfoilhy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk0wg8ms17t0sagfoilhy.png" alt="Dagster UI" width="800" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Traditional Task-Based Workflow&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task 1: Fetch weather data.
Task 2: Clean the data.
Task 3: Store the cleaned data in a database.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Asset-Centric Workflow in Dagster&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Asset 1: Raw weather data (produced by fetching from an API).
Asset 2: Cleaned weather data (transformed from raw weather data).
Asset 3: Stored weather dataset (created from cleaned data)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;With Dagster, you can build custom asset &lt;a href="https://docs.dagster.io/tutorial/building-an-asset-graph" rel="noopener noreferrer"&gt;graphs&lt;/a&gt;, linking them directly to pipeline steps. This feature stands out because it’s helping you monitor data as it evolves through each pipeline stage. It adds a level of clarity and interactivity to the workflow, making debugging and monitoring far more intuitive—a functionality I didn’t encounter with Airflow.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fna3eqpqay9yet3jfy7jl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fna3eqpqay9yet3jfy7jl.png" alt="Assets-Graphs" width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And it’s not just asset-centric, if you prefer the task-based approach like in Airflow, Dagster’s got you covered too. You can define your tasks using @ops (&lt;a href="https://docs.dagster.io/concepts/ops-jobs-graphs/ops" rel="noopener noreferrer"&gt;operations&lt;/a&gt;) in Dagster, just like you’d use @task in Airflow. So whether you're into working with assets or tasks, you’ve got the flexibility to choose the approach that works best for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  FLYTE
&lt;/h2&gt;

&lt;p&gt;Flyte, a workflow orchestration tool, was initially developed by Lyft in 2016 as an internal platform to manage complex machine learning and data processing pipelines. Later, in 2020 it open-sourced, making it accessible for other companies to use. It leverages Kubernetes, allowing businesses to scale and manage their data and ML workflows in a reliable and efficient way. &lt;/p&gt;

&lt;p&gt;Primarily designed to handle both machine learning and data engineering workflows, Built on Kubernetes, Flyte leverages its &lt;strong&gt;containerized infrastructure&lt;/strong&gt; for handling large-scale jobs, which enables efficient resource scaling and management. &lt;/p&gt;

&lt;p&gt;In flyte tasks are defined using Python functions and then composed into workflows. Each task represents a unit of work, and tasks can be connected with dependencies, indicating their execution order. It is somewhere similar to airflow task-centric approach.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;setup_database&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;  
    &lt;span class="nf"&gt;create_table&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; 
    &lt;span class="nf"&gt;fetch_and_store_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   

&lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_daily_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;  
    &lt;span class="nf"&gt;fetch_day_average&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="nd"&gt;@task&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;global_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;   
    &lt;span class="nf"&gt;fetch_global_average&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="nd"&gt;@workflow&lt;/span&gt;         &lt;span class="c1"&gt;#defining the workflow
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Noida&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2025-01-17&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="c1"&gt;# The workflow will execute the tasks in the order they are defined
&lt;/span&gt;    &lt;span class="nf"&gt;setup_database&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;fetch_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;fetch_daily_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;global_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Workflow executed successfully for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; on &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Running wf() &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;wf&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Flyte makes local execution easy with &lt;a href="https://docs.flyte.org/projects/flytectl/en/latest/" rel="noopener noreferrer"&gt;flytectl&lt;/a&gt;, which sets up a sandbox container for testing workflows. Plus, it lets you run Python code locally, so you can test and debug your workflows before deploying them to the cloud.&lt;/p&gt;

&lt;p&gt;Flyte emerges as a modern solution for virtually every aspect of tech workflows, offering the following key &lt;a href="https://flyte.org/airflow-alternative" rel="noopener noreferrer"&gt;benefits&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison
&lt;/h2&gt;

&lt;h3&gt; Dag Versioning &lt;/h3&gt;

&lt;p&gt;While working on the weather data project in Airflow, one of the challenges I encountered was managing changes in the pipeline over time—a common issue known as DAG versioning. If you update a pipeline to add or modify tasks, there’s no native way to version these changes, run different versions side by side, or to rollback to previous task. User faces hard-time and more complexity while taking precautions like appending version numbers to DAG IDs, using Git for code tracking, or maintaining separate environments. &lt;/p&gt;

&lt;p&gt;In contrast, Dagster solves this problem effectively with its asset-centric approach, built-in support for backfills, and asset snapshots. Each asset is versioned independently, so if any new asset is updated it doesn’t disrupt the other. &lt;/p&gt;

&lt;p&gt;As the modern data stack has grown and is still growing, tools can't just limit themselves to only executing, managing, and optimizing data assets anymore. They need to fit into the entire development workflow—from local testing to production deployments—while being cloud-native to support scalability and flexibility.&lt;/p&gt;

&lt;p&gt;Whereas,&lt;br&gt;
Flyte addresses DAG versioning seamlessly by supporting versioned workflows natively. When you update or modify a task, Flyte allows you to track and manage different workflow versions without disrupting ongoing processes.  Flyte enables you to test updated tasks without affecting the entire workflow, ensuring smoother iteration and flexibility.&lt;/p&gt;

&lt;h3&gt; Scaling &lt;/h3&gt;

&lt;p&gt;In data engineering, scaling up is something Dagster handles really well with its flexible architecture for handling large data workflows whereas in Airflow managing resources and scaling can be challenging. However, when it comes to machine learning, Flyte stands out as the more favorable choice, thanks to its built-in support for &lt;a href="https://flyte.org/machine-learning" rel="noopener noreferrer"&gt;ML workflows&lt;/a&gt;,&lt;a href="https://docs.flyte.org/en/latest/user_guide/concepts/main_concepts/versioning.html" rel="noopener noreferrer"&gt; model versioning&lt;/a&gt;, and Kubernetes-based &lt;a href="https://flyte.org/data" rel="noopener noreferrer"&gt;scalability&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt; Modern AI Orchestration &lt;/h3&gt;

&lt;p&gt;When you compare Airflow, Dagster, and Flyte, it's clear how they handle different project needs. Like in our weather data project. &lt;strong&gt;Airflow&lt;/strong&gt; excels in scheduling tasks but falls short when you need AI-specific task or handle high-volume, real-time data like in weather prediction models. &lt;strong&gt;Dagster&lt;/strong&gt; focuses heavily on the data-driven approach, which is great, but it lacks some of the dynamic scalability that a complex project like weather forecasting requires. &lt;strong&gt;Flyte&lt;/strong&gt;, however, shines when it comes to AI orchestration. It handles intensive workloads, scales effectively for complex data processing, and automates the workflow, making it ideal for things like predicting weather patterns or managing large sets of weather-related data, all while being efficient and reliable.&lt;/p&gt;

&lt;h2&gt;
  
  
  CONCLUSION
&lt;/h2&gt;

&lt;p&gt;When deciding between the tools, consider the scale and focus of your workflows. If you need flexibility in pipeline structure and asset management, Dagster is a strong contender. For machine learning workflows with the added benefit of seamless scaling, Flyte should be your go-to solution. Meanwhile, if you are managing straightforward, traditional data engineering tasks, Airflow’s simplicity will still make it a valuable tool. Each of these tools brings unique features and advantages, so understanding your project’s needs will guide you toward the optimal choice.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>tooling</category>
      <category>python</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
