DEV Community: Ludovic DEHON

Kestra Secures $8 Million to Simplify and Unify Orchestration for All Engineers

Ludovic DEHON — Mon, 23 Sep 2024 14:35:05 +0000

Orchestration is at the core of the modern business infrastructure, and today, we're taking a huge step toward transforming how it's done. We’re thrilled to announce Kestra's $8 million Seed round, led by Alven (Stripe, Dataiku, Qonto, Algolia) with participation from ISAI, Axeleo, and key tech leaders such as Olivier Pomel, Tristan Handy, Michel Tricot, Clément Delangue. This funding marks the next chapter in our mission to redefine orchestration for enterprises worldwide, empowering engineers to simplify the most complex workflows at an unprecedented scale.

This milestone wouldn’t have been possible without the trust of our growing community. Since raising $3 million in pre-seed funding last year, Kestra has surpassed every expectation:

We’ve expanded our use cases far beyond what we initially envisioned.
We’ve proven our platform’s resilience across large-scale, mission-critical workloads.
We’ve heard from countless users who confirm that Kestra delivers on its simplicity, transparency, and reliability promise.

A Growing Trust in Kestra

Today, Kestra’s adoption has skyrocketed by 10x. Thousands of companies, from ambitious startups to Fortune 100, use Kestra to orchestrate their most critical workflows. This drives us to keep pushing boundaries and simplifying orchestration in ways that were previously unimaginable.

Our $8 million Seed round is a testament to the confidence our investors have in Kestra’s future. In addition to Alven, ISAI, and Axeleo, we’re proud to be supported by an impressive lineup of private investors, including:

Olivier Pomel (Co-founder and CEO of Datadog),
Tristan Handy (Founder and CEO of dbt Labs),
Michel Tricot (Co-founder and CEO of Airbyte),
Clément Delangue (Co-founder and CEO of Hugging Face),
Bertrand Diard (Co-founder of Talend),
Nicolas Dessaigne (Co-founder of Algolia & Group Partner at Y Combinator),
Frédéric Plais (Co-founder and CEO of Platform.sh)
David Perry, John Britton, Antoine Balliet, Zachary Smith, Arnaud Ferreri.

This funding enables us to accelerate our growth, expand our team, and continue delivering exceptional value to engineers and enterprises alike.

Why We Built Kestra: Bridging the Orchestration Gap

Existing tools often solve specific needs—whether it's automating data pipelines, managing IT tasks, or coordinating business processes—but they tend to operate in silos. These siloed solutions introduce complexity, demand specialized skills, and ultimately create unnecessary risks, where what’s needed is efficiency, transparency, and reliability.

At Kestra, we recognized this gap and set out to build a unified orchestration platform that simplifies workflows across any infrastructure, cloud, or application. Our vision was to create an orchestration tool that engineers love to use, one that breaks down barriers and scales seamlessly.

Simplifying Complexity, Elevating Workflows

Kestra is designed to simplify complexity. With a declarative, language-agnostic framework, an API-first approach, and our Everything as Code and from the UI or following GitOps philosophy, Kestra is intuitive yet incredibly powerful. It’s a platform engineers can adopt quickly, customize extensively, and rely on for any workflow, no matter how complex.

"Kestra has been essential in designing complex execution flows while enhancing our Infrastructure-as-Code best practices.", Gorgias.

A Platform Without Limits

Kestra isn’t just another orchestration tool. It’s a platform built to handle workflows of any type, across any domain. Whether it’s automating infrastructure, transforming and transporting data, coordinating microservices, or real-time network monitoring, Kestra’s flexibility and extensibility make it a trusted solution for a wide range of challenges.

And we’re not stopping there. We continue to expand Kestra’s capabilities with new features and integrations, constantly pushing the boundaries of what’s possible.

Built on the Power of Open Source

At the heart of Kestra’s success is our global open-source community. We empower engineers to easily adopt and integrate Kestra into their workflows by embracing open-source principles. This openness accelerates innovation, fosters collaboration, and ensures that Kestra is continuously improving based on real-world needs. It’s why thousands of production environments around the world depend on Kestra to power their most critical workflows.

Scaling Beyond Limits: Orchestrating the Future

In just one year, Kestra has grown by 10x, and today, hundreds of millions of workflows are powered by our platform. We’ve become the orchestration layer of choice for software engineers, DevOps experts, and IT leaders across the tech landscape.

Kestra enables users to orchestrate workflows in under 5 minutes through:

Seamless workflow design: Easily build workflows through our intuitive UI or directly within your favorite IDE.
500+ integrations: Connect with major technologies incl. AWS, GCP, Azure, Terraform, Docker, GitHub, Kafka, Postgres, Redis, MongoDB, SQL Server, Databricks, Snowflake, dbt, Airbyte, and many more.
Deploy anywhere: Run Kestra on any cloud (AWS, Azure, GCP), on-premises, or even on your laptop using Docker.

"Kestra is the unifying layer for our data and workflows. You can start small, but scale without limits.", Leroy Merlin.

Resilient Orchestration at Scale for Critical Workflows

As organizations grow, so do their orchestration needs. For businesses managing mission-critical workflows, Kestra elevates its capabilities to meet these demands.

Building on Kestra’s core strengths, we offer enhanced security, governance, and monitoring to support large-scale, business-critical operations. It also ensures real-time performance and seamless integration with features like SSO, CI/CD pipelines, and secret managers.

With Kestra, organizations gain the reliability required to scale their workflows confidently, no matter the complexity or size of their operations.

Orchestrating with Kestra Enterprise: Trusted by Industry Leaders for Mission-Critical Workflows.

Looking Ahead: Innovating and Expanding

With this second $8 million funding round, we’re excited to enter the next phase of Kestra's growth. Our commitment to continuous improvement drives us to expand the platform’s capabilities — enhancing our ecosystem with more third-party integrations, simplifying plugin management, and improving the orchestration experience for all engineers.

We’re planning to expand into the U.S. to better support our North American clients, bringing us closer to them for improved collaboration and service. Meanwhile, we’re growing our team across Europe and North America, hiring key roles such as Software Engineers, DevOps, Architects, Solution Engineers, GTMs, and Marketing professionals.

Most importantly, we know that Kestra’s success is driven by the talented and dedicated people behind the platform. Thank you Team ❤️, your commitment to pushing the boundaries of orchestration is what sets us apart, and we look forward to growing this exceptional group to take on the challenges ahead.

Thank you to our users, customers, and investors for driving Kestra forward in our mission to transform orchestration. Help us build the leading platform for Unified Orchestration by starring us on GitHub and joining this exciting journey.

Orchestrate Everything, Everywhere, All at Once

End-to-End Data Ingestion, Transformation and Orchestration with Airbyte, dbt and Kestra

Ludovic DEHON — Sat, 24 Jun 2023 21:35:35 +0000

The key benefit of the Modern Data Stack is that you can avoid vendor lock-in by selecting best-of-breed tools rather than paying expensive license fees for one inflexible solution. However, assembling your modular stack based on multiple SaaS solutions will only marginally improve that situation. It’s undoubtedly easier to swap only one component by, e.g., migrating between two SaaS BI solutions than migrating an entire all-in-one stack. However, the vendor lock-in issue remains unresolved — all you’ll achieve is multiple vendor lock-ins at a smaller scale rather than one giant lock-in. Open-core technologies and SaaS offerings built on top of them can help avoid these issues and future-proof your data stack.

This post will discuss a selection of open-core tools that you can use to build your data stack in a modular way. Specifically, we'll leverage Airbyte, dbt, and Kestra — all of them are either source-available (Airbyte) or entirely open-source (dbt and Kestra).

Airbyte

Airbyte is a data integration platform that simplifies and standardizes replicating data from source systems to desired destinations, such as a data warehouse or a data lake. It provides many pre-built connectors to various databases, APIs, storage systems, and a low-code CDK for creating new custom connectors.

To start using Airbyte, you only need a workstation with Docker and a terminal. You can follow the quickstart guide to launch Airbyte on your local machine:

git clone https://github.com/airbytehq/airbyte.git
cd airbyte
./run-ab-platform.sh

Then, you can begin creating your sources, destinations, and connections:

Sources describe source systems, such as your applications and databases. This is usually the raw data that serves as a basis for analytics.
The destination is typically your data warehouse or a data lake platform (such as BigQuery, Snowflake, Redshift, Databricks, Dremio, Starburst, Azure Synapse, and more).
Finally, connections tie sources and destinations together by specifying how and when to load data from a specific source to a specific destination.

dbt

dbt is a data build tool that enables data analysts and engineers to transform data in a cloud analytics warehouse using templated SQL and a Command Line Interface (CLI). After you've ingested raw data with Airbyte, dbt can query it and transform it into analytics-ready datasets by executing SQL (or Python) code within your warehouse. Thanks to dbt, you can organize your business logic into modular components (models) and document and test those models. Finally, you can version control your dbt transformations to allow other people in your team to collaborate on the same project and improve the auditability and rollback of your cloud analytics warehouse.

The easiest way to get started with dbt is to leverage the dbt-duckdb package, allowing you to use an in-memory DuckDB database and run everything locally. However, you can also start using dbt for free with Google BigQuery.

Kestra

Orchestration is a critical aspect of end-to-end data management, as it helps you coordinate all steps, control and monitor their execution, and respond to failure scenarios. For instance, you can ensure that transformation steps run only after your data has been successfully ingested. If something fails, you can retry it and get notified about that failure.

Kestra is a simple, event-driven orchestrator that helps to maintain orchestration logic as code while bridging the gap between engineers and domain experts. The declarative YAML syntax opens the process of building data workflows to domain experts who are not programmers, as well as programmers working on a different stack (a language-agnostic interface rather than only Python).

Hands-on demo

Let's look at how you can combine Airbyte, dbt, and Kestra to build an end-to-end workflow.

First, download Kestra's Docker Compose file, for example, using curl:

curl -o docker-compose.yml https://raw.githubusercontent.com/kestra-io/kestra/develop/docker-compose.yml

Then, run docker compose up -d and navigate to the UI. You can start building your first flows using the integrated code editor in the UI.

The UI ships with Blueprints, which provide ready-to-use flow examples. For instance, you can use the following Blueprint that combines all steps needed to orchestrate Airbyte, dbt, and Kestra:

Here is a copy of that Blueprint. Make sure that you copy the ID of each connection from the Airbyte UI and paste those into the Airbyte task:

id: airbyteDbtCore
namespace: dev

tasks:
  - id: data-ingestion
    type: io.kestra.core.tasks.flows.Parallel
    tasks:
      - id: psyduck
        type: io.kestra.plugin.airbyte.connections.Sync
        connectionId: 4de8ab1e-50ef-4df0-aa01-7f21491081f1

      - id: sample-data
        type: io.kestra.plugin.airbyte.connections.Sync
        connectionId: 71291950-ccc1-4875-91b7-e801376c549e

      - id: charizard
        type: io.kestra.plugin.airbyte.connections.Sync
        connectionId: 9bb96539-73e7-4b9a-9937-6ce861b49cb9

      - id: pikachu
        type: io.kestra.plugin.airbyte.connections.Sync
        connectionId: 39c38950-b0b9-4fce-a303-06ced3dbfa75

  - id: dbt
    type: io.kestra.core.tasks.flows.WorkingDirectory
    tasks:
      - id: cloneRepository
        type: io.kestra.plugin.git.Clone
        url: https://github.com/jwills/jaffle_shop_duckdb
        branch: duckdb

      - id: dbt-build
        type: io.kestra.plugin.dbt.cli.Build
        debug: true
        runner: DOCKER
        dockerOptions:
          image: ghcr.io/kestra-io/dbt-duckdb:latest
        dbtPath: /usr/local/bin/dbt
        inputFiles:
          .profile/profiles.yml: |
            jaffle_shop:
              outputs:
                dev:
                  type: duckdb
                  path: ':memory:'
                  extensions:
                    - parquet
              target: dev
taskDefaults:
  - type: io.kestra.plugin.airbyte.connections.Sync
    values:
      url: http://host.docker.internal:8000/
      username: "{{envs.airbyte_username}}"
      password: "{{envs.airbyte_password}}"

When you execute that workflow, you should see a similar output:

The flow runs four Airbyte data ingestion processes in parallel. Once raw data is successfully ingested, the flow clones a Git repository and runs dbt CLI commands that build models committed to that Git repository.

The topology view shows the duration of each task. The entire flow, executing four Airbyte syncs and dbt build, took just one minute to run — all that without having to manually install any dependencies or build a CI/CD. To schedule that flow, you can add a simple CRON-based trigger:

triggers:
  - id: everyMinute
    type: io.kestra.core.models.triggers.types.Schedule
    cron: "*/1 * * * *"

Next steps

This post covered how combining Airbyte, dbt, and Kestra can simplify data management. Using these open-core technologies, you can avoid vendor lock-in and get the most out of the Modern Data Stack. If you encounter anything unexpected while reproducing this demo, you can open a GitHub issue or ask via Kestra Community Slack. Lastly, give us a GitHub star if you like the project.

Snowflake data pipeline with Kestra

Ludovic DEHON — Wed, 05 Oct 2022 19:48:02 +0000

The 21st century is the golden age of data. Today, modern data stacks offer numerous tools and workloads for data warehousing, data lakes, data analytics, and data engineering to manage massive amounts of data inflow. One such platform is Snowflake, a robust data cloud tool that streamlines many operations in a modern data stack.

In this article, we'll discuss how Snowflake integrates with Kestra (an open-source, scalable orchestration and scheduling platform) using Snowflake plugin to create a seamless and scalable data pipeline.

First, let's discuss Snowflake briefly.

What is Snowflake?

Leveraged by thousands of organizations worldwide, Snowflake is a top-of-the-line SaaS cloud data warehousing and data lake platform offering reliable data pipelines with high performance, concurrency, and scale across multiple cloud providers like AWS, Azure, and GCP.

Snowflake is a unified cloud-native platform that enables teams to avoid data silos by sharing and collaborating on data pipelines securely. It offers a flexible architecture with three components, i.e., robust cloud storage, query processing engine, and a layer of cloud services with numerous third-party integrations.

Snowflake is a multi-cloud, multi-cluster warehouse platform that can auto-scale to manage virtual warehouses dynamically. In the auto-scale mode, it starts and stops warehouses to govern cloud resource usage and their corresponding cost credits.

Why is Snowflake Popular?

In the last few years, Snowflake has had huge traction. Today more than 5900+ companies (as of March 2022) are running their critical data pipelines on Snowflake.

With Snowflake, companies don't have to pay for huge and expensive licenses for using data warehousing tools. Unlike conventional data warehousing tools, Snowflake charges on a pay-as-you-basis for storage and compute, enabling small and medium-scale companies to leverage cloud data pipelines for improved data management.

Now, companies can build scalable data workloads that can perform strong data analysis on structured, unstructured, and semi-structured data to derive valuable business insights and make data-driven decisions. Additionally, Snowflake Data Marketplace allows customers to access numerous ready-to-query datasets, further reducing integration costs.

Where does kestra fit in?

Typically, data warehouse workloads are part of a bigger technology stack within an organization or across organizations. To streamlines various operations, data pipelines should be orchestrated and scheduled. Data orchestration unifies data from disparate sources and transforms it to enable analytics. This is where Kestra jumps in.

Kestra is a state-of-the-art, open-source, cloud-native platform built to orchestrate & schedule scalable data workflows, aiming to improve the productivity of DataOps teams. Kestra's orchestration platform can build, run, manage, and monitor all kinds of complex workflows sequentially and in parallel. It offers numerous plugin integrations to build advanced data workflows, including our Snowflake plugin, discussed below.

Moreover, Kestra can execute workflows based on event-based, time-based, and API-based scheduling, giving complete control to the data teams. With a unified end-to-end platform, companies save time and money on storage, compute, and development costs.

Snowflake already offers many cost optimization processes like data compression and auto-scaling. However, Kestra makes it simpler to download, upload, and query data by integrating with Snowflake's storage and compute resources. Let's discuss Kestra's Snowflake plugin in detail.

Dissecting Kestra's Snowflake Plugin.

Kestra's Snowflake plugin can create complex data pipelines, whether Snowflake is a target or simply storing data as a part of a larger workflow for downstream applications.

Besides the Snowflake plugin, Kestra offers numerous JDBC plugin integrations, including ClickHouse, DuckDb, MySQL, Oracle, Apache Pinot, PostgreSQL, Redshift, Rockset, SQL Server, Trino, Vectorwise, and Vertica. These plugins can effectively process and transform tabular data within relational databases, reducing the processing cost of platforms like Snowflake.

Using Kestra, the data teams are at liberty to decide where certain data, plugins, and connectors would fit. As mentioned above, the Snowflake plugin can perform the download, upload, and query tasks. Let's discuss each of them below.

Query Snowflake table

Kestra can query the Snowflake server using this task to insert, update, and delete data. The Query task offers numerous properties, including auto-committing SQL statements, different fetching operations, specifying access-control roles, and storing fetch results. When the storevalue is true, Kestra allows storage of large results as an output of the Query task.

The plugin allows the usage of multi-SQL statements in the same transaction as a full SQL script with isolation support. It allows simple queries and fetches results with fetch or fetchOne properties, enabling teams to reuse the output on the next tasks from tools like Kafka Consume, Elastic Search, Mongo Find, and more. Some Query task instances are:

Fetch a row from the database, and define multiple flows depending on the output
Fetch the count of a store and iterate through the list. If an item doesn't exist, perform a particular task.

The following code snippet executes a query to fetch results from one table into Kestra internal storage.

id: select
type: io.kestra.plugin.jdbc.snowflake.Query
url: jdbc:snowflake://<account_identifier>.snowflakecomputing.com
username: snowflake
password: snowflake_passwd
sql: select * from source
fetch: true

Download from Snowflake stage

As the name suggests, this task downloads data from the Snowflake server to an internal Kestra stage, which is based on Amazon ION. The Download task offers properties such as data compression and access control role to streamline the download process of the connected database. The Download task outputs the URL of the downloaded file available on the Kestra storage server.

The following code snippet downloads default database to specified fileName location on internal Kestra server.

id: "download"
type: "io.kestra.plugin.jdbc.snowflake.Download"
stageName: MYSTAGE
fileName: prefix/destFile.csv

Upload to Snowflake stage

This task uploads data to an internal Snowflake stage. Similar to Download task, Upload can perform data compression and set access control role. Snowflake also support data transformation while loading data, which simplifies the ETL process.

The following code snippet uploads data to the specified fileName location.

id: "upload"
type: "io.kestra.plugin.jdbc.snowflake.Upload"
stageName: MYSTAGE
prefix: testUploadStream
fileName: destFile.csv

Building Dynamic Data Workflows With Kestra

Kestra has quickly become an essential piece of the modern data stack. Kestra provides unprecedented flexibility and control to data teams. It can orchestrate any kind of workflow with ease using a rich UI that monitors all flows.

Kestra's Snowflake plugin makes data warehousing simple. Even non-developers can write relevant flows using descriptive YAML. Your Snowflake storage pipeline accommodates raw data from multiple sources and transforms it using ETL operations. Additionally, you can skip the transformation and directly load data into the warehouse using the ELT pipeline. Kestra can manage both workflows simultaneously. In any case, Kestra ensures that the data is readily available to perform analysis and learn valuable patterns.

If you need any guidance with your Snowflake deployment, our experts at Kestra would love to hear from you. Let us know if you would like us to add more plugins to the list. Or start building your custom Kestra plugin today and send it our way. We always welcome contributions!

Debezium Change Data Capture without Kafka Connect

Ludovic DEHON — Tue, 05 Apr 2022 11:12:21 +0000

There are several products on the market to help you with your data pipeline management and data orchestration. Each handles data differently, and it can be difficult to determine the differences because many perform the same tasks, but with different methods. In this article, we will be discussing the differences between Debezium with Kafka Connect and our own product, Kestra (an open-source orchestration and scheduling platform with a robust set of plugins) and how they can integrate to provide a more efficient and cost-effective hybrid solution.

First, let’s discuss Debezium. Debezium is an open-source change data capture platform from Red Hat, offering a set of distributed services that captures row-level changes in your databases so that connected applications can see and respond to those changes in real-time. All changes (row-level) are recorded in a transaction log, and each application simply reads the transaction logs that are pertinent to them.

Debezium ensures exactly-once delivery of all changes and ensures that changes arrive in the order they were sent.

In other words, Debezium is essentially a modernized method of Change Data Capture (CDC). CDC is the process of identifying and capturing changes made to data in a database, and then delivering those changes in real-time to a downstream process or system. Debezium allows the monitoring of changes in data from multiple sources and can monitor multiple databases of different types.

Real-Time Delivery

The key selling point of Debezium is the real-time delivery of data changes whether from streaming sources or databases with heavy workloads. To support modern high-volume workloads, particularly streaming workloads, sources require constant monitoring, which means that connectors for Debezium must operate continuously. Debezium leverages Kafka Connect to establish this always-on connection, capturing data changes whenever they occur. This can consume a fair bit of bandwidth, as well as resources (CPU, memory) to process the events pushed through the pipeline.

Debezium continuously monitors upstream databases, and for each row-level change, produces a corresponding event that completely describes those changes. Because the Kafka Connect connectors operate continuously, and because events have to make sense even if the structure of the tables change over time, events can grow quite large. The larger and more complex an event, the more resources it requires.

Add to this that even in the simplest Debezium deployment, there are at least two Kafka Connect connectors running at any given time. One pulls data from the upstream source, whatever that may be, and the second pushes data changes out to various destinations data warehouses, databases, applications, etc). These connectors operate continuously, at a steady bandwidth, with dedicated processing and memory power to ensure that data is both received and delivered as close to instantaneously as possible. When this capacity is warranted, Debezium is an efficient and performance-oriented machine, perfect for real-time CDC use cases.

Trouble in Paradise

While Debezium offers clear advantages for real-time CDC use cases, it may not be efficient if the requirements are not strictly real-time. Depending on the complexity of your deployment, the number of persistent connectors (all Kafka Connect) required can become a drain on system resources. This is because for each process that consumes a Debezium event, a corresponding process consumes the same event from Kafka to the data warehouse. Each additional source requires two Kafka Connect connectors. Each process consumes a set minimum resource amount, regardless of the traffic.

For example, from Amazon MSK connect documentation :

Each MCU represents 1 vCPU of compute and 4 GiB of memory.

This lead to $160 for 1 source and 1 destination per month.

Chances are, in a complex deployment, a fair number of sources may not generate much traffic. It might only be a few rows per hour. For such a source, having a persistent, always-on process does not make much sense.

This situation can be exacerbated by your company ACLS. Because you do not want every user in your organization consuming the entirety of your dataset, you may need to define fine-grained role-based access control, and these rules, once applied, can necessitate numerous additional connectors (Kafka Connect), each requiring and competing for the same system resources.

In short, the same features that make Debezium’s performance in streaming / high volume scenarios can quickly become inefficient if the requirements are less stringent.

Kestra and Micro-Batch to the Rescue

Kestra is an orchestration and scheduling platform that is designed to simplify the building, running, scheduling, and monitoring of complex data pipelines. Data pipelines can be built in real-time, no matter how complex the workflow, and can connect to multiple resources as needed (including Debezium).

Real-time change data capture is an amazing accomplishment, and a valuable tool to have in your toolbox, to be sure. But a racecar is not very useful on a highway, or in a school zone, and in the same way, real-time delivery of data changes is not required for every use case. A dashboard or KPI might only need to be refreshed once a day, or every few hours, for example. Whether it is cloud services or on-premises, resources cost money, and the challenge is to make the most efficient use of resources based on your requirements. Bandwidth, compute resources, or services based on throughput (such as BigQuery) are all expensive commodities, especially if they are always running. If real-time performance is not necessary, then why waste money on resources you do not need?

Kestra is perfect for such situations and can scale performance up or down as needed from periodic updates to near-real-time scenarios. This functionality is possible due to the use of batch or micro-batch processing. Batch processing sends data at intervals, rather than in real-time. It is typically used when data freshness is not a mission-critical issue, and when you are working with large datasets and running complex algorithms that require a full dataset (sorting for example). Micro-batch processing is a similar process but on much smaller data sets, typically about a minute or so’s worth of data. This allows for near-real-time processing of datasets and is perfect for low-flow situations where a few minutes of delay is acceptable. In many cases, micro-batch processing and stream processing are used interchangeably in data architecture descriptions, because, depending on configuration, they can offer nearly the same performance.

A complex use case involving multiple data sources might have varying requirements, some real-time, some more forgiving. For these, a hybrid solution might be advisable. Debezium can be used (with the Kafka Connect service) for those streams that require real-time CDC. For near-real-time or batch processing, you can leverage Kestra. Kestra can consume events directly (without configuring a Kafka Connect service) by leveraging Debezium Engine and forward to any destination supported by Kestra (BigQuery, JDBC, Cloud Storage, and more), without a streaming pipeline. Changes/transfers can be scheduled for any interval, every 5 minutes, every hour, every day, whatever is required. Triggers can also be used to create an execution whenever there is data available. Kestra can also be leveraged to transform data before sending it to the destination.

Debezium leverages Kafka and Kafka Connect to deliver streaming performance, and the larger and more complex the deployment, the more challenging it can be to deliver enterprise-grade 99.9% availability while still managing resources and costs. Managing it in the cloud can be even more challenging. This involves picking the appropriate compute instance for the brokers, sizing the non-ephemeral storage accordingly, applying end-to-end security, ensuring high availability across availability zones, and more. These same challenges apply when Kafka is a component of another service as well - there is a reason that many organizations turn to managed services rather than deploying their own instance on-premise. There are challenges to visibility as well - users must be familiar with Kafka eccentricities to troubleshoot issues via logs and dashboards.

By leveraging Kestra for near-real-time or batch workloads, and Debezium for streaming, some of the advantages are obvious. This allows for a solution that leverages only the resources required for the use case in question, rather than applying resource-intensive streaming resources to every process. For workflows that are not real-time, CPU and memory resources are limited or shut down when not in use. Services that charge based on the throughput, such as BigQuery, are only charged when in use. All of this combines to create an efficient solution that wastes no resources and saves money. All row-level changes are still captured, and with Debezium, a built-in feature snapshots the database on the first start.

But there are less obvious advantages to adding Kestra to the mix. Pipelines are visibly presented, ensuring that dependencies are continuously monitored, and you can see exactly where in a data pipeline the problem lies. This monitoring capacity provides a great deal of peace of mind when managing different data flow requirements, and mitigates the complexity of clustered Kafka deployments (such as those that make part of more complex Debezium deployments).

Changes to pipelines can be iterative as well with Kestra. A working data pipeline can be modified on the fly with a few lines of yaml code, adding new components and integrations without disrupting a working flow. A new pipeline of data can be applied in minutes.

id: debezium-mysql
namespace: com.kestra.lde

tasks:
  - id: capture
    type: io.kestra.plugin.debezium.mysql.Capture
    hostname: 192.168.0.1
    maxDuration: "PT1M"
    password: mysql_passwd
    username: root
  - id: fileTransform
    type: io.kestra.plugin.scripts.nashorn.FileTransform
    from: "{{ outputs.capture.uris.users }}"
    script: |
      if (row['contactName']) {
        row['contactName'] = "*".repeat(row['contactName'].length);
      }
  - id: jsonWriter
    type: io.kestra.plugin.serdes.json.JsonWriter
    from: "{{ outputs.fileTransform.uri }}"
  - id: load
    type: io.kestra.plugin.gcp.bigquery.Load
    destinationTable: my-project.demo.users
    format: JSON
    from: "{{outputs.jsonWriter.uri }}"
    writeDisposition: WRITE_APPEND

Kestra’s flexibility is key to this potential solution and many others. With numerous plugins, Kestra offers deep integrations with multiple systems to create complex workflows. Systems without existing plugins can be integrated with simple-to-create containers such as Docker and Kubernetes. The only limit is your imagination. We hope to highlight many such possibilities in the coming weeks.

Our newly created Debezium plugins include connectors for Postgres and MySQL, and we are steadily working to include additional connectors to improve our product. Our current roadmap includes connectors for MongoDB, PostgresSQL, Oracle, SQL Server, Cassandra, and more. What connectors do you think we should work on first? We’d love to hear from you!

Give us your opinion on this Twitter poll
// Detect dark theme var iframe = document.getElementById('tweet-1511296199025991680-388'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1511296199025991680&theme=dark" }

Kestra, infinitely scalable open source orchestration and scheduling platform.

Ludovic DEHON — Wed, 02 Feb 2022 12:57:29 +0000

Today, our team is proud to announce a first public release of Kestra, an open-source platform to orchestrate & schedule any kinds of workflow at scale.

What is Kestra?

Kestra is :

an orchestrator: Build a complex pipeline in couple of minutes.
a scheduler: Launch your flows whatever your need!
a rich ui: Create, run, and monitor all your flows with a real-time user interface.
a data orchestrator: With its many plugins, build your data orchestration directly.
cloud native & scalable: Scale to millions of executions without stress or hassle.
an all-in-one platform: No need to use multiple tools to deliver a complete pipeline.
a pluggable platform with the option to choose from several plugins or to build your own.

As you can see, Kestra will handle all your pipelines !

The History of Kestra!

Kestra started in 2019 with this initial commit. At this time, Kestra was at the proof-of-concept stage.

To provide a bit of a background: I was working for Leroy Merlin as a consultant. We needed to build a new cloud-based data platform from scratch (destination: mostly Google Cloud Platform). We tried a lot of things and failed with some of our attempts. The biggest setback was the orchestration software that we tried to deliver with Apache Airflow: a lot of instability (tasks that failed simply due to the Airflow scheduler), performance issues (unable to handle a light workload), and a lack of features (scaling, data processing). After many tests (Google Composer, Open source Airflow on Kubernetes), the decision was final: Airflow was rejected by Leroy Merlin.

I did some research on the orchestrator ecosystem; most are proprietary and license based (far from my mindset), some are open source (at this time, only Apache Airflow seemed to be active — and it was rejected). I was really surprised by this discovery and faced this challenge from a co-worker:

If you think Airflow is bad, do better!

It was decided: I set myself the task of producing a proof of concept for our own open-source workflow management system. It took a lot of time to build this software, and the task seemed to be never ending; but I continued to work on it for several months by:

Choosing Kafka as database and queue
Implementing storage for task processing
Choosing ElasticSearch as a repository for UI
Adding some dynamic templating with HandleBar and changing it later to Peeble
Starting some Google Cloud plugins
Introducing the UI — built with Vue.js

And so on !

During a thirty-month period I built a variety of features, numerous plugins, and countless bug fixes — mostly during the night as I was still working as a full-time consultant for Leroy Merlin. It took a lot of effort, investment, and time that I could have spent with my family.

But now we are really proud of what we’ve achieved!

Kestra is Open Source!

I'm a real open-source enthusiast. As an architect, I’ve been interested in open source solutions in IT for twenty years. I started as an open source consumer (using it without adding contributions, as is the case with most users). I then decided that the time was right to start out with the permissive Apache License.

Three years ago, I started another open source project, AKHQ, with the same license. Working with a successful project was an invaluable experience for me as I was able to learn how to build a community around a project. I've also learnt that an open source system won't pay the bills on its own. AKHQ required a lot of personal investment; Kestra has required a lot more and will continue to do so in the future! This means you will have to ensure that you have the financial resources in place to enable your project to be viable and sustainable — we decided to create a company alongside Kestra in order to raise the required funds to support the development of the open source software.

The open source license is not limited and allows you to install and run it as you want on your server on premise or your cloud. We have also built our Enterprise Edition , bringing added security and productivity to your Kestra clusters. In addition, we plan to deliver Kestra in the form of software as a service in the near future (don't hesitate to contact us for more information).

Kestra Plugins are also Open Source!

When implementing the deep integration of the tools and databases you are using, the connectors (what we call “plugins”) can present the biggest challenge. Most orchestrators (even proprietary and licensed based) only talk bash or cmd. You have to manage all of your needs with simple commands, often requiring you to use another tool in order to have access to the underlying resource (such as Talend). With Kestra, we want to have a deep integration with your tools and let bash deal solely with edge cases a plugin can't cover.

An example for a query to Google BigQuery:

with Bash

DATE=$(date --iso-8601=seconds)
bq --format=json query 'SELECT name FROM \`project.dataset.table\` WHERE shippedDate=${DATE} AND shippedCountry = \'FR\'' > /tmp/query.json
jq -r '.name' /tmp/query.json

with Kestra

- id: query
  type: io.kestra.plugin.gcp.bigquery.Query
  fetchOne: true
  sql: |
    SELECT name
    FROM `kestra-prd.demo.salesOrder` AS s
    WHERE shippedDate = '{{ now() }}'
    AND shippedCountry = 'FR'
- id: "return"
  type: "io.kestra.core.tasks.debugs.Return"
  format: "{{ outputs.query.row.name }}"

Kestra avoids the rigmarole of installing the software on the system, handling dependencies and conflicts, dealing with Python, etc. — just install a plugin (a simple jar) and speak directly with your database.

We have a number of plugins and the process of developing your own is very simple. We also hope that a community will help us to maintain new plugins/connectors (contact us if you require help or support).

First Public Release and Production Ready!

First public release doesn't mean that Kestra is not production ready. In fact, it has been used in production since August 2020 at Leroy Merlin — take a deeper look at the case study if you want more detail. Here are some figures to give a picture of Kestra’s credentials:

4 clusters one for every environment
200+ users/developers
2000+ flows in production
350,000 executions every month
3,000,000 tasks every month
Equivalent of 1,500 days of task processing time every month (yeah, that’s the equivalent of fifty days of task processing every single day)

So, your next question is: why are you waiting so long for the first public release?

The answer is simple: we want to deliver the first impression as best as possible and this led to a lot of work: missing features, missing plugins, new UI design, polish of documentation and website. Now we are proud and confident enough in our product to display the result of our labor.

The road is not finished; we still have a lot to do. Stay tuned for the journey.

You want keep in touch, or have more informations?

⭐️ star us on github: https://github.com/kestra-io/kestra
⏩ retweet us: https://twitter.com/kestra_io
🗣️ chat with us slack: https://api.kestra.io/v1/communities/slack/redirect
🧑‍🤝‍🧑 follow on linkedin: https://www.linkedin.com/company/kestra
📧 contact us by email: https://kestra.io/company/contact.html