DEV Community: Smriti S

Understand security, scalability and decentralization in Blockchain

Smriti S — Sat, 20 Sep 2025 11:04:13 +0000

If you are a blockchain designer, you would have faced a fundamental challenge: how to balance security, scalability, and decentralization. This is also known as the Blockchain Trilemma, a term popularized by Ethereum co-founder Vitalik Buterin.

This blog describes the three pillars, along with methods to solve the trilemma.

The Three Pillars of the Trilemma

1. Security

A blockchain must resist attacks and ensure the integrity of data.

Transactions should be immutable once confirmed.
Malicious actors should not be able to double-spend or alter the history of transactions.
Consensus mechanisms (Proof of Work, Proof of Stake, etc.) are designed to make attacks costly and impractical.

Security is non-negotiable; without it, trust in the system collapses.

2. Scalability

Scalability refers to the ability of a blockchain to handle high transaction throughput and low latency.

Bitcoin handles ~7 transactions per second.
Ethereum handles ~15–30 transactions per second in its base layer.
Compare this with Visa, which can process thousands of transactions per second.

Scalability is critical if blockchains want to support mass adoption for payments, gaming, or enterprise applications.

3. Decentralization

Decentralization ensures no single party controls the system.

Nodes are distributed globally, making censorship difficult.
Anyone can participate in validating the transactions (permission-less).
Power is spread among users, not concentrated in a few hands.

The more decentralized a system, the harder it is to shut it down or manipulate it.

Why is it a trilemma?

The challenge is that achieving all three pillars at once is extremely difficult:

If you maximize decentralization and security (like Bitcoin), you often sacrifice scalability.
If you prioritize scalability and security (like some private/permissioned blockchains), you reduce decentralization.
If you chase decentralization and scalability, you may compromise security due to weaker validation.

In short, most blockchain designs can excel at two dimensions but must compromise on the third.

How can you solve the trilemma?

1. Layer 2 Solutions
- Rollups (Optimistic, ZK-Rollups) move transactions off-chain and settle on the main chain. For example: Arbitrum, zkSync.

2. Sharding
- Splitting the blockchain into smaller “shards” that process subsets of transactions in parallel. For example, Ethereum’s upcoming roadmap includes sharding.

3. Hybrid Consensus Models
- Combining Proof of Stake with other mechanisms for efficiency. For example, Polkadot’s nominated Proof of Stake.

4. Sidechains and App-Specific Chains
- Specialized chains connected to a main network via bridges. For example, Polygon, Cosmos zones.

Conclusion

For developers, architects, and decision-makers, the trilemma isn’t just theoretical. It directly impacts the user experience (slow or costly transactions), system resilience (how hard it is to attack) and governance (who really controls the network).

Every blockchain project implicitly makes trade-offs. Understanding the trilemma helps you evaluate whether a platform is suited to your use case or no.

Event-Driven Architecture with Blockchain: Use Kafka/MSK and Blockchain Logs

Smriti S — Sat, 20 Sep 2025 10:51:47 +0000

The intersection of event-driven architecture (EDA) such as Kafka and blockchain results in trusted events and scalable distribution. In this blog, you will understand the importance of creating and using event-driven systems that connect decentralized data with enterprise infrastructure.

If you are building real-time dashboards, supply-chain apps, or fintech services, this blog is for you.

Why does Event-Driven Architecture Matter?

In an event-driven architecture setup, systems communicate via events—immutable records of an event. Instead of constantly polling for changes, services react as soon as an event is published.

For example, consider the following chain of events.

A payment service emits an “OrderPaid” event.
A shipping service reacts to it and dispatches the package.
An analytics service consumes the same event for reporting.

This decoupled model provides scalability, real-time processing, and extensibility.

Blockchain as an Event Source

You can view a blockchain as a stream of ordered events (or transactions). Every transaction represents an action, such as transferring tokens or updating a smart contract.

But blockchains don’t push data outward; they expect clients to poll. This is where event-streaming platforms come in.

Event-Streaming Platform

Apache Kafka is a distributed event-streaming platform built for high-throughput, fault-tolerant event pipelines. Amazon MSK (Managed Streaming for Kafka) provides the same power without the ops overhead.

By integrating blockchain logs with Kafka/MSK, you can:

Capture on-chain events as Kafka topics.
Stream these events to consumers in real time.
Process them with stream processors (Kafka Streams, Flink, ksqlDB).
Process these events to multiple downstream systems (databases, microservices, analytics tools).

An exemplar flow is described below:

A smart contract emits an event transfer.
A Web3 listener service subscribes to these logs.
The listener publishes the events into a Kafka topic token-transfers.
Downstream services consume the topic:
- Analytics service updates dashboards.
- Notification service sends alerts to users.
- Fraud detection pipeline checks anomalies.

Benefits of integrating blockchain logs with Kafka/MSK

Scalability: Kafka/MSK can handle millions of events per second.
Decoupling: Producers (blockchain) don’t need to know about consumers.
Replay: Kafka retains events, so new consumers can catch up from history.
Real-time insights: Blockchain + Kafka enables instant reaction to on-chain activity.

Challenges

Event ordering: Kafka guarantees order within partitions, but you must partition carefully (for example, based on transaction hash).
Data volume: Popular blockchains generate high-volume logs; filtering is critical.
Latency: Blockchain finality (where a transaction is permanent and irreversible) introduces some delay (for example, 12s in Ethereum).
Security: Ensure Kafka/MSK pipelines don’t become central trust bottlenecks.

Conclusion

Overall, you should consider the above-mentioned criteria before choosing to intergrate blockchain with an event-driven service.

Introduction to Geth (Go Ethereum)

Smriti S — Tue, 15 Jul 2025 11:40:24 +0000

This topic provides a technical overview of Geth for users getting accustomed to protocol-level infrastructure.

Prerequisites

Familiarity with the command line.
Basic understanding of Ethereum and the Ethereum mainnet.

What is Geth?

Geth (short for Go Ethereum) is an open-source Ethereum execution client maintained by the Ethereum Foundation. Written in the Go programming language, Geth implements the Ethereum protocol and provides robust communication between the client and the Ethereum network.

Running Geth turns your machine into a full Ethereum node, capable of syncing with the blockchain and participating in the network.

After Ethereum’s transition to Proof of Stake (PoS), each Ethereum node is now composed of two clients:

An execution client (like Geth)

A consensus client (like Prysm or Lighthouse)

Functionalities of Geth

Transaction Execution and EVM Support

As an execution client, Geth handles:

Processing transactions
Managing blockchain state
Executing smart contracts
Supporting the Ethereum Virtual Machine (EVM)

⚠️ An execution client must be paired with a consensus client to form a complete Ethereum node.

Staking and PoS Support

Geth helps:

Validate blocks and manage staking operations
Enforce protocol rules (e.g., signatures, balances)
Store a complete copy of the Ethereum blockchain

PoS reduces energy consumption compared to the PoW model. Geth maintains up-to-date blockchain history.

State Management

Geth maintains the state trie — a data structure that tracks the current state of all accounts and smart contracts. It ensures state consistency by updating this trie as new transactions are processed.

Networking

Geth uses a peer-to-peer (P2P) protocol to:

Discover and connect with peers
Propagate transactions and blocks
Synchronize with the rest of the Ethereum network

How Does Geth Work?

Geth is executed via the command line and enables you to:

Start/manage Ethereum nodes
Communicate with the blockchain
Interact with smart contracts

JSON-RPC Communication

Geth communicates with consensus clients via JSON-RPC APIs:

Consensus clients request execution payloads from Geth
Developers use web3.js or web3.py libraries to send transactions, query balances, deploy contracts, etc.

Attach Console

To interact with your running Geth node:

geth attach

Geth Commands

Listed below are commonly used commands to configure and operate Geth:

Command	Description
`geth --help`	Displays the list of available commands and options, along with descriptions to help understand usage.
`geth`	Starts a Geth Ethereum node. By default, it connects to mainnet and synchronizes with the blockchain.
`Ctrl + C`	Gracefully stops the running Geth Ethereum node.

For more information on flags and configuration options, refer to the Geth official documentation.

Geth Security and Best Practices

To ensure the security and reliability of your Geth node, follow these best practices:

Regularly check official updates and install the latest version of Geth.
Use package managers like homebrew, apt-get, etc., depending on your OS, for efficient updates.
Verify the integrity of downloaded files using checksums or other verification methods.

Conclusion

Geth is a foundational component of Ethereum’s infrastructure. As the execution client, it processes transactions, maintains state, and works in tandem with a consensus client to run a full Ethereum node. This collaboration is essential for maintaining the integrity and functionality of the Ethereum network.

Zero to observable with OpenTelemetry and SigNoz

Smriti S — Tue, 15 Jul 2025 11:32:10 +0000

Observability is an important concept in software which provides visibility into your application/services. It describes how the services function in real-time and what is happening inside your application.

Observabilitiy has three main pillars:

Traces: Tells which requests are slow.
Metrics: Monitors latency, failures, and so on.
Logs: Helps debug errors with context.

Introduction

How do you ensure your logs, metrics and traces are in place? Observability is the answer to all of these. OpenTelemetry helps collect traces, metrics, and logs in a vendor-agnostic way and send it to a backend like SigNoz for visualization.

This topic describes how you can enforce observability in a simple application using OpenTelemetry and SigNoz.

OpenTelemetry, aka OTel, is a vendor-neutral open source Observability framework to instrument, generate, collect, and export traces, metrics, and logs (aka telemetry data).

SigNoz is an open-source observability tool powered by OpenTelemetry that helps monitor and gain insights into your application and infrastructure.

Prerequisites

Docker
A microservices application (Here, I am using a Flask-based service that simulates real-world behavior like successful requests, errors, and metrics exposure; and it is instrumented with OpenTelemetry.). Download the app here.

This blog assumes that you are hands on with concepts like containerization and Docker.

Steps to setup observability

Dockerize your application and build the application by specifying correct ports. In this case, the local host uses port 8000.

docker build -t simple-app . docker run -p 8000:8000 \ -e OTEL_EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:4318 \ -e OTEL_SERVICE_NAME=simple-app \ simple-app
Set up self-hosted SigNoz on Docker using the install script.
Verify the SigNoz installation.
- This installation spins up: • ClickHouse (for storing telemetry) • OTEL Collector (for receiving traces/metrics) • SigNoz login page (on port 8080)

Ensure port 4318 is open; this is where OTEL collector receives HTTP traces.

Hit some endpoints from the terminal to generate traffic and build traces. For example:
```
curl http://localhost:8000/
```
The above command displays a simple welcome message.
```
curl http://localhost:8000/fail
```
The above command throws an error.
To run SigNoz on your local machine, go to http://localhost:8080/ from your browser after verifying the installation.
Signup to SigNoz, and navigate to "Traces" from the left sidebar.
Go to "Service Name" and select the app name to display the traces associated with the application.

Depending on the type of metric, you can choose to view a variety of data insights in different formats (table, visual, and so).

Conclusion

This blog demonstrates observability setup using powerful, open-source tools like SigNoz and OpenTelemetry. These tools help build robust observability pipeline with scalability that help catch bugs easily and reduce latency.

Best Practices for Creating Custom Chaos Experiments

Smriti S — Tue, 25 Feb 2025 07:26:09 +0000

Chaos Engineering is all about breaking things with a purpose- to build resilient systems. Many chaos engineering tools offer predefined faults, but where do you start if you want to build custom chaos experiments to suit your infrastructure? And how do you design chaos experiments that are both effective and safe? Let’s break it down.

1. Define Clear Objectives

Answer this question: What are you testing?
Chaos experiments should have a specific goal. Before creating a custom experiment, ask:

What failure scenario are we simulating? (CPU spikes, network latency, disk failures, etc.)
What is the expected behavior of the system? (Auto-recovery, failover, degraded performance)
What metrics define success or failure? (Response time, error rate, downtime) For example: "I want to test how my application’s microservices handle sudden node failures and verify if traffic automatically shifts to healthy nodes."

2. Choose the Right Disruption

Answer this question: For your application/system, what faults are relevant? What resources should be disrupted?
All faults have different consequences (or blast radius). It is recommended to start with low-risk faults (such as network latency, CPU/memory spike, pod restarts) before moving towards destructive tests (such as node failures, killing database instances, packet loss in critical services).
Start executing chaos experiments in a staging environment, analyze results, and gradually increase the chaos impact to different environments (QA, pre-production, production).

3. Ensure RBAC Is in Place

Chaos experiments are disruptive if not controlled. Use RBAC policies to prevent unauthorized users from running high-risk experiments. This way, only authorized users/engineers will have the permissions to execute chaos experiments.
The manifest below describes the rules for a user (“chaos-engineer”) who can perform the operations (create, get, and list) on specified resources.

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: chaos-engineer
rules:
  - apiGroups: ["litmuschaos.io"]
    resources: ["chaosengines", "chaosexperiments"]
    verbs: ["create", "get", "list"]

4. Monitor System Behavior

When a chaos experiment is defined, it is important to audit what goes on when the experiment is in execution. Set up monitoring/observability tools to monitor CPU, memory, and response time, as well as detect unusual errors. This way, you can answer some of the questions such as:

Did auto-scaling trigger correctly?
Did users experience downtime?
Were ale rts sent to engineers?

5. Recovery Plan in-place

Sometimes, chaos experiments may not go as expected, and may disrupt to a higher extent than expected. To address such cases, rollback and recovery should be set up.
For example, Kubernetes self-healing, restarting pods after executing the experiment, and so on.
The manifest below uses Kyverno to auto-restart pods after chaos so that the affected pods don’t persist and affect other resources.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restart-failed-pods
spec:
  validationFailureAction: Enforce
  rules:
    - name: restart-on-failure
      match:
        resources:
          kinds: ["Pod"]
      mutate:
        patchStrategicMerge:
          spec:
            restartPolicy: Always

Conclusion

Define clear objectives → Know what failure you’re testing.
Start small, then scale → Begin with low-risk chaos.
Use RBAC and monitoring → Keep experiments safe
Monitor system behavior → Audit what goes on when the experiment runs.
Always have a recovery plan → Have rollback/recovery plans in case something doesn’t go as expected.

Want to create custom experiments but don’t know where to start? Join the Litmus Slack channel, check out the official documentation and blogs.

The Hidden Risks of Relying Solely on Testing: Why Chaos Engineering is Essential

Smriti S — Thu, 06 Feb 2025 04:46:35 +0000

This blog explores the challenges organizations face when relying solely on testing to validate deployments, highlights how chaos engineering can bridge these gaps, and showcases various adopters of chaos engineering practices along with their use cases.

Organizations are rolling out changes almost everyday with the help of CI/CD. You can perform automated, manual, unit and integration tests to validate features before deployment. But what if your system encounters unexpected failures?

What if there are network disruptions, application crashes or resource exhaustions that would impact the application resulting in downtime or cascading failures?

These unforeseen failures can’t be addressed using testing because traditional testing methodologies aren’t designed to simulate such complex, real-world conditions.
This is where chaos engineering comes into play.

You can address the gaps mentioned earlier by incorporating chaos engineering experimentation in your application to build resilient and reliable systems.

Described below are some organizations (by sector) that leverage chaos engineering to enhance their cloud-native applications and operations:

Technology & E-Commerce: The Challenges and Solutions

1. Flipkart: Handling High-Traffic Events Without Downtime
Challenge: During peak shopping events, Flipkart experienced traffic spikes that led to performance degradation and downtime. Traditional load testing failed to capture the full complexity of real-world failures like network congestion and database latency.
Impact: Shoppers faced checkout failures, slow page loads, and abandoned carts, affecting revenue and customer satisfaction.
Solution: Flipkart integrated chaos engineering to simulate failures in production-like environments, testing how microservices and databases handled stress conditions. This approach helped them optimize auto-scaling strategies and build robust failover mechanisms, ensuring seamless shopping experiences.

2. Delivery Hero: Ensuring Reliability in Food Delivery Services
Challenge: Delivery Hero operates a high-demand food delivery service, where real-time order processing and delivery tracking are critical. Network failures and API downtime led to order failures and frustrated customers.
Impact: Customers faced delays or lost orders, affecting restaurant partnerships and overall brand trust.
Solution: Using chaos engineering, Delivery Hero injected failures into their APIs, databases, and network connections to identify weak points. By proactively fixing these, they improved system redundancy, reduced downtime, and ensured smooth operations even during peak hours.

3. Talend: Improving Data Pipeline Resiliency
Challenge: Talend, a data integration company, processes massive datasets across multiple cloud environments. Issues like database failures or unexpected API rate limits disrupted data workflows, affecting analytics and reporting.
Impact: Businesses relying on Talend’s data pipelines faced inconsistencies in analytics, leading to misinformed decisions.
Solution: Talend used chaos engineering to introduce controlled disruptions in their ETL (Extract, Transform, Load) processes. This approach helped them refine retry logic, optimize failover configurations, and ensure consistent data processing even under failure conditions.

4. Kitopi: Enhancing Reliability in Cloud Kitchens
Challenge: Kitopi, a cloud kitchen platform, depends on efficient order routing and delivery logistics. Unforeseen infrastructure failures, like database slowdowns or application crashes, led to delayed or missed orders.
Impact: Customers experienced long wait times, and restaurant partners faced reduced efficiency.
Solution: Kitopi adopted chaos engineering to simulate database latencies and system crashes. By analyzing system responses, they improved recovery mechanisms, reduced downtime, and ensured reliable operations for food preparation and delivery.

5. Lenskart: Ensuring E-Commerce Stability During Scaling
Challenge: As Lenskart scaled its e-commerce operations, performance bottlenecks emerged due to unpredictable user behavior and traffic surges.
Impact: Checkout failures and slow load times led to cart abandonment and lost sales.
Solution: Lenskart employed chaos engineering to test their microservices against network failures and sudden traffic bursts. These experiments helped them fine-tune their cloud infrastructure for better scalability and higher uptime.

6. iFood: Strengthening Food Delivery Operations
Challenge: iFood operates a large-scale food delivery service that depends on real-time coordination between customers, restaurants, and delivery partners. System failures disrupted order processing and driver dispatching.
Impact: Delayed or canceled deliveries hurt customer satisfaction and restaurant partnerships.
Solution: iFood used chaos engineering to test service degradations, ensuring their systems could handle API timeouts and database failures. By enhancing their error-handling strategies, they minimized order disruptions and maintained seamless delivery services.

7. Wingie Enuygun Company: Improving Online Travel & Finance Platform Resilience
Challenge: Wingie Enuygun, an online travel and finance platform, faced outages due to third-party API failures and unpredictable network issues.
Impact: Users encountered booking failures and delayed transactions, leading to loss of trust.
Solution: By integrating chaos engineering, they simulated API slowdowns and network failures, allowing them to improve fallback strategies and system stability. As a result, they delivered a more reliable user experience.

Conclusion

Organizations that do not use chaos engineering risk falling into a cycle of operational inefficiencies and poor customer experiences. By injecting controlled failures, chaos engineering helps address these challenges, thereby building robust and scalable systems.

Fintech Leader's journey to resilience with Litmus Chaos

Smriti S — Thu, 19 Dec 2024 05:40:06 +0000

In this blog, you will understand the importance of chaos engineering in Fintech and see how Wingie Enuygun has leveraged Litmus Chaos, an open-source CNCF-hosted platform to build and enhance their application’s resilience.

Why is Chaos Engineering Essential in Fintech?

In the fintech industry, resilience is crucial. Millions of users rely on financial platforms for transactions, payments, and investments. In such cases, system downtime can lead to significant financial loss, regulatory scrutiny, and loss of customer trust. The complexity of fintech applications often involve microservices, third-party integrations, and real-time data processing, making them vulnerable to unexpected failures.

What is Chaos Engineering?

Chaos Engineering addresses these challenges by intentionally injecting failure into a system before it is in production, to test its ability to withstand and recover from unplanned failures. This proactive approach allows companies to:

Mitigate Financial Risks: Detect issues before they escalate into costly outages or downtime.
Ensure Compliance: Maintain uninterrupted service to comply with strict financial regulations and SLAs.
Build Customer Trust: Provide a seamless, uninterrupted user experience, which is vital for customer confidence in financial transactions.

By testing for failures before they happen, organizations can achieve operational resilience, reduce downtime, and gain the confidence to innovate faster.

One such organization that uses LitmusChaos is Wingie Enuygun.

Wingie Enuygun Group, a leader in travel and fintech, employs chaos engineering practices to enhance the resilience of their applications. Using LitmusChaos, they conduct controlled failure experiments during quality assurance (QA) cycles in pre-production environments. This proactive approach allows them to identify and address potential system weaknesses before deployment, ensuring a resilient application and reliable user experience.

By simulating real-world failure scenarios, such as server crashes or network outages, Wingie analyzes how their application responds and recovers. This process helps in uncovering vulnerabilities that might not be evident through traditional testing methods.

Why Wingie Enuygun Uses Litmus Chaos?

Travelers expect instant access to booking platforms, and even minor disruptions can have significant consequences. Recognizing this, Wingie Enuygun incorporates Litmus Chaos Engineering to:

Identify Bottlenecks: By pinpointing performance and scalability issues, Litmus helps optimize its infrastructure to handle peak loads.
Detect Issues Early: Controlled failures reveal potential issues before they impact users, enabling the company to address problems preemptively.
Foresee Potential Errors: Proactive testing allows teams to predict and mitigate issues, reducing the likelihood of costly downtime.

This approach gives the organization a strategic advantage, allowing teams to take preventive measures that keep their systems resilient, and highly available.

How Wingie Enuygun Uses Litmus Chaos?

To maximize the benefits of chaos engineering, Wingie Enuygun has integrated Litmus into its Quality Assurance (QA) cycles. This process ensures that every update, change, or new feature undergoes chaos testing for resilience before going into the production environment. Here’s how it works:

Controlled Chaos in Pre-Production: Before releasing changes into production, controlled disruptions are introduced to stress-test the systems. These disruptions simulate real-world failures, such as server crashes or network interruptions.
Automated Resilience Testing: By automating chaos experiments, the company validates its system’s ability to recover from critical failures like network latency, resource depletion, and service interruptions.
Bug Detection and Verification: Chaos experiments force failures that might not otherwise surface while testing or in production. This approach helps identify and resolve bugs before they impact users.

By embedding chaos engineering within its QA process, Wingie Enuygun strengthens its infrastructure and builds confidence in its system’s ability to withstand adverse conditions.

Conclusion

From QA cycle integration to proactive resilience testing and custom chaos experiments, Wingie Enuygun’s use of chaos engineering has driven continuous improvement and system innovation.

By leveraging LitmusChaos, the company enhances its ability to detect issues early, optimize performance, and ensure uninterrupted service for its users by improving the resilience of the application.

'Hello World' in Flyte

Smriti S — Fri, 21 Jul 2023 07:31:59 +0000

In my previous article, you understood the salient features of Flyte, which could help you decide if Flyte is the right orchestration platform for you.

In this article, you will understand how tasks and workflows in Flyte can be used to implement 'k' nearest neighbours in Python.

The building blocks of Flyte are:

Tasks: It is a versioned, and shareable unit of execution that encapsulates your code.
Workflows: It is a directed acyclic graph (DAG) of units of work encapsulated by nodes to describe the order of execution of tasks.

In this post, you will understand how to implement k nearest neighbours using tasks and workflows.

Let's dive into the implementation details!

1. Import the required packages.

from typing import List, NamedTuple

import pandas as pd
from flytekit import task, workflow
from sklearn.datasets import load_wine
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

2. Define a NamedTuple that provides a name for the output (useful while displaying the output on the console).

split_data = NamedTuple(
    "split_data",
    train_features=pd.DataFrame,
    test_features=pd.DataFrame,
    train_labels=pd.DataFrame,
    test_labels=pd.DataFrame,
)

3. Define a task that loads the wine dataset into your environment and splits it into train and test data. Notice the '@task' decorator specified at the beginning of the method.

@task
def data_processing() -> split_data:
    # load wine dataset
    wine = load_wine()

    # convert features and target (numpy arrays) into Modin DataFrames
    wine_features = pd.DataFrame(data=wine.data, columns=wine.feature_names)
    wine_target = pd.DataFrame(data=wine.target, columns=["species"])

    # split the dataset
    X_train, X_test, y_train, y_test = train_test_split(
        wine_features, wine_target, test_size=0.4, random_state=101
    )
    print("Sample data:")
    print(X_train.head(5))
    return split_data(
        train_features=X_train,
        test_features=X_test,
        train_labels=y_train,
        test_labels=y_test,
    )

4. Define another task that creates a K-nearest neighbour model and fits the model to the data. The predict function is used to predict values from the test data and store them in a list.

@task
def fit_and_predict(
    X_train: pd.DataFrame,
    X_test: pd.DataFrame,
    y_train: pd.DataFrame,
) -> List[int]:
    lr = KNeighborsClassifier()  # create a KNeighborsClassifier model
    lr.fit(X_train, y_train)  # fit the model to the data
    predicted_vals = lr.predict(X_test)  # predict values for test data
    return predicted_vals.tolist()

5. Define another task that determines the accuracy of the model based on the actual values and predicted values using the "accuracy_score" method.

@task
def calc_accuracy(y_test: pd.DataFrame, predicted_vals_list: List[int]) -> float:
    return accuracy_score(y_test, predicted_vals_list)

6. Define a workflow (annotated with the @workflow decorator). This workflow lists the tasks in the order in which they are to be executed.

@workflow
def pipeline() -> float:
    split_data_vals = data_processing()
    predicted_vals_output = fit_and_predict(
        X_train=split_data_vals.train_features,
        X_test=split_data_vals.test_features,
        y_train=split_data_vals.train_labels,
    )
    return calc_accuracy(
        y_test=split_data_vals.test_labels, predicted_vals_list=predicted_vals_output
    )

7. The pipeline is invoked, which displays the accuracy of the models.

if __name__ == "__main__":
    print(f"Accuracy of the model is {pipeline()}%")

Full code can be found here.

10 minutes to Flyte

Smriti S — Sat, 15 Jul 2023 05:29:12 +0000

An important aspect of building a machine learning pipeline is the 'workflow'. Workflow refers to the phases of a machine learning model defined and executed by the developer. A typical workflow involves phases like data collection, data cleaning, pre-processing, model building, and making predictions.

A workflow's dependencies on data and infrastructure complicate its maintenance and reproducibility. These dependencies may not allow the user to focus on the business logic thereby reducing the performance of the model.

Flyte is your one-stop solution to resolve these data and infrastructure dependencies, focus on the business logic, and improve the efficiency of your machine learning model.

What is Flyte?

Flyte is a distributed workflow automation platform that is

Open source
Kubernetes-built
Container-native

You can build complex, mission-critical data and machine learning pipelines at scale. You can also accelerate models to production by building highly concurrent, scalable, and maintainable pipelines that scale to millions of executions and containers.

Flyte bridges the gap between creating machine learning models and setting them in production. You can execute the model and Flyte will seamlessly orchestrate it in the production environment.

Note: A 'workflow' in Flyte's context is different from a machine learning workflow.

Building blocks of Flyte

The building block of Flyte is a task. Think of it as a function with the @task decorator. It is an independent unit/s of execution that encapsulates your code. Tasks are versioned and shareable.

@task
def some_function(arguments) -> return_type:
    function-body
return value

A single workflow in Flyte can be associated with multiple tasks. Below is an example of two tasks that generate a dataframe and compute descriptive statistics respectively.
Tasks in Flyte are combined to execute them in a specific order using a workflow. A 'workflow' is a directed acyclic graph (DAG) of units of work encapsulated by nodes that describes the order of execution of tasks. It is specified using the @workflow decorator.

@workflow
def my_workflow(arguments) -> return_type:
    function-body (Specify the order of tasks)
return value

Below is an example of a workflow that specifies the order of execution of the above-mentioned tasks. More about it here.
A node represents a unit of work within a workflow. Generally, a node encapsulates an instance of a task and can coordinate task inputs and outputs. It behaves as a wrapper for tasks and workflows and can be visualised on the user interface.
Launchplans invoke workflow executions. You can bind a partial or complete list of inputs to be passed as arguments to create an 'execution' (workflow execution). Flyte automatically creates a default launch plan (with no inputs) when a workflow is serialised and registered.

Instead of using a default launch plan, you can create one too! This way, you can supply the parameters to a launch plan instead of using the default one.
An execution is an instance of a workflow, node, or task.

Tip: You can involve tasks and workflows as regular Python methods, import them, and use them in other Python modules.

Other features of Flyte

Versioned
Versioned code, containerized with all the dependencies.
Versioned entities, with an option to rollback to a specific version.
Multi-tenant, scalable service to allow users to work in their own isolated repository without affecting other parts of the platform.
Cached output for every job trigger, allowing reuse (instead of re-compute), thereby saving execution time and resources.
Metadata capture (logs) for every workflow to backtrack to the source of errors.

Type system

Support for multiple data types such as Blobs, Directories, Schema, and more.
Strongly typed and parameterized tasks.
Typesafe pipeline construction, that is, every task has an interface characterized by an input and output. This means illegal pipeline construction fails during declaration, rather than at runtime.

Customization

Out of the box support to run Spark jobs on K8s, Hive queries, and more.
Customizable user classes to suit specific requirements.
Integrations (SQL, Pandera, Modin) to extend Flyte, if you are ready to take a deep-dive!

Miscellaneous

Ability to iterate through models.
User-friendly SDKs to run the workflows.
Ability to generate dynamic graphs to take decisions during run-time (such as tuning hyperparameters during execution, programmatically stopping the training in case of surging errors, modifying logic of code dynamically, and building AutoML pipelines).

Conclusion

In this post, you were introduced to Flyte and its features.