DEV Community: Dima

Integrating Agile Methodologies and DevOps for Effective Software Development

Dima — Wed, 14 Jun 2023 14:54:10 +0000

In today's fast-paced digital world, businesses must adapt quickly to ever-changing customer demands and market conditions. As a project manager at Nyoka, an international company specializing in AI solutions, technology services, and outsourcing, I've witnessed firsthand that traditional waterfall approaches often fall short in terms of rapid adaptation and iterative development.

On the other hand, Agile and DevOps have emerged as transformative practices that allow teams to deliver high-quality software faster and more efficiently. By integrating Agile and DevOps, organizations can further amplify the benefits of each, leading to improved productivity, better product quality, and increased customer satisfaction.

Understanding Agile and DevOps

Agile is a project management and product development strategy that is rooted in iterative progress, self-organization, and accountability. The Agile Manifesto outlines several key principles:

1. Individuals and interactions over processes and tools: Agile emphasizes the importance of the team and how its members work together.
Working software over comprehensive documentation: Agile prioritizes delivering a working product as the primary measure of progress, without disregarding documentation.
Customer collaboration over contract negotiation: Agile stresses the importance of maintaining a close relationship with the customer throughout the development process.
Responding to change over following a plan: Agile is all about adaptability and being able to respond to changes quickly and efficiently.

DevOps, a combination of the terms "development" and "operations," is a set of practices aimed at reducing the systems development life cycle time and providing continuous delivery with high software quality. Key principles of DevOps include:

Continuous Integration and Continuous Delivery (CI/CD): This encourages frequent code integration and ensures that software can be reliably released at any time.
Infrastructure as Code (IaC): This involves managing and provisioning computing infrastructure through machine-readable definition files, replacing the need for physical hardware configuration or interactive configuration tools.
Monitoring and Logging: Detailed performance tracking and logging to help identify and resolve problems.
Collaboration and Communication: Like Agile, DevOps also emphasizes breaking down silos and encouraging teams to work together.

Benefits of Integrating Agile and DevOps

Integration of Agile and DevOps can result in a seamless pipeline that increases efficiency, reduces errors, and accelerates time to market. Some of the benefits are:

Faster Time to Market: Agile's iterative approach combined with DevOps' automation capabilities can significantly speed up the development cycle.
Improved Quality and Performance: Automated testing, a key component of DevOps, can be seamlessly integrated into Agile sprints, providing immediate feedback on the code quality.
Increased Collaboration: Agile and DevOps integration encourages cross-functional collaboration, leading to better communication and understanding.
Enhanced Customer Satisfaction: Faster delivery of high-quality software leads to increased customer satisfaction.

Successful Integrations of Agile and DevOps

Several companies have successfully integrated Agile and DevOps, such as:

Spotify: This music streaming giant adopted an Agile approach to support its rapidly growing business. Regular small updates to its platform, instead of launching big releases, allowed Spotify to quickly respond to user feedback and stay ahead of the competition.

Adobe: By taking a DevOps approach, Adobe removed silos between development and operations teams. This resulted in faster releases and higher customer satisfaction.

Amazon: A mandate from CEO Jeff Bezos to expose all data and functionality through service interfaces triggered Amazon's journey towards DevOps. This engineering process shift incentivized engineers to build for desired quality, resulting in a mature and robust set of APIs that all teams within Amazon use.

At Nyoka, we've also successfully implemented Agile practices, working in sprints, and maintaining a backlog of tasks prioritized based on customer needs. Our operations team, in sync with developers, has utilized DevOps practices such as Infrastructure as Code (IaC) and Continuous Integration and Continuous Delivery (CI/CD) to create a seamless pipeline from development to production. This integration has not only reduced our deployment time from weeks to days but also provided our customers with a transparent view of the entire development process.

Success Metrics for Agile and DevOps

Continuous improvement is at the heart of both Agile and DevOps. Therefore, it's important to measure implementation success and make adjustments as needed. Here's how you can track progress and measure success using Key Performance Indicators (KPIs):

Lead time: The time from the start of development to feature deployment. Agile and DevOps aim to reduce lead times.
Frequency of deployment: The goal is to increase deployment frequency without sacrificing quality.
Failed Change Rate: The percentage of changes that result in failure. A lower percentage indicates successful deployment.
Average Recovery Time (MTTR): The time it takes to recover from a failure. The aim is to minimize MTTR.

Tracking these KPIs over time can provide a clear picture of the progress and success of Agile and DevOps implementations. Review these metrics regularly and discuss them with your team. Adjust your practices based on these metrics as needed. For example, if turnaround times are too long, consider improving automation or streamlining processes.

Our Journey to Agile and DevOps Integration

At Nyoka, we tackled several challenges during our Agile and DevOps integration process:

Inconsistency of decisions between teams: Our teams are scattered around the world, working on different components of the project. It's crucial to ensure consistency in goals, end results, and deadlines. Improved processes have played a key role in achieving consistency across different scrum teams.
Agile with Waterfall Integration: For some projects, we used a hybrid approach combining Agile and Waterfall methodologies. This required careful planning and management to ensure a smooth workflow and communication between teams.
Differences in time zones, customs, and inter-team activities: Initially, these differences created coordination and communication problems, which we resolved with clear instructions, regular synchronization, and the use of collaboration tools.
Teams using different DevOps tools: Different teams were using a variety of DevOps tools, necessitating a common platform for better management of the development process.

The introduction of Agile and DevOps has resulted in a substantial increase in our delivery speed and responsiveness to market needs. Our tracked efforts show a 50% improvement in merges and updates, a 60% improvement in software configuration management, a 59% reduction in quality costs, and a 90% improvement in assembly and deployment.

Moreover, we noticed qualitative benefits:

Improved demand management and traceability from portfolio to Agile delivery teams
Granular configuration management and traceability
Integration with Agile lifecycle tools allowing story-based, configuration management driven from metadata
Real-time traceability of status for build and deployment
Automated build and deployments, including “one-button deployment”
Developer efficiencies as a consequence of improved tool interaction times and processes
Agile and DevOps are powerful methodologies that, when integrated, can significantly improve the productivity, speed, and quality of software development in an organization. While the transition may not be without its challenges, the benefits can be substantial.

The journey we took as a team, from recognizing the need for change, understanding Agile and DevOps principles, to the successful integration of these practices, has been a significant part of our growth story. Our customer satisfaction, speed of delivery, and overall team collaboration have improved remarkably. These positive outcomes are a testament to the power of Agile and DevOps when implemented in harmony.

We believe that our journey provides valuable lessons for other organizations looking to adopt Agile and DevOps practices. It's important to note that successful integration requires commitment, flexibility, and patience. It's not a one-size-fits-all solution, but a philosophy that can be tailored to meet your organization's unique needs and challenges.

In the words of Jeff Sutherland, one of the creators of Scrum, "The future is here, it's just not evenly distributed yet." It's up to us to distribute it. Agile and DevOps are the tools that can help us do that, and we are more than ready to take on the future, one sprint at a time.

Building a Robust Data Lake with Python and DuckDB

Dima — Thu, 18 May 2023 10:42:15 +0000

Every morning, when I wake up, the weight of responsibility settles on my shoulders. It’s a weight that fuels me, driving me to provide for my customers, myself. Today, that responsibility involves crafting a powerful data lake from scratch using Python and DuckDB.

First, let’s define the architecture. My plan is to store Parquet files in S3, using Dagster to orchestrate the Python application and the DuckDB engine.

I will immerse you in the world of Data Lake, Python, and DuckDB. I will provide a step-by-step, practical guide full of examples.

I’ll use the HTTP Range header to read parts of the Parquet files stored in S3 at random. This allows me to access the specific information I need without having to download entire files, saving both time and resources. Here, Python will be my tool of choice, a reliable ally in navigating the complexity of data manipulation.

Advanced Handling of Parquet Files

For efficient handling of large Parquet files in S3, we’ll use the HTTP Range header to read file chunks.

import boto3
from io import BytesIO
import pandas as pd
import pyarrow.parquet as pq

s3 = boto3.client('s3', region_name='your_region', aws_access_key_id='your_key_id', aws_secret_access_key='your_access_key')

def read_parquet_file(bucket, key, start_range, end_range):
    response = s3.get_object(Bucket=bucket, Key=key, Range=f'bytes={start_range}-{end_range}')
    data_chunk = response['Body'].read()
    df = pq.read_table(source=BytesIO(data_chunk)).to_pandas()
    return df

# Now you can use the function to read a chunk of the Parquet file
df = read_parquet_file('mybucket', 'file.parquet', 0, 10000)

Here comes DuckDB, our analytical database, tasked with converting and processing the data. DuckDB, although an incredibly powerful tool, is not without its limitations. Its in-memory nature can lead to performance issues with larger datasets, and its feature set isn’t as comprehensive as some other databases, like SQLite. Yet, for our data lake, DuckDB’s column-oriented design and vectorized query execution make it a potent choice.

Sophisticated DuckDB Processing

Processing data in DuckDB can go beyond simple queries. We can handle advanced operations, including joins, aggregations, and window functions:

con = duckdb.connect('duckdb_file.db')

# Assume we have two tables, orders and customers, in our database
query = """
    SELECT c.name, COUNT(o.id) OVER (PARTITION BY c.id) as num_orders
    FROM customers c
    LEFT JOIN orders o ON c.id = o.customer_id
"""

df = con.execute(query).fetch_df()

For orchestration, I turn to Dagster, employing its Software-Defined Assets to model and manage the data. With Dagster, I create pipelines to run my Python application, allowing me to define dependencies and track the lineage of my data. This way, I know where everything is, and where it came from — a priceless asset in the ever-changing world of data.

Utilizing S3 for Advanced Storage Needs

When dealing with S3, we often need more than just simple file operations. Boto3 supports advanced features like multipart uploads for large files:

def upload_large_file(bucket, key, local_file):
    s3 = boto3.client('s3')
    transfer = boto3.s3.transfer.S3Transfer(s3)
    transfer.upload_file(local_file, bucket, key)

upload_large_file('mybucket', 'largefile.parquet', '/path/to/largefile.parquet')

Deep Dive into DuckDB Engine Set-Up

Setting up the DuckDB engine can include advanced options such as configuring the number of threads, memory limit, and enabling or disabling specific optimizations:

con = duckdb.connect('my_database.duckdb')

# Configure settings
con.execute("PRAGMA threads=4") # Use 4 threads
con.execute("PRAGMA memory_limit='4GB'") # Limit memory usage to 4GB

Orchestrating Complex Tasks with Dagster

Using Dagster, we can orchestrate complex, dependent tasks. Here’s an example of a pipeline where one task prepares data and another analyzes it:

@asset
def prepare_data():
    # Some complex data preparation here
    data = load_data()  # Assume this function loads data
    prepared_data = data * 2
    return prepared_data

@asset(ins={"prepared_data": AssetIn(prepare_data)})
def analyze_data(prepared_data):
    # Some complex analysis here
    results = prepared_data.sum()
    return results

Running Advanced Python Application with Dagster

With Dagster, you can establish more complex pipelines that involve dependencies, conditionals, and more. Here is an example of a pipeline that involves two solid functions where one depends on the other:

from dagster import pipeline, solid

@solid
def process_data(context, df):
    df_processed = df * 2  # Some complex data processing
    return df_processed

@solid
def analyze_data(context, df):
    result = df.sum()  # Some complex data analysis
    context.log.info(f"Result: {result}")

@pipeline
def complex_pipeline():
    analyze_data(process_data())

Integration of All Components

Let’s tie all the components together in a more complex pipeline that reads a Parquet file from S3, processes it using DuckDB, and uploads the result back to S3.

from dagster import execute_pipeline, ModeDefinition, fs_io_manager

@pipeline(
    mode_defs=[
        ModeDefinition(resource_defs={"io_manager": fs_io_manager}),
    ]
)
def data_lake_pipeline():
    data = read_parquet_file('mybucket', 'file.parquet', 0, 10000)
    processed_data = process_data(data)
    result = analyze_data(processed_data)
    upload_large_file('mybucket', 'result.parquet', result)

result = execute_pipeline(data_lake_pipeline)

Monitor Your System

To ensure that our data lake runs smoothly, we need to incorporate monitoring. Let’s use AWS CloudWatch for logging and monitoring our application:

import boto3

def log_to_cloudwatch(message):
    client = boto3.client('logs', region_name='your_region')
    log_group = 'my_log_group'
    log_stream = 'my_log_stream'
    response = client.put_log_events(
        logGroupName=log_group,
        logStreamName=log_stream,
        logEvents=[{'timestamp': 1000, 'message': message}]
    )
    return response

# Now you can log any event or error in your application
log_to_cloudwatch('Data processing started')

Scale Up Your System

As our data grows, our system should scale accordingly. We can create multiple DuckDB instances, each handling a portion of our data. This could be orchestrated in a distributed fashion using Apache Spark or a similar system:

from pyspark.sql import SparkSession

# Initialize Spark
spark = SparkSession.builder.getOrCreate()

# Read data into a Spark DataFrame
df = spark.read.parquet("s3://mybucket/file.parquet")

# Create a temporary view for SQL queries
df.createOrReplaceTempView("my_data")

# Execute a SQL query using DuckDB and store the result back in a Spark DataFrame
result = spark.sql("SELECT * FROM my_data WHERE value > 100")

Remember that complexity is the essence of the game. It encourages us to explore, expand our knowledge, and reach new horizons. As in business, data science is not about making quick decisions. It’s about taking responsibility, striving for more, and gradually improving every day.

Distributed Deep Learning in TensorFlow

Dima — Fri, 05 May 2023 13:48:32 +0000

I know that solving complex problems and processing large-scale deep learning problems can be quite a challenge. Fortunately, distributed deep learning comes to our rescue, allowing us to leverage the power of multiple devices and computing resources to better train our models. And what better way to discuss this than with TensorFlow, which offers built-in support for distributed learning using the tf.distribute package.

In this article, I’ll dive into distributed deep learning in TensorFlow, delving into model and data parallelism strategies. We’ll explore synchronous and asynchronous learning strategies, look at examples of how to use them, and give practical examples to help you implement them in your projects.

In the following sections, we will look at these strategies in detail, understand their inner workings, and analyze their suitability for different use cases. By the end, you will have a good understanding of TensorFlow’s distributed learning strategies and will be well prepared to implement them in your projects.

Distributed learning strategies in TensorFlow

Distributed learning is an important aspect of training deep learning models on large data sets, as it allows us to share the computational load across multiple devices or even clusters of devices. TensorFlow, being a popular and versatile deep learning framework, offers us the tf.distribute package, which is equipped with various strategies to seamlessly implement distributed learning.

Synchronous learning strategies

Synchronous learning strategies are characterized by simultaneous model updates that ensure consistency and accuracy in the learning process. TensorFlow offers us three main synchronous strategies: MirroredStrategy, MultiWorkerMirroredStrategy, and CentralStorageStrategy. Let’s take a look at each of them.

MirroredStrategy

MirroredStrategy is a standard TensorFlow synchronous learning strategy that provides data parallelism by replicating the model across multiple devices, usually GPUs. In this strategy, each device processes different minipacks of data and computes gradients independently of each other. Once all devices have completed their computations, the gradients are combined and applied to update the model parameters.

Consider an example. In this example, we will use a more complex model architecture, the deep residual network (ResNet) for image classification. This model consists of several residual blocks.

import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D, BatchNormalization, Activation, Add, MaxPooling2D, GlobalAveragePooling2D, Dense
from tensorflow.keras.models import Model

# Define the residual block
def residual_block(x, filters, strides=1):
    shortcut = x

    x = Conv2D(filters, kernel_size=(3, 3), strides=strides, padding='same')(x)
    x = BatchNormalization()(x)
    x = Activation('relu')(x)

    x = Conv2D(filters, kernel_size=(3, 3), strides=1, padding='same')(x)
    x = BatchNormalization()(x)

    if strides != 1:
        shortcut = Conv2D(filters, kernel_size=(1, 1), strides=strides, padding='same')(shortcut)
        shortcut = BatchNormalization()(shortcut)

    x = Add()([x, shortcut])
    x = Activation('relu')(x)

    return x

# Define the ResNet model
def create_resnet_model(input_shape, num_classes):
    inputs = Input(shape=input_shape)

    x = Conv2D(64, kernel_size=(7, 7), strides=2, padding='same')(inputs)
    x = BatchNormalization()(x)
    x = Activation('relu')(x)
    x = MaxPooling2D(pool_size=(3, 3), strides=2, padding='same')(x)

    x = residual_block(x, filters=64)
    x = residual_block(x, filters=64)

    x = residual_block(x, filters=128, strides=2)
    x = residual_block(x, filters=128)

    x = residual_block(x, filters=256, strides=2)
    x = residual_block(x, filters=256)

    x = residual_block(x, filters=512, strides=2)
    x = residual_block(x, filters=512)

    x = GlobalAveragePooling2D()(x)
    outputs = Dense(num_classes, activation='softmax')(x)

    model = Model(inputs=inputs, outputs=outputs)

    return model

# Instantiate the MirroredStrategy
strategy = tf.distribute.MirroredStrategy()

# Create the ResNet model and compile it within the strategy scope
with strategy.scope():
    input_shape = (224, 224, 3)
    num_classes = 10
    resnet_model = create_resnet_model(input_shape, num_classes)
    resnet_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the ResNet model using the strategy
resnet_model.fit(train_dataset, epochs=10, validation_data=val_dataset)

In this example, we first define a residual block function, which is a building block for the ResNet architecture. We then create a ResNet model with multiple residual blocks, increasing its complexity compared to the previous example. The rest of the code remains unchanged, with the MirroredStrategy being instantiated and used to train the ResNet model on multiple GPUs.

MultiWorkerMirroredStrategy

MultiWorkerMirroredStrategy extends the capabilities of MirroredStrategy to support training across multiple workers, each with potentially multiple devices. This strategy is particularly useful when you need to scale your training process beyond a single machine.

In this example we will use the same complex ResNet model as before, but we will train it using MultiWorkerMirroredStrategy. This will allow us to distribute the learning process across multiple machines, each with multiple GPUs.

import os
import json
import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D, BatchNormalization, Activation, Add, MaxPooling2D, GlobalAveragePooling2D, Dense
from tensorflow.keras.models import Model

# Define the residual block and create_resnet_model functions as shown in the previous example

# Define the strategy and worker configurations
num_workers = 2
worker_ip_addresses = ['192.168.1.100', '192.168.1.101']
os.environ['TF_CONFIG'] = json.dumps({
    'cluster': {
        'worker': worker_ip_addresses
    },
    'task': {'type': 'worker', 'index': 0}
})

# Instantiate the MultiWorkerMirroredStrategy
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

# Create the ResNet model and compile it within the strategy scope
with strategy.scope():
    input_shape = (224, 224, 3)
    num_classes = 10
    resnet_model = create_resnet_model(input_shape, num_classes)
    resnet_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the ResNet model using the strategy
resnet_model.fit(train_dataset, epochs=10, validation_data=val_dataset)

In this example, we use the same ResNet model architecture as in the previous MirroredStrategy example. The primary difference is that we now define the number of workers and their IP addresses, and set up the TF_CONFIG environment variable to configure the distributed training. We then instantiate the MultiWorkerMirroredStrategy and train the ResNet model on multiple machines with multiple GPUs.

CentralStorageStrategy

CentralStorageStrategy is another synchronous learning strategy provided by TensorFlow. Unlike MirroredStrategy and MultiWorkerMirroredStrategy, this strategy stores the model’s variables in a centralized location (usually the CPU). The gradients are still computed independently on each device, but they are aggregated and applied to the centrally stored variables.

In this example, we will use the same complex ResNet model as before, but we will train it using the CentralStorageStrategy strategy. This strategy allows us to store the model variables in a centralized location (usually the CPU), but to compute gradients independently on each device.

import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D, BatchNormalization, Activation, Add, MaxPooling2D, GlobalAveragePooling2D, Dense
from tensorflow.keras.models import Model

# Define the residual block and create_resnet_model functions as shown in the previous examples

# Instantiate the CentralStorageStrategy
strategy = tf.distribute.experimental.CentralStorageStrategy()

# Create the ResNet model and compile it within the strategy scope
with strategy.scope():
    input_shape = (224, 224, 3)
    num_classes = 10
    resnet_model = create_resnet_model(input_shape, num_classes)
    resnet_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the ResNet model using the strategy
resnet_model.fit(train_dataset, epochs=10, validation_data=val_dataset)

In this example, we use the same ResNet model architecture as in the previous MirroredStrategy and MultiWorkerMirroredStrategy examples. The main difference is that we instantiate the CentralStorageStrategy instead of the other strategies. The rest of the code remains unchanged, and we train the ResNet model using the CentralStorageStrategy. This strategy can be particularly useful when memory constraints on the devices are a concern, as it stores the model’s variables in a centralized location.

Asynchronous learning strategies

Asynchronous learning strategies allow devices to update model parameters independently and without waiting for other devices to complete their computations. TensorFlow offers ParameterServerStrategy for implementing asynchronous learning with data and model parallelism.

ParameterServerStrategy

ParameterServerStrategy uses a set of parameter servers storing model variables and a set of workloads responsible for calculating gradients. The worker tasks asynchronously retrieve the latest model parameters from the parameter servers, compute gradients using their local data, and pass the gradients back to the parameter servers, which then update the model parameters.

In this example, we will use the same complex ResNet model as before, but train it using ParameterServerStrategy. This strategy allows us to implement asynchronous learning with data and model parallelism, using a set of parameter servers storing model variables and a set of worker tasks responsible for calculating gradients.

import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D, BatchNormalization, Activation, Add, MaxPooling2D, GlobalAveragePooling2D, Dense
from tensorflow.keras.models import Model

# Define the residual block and create_resnet_model functions as shown in the previous examples

# Define the strategy and cluster configurations
num_ps = 2
num_workers = 4
cluster_spec = tf.train.ClusterSpec({
    'ps': ['ps0.example.com:2222', 'ps1.example.com:2222'],
    'worker': ['worker0.example.com:2222', 'worker1.example.com:2222', 'worker2.example.com:2222', 'worker3.example.com:2222']
})
task_type = 'worker'  # or 'ps' for parameter servers
task_index = 0  # index of the current task (e.g., worker or parameter server)

# Instantiate the ParameterServerStrategy
strategy = tf.distribute.experimental.ParameterServerStrategy()

# Create the ResNet model and compile it within the strategy scope
with strategy.scope():
    input_shape = (224, 224, 3)
    num_classes = 10
    resnet_model = create_resnet_model(input_shape, num_classes)
    resnet_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the ResNet model using the strategy
resnet_model.fit(train_dataset, epochs=10, validation_data=val_dataset)

In this example, we use the same ResNet model architecture as in the previous MirroredStrategy, MultiWorkerMirroredStrategy, and CentralStorageStrategy examples. The main differences are that we define the number of parameter servers and workers, as well as the cluster specification that includes their addresses. We also set the task type and index for the current task. After that, we instantiate the ParameterServerStrategy and train the ResNet model as we did with the other strategies. This strategy is especially effective when both data and model parallelism are required, as well as when there is a tolerance for higher communication overhead.

Choosing the Right Strategy

Selecting the most suitable distributed learning strategy in TensorFlow depends on various factors, including the scale of your deep learning tasks, the available hardware resources, and the communication overhead between devices or workers. Here are some guidelines to help you choose between synchronous and asynchronous strategies based on specific use cases:

If you have a single machine with multiple GPUs, consider using MirroredStrategy, as it allows you to achieve data parallelism with minimal communication overhead.
If you need to scale your training process across multiple machines, each with multiple devices, MultiWorkerMirroredStrategy can be an excellent choice.
If memory constraints on the devices are a concern, CentralStorageStrategy might be a suitable option, as it stores the model’s variables in a centralized location.
For scenarios that require both data and model parallelism, as well as tolerance for higher communication overhead, ParameterServerStrategy can be an effective asynchronous learning solution

In this article, we delved into the world of distributed deep learning in TensorFlow, exploring various strategies for model and data parallelism. We examined synchronous learning strategies like MirroredStrategy, MultiWorkerMirroredStrategy, and CentralStorageStrategy, as well as asynchronous learning strategies like ParameterServerStrategy. By providing practical examples, we demonstrated how to implement these strategies in TensorFlow and discussed factors to consider when choosing the right strategy for your use case.

You now have a solid understanding of TensorFlow’s distributed learning strategies and can confidently apply them to your projects. So, go ahead and explore the tf.distribute package, experiment with different strategies, and optimize your deep learning tasks.

Writing a Search Engine from Scratch using FastAPI and Tantivy

Dima — Sat, 29 Apr 2023 08:32:54 +0000

The importance of search engines in our daily lives cannot be overstated. They help us navigate the vast ocean of information available on the internet and make it accessible at our fingertips. In this article, I’ll guide you through the process of building a custom search engine from scratch using FastAPI, a high-performance web framework for Python, and Tantivy, a fast, full-text search engine library written in Rust.

Before diving into the code, we need to set up our development environment. First, ensure that you have Python installed on your system. FastAPI requires Python 3.6 or higher. Next, install FastAPI and its dependencies. You can do this using the following command:



pip install fastapi[all]

This command will install FastAPI and all the optional dependencies needed for its various features. Tantivy is a Rust library, so we’ll need to use a Python wrapper called “tantivy-py” to work with it. Install the wrapper using:



pip install tantivy-py

Now that we have the necessary tools and libraries installed, create a virtual environment for your project and set up your preferred IDE or text editor.

FastAPI

FastAPI is a modern, high-performance web framework for building APIs with Python. It’s designed to be easy to use and has built-in support for type hints, which allows for automatic data validation, serialization, and documentation generation. FastAPI also has excellent support for asynchronous programming, which improves the performance of I/O-bound operations.

To create a FastAPI application, you’ll need to define routes, add parameters, and create request and response models. Routes are the different URLs or paths that your API can handle, while parameters are the variables passed in the URL, query string, or request body. Request and response models describe the data structure used for input and output, respectively.

Here’s an example of a FastAPI application with routes, parameters, and request/response models:



from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class UserIn(BaseModel):
    name: str
    age: int
    city: str

class UserOut(BaseModel):
    id: int
    name: str
    age: int
    city: str

users = []

@app.post("/users", response_model=UserOut)
def create_user(user: UserIn):
    user_id = len(users)
    new_user = UserOut(id=user_id, **user.dict())
    users.append(new_user)
    return new_user

@app.get("/users/{user_id}", response_model=UserOut)
def get_user(user_id: int):
    if user_id >= len(users) or user_id < 0:
        raise HTTPException(status_code=404, detail="User not found")
    return users[user_id]

@app.get("/users", response_model=list[UserOut])
def search_users(city: str = None, min_age: int = None, max_age: int = None):
    filtered_users = users

    if city:
        filtered_users = [user for user in filtered_users if user.city.lower() == city.lower()]

    if min_age:
        filtered_users = [user for user in filtered_users if user.age >= min_age]

    if max_age:
        filtered_users = [user for user in filtered_users if user.age <= max_age]

    return filtered_users

In this example, we define a FastAPI application with three routes: create_user, get_user, and search_users. We use the UserIn and UserOut classes as request and response models to validate and serialize the input/output data. We also use parameters in the URL path (e.g., user_id), query string (e.g., city, min_age, max_age), and request body (e.g., user).

Tantivy

Tantivy is a full-text search engine library written in Rust. It is designed to be fast and efficient, making it a great choice for building search engines. Tantivy provides indexing and searching capabilities, allowing you to create a schema, add documents to the index, and execute search queries.

To work with Tantivy, you’ll need to create a schema, which is a description of the structure of your documents. The schema defines the fields in each document, their data types, and any additional options or settings. Once you have a schema, you can add documents to the index, store and retrieve data, and perform searches using basic or advanced query features, such as fuzzy search, filters, and pagination.

Here’s an example of creating a schema, indexing documents, and performing basic and advanced searches using Tantivy:



from tantivy import Collector, Index, QueryParser, SchemaBuilder, Term

# Create a schema
schema_builder = SchemaBuilder()
title_field = schema_builder.add_text_field("title", stored=True)
body_field = schema_builder.add_text_field("body", stored=True)
schema = schema_builder.build()

# Create an index with the schema
index = Index(schema)

# Add documents to the index
with index.writer() as writer:
    writer.add_document({"title": "First document", "body": "This is the first document."})
    writer.add_document({"title": "Second document", "body": "This is the second document."})
    writer.commit()

# Create a query parser
query_parser = QueryParser(schema, ["title", "body"])

# Basic search
query = query_parser.parse_query("first")
collector = Collector.top_docs(10)
search_result = index.searcher().search(query, collector)

print("Basic search results:")
for doc in search_result.docs():
    print(doc)

# Fuzzy search
fuzzy_query = query_parser.parse_query("frst~1")  # Allows one edit distance
fuzzy_collector = Collector.top_docs(10)
fuzzy_search_result = index.searcher().search(fuzzy_query, fuzzy_collector)

print("Fuzzy search results:")
for doc in fuzzy_search_result.docs():
    print(doc)

# Filtered search
title_term = Term(title_field, "first")
body_term = Term(body_field, "first")
filter_query = schema.new_boolean_query().add_term(title_term).add_term(body_term)
filtered_collector = Collector.top_docs(10)
filtered_search_result = index.searcher().search(filter_query, filtered_collector)

print("Filtered search results:")
for doc in filtered_search_result.docs():
    print(doc)

In this example, we first create a schema with two text fields: “title” and “body”. Then, we create an index and add documents to it. We also create a query parser to parse queries for searching the index. We demonstrate basic search, fuzzy search (with a specified edit distance), and filtered search (using boolean queries to combine terms).

Building the Search Engine

Now that we have an understanding of FastAPI and Tantivy, it’s time to build our search engine. We’ll start by designing the search engine architecture, which includes the FastAPI application and the Tantivy index.

First, create a FastAPI application by defining search and indexing endpoints. These endpoints will be responsible for processing search queries and indexing new documents, respectively. Implement request and response models for each endpoint to describe the data structure used for input and output.



from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from tantivy import Collector, Index, QueryParser, SchemaBuilder

app = FastAPI()

# Create a schema
schema_builder = SchemaBuilder()
title_field = schema_builder.add_text_field("title", stored=True)
body_field = schema_builder.add_text_field("body", stored=True)
schema = schema_builder.build()

# Create an index with the schema
index = Index(schema)

# Create a query parser
query_parser = QueryParser(schema, ["title", "body"])

class DocumentIn(BaseModel):
    title: str
    body: str

class DocumentOut(BaseModel):
    title: str
    body: str

@app.post("/index", response_model=None)
def index_document(document: DocumentIn):
    with index.writer() as writer:
        writer.add_document(document.dict())
        writer.commit()

@app.get("/search", response_model=list[DocumentOut])
def search_documents(q: str):
    query = query_parser.parse_query(q)
    collector = Collector.top_docs(10)
    search_result = index.searcher().search(query, collector)

    documents = [DocumentOut(**doc) for doc in search_result.docs()]
    return documents

In this example, we create a FastAPI application with two endpoints: index_document and search_documents. The index_document endpoint is responsible for indexing new documents, while the search_documents endpoint is responsible for processing search queries. We use the DocumentIn and DocumentOut classes as request and response models to describe the data structure for input and output.

Next, index documents using Tantivy. Write a function that processes and stores data in the Tantivy index. This function should take input data, create a document based on the schema, and add it to the index.



from typing import Dict
from tantivy import Index, SchemaBuilder

# Create a schema
schema_builder = SchemaBuilder()
title_field = schema_builder.add_text_field("title", stored=True)
body_field = schema_builder.add_text_field("body", stored=True)
schema = schema_builder.build()

# Create an index with the schema
index = Index(schema)

def index_document(document_data: Dict[str, str]) -> None:
    with index.writer() as writer:
        writer.add_document(document_data)
        writer.commit()

# Example usage
document = {"title": "Example document", "body": "This is an example document."}
index_document(document)

In this example, we define a function called index_document that takes a dictionary as input data, representing a document to be indexed. This function creates a document based on the schema and adds it to the Tantivy index. The example usage demonstrates how to use the function to index a sample document.

Finally, implement the search functionality. Use Tantivy to execute search queries, and handle search results by processing and returning them in a format that can be easily consumed by the client.



rom typing import Dict, List
from tantivy import Collector, Index, QueryParser, SchemaBuilder

# Create a schema
schema_builder = SchemaBuilder()
title_field = schema_builder.add_text_field("title", stored=True)
body_field = schema_builder.add_text_field("body", stored=True)
schema = schema_builder.build()

# Create an index with the schema
index = Index(schema)

# Create a query parser
query_parser = QueryParser(schema, ["title", "body"])

def search_documents(query_str: str) -> List[Dict[str, str]]:
    query = query_parser.parse_query(query_str)
    collector = Collector.top_docs(10)
    search_result = index.searcher().search(query, collector)

    documents = [doc.as_json() for doc in search_result.docs()]
    return documents

# Example usage
search_query = "example"
results = search_documents(search_query)
print(f"Search results for '{search_query}':")
for doc in results:
    print(doc)

In this example, we define a function called search_documents that takes a query string as input and uses Tantivy to execute the search query. The function processes the search results by converting each result document to a dictionary and returns a list of dictionaries that can be easily consumed by the client. The example usage demonstrates how to use the function to perform a search and display the results.

Testing and Optimization

To ensure that your search engine works correctly and efficiently, write unit tests for the FastAPI and Tantivy components. Test the functionality of each endpoint and the proper interaction between FastAPI and Tantivy. Additionally, benchmark your search engine to assess its performance and identify any bottlenecks or areas for improvement. Optimize your code by addressing these bottlenecks and making any necessary adjustments.

Here’s an example of unit tests for the FastAPI and Tantivy components using pytest and httpx libraries:



import pytest
import httpx
from fastapi import FastAPI
from fastapi.testclient import TestClient
from .main import app, index_document, search_documents

client = TestClient(app)

# Test FastAPI endpoints
def test_index_document():
    response = client.post("/index", json={"title": "Test document", "body": "This is a test document."})
    assert response.status_code == 200

def test_search_documents():
    response = client.get("/search", params={"q": "test"})
    assert response.status_code == 200
    assert len(response.json()) > 0
    assert response.json()[0]["title"] == "Test document"

# Test Tantivy components
def test_tantivy_index_document():
    document = {"title": "Tantivy test document", "body": "This is a Tantivy test document."}
    index_document(document)

def test_tantivy_search_documents():
    search_query = "tantivy"
    results = search_documents(search_query)
    assert len(results) > 0
    assert results[0]["title"] == "Tantivy test document"

# Performance benchmarking and optimization can be done using profiling tools,
# such as cProfile, Py-Spy, or others, depending on the specific bottlenecks and areas
# for improvement identified in your application.

To benchmark the search engine and identify bottlenecks, you can use profiling tools, such as cProfile or Py-Spy. Once you’ve identified the areas for improvement, optimize your code by addressing the bottlenecks and making necessary adjustments. Performance optimization is an iterative process and may require multiple rounds of profiling and optimization.

Deploying the Search Engine

Once your search engine is fully functional and optimized, it’s time to deploy it. Choose a deployment platform that best suits your needs, such as a cloud provider or a dedicated server. Configure the deployment environment by setting up necessary components, such as a web server, application server, and any required databases or storage systems.

After deployment, monitor and maintain your search engine to ensure its smooth operation. Keep an eye on performance metrics, such as response times and resource utilization, and address any issues that arise.

In this article, we’ve explored the process of building a search engine from scratch using FastAPI and Tantivy. We’ve covered the fundamentals of both FastAPI and Tantivy, as well as the steps needed to create, test, optimize, and deploy a custom search engine. By following this guide, you should now have a working search engine that can be tailored to your specific needs.

The possibilities with this custom search engine are vast, and you can extend its functionality to accommodate various applications, such as site search, document search, or even powering a custom search service. As you continue to experiment and explore, you’ll discover the true power and flexibility of using FastAPI and Tantivy to create search solutions that meet your unique requirements.

Connecting Your Flutter Application to a Backend

Dima — Thu, 13 Apr 2023 08:43:21 +0000

When developing a Flutter application, you probably want to keep development costs to a minimum, make the solution fast, reliable, and secure. In this guide (or rather a technical overview with comments), I will explain how to connect your Flutter application to a backend using HTTP REST API and gRPC. You will learn how to choose the right backend for your application, how to set it up, and how to connect your application to it.

What to choose?

It is not a secret that REST API is the most common way of communication between frontend and backend in modern web applications and microservices-based infrastructures. However, it is worth noting that microservices can use other communication methods as well.

HTTP REST API uses the HTTP protocol to transfer data between the client and the server. It is easy to use and understandable for most developers. To use it, you need to define the requests that your application should send to the server, as well as the structure of the responses that you expect to receive. This article provides an example using the Dio package.

gRPC (Google Remote Procedure Call) is a newer approach to communication between the client and the server, based on open-source RPC architecture. It uses the Protobuf message exchange format, which is a highly efficient format for exchanging messages with a high degree of packing (thanks to the forced use of http/2 features) for serializing structured data. In some cases, using gRPC API may be more efficient than REST API.

When developing an application, the frontend and backend are usually created by different people with different competencies. Tools such as swagger and redoc are used to provide usage instructions for REST. In the case of gRPC, the frontend receives almost ready-to-implement code. It is also worth considering the fact that if the project involves a web application, using gRPC for a mobile application may be too expensive.

Connecting with HTTP REST

To work with the REST API, I recommend using the Dio package because it is more functional and convenient than the standard http package. If you have created web applications, this is similar to axios.

To use Dio, you need to create a class for working with network connections, in which you will define the requests that your application should send to the server, as well as the structure of the responses you expect to receive.

“ flutter pub add dio ”

Add the dio package to your project

Let's create a class for working with network connections:

import 'package:dio/dio.dart';

class NetworkService {
  late final Dio _dio;
  final JsonEncoder _encoder = const JsonEncoder();
  static final NetworkService _instance = NetworkService.internal();

  NetworkService.internal();

  static NetworkService get instance => _instance;

  Future<void> initClient() async {
    _dio = Dio(
      BaseOptions(
        baseUrl: Constant.baseUrl,
        connectTimeout: 60000,
        receiveTimeout: 6000,
      ),
    );
// A place for interceptors. For example, for authentication and logging
  }

  Future<dynamic> get(
    String url, {
    Map<String, dynamic>? queryParameters,
  }) async {
    try {
      final response = await _dio.get(url, queryParameters: queryParameters);
      return response.data;
    } on DioError catch (e) {
      final data = Map<String, dynamic>.from(e.response?.data);
      throw Exception(data['message'] ?? "Error while fetching data");
    } catch (e) {
      rethrow;
    }
  }

  Future<dynamic> download(String url, String path) async {
    return _dio.download(url, path).then((Response response) {
      if (response.statusCode! < 200 || response.statusCode! > 400) {
        throw Exception("Error while fetching data");
      }
      return response.data;
    }).onError((error, stackTrace) {
      throw Exception(error);
    });
  }

  Future<dynamic> delete(String url) async {
    return _dio.delete(url).then((Response response) {
      if (response.statusCode! < 200 || response.statusCode! > 400) {
        throw Exception("Error while fetching data");
      }
      return response.data;
    }).onError((DioError error, stackTrace) {
      _log(error.response);
    });
  }

  Future<dynamic> post(String url, {body, encoding}) async {
    try {
      final response = await _dio.post(url, data: _encoder.convert(body));
      return response.data;
    } on DioError catch (e) {
      throw Exception(e.response?.data['detail'] ?? e.toString());
    } catch (e) {
      rethrow;
    }
  }

  Future<dynamic> postFormData(String url, {required FormData data}) async {
    try {
      final response = await _dio.post(url, data: data);
      return response.data;
    } on DioError catch (e) {
      throw Exception(e.response?.data['detail'] ?? e.toString());
    } catch (e) {
      rethrow;
    }
  }

  Future<dynamic> patch(String url, {body, encoding}) async {
    try {
      final response = await _dio.patch(url, data: _encoder.convert(body));
      return response.data;
    } on DioError catch (e) {
      throw Exception(e.response?.data['detail'] ?? e.toString());
    } catch (e) {
      rethrow;
    }
  }

  Future<dynamic> put(String url, {body, encoding}) async {
    try {
      final response = await _dio.put(url, data: _encoder.convert(body));
      return response.data;
    } on DioError catch (e) {
      throw e.toString();
    } catch (e) {
      rethrow;
    }
  }
}

Example

final NetworkService _client;

Future<String> login(LoginRequest loginRequest) async {
  try {
    final jsonData = await _client.post(
      "${Constant.baseUrl}/v1/auth/login",
      body: loginRequest.toJson()
    );
    return jsonData['access_token']
  } catch (e) {
    rethrow;
  }
}

Connecting with gRPC

To work with gRPC, you need to use generated code based on proto files. Create a service for work, which will use the HelloClient class generated using gRPC. When you create an instance of HelloClient, you will need to pass it a channel that will be used to send requests.

Now, integrate the generated gRPC client into your Flutter application:

“ flutter pub add grpc ”

Add the grpc package to your project

Creating a service for work:

import 'package:grpc/grpc.dart';
//import your autogenerate code
import '../services/proto/hello.pbgrpc.dart';

class HelloService {

  ///here enter your host without the http part
  String baseUrl = "example.com";

  HelloService._internal();
  static final HelloService _instance = HelloService._internal();

  factory HelloService() => _instance;

  ///static HelloService instance that we will call when we want to make requests
  static HelloService get instance => _instance;
   ///HelloClient is the  class that was generated for us when we ran the generation command
  ///We will pass a channel to it to intialize it
  late HelloClient _helloClient;

  ///this will be used to create a channel once we create this class.
  ///Call HelloService().init() before making any call.
  Future<void> init() async {
    final channel = ClientChannel(
      baseUrl,
      port: 443,
      options: const ChannelOptions(),
    );
    _helloClient = HelloClient(channel);
  }

  ///provide public access to the HelloClient instance
  HelloClient get helloClient {
    return _helloClient;
  }
}

In you main function initialize the HelloService class like below.

HelloService().init();

Make gRPC call

Future<void> sayHello() async {
  try {
    HelloRequest helloRequest = HelloRequest();
    helloRequest.name = "Itachi";
    var res = await HelloService.instance.helloClient.sayHello(helloRequest);
  } catch (e) {
    print(e);
  }
}

Comparing REST and gRPC for Flutter Applications

Advantages of REST: Simple, easy to understand, widely supported, cacheable, and suitable for most applications.
Disadvantages of REST: Higher latency, less efficient serialization, limited streaming capabilities.
Advantages of gRPC: Low latency, efficient serialization, streaming support, strong client libraries, and ideal for real-time applications and microservices.
Disadvantages of gRPC: Steeper learning curve, not as human-readable, less support in web browsers, and may be overkill for simple applications.

When choosing between REST and gRPC for your Flutter application, consider the specific needs and requirements of your app. REST is a solid choice for most applications, while gRPC is well-suited for real-time applications or those with complex communication patterns between services.

In this article, we explored how to organize a Flutter application with an API backend, focusing on two main approaches: HTTP REST and gRPC. We discussed designing APIs, implementing clients in Flutter.

By understanding the advantages and disadvantages of each approach, you can make an informed decision when designing your Flutter application with an API backend. Choose the best approach for your app based on its specific needs, and build efficient, scalable, and maintainable applications.