DEV Community: Tomoya Oda

PyTorch Distributed Data Parallel (DDP) using Hugging Face Accelerate

Tomoya Oda — Tue, 25 Jul 2023 20:42:51 +0000

This article is translated from my Japanese tech blog.
https://tmyoda.hatenablog.com/entry/20210314/1615712115

Introduction

Recently, PyTorch has been recommending DDP (Distributed Data Parallel). However, the multiple changes need to the source code.

So, I'll introduce the implementation of DDP using Hugging Face Accelerate.
It's very handy and I think it's best DDP library so far.

Preparation

I used this image dataset.
https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/overview

Data -> Download All

Training Code

I used the VGG16 model from torchvision.
The train_model function here is modified for DDP (Distributed Data Parallel).

https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html

I referred to this kernel for the Dataloader.
https://www.kaggle.com/alpaca0984/dog-vs-cat-with-pytorch#Generate-submittion.csv

nn.DataParallel

import time
import copy
from tqdm import tqdm
import multiprocessing as mp
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import transforms
import torchvision.models as models
from torchvision import datasets
import matplotlib.pyplot as plt

from src.datasets import DogCatDataset

# Config
IMAGE_SIZE = 224
NUM_CLASSES = 2
BATCH_SIZE = 50
NUM_EPOCH = 1

# Seed
torch.manual_seed(42)
np.random.seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_dir = './data/train'
test_dir = './data/test'

# Data preprocessing
transform = transforms.Compose(
    [
        # Resize
        transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
        # To tensor
        transforms.ToTensor(),
        # Std
        transforms.Normalize(mean=[0.4883, 0.4551, 0.4170],
                             std=[0.2257, 0.2211, 0.2214])
    ]
)


# Get training data
train_dataset = DogCatDataset(
    csv_file="./data/train.csv",
    root_dir=train_dir,
    transform=transform
)

# Split train and val
n_samples = len(train_dataset)  # n_samples is 60000
train_size = int(len(train_dataset) * 0.8)  # train_size is 48000
val_size = n_samples - train_size  # val_size is 48000

train_dataset, val_dataset = torch.utils.data.random_split(
    train_dataset, [train_size, val_size])
datasets = {'train': train_dataset, 'val': val_dataset}

# Create training and validation dataloaders
dataloaders = {
    x: torch.utils.data.DataLoader(
        datasets[x],
        batch_size=BATCH_SIZE,
        shuffle=True,
        pin_memory=True,
        num_workers=mp.cpu_count()) for x in ['train', 'val']
}


# Fine-tuning
model = models.vgg16(pretrained=True, progress=True)
model.classifier[6] = nn.Linear(4096, NUM_CLASSES)
# Transfer to GPU
model = model.to(device)
# Use multi-GPU
# model = nn.DataParallel(model)

# optimizer SGD
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
# Loss function
criterion = nn.CrossEntropyLoss()

# Start training 
# https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html
since = time.time()
history = {'accuracy': [],
            'val_accuracy': [],
            'loss': [],
            'val_loss': []}


best_model_wts = copy.deepcopy(model.state_dict())
best_acc = 0.0

for epoch in range(NUM_EPOCH):
    print('Epoch {}/{}'.format(epoch, NUM_EPOCH - 1))
    print('-' * 10)

    # Each epoch has a training and validation phase
    for phase in ['train', 'val']:
        if phase == 'train':
            model.train()  # Set model to training mode
        else:
            model.eval()   # Set model to evaluate mode

        running_loss = 0.0
        running_corrects = 0

        # Iterate over data.
        for inputs, labels in tqdm(dataloaders[phase]):
            inputs = inputs.to(device)
            labels = labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward
            # track history if only in train
            with torch.set_grad_enabled(phase == 'train'):
                # Get model outputs and calculate loss
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                _, preds = torch.max(outputs, 1)

                # backward + optimize only if in training phase
                if phase == 'train':
                    loss.backward()
                    optimizer.step()

            # statistics
            running_loss += loss * inputs.size(0)
            running_corrects += torch.sum(preds == labels.data)

        epoch_loss = running_loss.item() / len(dataloaders[phase].dataset)
        epoch_acc = running_corrects.double() / len(dataloaders[phase].dataset)

        print(
            '{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase,
                epoch_loss,
                epoch_acc))

        # deep copy the model
        if phase == 'val' and epoch_acc > best_acc:
            best_acc = epoch_acc
            best_model_wts = copy.deepcopy(model.state_dict())

        if phase == 'train':
            history['accuracy'].append(epoch_acc.item())
            history['loss'].append(epoch_loss)
        else:
            history['val_accuracy'].append(epoch_acc.item())
            history['val_loss'].append(epoch_loss) 

    print()

time_elapsed = time.time() - since
print(
    'Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed //
        60,
        time_elapsed %
        60))
print('Best val Acc: {:4f}'.format(best_acc))

# load best model weights
model.load_state_dict(best_model_wts)

model = model.to('cpu')
torch.save(model.state_dict(), './model/best.pth')


# plot
acc = history['accuracy']
val_acc = history['val_accuracy']
loss = history['loss']
val_loss = history['val_loss']
epochs_range = range(NUM_EPOCH)

plt.figure(figsize=(24, 8))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.savefig("training_results.png")

Hugging Face Accelerate

https://huggingface.co/docs/accelerate/

pip install accelerate

Source Code

import time
import copy
from tqdm import tqdm
import multiprocessing as mp
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import transforms
import torchvision.models as models
from torchvision import datasets
import matplotlib.pyplot as plt
from accelerate import Accelerator

from src.datasets import DogCatDataset

# Config
IMAGE_SIZE = 224
NUM_CLASSES = 2
BATCH_SIZE = 50
NUM_EPOCH = 2

# Seed
torch.manual_seed(42)
np.random.seed(42)

# Device
accelerator = Accelerator()
device = accelerator.device

train_dir = './data/train'
test_dir = './data/test'

# Data preprocessing
transform = transforms.Compose(
    [
        # Resize
        transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),
        # To tensor
        transforms.ToTensor(),
        # Std
        transforms.Normalize(mean=[0.4883, 0.4551, 0.4170],
                             std=[0.2257, 0.2211, 0.2214])
    ]
)


# Get training data
train_dataset = DogCatDataset(
    csv_file="./data/train.csv",
    root_dir=train_dir,
    transform=transform
)

# Split train and val
n_samples = len(train_dataset)  # n_samples is 25000
train_size = int(len(train_dataset) * 0.8)  # train_size is 20000
val_size = n_samples - train_size  # val_size is 5000

train_dataset, val_dataset = torch.utils.data.random_split(
    train_dataset, [train_size, val_size])
datasets = {'train': train_dataset, 'val': val_dataset}

# Create training and validation dataloaders
dataloaders = {
    x: torch.utils.data.DataLoader(
        datasets[x],
        batch_size=BATCH_SIZE,
        shuffle=True,
        pin_memory=True,
        drop_last=False if x == 'val' else True,
        num_workers=mp.cpu_count()) for x in ['train', 'val']
}


# Fine-tuning
model = models.vgg16(pretrained=True, progress=True)
model.classifier[6] = nn.Linear(4096, NUM_CLASSES)

# optimizer SGD
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
# Loss function
criterion = nn.CrossEntropyLoss()

# Prepare everything
# There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
# prepare method.
model, optimizer, dataloaders['train'], dataloaders['val'] = accelerator.prepare(
    model, optimizer, dataloaders['train'], dataloaders['val'])

# Start training
# https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html
since = time.time()
history = {'accuracy': [],
           'val_accuracy': [],
           'loss': [],
           'val_loss': []}


best_model_wts = copy.deepcopy(model.state_dict())
best_acc = 0.0

for epoch in range(NUM_EPOCH):
    # Use accelerator.print to print only on the main process.
    accelerator.print('Epoch {}/{}'.format(epoch, NUM_EPOCH - 1))
    accelerator.print('-' * 10)

    # Each epoch has a training and validation phase
    for phase in ['train', 'val']:
        if phase == 'train':
            model.train()  # Set model to training mode
        else:
            model.eval()   # Set model to evaluate mode

        running_loss = 0.0
        running_corrects = 0

        # Iterate over data.
        for inputs, labels in tqdm(dataloaders[phase]):
            # zero the parameter gradients
            optimizer.zero_grad()

            # forward
            # track history if only in train
            with torch.set_grad_enabled(phase == 'train'):
                # Get model outputs and calculate loss
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                _, preds = torch.max(outputs, 1)

                # backward + optimize only if in training phase
                if phase == 'train':
                    accelerator.backward(loss)
                    optimizer.step()

            # statistics
            running_loss += loss * inputs.size(0)
            running_corrects += torch.sum(preds == labels.data)

        all_running_loss = accelerator.gather(running_loss)
        all_running_corrects = accelerator.gather(running_corrects)

        if accelerator.is_local_main_process:
            epoch_loss = all_running_loss.sum().item() / len(dataloaders[phase].dataset)
            epoch_acc = all_running_corrects.sum().double() / len(dataloaders[phase].dataset)

            print(
                '{} Loss: {:.4f} Acc: {:.4f}'.format(
                    phase,
                    epoch_loss,
                    epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                unwrapped_model = accelerator.unwrap_model(model)
                best_model_wts = copy.deepcopy(unwrapped_model.state_dict())

            if phase == 'train':
                history['accuracy'].append(epoch_acc.item())
                history['loss'].append(epoch_loss)
            else:
                history['val_accuracy'].append(epoch_acc.item())
                history['val_loss'].append(epoch_loss)

    print()

if accelerator.is_local_main_process:
    time_elapsed = time.time() - since
    print(
        'Training complete in {:.0f}m {:.0f}s'.format(
            time_elapsed //
            60,
            time_elapsed %
            60))
    print('Best val Acc: {:4f}'.format(best_acc))

    torch.save(best_model_wts, './model/best.pth')

# plot
    acc = history['accuracy']
    val_acc = history['val_accuracy']
    loss = history['loss']
    val_loss = history['val_loss']
    epochs_range = range(NUM_EPOCH)

    plt.figure(figsize=(24, 8))
    plt.subplot(1, 2, 1)
    plt.plot(epochs_range, acc, label='Training Accuracy')
    plt.plot(epochs_range, val_acc, label='Validation Accuracy')
    plt.legend(loc='lower right')
    plt.title('Training and Validation Accuracy')

    plt.subplot(1, 2, 2)
    plt.plot(epochs_range, loss, label='Training Loss')
    plt.plot(epochs_range, val_loss, label='Validation Loss')
    plt.legend(loc='upper right')
    plt.title('Training and Validation Loss')
    plt.savefig("training_results.png")

Command for training

Example for single node, 4 GPU

$ accelerate config
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]:1
Do you want to use DeepSpeed? [yes/NO]: NO
How many processes in total will you use? [1]:4
Do you wish to use FP16 (mixed precision)? [yes/NO]: NO

accelerate launch train.py

DDP (accelerate) Explanation

Specify the device as follows.

accelerator = Accelerator()
device = accelerator.device

Pass the model, optimizer, and dataloader to the prepare function.

model, optimizer, dataloaders['train'], dataloaders['val'] = accelerator.prepare(
    model, optimizer, dataloaders['train'], dataloaders['val'])

When aggregating information from other nodes (during evaluation, etc.), use the gather function as follows.

        all_running_loss = accelerator.gather(running_loss)
        all_running_corrects = accelerator.gather(running_corrects)

The model is saved as follows.

accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
accelerator.save(unwrapped_model.state_dict(), filename)

The following process is only performed in the main process.

if accelerator.is_local_main_process:

Time comparison

This is not a rigorous comparison, so take it as a reference.

GPU: 4
Batch size per 1 GPU: 50
Epoch: 1
Fine-tuning the pretrained VGG16 model from torchvision

cuda: 0

4m 35s

train Loss: 0.0136 Acc: 0.9951
val Loss: 0.0240 Acc: 0.9912

Training complete in 4m 35s

nn.DataParallel

1m 39s

train Loss: 0.0459 Acc: 0.9817
val Loss: 0.0235 Acc: 0.9908

Training complete in 1m 39s

DDP (accelerate)

0m 42s

train Loss: 0.0750 Acc: 0.9682
val Loss: 0.0265 Acc: 0.9908

Training complete in 0m 42s

DDP is so faster!

Implementing DDP was quite difficult, but using accelerate allows for an easy implementation of DDP. I highly recommend giving it a try.

Kaggle SETI 59th Solution

Tomoya Oda — Tue, 25 Jul 2023 20:25:37 +0000

This article is translated from my Japanese tech blog.
https://tmyoda.hatenablog.com/entry/20210819/1629384283

About the SETI Competition

https://www.kaggle.com/competitions/seti-breakthrough-listen

This competition is given a spectrogram of a signal and predicts anomalies in it.
(The data used in this competition has been artificially generated from a simulator)

Pipeline

Augmentation

I didn't have enough time to investigate augmentation thoroughly. For now, I used these four and mixup is included. I don't know which one is effective...

vflip
shift_scale_rotate
motion_blur
spec_augment

I wanted to use SpecAug in albumentations, so I created a class as follows.

class SpecAugment(ImageOnlyTransform):
    def __init__(self, alpha=0.1, **kwargs):
        super(SpecAugment, self).__init__(**kwargs)
        self.spec_alpha = alpha

    def apply(self, img, **params):
        x = img
        t0 = np.random.randint(0, x.shape[0])
        delta = np.random.randint(0, int(x.shape[0] * self.spec_alpha))
        x[t0:min(t0 + delta, x.shape[0])] = 0
        t0 = np.random.randint(0, x.shape[1])
        delta = np.random.randint(0, int(x.shape[1] * self.spec_alpha))
        x[:, t0:min(t0 + delta, x.shape[1])] = 0
        return x

Test Time Augmentation (TTA)

Since there are four augmentations I applied this time, I decided to perform the TTA 16 times. The number 16 was chosen because I wanted to apply all the augmentations at least once for each image during the TTA.

For example, when TTA 16 times, 4 types of augmentation, and the probability of each augmentation being applied is p=0.5, the probability of all augmentations being applied at least once can be calculated using the following formula.

TTA: 16, Augmentation 4

TTA: 4, Augmentation 4

Resizing Network

notebook
[https://www.kaggle.com/swimmy/seti-learned-image-resizing:title]
paper
[https://arxiv.org/abs/2103.09950:title]

This model is the best score so far.
I believe it would be better to input the image without resizing, but my GPU has not enough memory.
If I want to input the image without resize, I need to reduce the batch size.

However, this leads to a situation where, in the case of imbalanced data like this time (9:1), only one class appears in a batch.

So, I decided to train with the largest possible image size using this model.

Training

In this competition, the dataset was reset once, and the dataset was completely refreshed. So, I decided to use the previous data for pre-training. Doing this, the score slightly increased for both LB and CV.

Also, the pre-training of the model is fold-out, and the fine-tuning is 4Fold CV.

Model

I have encountered a problem model would not learn when enlarged (probably due to bad learning rate and scheduler) even I tried various models (nfnet, volo, swin,...).

So, I decided to use efficientnetv2_s and m which had good score.

What I tried

AST: Audio Spectrogram Transformer ( [https://arxiv.org/pdf/2104.01778v3.pdf]): No change
Weighted CE loss: No change
Temperature scaling ([https://github.com/gpleiss/temperature_scaling]): Slight increase in private
Dark magic trick ([https://www.kaggle.com/c/seti-breakthrough-listen/discussion/238722]) : Score decreased
Don't include augmentation in the last few epochs: Score decreased: Score decreased
Apply mixup not every time but probabilistically: No change: No change
Adversarial validation: The distributions of train and test were too different, and there were no instances in train with high confidence in test
Pseudo Label: No change

1st Place Solution

I was surprised by the first place solution.
I think the idea to remove this background can be used in other competitions dealing with spectrograms.

https://www.kaggle.com/c/seti-breakthrough-listen/discussion/266385

Kaggle Coleridge 52nd Solution

Tomoya Oda — Tue, 25 Jul 2023 19:01:23 +0000

This article is translated from my Japanese tech blog.
https://tmyoda.hatenablog.com/entry/20210628/1624883322

About the Coleridge Competition

https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data

This is a Competition to predict the dataset names that are shown in academic papers. The only data provided is the text of the papers and GT.

How I Split into the Dataset to the Validation

In this competition, there are about 130 dataset names (targets) in the training set, but the test set includes dataset names that do not appear in the training phase.

Therefore, it must be divided without any duplication of the dataset names. So, I implemented a BFS and divided it into an 8:2 ratio to avoid any duplication.

Pipeline

Classifier

This classifier worked better than I thought, and most of our team's top submissions included this classifier.

Just classify whether a dataset name exists or not.

MLM

We almost re-use of the kernel below.
https://www.kaggle.com/tungmphung/coleridge-predict-with-masked-dataset-modeling

Jaccard filter

This is also re-use of the kernel as well.

def jaccard_filter(org_labels, threthold=0.75):
    assert isinstance(org_labels, list)

    filtered_labels = []
    for labels in org_labels:
        filtered = []

        for label in sorted(labels, key=len):
            label = clean_text(label)
            if len(filtered) == 0 or all(jaccard(label, got_label)
                                         < threthold for got_label in filtered):
                filtered.append(label)

        filtered_labels.append('|'.join(filtered))

    return filtered_labels

What I tried

Using DiceLoss, FocalLoss which is good at imbalanced data: The score decreased
NER (Named Entity Recognition): It didn't seem to be effective
SciBERT: No change
Increasing external datasets csv: Extraneous strings were hit: decreasing the score
Switching BERT to Electra: The score decreased
Changing CONNECTION_TOKEN: The number of target documents increased, and the score decreased
Beam search with k-fold: It was hard for us to run because of the time

Spark on AWS Glue: Performance Tuning 5 ( Using Cache)

Tomoya Oda — Sun, 16 Jul 2023 21:56:10 +0000

This is a continuation of my previous posts as follows.

Using Cache

Spark RDDs are re-computed each time an action is performed on them. You can avoid this by using cache() or persist(), which keep the RDD in memory.

Comparison between using cache and no cache

Please note that cache() and persist() are transformations, not actions, so they are evaluated lazily.

https://kb.databricks.com/scala/best-practice-cache-count-take#:~:text=Since%20cache()%20is%20a,RDD%20in%20a%20single%20action

Since cache() is a transformation, the caching operation takes place only when a Spark action (for example, count(), show(), take(), or write()) is also used on the same DataFrame, Dataset, or RDD in a single action.

Let's try cache!

with timer('before cache'):
    part_df.select("backend_port").distinct().count()

part_df.cache()
part_df.count() # execute cache (cache is a transformation)
with timer('after cache'):
    part_df.select("backend_port").distinct().count()

[before cache] done in 4.5241 s
[after cache] done in 1.6293 s

It's faster with cache()!

Summary

RDDs are re-computed for each action, so caching makes them faster
Since cache() and persist() are transformations

Spark on AWS Glue: Performance Tuning 4 ( Spark Join)

Tomoya Oda — Sun, 16 Jul 2023 21:49:52 +0000

This is a continuation of my previous posts as follows.

Spark Join

Apache Spark has a type of join called Broadcast Join, which avoids shuffle processing. This method is effective when one table is small and the other is large. Essentially, it distributes the small table to all worker nodes, allowing each node to perform the join. This experiment will test the effectiveness of Broadcast Join for speed optimization using a small dataframe (small enough to fit in the memory of each worker node) and a large dataframe.

https://sparkbyexamples.com/spark/broadcast-join-in-spark/?expand_article=1

With broadcast join, Spark broadcast the smaller DataFrame to all executors and the executor keeps this DataFrame in memory and the larger DataFrame is split and distributed across all executors so that Spark can perform a join without shuffling any data from the larger DataFrame as the data required for join colocated on every executor.

BroadCast Join

join_df = part_df.select(part_df['request_port']).distinct().withColumn("random", F.round(F.rand()*(10-5)+5,0))

with timer('broadcast join dataframe'):
    broadcast_df = part_df.join(join_df.hint('BROADCAST'), part_df.request_port == join_df.request_port, how='left')
    broadcast_df.count()

with timer('sortmerge join dataframe'):
    merge_df = part_df.join(join_df.hint('MERGE'), part_df.request_port == join_df.request_port, how='left')
    merge_df.count()

with timer('shuffle hash join dataframe'):
    shuffle_df = part_df.join(join_df.hint('SHUFFLE_HASH'), part_df.request_port == join_df.request_port, how='left')
    shuffle_df.count()

I used Join Hints to suggest the join strategy. You can find more about JOIN hints here.

https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html#:~:text=The%20join%20side%20with%20the,BROADCAST%20are%20BROADCASTJOIN%20and%20MAPJOIN%20.&text=Suggests%20that%20Spark%20use%20shuffle%20sort%20merge%20join

Although not shown here, I looked Physical Plan using explain(), to verify the Hint's effectiveness.

[shuffle hash join dataframe] done in 23.2022 s
[broadcast join dataframe] done in 11.7729 s
[sortmerge join dataframe] done in 38.4018 s

The broadcast join is the fastest!

Summary

When performing a JOIN operation between a small df and a large df, the broadcast join is the fastest strategy.

Spark on AWS Glue: Performance Tuning 3 ( Impact of Partition Quantity)

Tomoya Oda — Sun, 16 Jul 2023 21:38:08 +0000

This is a continuation of my previous posts as follows.

Impact of Partition Quantity

we will compare the speeds of three different partition numbers: 1, default (unspecified), and 300.
And also we will compare with and without shuffling.

DataFrame Preparation

df = spark.read.format("parquet").load("s3:/.../parquet-chunk-high/")
one_part_df = df.coalesce(1)
print(one_part_df.rdd.getNumPartitions())
one_part_df.count()

part_df = spark.read.format("parquet").load("s3:/.../parquet-chunk-high/")
print(part_df.rdd.getNumPartitions())
part_df.count()

df = spark.read.format("parquet").load("s3:/.../parquet-chunk-high/")
part_300_df = df.repartition(300)
print(part_300_df.rdd.getNumPartitions())
part_300_df.count()

1
94
300

By default (unspecified) 94 partitions were read.

Without shuffling

with timer('one part filter'):
    result = one_part_df.filter(one_part_df['request_processing_time'] < 0.0008).count()
    print(result)

with timer('part filter'):
    result = part_df.filter(part_df['request_processing_time'] < 0.0008).count()
    print(result)

with timer('part 300 filter'):
    result = part_300_df.filter(part_300_df['request_processing_time'] < 0.0008).count()
    print(result)

9
[one part filter] done in 45.5252 s
9
[part filter] done in 1.4579 s
9
[part 300 filter] done in 3.5410 s

94 partitions is the fastest.

With shuffling

with timer('one part shuffle'):
    result = one_part_df.groupBy('elb_name').agg(F.mean('request_processing_time').alias("mean_time")).orderBy('mean_time').count()
    print(result)

with timer('part shuffle'):
    result = part_df.groupBy('elb_name').agg(F.mean('request_processing_time').alias("mean_time")).orderBy('mean_time').count()
    print(result)

with timer('part 300 shuffle'):
    result = part_300_df.groupBy('elb_name').agg(F.mean('request_processing_time').alias("mean_time")).orderBy('mean_time').count()
    print(result)

9
[one part shuffle] done in 78.1068 s
9
[part shuffle] done in 2.6624 s
9
[part 300 shuffle] done in 12.2829 s

Again, 94 partitions is the fastest.

Summary

Basically, a larger number of partitions is better than a smaller number.
"just right" number of partitions is the most efficient.

Spark on AWS Glue: Performance Tuning 2 (Glue DynamicFrame vs Spark DataFrame)

Tomoya Oda — Sun, 16 Jul 2023 21:26:01 +0000

This is a continuation of my previous posts as follows.

Glue DynamicFrame vs Spark DataFrame

Let's compare them using the Parquet file which I created in the part 1.

Data Read Speed Comparison

We will read a single large Parquet file and a highly partitioned Parquet file.

with timer('df'):
    dyf = glueContext.create_dynamic_frame.from_options(
        "s3",
        {
            "paths": [
                "s3://.../parquet-chunk-high/"
            ]
        },
        "parquet",
    )
    print(dyf.count())

with timer('df partition'): 
    dyf = glueContext.create_dynamic_frame.from_options(
        "s3",
        {
            "paths": [
                "s3:/.../parquet-partition-high/"
            ]
        },
        "parquet",
    )
    print(dyf.count())

324917265
[df] done in 125.9965 s
324917265
[df partition] done in 55.9798 s

DynamicFrame is too slow...

Summary

Based on the part 1 (Reading Speed Comparison), spark.read is 27.1 s (for single large file) and 36.3 s (for highly partitioned file), so DynamicFrame is quite slow.
Interestingly, the speed of reading partitioned data is faster than single large Parquet file.

Spark on AWS Glue: Performance Tuning 1 (CSV vs Parquet)

Tomoya Oda — Sun, 16 Jul 2023 21:11:07 +0000

These are my posts for Spark on AWS Glue: Performance Tuning

Introduction

I recently started reading Learning Spark 2nd Edition from O'Reilly.¹

According to the book, the performance seems to vary depending on how you use it. So, let's take a look around.

I would casually record the results of comparing the execution speeds on my end.

Data Preparation

I am going to use some random ELB access logs that were available. The total size of the data is 94.5 GB.

Measurement Function

from contextlib import contextmanager
import time

@contextmanager
def timer(name):
    t0 = time.time()
    yield
    print(f'[{name}] done in {time.time() - t0:.4f} s')

CSV vs Parquet

Generally, columnar formats such as Parquet have the advantage of data compression rates and increased data read speeds by only retrieving the necessary columns.²

Parquet

when reading a Parquet file, Spark first references the metadata and obtain the position of the block to be read. The block to be read contains statistical information such as the min/max values of that block.

For example, if you wanted data with a condition of value > 5.0, you can speed up the process by using the block's statistical information to skip the reading target.
This is called Predicate Pushdown.

Reading Speed Comparison

Let's see if there is any difference in the reading speed in
CSV vs Parquet format.
Also, we will create partitioned data.

(I have added a hours column)

# add hour column
from pyspark.sql.functions import hour
df = df.withColumn("hours", hour("request_timestamp"))

df.coalesce(1).write.mode('append').csv('s3://.../csv-chunk-high/')

df.write.mode('append')\
    .partitionBy('hours')\
    .csv('s3://.../csv-partition-high/')

df.coalesce(1).write.mode('append').parquet('s3://..../parquet-chunk-high/')

df.write.mode('append')\
    .partitionBy('hours')\
    .parquet('s3://.../parquet-partition-high/')

Reading the Data

with timer('csv'):
    df = spark.read.format("csv").load("s3://.../csv-chunk-high/")
    print(df.count())

with timer('csv partition'): 
    df = spark.read.format("csv").load("s3://.../csv-partition-high/")
    print(df.count())

with timer('parquet'):
    df = spark.read.format("parquet").load("s3://.../parquet-chunk-high/")
    print(df.count())

with timer('parquet partition'): 
    df = spark.read.format("parquet").load("s3://.../parquet-partition-high/")
    print(df.count())

324917265
[csv] done in 27.1925 s

324917265
[csv partition] done in 36.3690 s

324917265
[parquet] done in 31.8977 s

324917265
[parquet partition] done in 32.5805 s

The result seems to be no big difference in read speed between Parquet and CSV.

Reading Part of the Data

CSV doesn't have column names.

with timer('csv'):
    df = spark.read.format("csv").load("s3://.../csv-chunk-high/")
    df = df.filter(df['_c6'] < 0.0008)
    print(df.count())

with timer('csv partition'): 
    df = spark.read.format("csv").load("s3://.../csv-partition-high/")
    df = df.filter(df['_c6'] < 0.0008)
    print(df.count())

with timer('parquet'):
    df = spark.read.format("parquet").load("s3://.../parquet-chunk-high/")
    df = df.filter(df['request_processing_time'] < 0.0008)
    print(df.count())

with timer('parquet partition'): 
    df = spark.read.format("parquet").load("s3://.../parquet-partition-high/")
    df = df.filter(df['request_processing_time'] < 0.0008)
    print(df.count())

119627151
[csv] done in 44.2805 s
119627151
[csv partition] done in 48.3934 s

119627151
[parquet] done in 32.7956 s
119627151
[parquet partition] done in 37.8519 s

Parquet is faster!

Data Size Comparison

Snappy.parquet is smaller in data size compared to CSV

aws s3 ls s3://.../csv-chunk-high/  --recursive --human --sum
   Total Size:  94.5 GB

aws s3 ls s3://.../parquet-chunk-high/  --recursive --human --sum
   Total Size:  11.7 GB

How much is CSV gzip?

As CSV is uncompressed, the above comparison would not be fair.
So let's compress CSV using gzip by adding the arguments as follows

df.coalesce(1).write.mode('append').csv('s3://.../csv-chunk-high-compress/', compression="gzip")

aws s3 ls s3://.../csv-chunk-high-compress/  --recursive --human --sum
   Total Size: 13.3 GiB

11.7 GB vs 13.3 GiB, Snappy.parquet is compressed more than CSV gzip.

Summary

Reading speed for the entire data is no different between CSV and Parquet
Parquet reads faster when I used filters
Snappy.parquet has good compression efficiency

References

Configuring dind (docker in docker) with VSCode Remote Development

Tomoya Oda — Mon, 17 Apr 2023 00:30:59 +0000

It was as simple as adding a single line to the devcontainer.json configuration file.

devcontainer.json

    "features":{
        "ghcr.io/devcontainers/features/docker-in-docker:2": {}
    },

There are various features available for dev containers, such as AWS CLI and Python.
It seems that for future container development, there may no longer be no need for trial and error with creating Dockerfiles.

https://containers.dev/features