Kaggle Getting Started Competition -- Petals to the Metal

Tyler Hu — Tue, 19 May 2026 08:41:57 +0000

The experimental environment for the project is Kaggle notebook

competition link : https://www.kaggle.com/competitions/tpu-getting-started

Firstly, we should observe the data. There are 193 files of data provided in this competition, including 192 tfrec files and one csv file. The csv file is the sample file for submission, for reference only.

data/
├── tfrecords-jpeg-192x192/
├── tfrecords-jpeg-224x224/
│   ├── test/
│   │   ├── 00-224x224-462.tfrec
│   │   ├── 01-224x224-462.tfrec
│   │   ├── ...
│   │   └── 15-224x224-452.tfrec
│   ├── train/
│   │   ├── 00-224x224-798.tfrec
│   │   ├── 01-224x224-798.tfrec
│   │   ├── ...
│   │   └── 15-224x224-783.tfrec
│   └── val/
│       ├── 00-224x224-232.tfrec
│       ├── 01-224x224-232.tfrec
│       ├── ...
│       └── 15-224x224-232.tfrec
├── tfrecords-jpeg-331x331/
├── tfrecords-jpeg-512x512/
└── sample_submission.csv

We notice the data files are all tfrec files. The type of file is a professional data file for TensorFlow and this experimental architecture is PyTorch so we need data processing to transform tfrec files to Tensor format.

Let's take a look at the libraries needed for this experiment.

import io
import timm
import torch
import torch.nn as nn
import torchvision.transforms as transforms
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from pathlib import Path
from PIL import Image
import tfrecord
import glob

Reading Files

We should to know what the data looks like. The tfrec file actually is a binary file so we can't look images and labels directly. We need to parse it.

train_files = sorted(glob.glob("/kaggle/input/competitions/tpu-getting-started/tfrecords-jpeg-224x224/train/*.tfrec"))
print(len(train_files))   # output: 16
print(train_files[0])

Let's store all training data from tfrecords-jpeg-224x224 folder into the train_files list, then print its length to verify. It should be 16 because there are 16 files in train folder, which means the training data is split into 16 slices, and each slice contains 798 data points except for the last one, which contains 783 data points.

reader = tfrecord.tfrecord_loader(
    data_path=train_files[0],
    index_path=None,
    description={}   # empty dict, reads out raw field name
)

for record in reader:
    print(record.keys())
    break

Next, let's explain the three parameters of tfrecord_loader method:

We only use the first slices, which is the 00-224x224-798.tfrec file, so we assign the first element of the train_files list to the data_path.
TFRecord allows us to use a index file to accelerate random access, but we don't have it so just set it up at None.
description tell us that we need set what field and what type of field. We need to set this parameter because the tfrec file stores binary data and it has not type information, so the library doesn't know how to parse it. We need to tell the library which fields to read and their types. It is a good way that typing { }. It can read all of field so that we can know what field there are, then typing it.

We can see the following output after printing it.

dict_keys(['id', 'class', 'image'])

Now, let's formally parse a file:

description = {
    "image": "byte",
    "class": "int",
    "id": "byte",
}

reader = tfrecord.tfrecord_loader(
    data_path=train_files[0],
    index_path=None,
    description=description,
)

for record in reader:
    print(record["class"])
    print(type(record["image"]))
    print(len(record["image"]))
    break

Now, we know the field name so we can formally define description dictionary. But how about the type? Simple, just set everything to byte and see whether it throws an error. The error message will tell you the right type. The output is as follows:

[57]
<class 'bytes'>
25512

Dataset

OK, we have understood the data structure and now we can define the dataset class.

class PetalsDataset(Dataset):
    def __init__(self, root_dir, type, transform):
        self.type = type
        self.root_dir = root_dir
        self.type_path = Path(root_dir) / type
        self.files_list = sorted(glob.glob(f"{self.type_path}/*.tfrec"))
        self.transform = transform
        self.samples = []
        description = {"image": "byte", "class": "int", "id": "byte"} if self.type!='test' else {"image": "byte", "id": "byte"}
        for path in self.files_list:
            reader = tfrecord.tfrecord_loader(path, None, description)
            for record in reader:
                self.samples.append(record)

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        record:dict = self.samples[idx]
        img = Image.open(io.BytesIO(record['image'])).convert('RGB')
        if self.transform:
            img = self.transform(img)
        if self.type == 'test':
            return img
        label = record['class'][0]
        return img, label

Let's pay attention to the part of data processing. As I said before, the training data is split into 16 partitions and partitions are stored in 16 tfrec files respectively. So the total training data should include 16x798=12,768 samples. But the last one just includes 783 samples so the actual training data includes 12,753 samples. My code contains two nested loops. The first one read each tfrec files and assign the tfrec files parsed to reader. The second one iterate over the reader object by record and store each sample into samples list, so the length of samples list should be 12,753.

At the same time, I define a inputting parameter, type, which is used to specify whether the dataset is training, validation or test. Note that the test dataset don't have class field, so we should drop the field while we define it.

Then, we define transfrom to transform images into Tensor format for model input.

train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.5, 1.0)),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomVerticalFlip(p=0.5),
    transforms.RandomRotation(45),
    transforms.ColorJitter(0.4, 0.4, 0.4, hue=0.2),
    transforms.RandomGrayscale(p=0.1),


    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

val_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

I want to explain the series of reinforcement in transform:

RandomResizedCrop: This randomly crops a portion of image and resize it to 224x224. scale means that the cropped portion is between 50% and 100% of the original image. （ This encourages model to recognize flowers from local feature rather than relying on the full image.）
RandomHorizontalFlip: This randomly flips the image horizontally with a 50% probability. ( This helps the model become invariant to orientation. )
RandomVerticalFlip: This randomly flips the image vertically with a 50% probability. ( This helps the model become invariant to orientation. )
RandomRotation: This randomly rotates the image within a range of -45 to 45 degrees. ( This helps the model become invariant to orientation. )
ColorJitter: This randomly changes image color. The parameter means that (brightness, contrast, saturation, hue). ( This prevents the model from relying on fixed colors to identify flowers, since the same species can appear differently under various lighting conditions and shooting angles. )
RandomGrayscale: This convers the image into grayscale image within a 10% probability. ( This encourages model identify flowers not only by color, but also by shape and texture )

Next, we just need to define the dataset object and dataloader object.

root_dir = '/kaggle/input/competitions/tpu-getting-started/tfrecords-jpeg-224x224'
train_dataset = PetalsDataset(root_dir, type='train', transform=train_transform)
val_dataset = PetalsDataset(root_dir, type='val', transform=val_transform)

train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=32, shuffle=False)

Model

We can define the model after we have defined the dataset and dataloader object. Let's try to build a CNN model at first:

class ConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.block = nn.Sequential(
            nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )

    def forward(self, x):
        return self.block(x)

class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.feature = nn.Sequential(
            ConvBlock(3, 32),
            ConvBlock(32, 64),
            ConvBlock(64, 128)
        )

        self.classification = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 28 * 28, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, 104)
        )

    def forward(self, x):
        x = self.feature(x)
        x = self.classification(x)
        return x

model = CNN()

Actually, the model is based on the CNN code from Andrew Ng's PyTorch basics course.

Model Training

We should define the device, loss function and optimizer for model training.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
model = model.to(device)

loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-2)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=20)

Then we code the training loop.

def train_epoch(train_dataloader, model, device, loss_function, optimizer):
    model.train()
    running_loss = 0.0
    current = 0
    total = 0

    for batch_idx, (data, target) in enumerate(train_dataloader):
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()
        outputs = model(data)
        loss = loss_function(outputs, target)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += target.size(0)
        current += predicted.eq(target).sum().item()

        if batch_idx % 20 == 0 and batch_idx > 0:
            avg_loss = running_loss / 20
            accuracy = 100. * current / total
            print(f"[{batch_idx * 32} / {399 * 32}] "
                  f"LOSS : {avg_loss:.3f} | accuracy : {accuracy:.1f}%")
            running_loss = 0.0

Note that I print accumulated running loss every 20 batches. When working with other datasets, we should adjust the number according to the total batches in our dataloader. This train_dataloader contains 399 batches, so we print the accumulated running loss every 20 batches.

Next, we define evaluate loop.

def evaluate(model, test_loader, device):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, targets in test_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()
    return 100. * correct / total

Now, we can start training model.

num_epochs = 20
best_accuracy = 0
for epoch in range(num_epochs):
    print(f'\nEpoch: {epoch+1}')
    train_epoch(train_dataloader, model, device, loss_function, optimizer)
    accuracy = evaluate(model, val_dataloader, device)
    scheduler.step()
    print(f'Test Accuracy:{accuracy:.2f}%')

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        torch.save(model.state_dict(), '/kaggle/working/best_model.pth')
        print(f'  save the best model with acc={accuracy:.2f}%')

Running the code above, the model with the highest accuracy rate will be saved. But the highest accuracy just is 26% probability approximately. I was quite dispirited. But I have realized it is normal after some research. This is just a simple model I threw together without any hyperparameter tuning. So it is just a foundation of neural network.

Enhancement

If we want higher accuracy, we need to fine-tune a pretrained model. This CNN model we built only contains three convolutional layers, which is far from sufficient for complex image classification. The pretrained model is common in Kaggle competition. Next, we use the convnext_base model, ConvNeXt is an enhanced version of CNN, proposed by Meta.

model = timm.create_model('convnext_base', pretrained=True, num_classes=104, drop_rate=0.4,drop_path_rate=0.2)

Pretrained model has learned many feature about image, so it achieve a much high accuracy soon. Here I decrease the number of epochs to 10 times.

Running the code with the new model again. The accuracy will increase to 91% probability, and the competition score can reach 0.90.

But now we have another question, which is after we ran code, despite the accuracy is much high in training set even achieve 100%, the accuracy in validation set is just 90% probability. The gap is 10% probability. This is a classic case of overfitting. Let's address it. I used a common and useful way -- Mixup

from torchvision.transforms.v2 import MixUp

mixup = MixUp(alpha=0.2, num_classes=104)

The principle is to blend two images together like a semi-transparent overlay, and mix the labels proportionally as well. Without Mixup, model tend to memorize each images rather than learning general feature.

At the same time, we need to add the following code into loop:

def train_epoch(train_dataloader, model, device, loss_function, optimizer):
    model.train()
    running_loss = 0.0
    current = 0
    total = 0

    for batch_idx, (data, target) in enumerate(train_dataloader):
        data, target = data.to(device), target.to(device)
        data, target = mixup(data, target)      #<---
    #......
        total += target.size(0)
        current += predicted.eq(target.argmax(1)).sum().item()  #Here should be target.argmax(1)

This is because after Mixup, the target is no longer a class label (eg. [48, 77, 0, ...]), but a probability distribution over classes (eg. [[0,0,...,1,...,0], ...])

Now, the competition score can reach 0.96 ! It surpass 95% and achieve my expected perfectly.

Submission

Finally, let's take a look at the code for submission.

all_ids = []
all_preds = []
for img_ids, inputs in test_dataloader:
            inputs = inputs.to(device)
            outputs = test_model(inputs)
            _, predicted = outputs.max(1)

            all_ids.extend(img_ids)
            all_preds.extend(predicted.cpu().numpy())

df = pd.DataFrame({
    'id': all_ids,
    'label': all_preds
})
df.to_csv('submission.csv', index=False)
print(df.head())

Well, if you want to use the model saved, use the code follows:

test_model = timm.create_model('convnext_base', pretrained=True, num_classes=104, drop_rate=0.4,drop_path_rate=0.2)
state_dict = torch.load('/kaggle/input/models/huuhgodona/convnext-model/pytorch/default/1/best_convnext.pth', map_location='cuda')
test_model.load_state_dict(state_dict)
test_model = test_model.to(device)
test_model.eval()

A pth file just save the parameter model need, so we should have defined the model we need, which matches the pth file. Then we load the parameter into it, and the same applies to our own custom CNN.

My first Post of Kaggle

Tyler Hu — Fri, 15 May 2026 05:39:07 +0000

Hi, I'm a data science undergradates from China.🤗 I'm very happy to be here on DEV.TO to share my own thread.

I'm taking AI-related course on DeepLearning.AI and I really looking forward to connecting some study partners! As I said before, I'm from China so I'm still learning and improving English. If my writing isn't natural enough, I hope you won't mind and will point out any problems I've made, I'm really appreciate the feedback.❤️

Also, if you curious about anything regarding China, feel free to reach out.

Anyway, back to the point. I'll be publishing a series of posts covering Kaggle Competition on DEV to document my learning process. Stay tuned!

DEV Community: Tyler Hu

vue