Discover how to overcome the repetitive setup process when working on Kaggle projects. This blog post guides you through building an AI agent to streamline PyTorch training pipelines. Learn to define steps, automate data preprocessing, and generate training loops effortlessly.
If you've spent time on Kaggle, you're likely familiar with the initial hurdles of setting up your first lines of code. The process often involves repetitive tasks, from loading and preprocessing the data to defining the model architecture, selecting loss functions, setting up optimizers, and running training loops.
While these steps are essential, they can be quite boilerplate in nature, demanding your attention each time you embark on a new project. This repetitive process not only consumes time but also requires effort that could otherwise be directed toward the core creative aspects of your project – designing innovative models, exploring unique data insights, and pushing the boundaries of what AI can achieve.
Today, I'm excited to share my journey of creating an AI agent that helps me get started quickly on Kaggle competitions by building a training pipeline in PyTorch for me. If you're new to the world of deep learning and data science, don't worry – I'll guide you through the process step by step. Let's dive in and embark on this exciting adventure together!
Step 1: Gathering Resources
Before we start coding, let's gather the tools we'll need for our journey:
- Python and PyTorch: Make sure you have Python installed (version 3.6 or higher). To install PyTorch, head over to the official website and choose the version compatible with your system.
- Kaggle Account: If you don't have one, create an account on Kaggle. It's an incredible platform filled with datasets and competitions that you can explore.
- Fine Account: Similarly, create an account on Fine if you don’t have one. File lets you build, deploy and run agents quickly and easily.
-
Fine’s CLI tool: Install the CLI tool that will allow the agents to operate. You can do it by running
npm i @fine-dev/cli
Step 2: Choose Your Dataset
For our first training pipeline, let's select a dataset from Kaggle. Choose something that interests you – whether it's images, text, or tabular data. Once you've found the perfect dataset, download it and unzip the files to a dedicated folder on your machine.
In this tutorial I will use the famous Titanic dataset.
Step 3: Building the Agent
It's time to introduce some automation magic into our process. Enter the agent – a trusty companion that will carry out our predefined tasks, sparing us the repetitive setup and execution steps.
In this step, I'll guide you through the process of building your agent using a workflow.yaml
file. This file will serve as a roadmap for our agent, detailing the sequence of tasks it should perform. It's like providing your agent with a to-do list that it can follow diligently.
Let's get started:
Create a workflow File
In your project directory, create a new file named workflow.yaml
. This is where we'll define the steps that our agent will execute.
Define Steps
Next, inside workflow.yaml
give your agent an id, a name, and an identity, followed by the agent's tasks as a list of steps. Each step will have a name, id, and a list of commands:
id: csv-pytorch-basic | |
name: PyTorch Starter Agent | |
identity: | | |
You are a senior data scientist, specializing in PyTorch and experienced in working with tabular data. | |
You are proficient in data exploration and in setting up the right training loops. | |
steps: | |
- name: Search csv | |
id: find-file | |
run: locate-files | |
with: | |
search: Search for a file named `*.csv` in the project files. | |
- name: Read csv | |
id: read-csv | |
run: read-file | |
with: | |
file_path: ${{ steps.find-file.outputs.first }} | |
- id: Transform | |
run: transform-text | |
with: | |
input: ${{ steps.read-csv.outputs.content }} | |
instructions: keep only the first row | |
- name: Write Training Loop | |
run: write-code | |
with: | |
instructions: | | |
Given a csv located at ${{ steps.find-file.outputs.first }}, with the following columns: | |
${{ steps.transform.outputs.content }} | |
Create a .py file containing a training loop in PyTorch, that predicts the most suitable column in the CSV. | |
Make sure to follow best practices of Python and of data science. |
Our workflow.yaml
sets up an agent to locate a .csv file, read its content, keep the first row, and then generate a PyTorch training loop that predicts the most suitable column in the CSV. It ensures that we follow Python and data science best practices throughout the process.
Step 4: Deploying the Agent
Now that we have our agent’s workflow defined, it’s time to deploy it to Fine. To do so, follow these steps:
Login to Fine
First, we need to connect our local environment to our Fine account. Run the command:
$ fine-dev login
Deploy the agent
Run the following commands from your project’s directory:
$ fine-dev deploy -p workflow.yaml
Step 5: Running the Agent
After building and deploying our agent, it’s time to take it for a spin! To do it, we need to do two things:
Download the dataset of choice
I will be using the Titanic dataset, so I downloaded the train.csv file and put in my project's directory.
Run the proxy
To allow the agent to operate, we need to set up the proxy. Run the following command from your project's directory:
$ fine-dev proxy
Run the Agent
From the Fine web interface, open a notebook and hit ctrl+p to open the agent palette. If everything went well, you should find our PyTorch Starter Agent waiting for you in the list. Select it and our agent will start running!
Step 6: Validate the Results
Our agent has finished running, let’s see what it created for us:
In the file src/main.py
we can find the following code:
# Importing necessary libraries | |
import pandas as pd | |
import torch | |
import torch.nn as nn | |
import torch.optim as optim | |
from sklearn.model_selection import train_test_split | |
from sklearn.preprocessing import StandardScaler | |
from torch.utils.data import Dataset, DataLoader | |
# Define the PyTorch Dataset | |
class TabularDataset(Dataset): | |
def __init__(self, data, cat_cols=None, output_col=None): | |
self.n = data.shape[0] | |
if output_col: | |
self.y = data[output_col].astype(np.float32).values.reshape(-1, 1) | |
else: | |
self.y = np.zeros((self.n, 1)) | |
self.cat_cols = cat_cols if cat_cols else [] | |
self.cont_cols = [col for col in data.columns | |
if col not in self.cat_cols + [output_col]] | |
if self.cont_cols: | |
self.cont_X = data[self.cont_cols].astype(np.float32).values | |
else: | |
self.cont_X = np.zeros((self.n, 1)) | |
if self.cat_cols: | |
self.cat_X = data[cat_cols].astype(np.int64).values | |
else: | |
self.cat_X = np.zeros((self.n, 1)) | |
def __len__(self): | |
return self.n | |
def __getitem__(self, idx): | |
return [self.y[idx], self.cont_X[idx], self.cat_X[idx]] | |
# Define the Neural Network | |
class FeedForwardNN(nn.Module): | |
def __init__(self, emb_dims, no_of_cont, lin_layer_sizes, | |
output_size, emb_dropout, lin_layer_dropouts): | |
super(FeedForwardNN, self).__init__() | |
# Embedding layers | |
self.emb_layers = nn.ModuleList([nn.Embedding(x, y) for x, y in emb_dims]) | |
no_of_embs = sum([y for x, y in emb_dims]) | |
self.no_of_embs = no_of_embs | |
self.no_of_cont = no_of_cont | |
# Linear Layers | |
first_lin_layer = nn.Linear(self.no_of_embs + self.no_of_cont, lin_layer_sizes[0]) | |
self.lin_layers = nn.ModuleList( | |
[first_lin_layer] + [nn.Linear(lin_layer_sizes[i], lin_layer_sizes[i + 1]) | |
for i in range(len(lin_layer_sizes) - 1)]) | |
for lin_layer in self.lin_layers: | |
nn.init.kaiming_normal_(lin_layer.weight.data) | |
# Output Layer | |
self.output_layer = nn.Linear(lin_layer_sizes[-1], output_size) | |
nn.init.kaiming_normal_(self.output_layer.weight.data) | |
# Dropout Layers | |
self.first_emb_dropout_layer = nn.Dropout(emb_dropout) | |
self.dropout_layers = nn.ModuleList([nn.Dropout(size) for size in lin_layer_dropouts]) | |
def forward(self, cont_data, cat_data): | |
if self.no_of_embs != 0: | |
x = [emb_layer(cat_data[:, i]) for i, emb_layer in enumerate(self.emb_layers)] | |
x = torch.cat(x, 1) | |
x = self.first_emb_dropout_layer(x) | |
if self.no_of_cont != 0: | |
if self.no_of_embs != 0: | |
x = torch.cat([x, cont_data], 1) | |
else: | |
x = cont_data | |
for lin_layer, dropout_layer in zip(self.lin_layers, self.dropout_layers): | |
x = F.relu(lin_layer(x)) | |
x = dropout_layer(x) | |
x = self.output_layer(x) | |
return x | |
def preprocess_data(data_path): | |
# Load the data | |
data = pd.read_csv(data_path) | |
# Preprocess the data | |
# Split the data into train and test sets | |
train, test = train_test_split(data, test_size=0.2, random_state=42) | |
# Standardize the continuous features | |
scaler = StandardScaler() | |
train[cont_cols] = scaler.fit_transform(train[cont_cols]) | |
test[cont_cols] = scaler.transform(test[cont_cols]) | |
# Convert the data into PyTorch Datasets | |
train_dataset = TabularDataset(data=train, cat_cols=cat_cols, output_col=output_col) | |
test_dataset = TabularDataset(data=test, cat_cols=cat_cols, output_col=output_col) | |
return train_dataset, test_dataset | |
def train_model(train_dataset, model, criterion, optimizer, batch_size, epochs): | |
# Create a DataLoader | |
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) | |
# Training loop | |
for epoch in range(epochs): | |
for y, cont_x, cat_x in train_dataloader: | |
# Zero the gradients | |
optimizer.zero_grad() | |
# Forward pass | |
preds = model(cont_x, cat_x) | |
loss = criterion(preds, y) | |
# Backward pass and optimization | |
loss.backward() | |
optimizer.step() | |
def save_model(model, model_path): | |
# Save the model | |
torch.save(model.state_dict(), model_path) | |
def main(): | |
# Define the data path and model path | |
data_path = 'train.csv' | |
model_path = 'model.pth' | |
# Define the model parameters | |
emb_dims = [(10, 50), (5, 25)] | |
no_of_cont = 5 | |
lin_layer_sizes = [50, 100, 50] | |
output_size = 1 | |
emb_dropout = 0.04 | |
lin_layer_dropouts = [0.001, 0.01] | |
# Initialize the model | |
model = FeedForwardNN(emb_dims, no_of_cont, lin_layer_sizes, output_size, | |
emb_dropout, lin_layer_dropouts) | |
# Define the training parameters | |
criterion = nn.MSELoss() | |
optimizer = optim.Adam(model.parameters(), lr=0.1) | |
batch_size = 64 | |
epochs = 10 | |
# Preprocess the data | |
train_dataset, test_dataset = preprocess_data(data_path) | |
# Train the model | |
train_model(train_dataset, model, criterion, optimizer, batch_size, epochs) | |
# Save the model | |
save_model(model, model_path) | |
if __name__ == '__main__': | |
main() |
Nice! That's a great starter for our Kaggle competition.
Step 7: Celebrate Your Achievement!
Congratulations! You've successfully set up your first training pipeline in PyTorch using data from Kaggle. This is just the beginning of your journey into the exciting world of AI and machine learning.
Remember, every great accomplishment starts with a single step. Keep learning, experimenting, and don't be afraid to ask questions. Happy coding and may your AI adventures be filled with curiosity and discovery! 🚀🧠
Top comments (0)