DEV Community

Cover image for Using AI Agents To Get Started Quickly On Kaggle
Jonathan Harel for Fine

Posted on • Edited on • Originally published at fine.dev

Using AI Agents To Get Started Quickly On Kaggle

Discover how to overcome the repetitive setup process when working on Kaggle projects. This blog post guides you through building an AI agent to streamline PyTorch training pipelines. Learn to define steps, automate data preprocessing, and generate training loops effortlessly.

If you've spent time on Kaggle, you're likely familiar with the initial hurdles of setting up your first lines of code. The process often involves repetitive tasks, from loading and preprocessing the data to defining the model architecture, selecting loss functions, setting up optimizers, and running training loops.

While these steps are essential, they can be quite boilerplate in nature, demanding your attention each time you embark on a new project. This repetitive process not only consumes time but also requires effort that could otherwise be directed toward the core creative aspects of your project – designing innovative models, exploring unique data insights, and pushing the boundaries of what AI can achieve.

Today, I'm excited to share my journey of creating an AI agent that helps me get started quickly on Kaggle competitions by building a training pipeline in PyTorch for me. If you're new to the world of deep learning and data science, don't worry – I'll guide you through the process step by step. Let's dive in and embark on this exciting adventure together!

Step 1: Gathering Resources

Before we start coding, let's gather the tools we'll need for our journey:

  • Python and PyTorch: Make sure you have Python installed (version 3.6 or higher). To install PyTorch, head over to the official website and choose the version compatible with your system.
  • Kaggle Account: If you don't have one, create an account on Kaggle. It's an incredible platform filled with datasets and competitions that you can explore.
  • Fine Account: Similarly, create an account on Fine if you don’t have one. File lets you build, deploy and run agents quickly and easily.
  • Fine’s CLI tool: Install the CLI tool that will allow the agents to operate. You can do it by running npm i @fine-dev/cli

Step 2: Choose Your Dataset

For our first training pipeline, let's select a dataset from Kaggle. Choose something that interests you – whether it's images, text, or tabular data. Once you've found the perfect dataset, download it and unzip the files to a dedicated folder on your machine.

In this tutorial I will use the famous Titanic dataset.

Step 3: Building the Agent

It's time to introduce some automation magic into our process. Enter the agent – a trusty companion that will carry out our predefined tasks, sparing us the repetitive setup and execution steps.

In this step, I'll guide you through the process of building your agent using a workflow.yaml file. This file will serve as a roadmap for our agent, detailing the sequence of tasks it should perform. It's like providing your agent with a to-do list that it can follow diligently.

Let's get started:

Create a workflow File

In your project directory, create a new file named workflow.yaml. This is where we'll define the steps that our agent will execute.

Define Steps

Next, inside workflow.yaml give your agent an id, a name, and an identity, followed by the agent's tasks as a list of steps. Each step will have a name, id, and a list of commands:

id: csv-pytorch-basic
name: PyTorch Starter Agent
identity: |
You are a senior data scientist, specializing in PyTorch and experienced in working with tabular data.
You are proficient in data exploration and in setting up the right training loops.
steps:
- name: Search csv
id: find-file
run: locate-files
with:
search: Search for a file named `*.csv` in the project files.
- name: Read csv
id: read-csv
run: read-file
with:
file_path: ${{ steps.find-file.outputs.first }}
- id: Transform
run: transform-text
with:
input: ${{ steps.read-csv.outputs.content }}
instructions: keep only the first row
- name: Write Training Loop
run: write-code
with:
instructions: |
Given a csv located at ${{ steps.find-file.outputs.first }}, with the following columns:
${{ steps.transform.outputs.content }}
Create a .py file containing a training loop in PyTorch, that predicts the most suitable column in the CSV.
Make sure to follow best practices of Python and of data science.
view raw workflow.yaml hosted with ❤ by GitHub

Our workflow.yaml sets up an agent to locate a .csv file, read its content, keep the first row, and then generate a PyTorch training loop that predicts the most suitable column in the CSV. It ensures that we follow Python and data science best practices throughout the process.

Step 4: Deploying the Agent

Now that we have our agent’s workflow defined, it’s time to deploy it to Fine. To do so, follow these steps:

Login to Fine

First, we need to connect our local environment to our Fine account. Run the command:

$ fine-dev login

Deploy the agent

Run the following commands from your project’s directory:

$ fine-dev deploy -p workflow.yaml

Step 5: Running the Agent

After building and deploying our agent, it’s time to take it for a spin! To do it, we need to do two things:

Download the dataset of choice

I will be using the Titanic dataset, so I downloaded the train.csv file and put in my project's directory.

Run the proxy

To allow the agent to operate, we need to set up the proxy. Run the following command from your project's directory:

$ fine-dev proxy

Run the Agent

From the Fine web interface, open a notebook and hit ctrl+p to open the agent palette. If everything went well, you should find our PyTorch Starter Agent waiting for you in the list. Select it and our agent will start running!

Step 6: Validate the Results

‍Our agent has finished running, let’s see what it created for us:

In the file src/main.py we can find the following code:

# Importing necessary libraries
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from torch.utils.data import Dataset, DataLoader
# Define the PyTorch Dataset
class TabularDataset(Dataset):
def __init__(self, data, cat_cols=None, output_col=None):
self.n = data.shape[0]
if output_col:
self.y = data[output_col].astype(np.float32).values.reshape(-1, 1)
else:
self.y = np.zeros((self.n, 1))
self.cat_cols = cat_cols if cat_cols else []
self.cont_cols = [col for col in data.columns
if col not in self.cat_cols + [output_col]]
if self.cont_cols:
self.cont_X = data[self.cont_cols].astype(np.float32).values
else:
self.cont_X = np.zeros((self.n, 1))
if self.cat_cols:
self.cat_X = data[cat_cols].astype(np.int64).values
else:
self.cat_X = np.zeros((self.n, 1))
def __len__(self):
return self.n
def __getitem__(self, idx):
return [self.y[idx], self.cont_X[idx], self.cat_X[idx]]
# Define the Neural Network
class FeedForwardNN(nn.Module):
def __init__(self, emb_dims, no_of_cont, lin_layer_sizes,
output_size, emb_dropout, lin_layer_dropouts):
super(FeedForwardNN, self).__init__()
# Embedding layers
self.emb_layers = nn.ModuleList([nn.Embedding(x, y) for x, y in emb_dims])
no_of_embs = sum([y for x, y in emb_dims])
self.no_of_embs = no_of_embs
self.no_of_cont = no_of_cont
# Linear Layers
first_lin_layer = nn.Linear(self.no_of_embs + self.no_of_cont, lin_layer_sizes[0])
self.lin_layers = nn.ModuleList(
[first_lin_layer] + [nn.Linear(lin_layer_sizes[i], lin_layer_sizes[i + 1])
for i in range(len(lin_layer_sizes) - 1)])
for lin_layer in self.lin_layers:
nn.init.kaiming_normal_(lin_layer.weight.data)
# Output Layer
self.output_layer = nn.Linear(lin_layer_sizes[-1], output_size)
nn.init.kaiming_normal_(self.output_layer.weight.data)
# Dropout Layers
self.first_emb_dropout_layer = nn.Dropout(emb_dropout)
self.dropout_layers = nn.ModuleList([nn.Dropout(size) for size in lin_layer_dropouts])
def forward(self, cont_data, cat_data):
if self.no_of_embs != 0:
x = [emb_layer(cat_data[:, i]) for i, emb_layer in enumerate(self.emb_layers)]
x = torch.cat(x, 1)
x = self.first_emb_dropout_layer(x)
if self.no_of_cont != 0:
if self.no_of_embs != 0:
x = torch.cat([x, cont_data], 1)
else:
x = cont_data
for lin_layer, dropout_layer in zip(self.lin_layers, self.dropout_layers):
x = F.relu(lin_layer(x))
x = dropout_layer(x)
x = self.output_layer(x)
return x
def preprocess_data(data_path):
# Load the data
data = pd.read_csv(data_path)
# Preprocess the data
# Split the data into train and test sets
train, test = train_test_split(data, test_size=0.2, random_state=42)
# Standardize the continuous features
scaler = StandardScaler()
train[cont_cols] = scaler.fit_transform(train[cont_cols])
test[cont_cols] = scaler.transform(test[cont_cols])
# Convert the data into PyTorch Datasets
train_dataset = TabularDataset(data=train, cat_cols=cat_cols, output_col=output_col)
test_dataset = TabularDataset(data=test, cat_cols=cat_cols, output_col=output_col)
return train_dataset, test_dataset
def train_model(train_dataset, model, criterion, optimizer, batch_size, epochs):
# Create a DataLoader
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# Training loop
for epoch in range(epochs):
for y, cont_x, cat_x in train_dataloader:
# Zero the gradients
optimizer.zero_grad()
# Forward pass
preds = model(cont_x, cat_x)
loss = criterion(preds, y)
# Backward pass and optimization
loss.backward()
optimizer.step()
def save_model(model, model_path):
# Save the model
torch.save(model.state_dict(), model_path)
def main():
# Define the data path and model path
data_path = 'train.csv'
model_path = 'model.pth'
# Define the model parameters
emb_dims = [(10, 50), (5, 25)]
no_of_cont = 5
lin_layer_sizes = [50, 100, 50]
output_size = 1
emb_dropout = 0.04
lin_layer_dropouts = [0.001, 0.01]
# Initialize the model
model = FeedForwardNN(emb_dims, no_of_cont, lin_layer_sizes, output_size,
emb_dropout, lin_layer_dropouts)
# Define the training parameters
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)
batch_size = 64
epochs = 10
# Preprocess the data
train_dataset, test_dataset = preprocess_data(data_path)
# Train the model
train_model(train_dataset, model, criterion, optimizer, batch_size, epochs)
# Save the model
save_model(model, model_path)
if __name__ == '__main__':
main()
view raw main.py hosted with ❤ by GitHub

Nice! That's a great starter for our Kaggle competition.

Step 7: Celebrate Your Achievement!

Congratulations! You've successfully set up your first training pipeline in PyTorch using data from Kaggle. This is just the beginning of your journey into the exciting world of AI and machine learning.

Remember, every great accomplishment starts with a single step. Keep learning, experimenting, and don't be afraid to ask questions. Happy coding and may your AI adventures be filled with curiosity and discovery! 🚀🧠

Retry later

Top comments (0)

Retry later
Retry later