DEV Community

Matias Kreder
Matias Kreder

Posted on


Train DeepRacer models on EC2 Spot instances


DeepRacer is an AWS service that lets you create Machine Learning models that run on virtual or physical autonomous 1/18 scale racing cars.

You would typically log in to the AWS Management Console and set up the training jobs to build those ML models. However, you need to be aware that this training method has a free tier for just 10 hours of training, and then it costs as much as $3.50 for each extra hour of training.

Ten hours should be enough if you want to play around with DeepRacer. Still, if you would like to compete in any DeepRacer league, you would probably need to consider other cost-effective options to do the training, as 10 hours wouldn't likely be enough.

There are a couple of options you should consider if you want to spend less on the training:

  • You can do the training locally on your computer: This could seem like the cheapest option, but you would most likely need a powerful GPU with at least 8 GB of VRAM. Keep in mind that you should use a gaming GPU like the 3060 or the 1080ti, or you could get a cheap used data center GPU like the M40 or K20 at an affordable price on eBay. However, using a data center GPU requires you to do some hardware hacks to keep your system running cool. You could also opt for CPU training, which would usually significantly slow down your training.
  • You can do the training on the Cloud: Running the training on AWS EC2 or any other cloud is an excellent option if you want to avoid investing in a GPU or dealing with hardware hacks.

Instance type and pricing model

Before starting, you must choose the appropriate instance type and pricing model for your DeepRacer training.

  • Instance type: While a GPU instance would be able to get you to complete your DeepRacer training faster, it is less affordable than CPU instances.
  • Pricing model: You could run your workloads on OnDemand or Spot instances. While OnDemand will probably get your training completed faster, it is nice to use AWS idle capacity to do training on cheaper Spot instances. If you decide to do so, you need to be aware that Spot instances are unexpectedly interrupted, and your system needs to be prepared for that.

For this particular guide, I'm going to be using a g4dn.2xlarge GPU instance that will run using Spot pricing. In order to do that, I had to open a support ticket to request a quota limit increase on spot instances for 8 VCPUs. This is something AWS recently introduced.

Instance Creation

The first step is to log into the AWS Management Console and launch a new instance.

Select "Ubuntu 20.04" as the operating system. The tools we will use only work with this Linux distribution.

Operating System selection menu

According to the deepracer-for-cloud documentation, each Robomaker worker requires 2-4 vCPUs. A c5.4xlarge instance can run 3 workers and the Sagemaker without a drop in performance. Using OpenGL images reduces the number of vCPUs required per worker.

Then, select the instance type. I selected the "g4dn.2xlarge" instance type because I plan to do GPU training. This instance type has 8 vCPUs, so I will try to run 4 workers on it. The GPU only does computing during a critical portion of the training called policy training, but the CPU is still used most of the time. For that reason, there is minimal improvement in using GPU instances in terms of workers but the training times are better than CPU instances.

Instance type selection menu

Create your instance with at least 50 GB of storage. That should be enough, as the instance storage is only used to save logs and Docker containers. The training data files are stored in AWS S3.

Storage capacity screen

Lastly, under "Advanced details," choose to request Spot instances and select an IAM role that allows the instance to access your S3 bucket, where training files will be uploaded.

Advanced details section

DRFC Setup

The next step is to connect to our instance through SSH and set up DeepRacer For Cloud (DRFC). The steps are well documented in the DRFC documentation.

To summarize, you should basically:

  1. SSH into your new instance
  2. Clone deepracer-for-cloud repo git clone
  3. Launch the script cd deepracer-for-cloud/bin/; ./
  4. Reboot the instance
  5. Launch the script cd deepracer-for-cloud/bin/; ./ -c aws -a gpu
  6. Source the script to use the DRFC helper scripts source; cd ..
  7. Update system.env: Change bucket names and DR_WORKERS to the number of workers you are planning to run. I used 4.
  8. Change run.env configuration to match the track you are training on and any other parameters like race type
  9. Update model_matadata.json, and hyperparameters.json on the custom_files folder
  10. Upload your custom files and start the training dr-upload-custom-files; dr-start-training

Spot configuration

Lastly, configure the instance to start DRFC on boot. This small script will clone the existing model and start the training from there at boot.

cat << EOF > /home/ubuntu/
source /home/ubuntu/deepracer-for-cloud/bin/
dr-increment-training -f
dr-start-training -q -w

chmod +x /home/ubuntu/
Enter fullscreen mode Exit fullscreen mode

You also need to configure rc.local as root, to launch the start script you created in the previous step:

sudo su -

cat << EOF > /etc/rc.local

su - ubuntu -c /home/ubuntu/ > /tmp/dr-start.log

exit 0

chmod +x /etc/rc.local

Enter fullscreen mode Exit fullscreen mode

Terraform alternative

If you are familiar with Terraform, you can also use this terraform method that Nalbam created.

Top comments (2)

cortega26 profile image
Carlos Ortega González

Excellent post, Matias, any chance you can offer some pointers as to how to train a model locally on my PC? I know it might take some more time but I'm not in a hurry. Thanks in advance.

mkreder profile image
Matias Kreder

@carlos, I recommend you go through this guide. It explains everything you will need to set up local training on your computer.