DEV Community

Cover image for 2024 : AWS DeepRacer Local Training On DRFC
Darren Broderick (DBro)
Darren Broderick (DBro)

Posted on

2024 : AWS DeepRacer Local Training On DRFC


Training Locally on DRFC ->Troubleshooting

Handy Commands
dr-upload-model -b -f -i
dr-upload-model -b -f -I "name of model"
If you have disk problems do "docker system prune"

So you’re training locally on DRFC for AWS DeepRacer, great!

But it’s not always straightforward, sometimes it’s easy to forget how to run or update your stack after the inital setup or you get new errors when starting training again, especially after a season break.

This article an be used as a supplement to the main DRFC guide.
https://aws-deepracer-community.github.io/deepracer-for-cloud

It’s a list of commands/ steps I follow and troubleshooting problems / solutions I’ve faced when training locally.

Hopefully it can help you too, BUT it is tailored to how I run things FYI.

Contents

  1. Handy monthly items
  2. General Training Starting Steps
  3. Virtual DRFC Upload
  4. Physical DRFC Upload
  5. Container Update Links
  6. Open GL Robomaker
  7. New Sagemaker -> M40 Tagging
  8. Log Analysis
  9. Run Second DRFC Instance
  10. Steps for fresh DRFC
  11. Troubleshooting DRFC (List of issues & solutions)
  12. Miscellaneous

Handy monthly items
Latest Robomaker Container (For Training)
https://hub.docker.com/r/aws deep racercommunity/deepracer-robomaker/tags?page=1&ordering=last_updated

All Track Files & Details (For DR_WORLD_NAME & Log Analysis)
https://github.com/aws-deepracer-community/deepracer-race-data/tree/main/raw_data/tracks

Commands

  1. docker ps -a
  2. docker images
  3. docker service ls

General Training Starting Steps
These are commands I run if starting from a reboot

source bin/activate.sh
sudo liquidctl set fan1 speed 30
(This is my own fan setting)
dr-increment-training -f
dr-update OR dr-update-env (I tend to favour -env)
dr-start-training OR dr-start-training -w
dr-start-viewer OR dr-update-viewer
http://127.0.0.1:8100 OR http://localhost:8100
dr-logs-robomaker (dr-logs-robomaker -n2) for worker 2 etc
dr-logs-sagemaker
nvidia-smi (check temperatures)
htop to check threads and memory usage
(Try to maximise my worker count, but keep to <75%)
dr-start-evaluation -c & dr-stop-evaluation
Virtual DRFC Upload
aws configure
dr-upload-model -b -f
Uploads best checkpoint to s3
Physical DRFC Upload
dr-upload-car-zip -f
Sagemaker must be running for this to work
Only uses last checkpoint, not best
Container Update Links
Check your version with command ”docker images”

docker service ls to make sure you see s3_minio.

Sagemaker
https://hub.docker.com/r/awsdeepracercommunity/deepracer-sagemaker/tags?page=1&ordering=last_updated
For new Sagemaker images follow this guide:
https://github.com/aws-deepracer-community/deepracer-for-cloud/blob/master/docs/multi_gpu.md
Robomaker
https://hub.docker.com/r/aws deep racercommunity/deepracer-robomaker/tags?page=1&ordering=last_updated
RL Coach
https://hub.docker.com/r/awsdeepracercommunity/deepracer-rlcoach/tags
Linux terminal startup script is called “.bashrc”

Open GL Robomaker
https://aws-deepracer-community.github.io/deepracer-for-cloud/opengl.html

Image description

  • example -> docker pull awsdeepracercommunity/deepracer-robomaker:4.0.12-gpu-gl
  • system.env: (Below bullet points)
  • DR_HOST_X=True; uses the local X server rather than starting one within the docker container.
  • DR_ROBOMAKER_IMAGE; choose the tag for an OpenGL enabled image - e.g. cpu-gl-avx for an image where Tensorflow will use CPU orgpu-glor an image where also Tensorflow will use the GPU.
  • Do echo $DISPLAY and see what that is, should be :0 but might be :1
  • Make system.env dr_display value same as echo value
  • dr-reload
  • source utils/setup-xorg.sh

Image description

  1. source utils/start-xorg.sh
  2. you should see the xorg stuff in nvidia-smi once you run the start-xorg.sh script
  3. sudo pkill x11vnc
  4. sudo pkill Xorg

New Sagemaker — M40 Tagging (redunant from v5.1.1)
With the latest images you don’t need to compile a specific image (like your -m40 image)

run -> docker tag 2b4e84b8c10a awsdeepracercommunity/deepracer-sagemaker:gpu-m40

Image description

Log Analysis

  1. run -> dr-start-loganalysis
  2. Only change needed is for model_logs_root
  3. e.g. ‘minio/bucket/model-name/0’
  4. All Track files & details
  5. https://github.com/aws-deepracer-community/deepracer-race-data/tree/main/raw_data/tracks
  6. Might have to upload the new track to tracks folder
  7. Repo for all racer data
  8. https://github.com/aws-deepracer-community/deepracer-race-data/tree/main/raw_data/leaderboards
  9. Run Second DRFC Instance
  10. Create 2 different run.env or use 2 folders
  11. The DR_RUN_ID keeps things separate
  12. Only 1 minio should be running
  13. Use a unique model name
  14. Run source bin/activate.sh run-1.env to activate a separate environment
  15. Steps for fresh DRFC
  16. https://aws-deepracer-community.github.io/deepracer-for-cloud/installation.html

  17. ./bin/prepare.sh && sudo reboot

  18. docker start

  19. ARCH=gpu

  20. Run LARS script -> source bin/lars_one.sh

  21. docker swarm init (If issues run step 7 and grab IP, run step 8, check bottom for example)

  22. ifconfig -a

  23. docker swarm init

  24. docker swarm init — advertise-addr 000.000.0.000

  25. sudo ./bin/init.sh -a gpu -c local

  26. docker images

  27. docker tag xxxxxxx awsdeepracercommunity/deepracer-sagemaker:gpu-m40

  28. source bin/activate.sh

  29. vim run.env

  30. vim system.env

  31. dr-update

  32. aws configure — profile minio

  33. aws configure

  34. (use real AWS IAM details below to allow upload of models)

  35. dr-reload

  36. docker ps -a

  37. Setup multiple GPU

  38. cd custom-files

  39. vim on the 3 files

  40. dr-upload-custom-files

  41. Different editor option to vim

gedit

Troubleshooting DRFC (List of issues & solutions)

General Tip

Image description

It’s always worth checking if you are missing anything new that might have been added to the default files that DRFC would then be expecting.

In particular, the system.env or template-run.env files and compare them with your own.

Troubleshooting Docker Start

Docker failed to start

docker ps -a
docker service ls
sudo service docker status
sudo service — status-all
sudo systemctl status docker.service
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

sudo systemctl stop docker
sudo systemctl start docker
sudo systemctl enable docker
sudo systemctl restart docker
sudo service docker restart
snap list
sudo su THEN apt-get install docker.io
Re-run Installing Docker (From Lars)
cat /etc/docker/daemon.json
apt-cache policy docker-ce
sudo tail /var/log/syslog
sudo cat /var/log/syslog | grep dockerd | tail
“For me it was a missing file”

sudo gedit /etc/docker/daemon.json
Make /etc/docker/daemon.json look like below:
{
“runtimes”: {
“nvidia”: {
“path”: “nvidia-container-runtime”,
“runtimeArgs”: []
}
},
“default-runtime”: “nvidia”
}

Make /etc/docker/daemon.json look like below:
sudo systemctl stop docker then sudo systemctl start docker
test with -> docker images
Troubleshooting Docker Swarm

Could not connect to the endpoint URL: “http://localhost:9000/bucket

Error response from daemon: This node is not a swarm manager. Use “docker swarm init” or “docker swarm join” to connect this node to swarm and try again.

You might have to disable ipv6 to stop docker pulling from multiple addresses

Here’s how to disable IPv6 on Linux if you’re running a Red Hat-based system:

Open the terminal window.
Change to the root user.
Type these commands:
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.lo.disable_ipv6=1
To re-enable IPv6, type these commands:

sudo sysctl -w net.ipv6.conf.all.disable_ipv6=0
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=0
sudo sysctl -w net.ipv6.conf.lo.disable_ipv6=0
sysctl -p
run -> ./bin/init.sh (Resets run, system.env, hyperparam, RF & model_metadata)

run -> docker pull minio/minio:RELEASE.2022–10–24T18–35–07Z

DR_MINIO_IMAGE in system.env, make sure it’s set to:
RELEASE.2022–10–24T18–35–07Z

Useful Links

  1. Full Guide — https://aws-deepracer-community.github.io/deepracer-for-cloud
  2. Sudo — https://phpraxis.wordpress.com/2016/09/27/enable-sudo-without-password-in-ubuntudebian
  3. Training on multiple GPU — https://github.com/aws-deepracer-community/deepracer-for-cloud/blob/master/docs/multi_gpu.md
  4. nvidia monitor — https://stackoverflow.com/questions/8223811/a-top-like-utility-for-monitoring-cuda-activity-on-a-gpu
  5. Tesla M40 24GB specs — https://www.microway.com/hpc-tech-tips/nvidia-tesla-m40-24gb-gpu-accelerator-maxwell-gm200-close
  6. Complex shutdown — https://www.maketecheasier.com/schedule-ubuntu-shutdown
  7. Sudo shutdown — https://sdet.ro/blog/shutdown-ubuntu-with-timer
  8. Video trimmer — https://launchpad.net/~kdenlive/+archive/ubuntu/kdenlive-stable
  9. Flatpak — https://flatpak.org/setup/Ubuntu

Installation commands

  • sudo snap install jupyter
  • sudo apt install git
  • sudo apt install nvidia-cuda-toolkit
  • sudo apt install curl
  • sudo apt install jq
  • sudo pip install liquidctl (to install fan controller globally)
  • sudo apt install net-tools
  • sudo apt install vim
  • sudo apt-get install htop
  • sudo apt install hddtemp
  • sudo apt install lm-sensors
  • pip install — user pipenv
  • sudo apt install pipenv
  • pipenv install jupyterlab

Installing Docker

  1. sudo su (run from root)
  2. curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
  3. sudo add-apt-repository “deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable”
  4. sudo apt-get update && sudo apt-get install -y — no-install-recommends docker-ce docker-ce-cli containerd.io
  5. sudo apt-get install -y — no-install-recommends nvidia-docker2 nvidia-container-toolkit nvidia-container-runtime
  6. sudo apt-get upgrade

Steps for Cuda upgrade

Top comments (0)