Synthetic Data Experiments: Package Detection

#python #datascience #machinelearning #deeplearning

Having a package stolen is frustrating. As Mark Rober has demonstrated, it can drive people to the edge of madness. But what if you could build your own package detection model using exclusively synthetic data? We’ve outlined a few short steps we took to go from synthetic data generation to working detector.

Generate Synthetic Data

Synthetic data is generated from a simulation or “sim”—typically a scene that has been created from custom or stock 3D models. Sims can run in the cloud in parallel to create virtually infinite training data. I created a sim for package detection using open-source 3D graphics software Blender and zpy [1]. In this sim, assorted 3D packages are spawned while the camera angle and lighting conditions are randomized. The resulting synthetic dataset is visually diverse and perfectly labeled.

Figure 1: Synthetic images of packages generated from a sim.

Collect the Test Data

To test our model trained on the synthetic data, we are going to need to collect some real images. We found some on the internet, and manually labeled them using a DIY labeling platform called RoboFlow [2]. Give it a try. After spending an hour drawing bounding boxes on images, take a moment to appreciate that nearly all training data has to be painstakingly manually labeled like that. It’s the sort of tedious work that folks in developing countries wind up being paid pennies for. Talk about a dystopian future…

Train the Model

Armed with our synthetic training dataset and our real test dataset, we are ready to do some model training. We used a resnet variant implemented in PyTorch, from the Detectron2 github repo [3]. This network was pre-trained on Imagenet, so we only need to fine tune it a little longer on our synthetic dataset before it is capable of making decent predictions. Not bad for such a small dataset (1000 synthetic images) and such a short training time (30 minutes).

Figure 2: Predictions from our neural network trained on synthetic data. False positives shown for context.

Closing Thoughts

These are great results for the first iteration. To improve model performance further we could increase the size of the dataset, add more variety to the sim, or pick better hyperparameters for our model. Evaluating model performance on real test data and iterating is core to the synthetic data workflow. After all, the coolest thing about synthetic training data is that it’s ultimately dynamic data.

For your next computer vision project, whether it be a hobby or your job, spare those poor manual data labelers and consider trying out the synthetic approach. We’ve made it easy for you: we’ve released our data development toolkit zpy [1] under an open source license. Now everything you need to generate and iterate synthetic data for computer vision is available for free. Your feedback, commits, and feature requests, will be invaluable as we continue to build a more robust set of tools for generating synthetic data. Meanwhile, if you could use hands on support with a particularly tricky problem, please reach out!