DEV Community: Zumo Labs

Will Self-Supervised Visual Transformers Replace Pre-Trained CNNs?

Hugo — Tue, 01 Jun 2021 14:58:40 +0000

Pre-trained CNNs are still king when training models for computer vision use cases. However, the emerging popularity of Visual Transformers (ViTs), and subsequent consensus about their unsupervised learning capabilities, gives unexpected space for ViTs to usurp the throne.

Pre-Trained CNNs

Convolutional Neural Networks work by sliding a pattern (formally known as the kernel, but also referred to as a "feature map") across an image (Slide 1). This sliding strategy is effective because it acts as a natural form of translation invariance: once a CNN can recognize something in one part of the image, it will recognize it in any part of the image [1]. However, this approach leads to a kind of fragility: feature maps are often overfit to a particular texture or object size.

Building up feature maps requires a ton of data, and CNNs are usually pre-trained on a large generic dataset like COCO or ImageNet-the latter boasting over one million images and around 1,000 categories. Further, a pre-trained CNN can be fine-tuned to new tasks by cutting out the model head and retraining with a new, often much smaller, dataset (Slide 2).

Transformers

Transformers have been popular in natural language processing (NLP) for quite some time. They work through a concept known as "self-attention," which pays certain input parts more attention than others [3]. In NLP, this allows for specific words within a sentence to be identified as more important. There are different types of attention and plenty of nuance for the experts to argue over, but the words "attention" and "focus" are good mental models of how these networks learn.

Self-Supervised ViT

Self-supervised training is a little different in that it does not require labels-you don't need to tell the model that the object in an image belongs to the category "cat," for example. Instead, a self-supervised training technique might involve cropping an image, feeding it through multiple networks, and then getting them all to agree on which features in the image are essential (Slide 3). This type of learning technique, called DINO [3], successfully trained visual transformers (transformers for visual tasks, e.g., images). The ViTs trained with DINO ended up surprisingly effective for classification tasks, reaching 80% top-1 accuracy on ImageNet. Inspecting the self-attention maps of these ViTs also shows that they can very precisely segment out objects in an image (Slide 4).

Now, the bold prediction: self-supervised ViTs will eventually replace pre-trained CNNs as the go-to feature encoders for computer vision tasks. There are still unanswered questions, such as whether ViTs will generalize outside the training distribution better than CNNs. But one thing is sure: not requiring labels during training will enable using much larger datasets. Consider the difference in capacity between ImageNet and a self-supervised ViT trained on the entire internet of images…

Conclusion

Thanks for reading our latest paper exploration. If you love computer vision, check out zpy [4], our open-source synthetic data development toolkit. It's everything you need to generate and iterate on synthetic training data for computer vision. Your feedback, commits, and feature requests are invaluable as we continue to build a more robust set of tools for generating synthetic data. Meanwhile, if you could use support with a particularly tricky problem, please reach out.

References

[1] CS231n Convolutional Neural Networks for Visual Recognition - Convolutional Neural Networks (https://cs231n.github.io/convolutional-networks/)
[2] Transformer: A Novel Neural Network Architecture for Language Understanding (https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html)
[3] Emerging Properties in Self-Supervised Vision Transformers (https://arxiv.org/pdf/2104.14294.pdf ).
[4] zpy (github.com/ZumoLabs/zpy)

Synthetic Data Experiments: Package Detection

Hugo — Mon, 24 May 2021 14:34:15 +0000

Having a package stolen is frustrating. As Mark Rober has demonstrated, it can drive people to the edge of madness. But what if you could build your own package detection model using exclusively synthetic data? We’ve outlined a few short steps we took to go from synthetic data generation to working detector.

Generate Synthetic Data

Synthetic data is generated from a simulation or “sim”—typically a scene that has been created from custom or stock 3D models. Sims can run in the cloud in parallel to create virtually infinite training data. I created a sim for package detection using open-source 3D graphics software Blender and zpy [1]. In this sim, assorted 3D packages are spawned while the camera angle and lighting conditions are randomized. The resulting synthetic dataset is visually diverse and perfectly labeled.

Figure 1: Synthetic images of packages generated from a sim.

Collect the Test Data

To test our model trained on the synthetic data, we are going to need to collect some real images. We found some on the internet, and manually labeled them using a DIY labeling platform called RoboFlow [2]. Give it a try. After spending an hour drawing bounding boxes on images, take a moment to appreciate that nearly all training data has to be painstakingly manually labeled like that. It’s the sort of tedious work that folks in developing countries wind up being paid pennies for. Talk about a dystopian future…

Train the Model

Armed with our synthetic training dataset and our real test dataset, we are ready to do some model training. We used a resnet variant implemented in PyTorch, from the Detectron2 github repo [3]. This network was pre-trained on Imagenet, so we only need to fine tune it a little longer on our synthetic dataset before it is capable of making decent predictions. Not bad for such a small dataset (1000 synthetic images) and such a short training time (30 minutes).

Figure 2: Predictions from our neural network trained on synthetic data. False positives shown for context.

Closing Thoughts

These are great results for the first iteration. To improve model performance further we could increase the size of the dataset, add more variety to the sim, or pick better hyperparameters for our model. Evaluating model performance on real test data and iterating is core to the synthetic data workflow. After all, the coolest thing about synthetic training data is that it’s ultimately dynamic data.

For your next computer vision project, whether it be a hobby or your job, spare those poor manual data labelers and consider trying out the synthetic approach. We’ve made it easy for you: we’ve released our data development toolkit zpy [1] under an open source license. Now everything you need to generate and iterate synthetic data for computer vision is available for free. Your feedback, commits, and feature requests, will be invaluable as we continue to build a more robust set of tools for generating synthetic data. Meanwhile, if you could use hands on support with a particularly tricky problem, please reach out!

References

[1] zpy (github.com/ZumoLabs/zpy)
[2] RoboFlow (roboflow.com)
[3] Detectron2 (github.com/facebookresearch/detectron2)

What is Neural Rendering?

Hugo — Tue, 04 May 2021 17:07:42 +0000

As our world becomes increasingly digitized, the methods by which we render these virtual worlds are rapidly changing. Neural rendering has huge potential in improving many aspects of the rendering pipeline by leveraging generative machine learning techniques. What is neural rendering? In this article we'll introduce the concept, compare it to classical computer graphics, and discuss what it means for the future.

Classic Rendering

Creating 3D virtual worlds today is a complicated and involved process. Each item, or asset, in a virtual scene is represented by a polygon mesh (Slide 1). This polygon mesh can either be modeled by an artist, or scanned into existence: both of these processes are manual and time consuming. The more detailed we want this specific asset to be, the more polygons the mesh will have.

The polygon mesh is only the beginning. Each surface in this 3D world also has a corresponding material, which determines the appearance of the mesh. At runtime, the material and mesh of the object are used as inputs to shader programs, which calculate the appearance of the object under given lighting conditions and a specific camera angle (Slide 2). Over the years, many different shader programs have been developed, though the fundamental principle is the same: use the laws of physics to calculate the appearance of an object. This is most evident in the approach known as Ray Tracing, where every light ray is traced from its source down to every surface it bounces on.

This render pipeline can create amazing results: every CGI effect in every movie you have seen, and every game you have ever played uses some form of this "classical computer graphics" pipeline. The main pain point for this pipeline is in the huge amount of work required to explicitly define every object and every material, and the large computation required to render a realistic or complex scene. Which leads us to the question: what if we didn't have to define every object and calculate every light bounce?

Enter Neural Rendering

So, what is neural rendering? Though still a very young field, it's one which has grown to encompass a large number of techniques-GANs are a form of neural rendering. The key concept behind neural rendering approaches is that they are differentiable. A differentiable function is one whose derivative exists at each point in the domain. This is important because machine learning is basically the chain rule with extra steps: a differentiable rendering function can be learned with data, one gradient descent step at a time. Learning a rendering function statistically through data is fundamentally different from the classic rendering methods we described above, which calculate and extrapolate from the known laws of physics.

One of the coolest flavors of neural rendering is novel view synthesis. In this problem, a neural network learns to render a scene from an arbitrary viewpoint. Slides 3 and 4 are figures from two great papers on this topic: one from Google Research [1] and the other from Facebook Reality Labs [2]. Both of these works use a volume rendering technique known as ray marching. Ray marching is when you shoot out a ray from the observer (camera) through a 3D volume in space and ask a function: what is the color and opacity at this particular point in space? Neural rendering takes the next step by using a neural network to approximate this function.

The Future of Rendering

We really just scratched the surface when it comes to neural rendering. If you want to learn more, we recommend this super extensive summary paper [3]. But before we go, what could this mean for the future?
With neural rendering, we no longer need to physically model the scene and simulate the light transport, as this knowledge is now stored implicitly inside the weights of a neural network. This means that it will be possible to render your face, while it is inside a VR headset (Slide 5), without ever having to store or distort a 3D polygon mesh of your face!

With neural rendering, the compute required to render an image is also no longer tied to the complexity of the scene (the number of objects, lights, and materials), but rather the size of the neural network (time required to perform a forward pass). This opens up the door for the possibility of really high quality rendering at a blazingly fast frame rate.
If you're interested in the intersection of machine learning and 3D, please check out our open source synthetic data toolkit zpy [5]. Your feedback, commits, and feature requests will be invaluable as we continue to build a more robust set of tools for generating synthetic data. Who knows? Perhaps the next great neural rendering model will be trained using data generated with zpy.

References

[1] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis (arxiv.org/pdf/2003.08934.pdf)
[2] Neural Volumes: Learning Dynamic Renderable Volumes from Images (arxiv.org/pdf/1906.07751.pdf)
[3] State of the Art on Neural Rendering (arxiv.org/pdf/2004.03805.pdf)
[4] zpy: an open source synthetic data toolkit.