DEV Community: Tyler Kim

Understanding and Implementing Proximal Policy Optimization (Schulman et al., 2017)

Tyler Kim — Thu, 06 May 2021 00:56:42 +0000

Research in policy gradient methods has been prevalent in recent years, with algorithms such as TRPO, GAE, and A2C/A3C showing state-of-the-art performance over traditional methods such as Q-learning. One of the core algorithms in this policy gradient/actor-critic field is Proximal Policy Optimization Algorithm implemented by OpenAI.

In this post, I try to accomplish the following:

Discuss the motives behind PPO by providing a beginner-friendly overview of Policy Gradient Methods and Trust Region Methods(TRPO)
Understand the core contribution of PPO: Clipped Surrogate Objective and Multiple Epochs Policy Update

Motives

Destructive Policy Updates

We first need to understand the optimization objective of Policy Gradient methods defined as following:

The policy is our neural network that takes the state observation from an environment as an input and suggest actions to take as an output. The advantage is an estimation, hence the hat over A, of the relative value for selected action in current state. It is computed as discounted reward(Q) - value function, where value function basically gives an estimate of discounted sum of reward. When training, this neural net representing the value function will frequently be updated using the experience our agent collects in an environment. However, that also means the value estimate will be very noisy due to the variance caused by the network; network is not always going to predict the exact value of that state.

Multiplying log probabilities of policy's output and advantage function gives us a clever optimization function. If advantage is positive, meaning the actions the agent took in the sample trajectory resulted in better than average return, policy gradient would be positive to increase the probability of selecting these actions again when we encounter a similar state. If advantage is negative, policy gradient would be negative to do the exact opposite.

As much appealing it is to constantly perform gradient descent steps in one batch of collected experience, it will often update the parameters so far outside of the range that leads to "destructively large policy updates."

Trust Region Policy Optimization

One of the approaches to prevent such destructive policy updates was Trust Region Policy Optimization (Schulman et al, 2015). In this paper, authors implemented an algorithm to limit the policy gradient step so it does not move too much away from the original policy, causing overly large updates that often ruin the policy altogether.

First, we define r(θ) as the probability ratio between the action under current policy and the action under the previous policy.

Given a sequence of sampled actions and states, r(θ) will be greater than one if the particular action is more probable for the current policy than it is for the old policy. It will be between 0 and 1 when the action is less probable for our current policy.

Now if we multiply this r(θ) with the previously mentioned advantage function, we get the TRPO's objective in a more readable format:

In this TRPO method, we notice that it is actually quite similar to the vanilla policy gradient method on the left. In fact, the only difference here is that the log operator is replaced with the probability of the action of current policy divided by the probability of the action under previous policy. Optimizing this objective function is identical otherwise.

Additionally, TRPO added a KL constraint limit the gradient step from moving the policy too far away from original policy. This results in the gradient staying in the region where we know everything works fine, hence the name 'trust region.' However, this KL constraint is known to add a overhead to our optimization process which sometimes lead to an undesirable training behavior.

PPO

Clipped Surrogate Objective

With the motives mentioned above, Proximal Policy Optimization attempts to simplify the optimization process while retaining the advantages of TRPO. One of this paper's main contribution is the clipped surrogate objective:

Here, we compute an expectation over the minimum of two terms: normal PG objective and clipped PG objective. The key component comes from the second term where a normal PG objective is truncated with a clipping operation between 1-epsilon and 1+epsilon, epsilon being the hyperparameter.

Because of the min operation, this objective behaves differently when advantage estimate is positive or negative.

Let's first take a look at the left figure depecting postive advantage: the case when selected action had better-than-expected effect on the outcome. In the graph, the loss function flattens out when r gets too high or when action is a lot more likely under current policy than it was under old policy. We do not want to overdo the action update by taking a step too far, so we 'clip' the objective to prevent this as well as blocking the gradient with a flat line.

The same applies to the right figure when advantage estimate is negative. The loss function would flatten out when r goes near zero, meaning particular action is much less likely on current policy.

As clever this approach is, the clipping operation also helps us out for 'undoing' policy's mistakes. For example, the highlighted part in the right figure shows the region where last gradient step made the selected action a lot more probable while also making the policy worse as shown with a negative advantage. Thankfully, our clipping operation will kindly tell the gradient to walk the other direction in proportional to amount we messed up. This is the only part where the first term inside min() is lower than the second term, acting as a backup plan. And the most beautiful part is that PPO does all of this without having to compute additional KL constraints.

All of these ideas can be summarized in the final loss function by summing this clipped PPO objective and two additional terms:

The c1 and c2 are hyperparameters. The first term is a mean square error of value function in charge of updating the baseline network. The second term, which may look unfamiliar, is an entropy term used to ensure enough exploration for our agents. This term will push the policy to behave more spontaneously until the other part of the objective starts dominating.

Multiple Epochs for Policy Updating

Finally, let's take a look at the algorithm altogether and its beauty of parallel actors:

Algorithm consists of two large threads: the beige-thread and the green-thread. The beige-threads collect data, calculate advantage estimates, and sample mini-batches for the green-thread to use. One special take: these are done by N parallel actors each doing their own tasks independently.

Running multiple epochs of gradient descent on samples was uncommon because of the risk of destructively large policy updates. However, with the help of PPO's Clipped Surrogate Objective, we can take advantage of parallel actors to improve on sample efficiency.

Every once in a while, green-thread will fire and run Stochastic Gradient Descent on our clipped loss function. Another special take? We can run K epochs of optimization on the same trajectory sample. This was also hard to do pre-PPO due to the risk of taking large steps on local samples, but PPO prevents this while allowing us to learn more from each trajectory.

Orbitron: Reinventing the wheels and its control algorithm

Tyler Kim — Fri, 02 Apr 2021 04:50:42 +0000

Being a heavy Sci-fi fan myself, I always wondered: how would those spherical wheels from Tron and I-Robot work in real life?
And this simple thought began the 6-month journey of Project Orbitron.

Now, this project consisted of two major goals upon start:

Building a vehicle with spherical wheels that implement a 4 wheel independent steering/driving (4WIS/D) system using Arduino
Developing an intuitive control algorithm for 4WIS/D vehicle in Mathematica

This article will showcase my vehicle prototype Orbitron along with a short story behind the building scene. Then, I'll introduce you to the highlight: a clever algorithm I built to control Orbitron seamlessly.

You can also check out this maker portfolio video I made for my college application or check out the Github repo containing the full code.

tylertaewook / project-orbitron

A unique control algorithm in Mathematica for 4WIS/WID vehicles; patent-pending

Project Orbitron

A 4WIS/4WID Vehicle with Spherical Wheels & A unique control algorithm in Mathematica

December 2017 - May 2018
Explore the docs »

View Demo · Report Bug · Request Feature

================================================================

[LINK - MAKER PORTFOLIO]

================================================================

About The Project

Project Orbitron is an independent build/research project featuring a vehicle with spherical wheels named ORBITRON and a unique control algorithm. The project began as a simple yet powerful urge to build a vehicle with spherical wheels after being inspired by Goodyear 360 but soon evolved into a research project under Kent Guild, academic society at Kent School, after realizing vehicle's potential. After presenting a project proposal in front of Kent Pre-Engineering Department, Project Orbitron was granted with $1,000 fund.

The finished algorithm is going through a patent process as of October 2020…

View on GitHub

Orbitron

As I mentioned above, ORBITRON is a vehicle with spherical wheels, hence the name 'ORB'itron. Unfortunately, I was a bit under-qualified to suspend wheels in mid-air with electromagnets, as many sci-fi movies suggested. Instead, I implemented a 4 Wheel Independent Steering/Driving (4WIS/D) system: a steering system for a four-wheeled vehicle that allows for separate speed and direction controls for each wheel.

Structure

Wheels

After an initial sketch of the wheel's frame, I modeled the same design in Fusion 360. I designed the frame to house two separate motors, each controlling speed and direction, to steer and drive the wheels independently from others.

I used a 60mm EVA-foam ball as wheels since they were light yet sturdy enough to support the vehicle.
The HS-785HB servo with a built-in gearbox on the top controls the wheel’s direction by turning the motor's entire rectangular structure. The 170-RPM Econ Gear Motor directly connected to sphere's shaft takes care of driving the wheel and controlling the speed.

Body

Designing the body was relatively simple, as it was simply a rectangular board supporting the wheel frames.

I built the body out of MDF board at first, but it proved to be too heavy. So I changed into Foamex board supported by PVC pipes which was much lighter and stronger.

To continue working on this project during summer vacation when I flied back to Korea, I carefully designed the board to be foldable to make overseas shipping easier. This way, I just had to detach the wheel frames, fold the board up, and cover with bubble wraps when shipping.

Electronics

Arduino

I won't go into too much detail in the wirings here. Shortly put, the Arduino Mega is connected to XBee shield for wireless communication, two Motor Drivers for controlling the driving motors, and four servo motors for steering each wheel.

Controls

While building, I developed a simple C# WinForms application to ensure each component was functioning properly. This app would send alphabet signals through XBee wireless module, and Orbitron would perform pre-set movements such as rotating all servos 180 degrees when received 'r' character.

Algorithm

The real beauty of this project was the algorithm development. The following writings will summarize my paper: "Intuitive Control Algorithm Development of 4WIS/4WID Using a SpaceMouse"

4WIS/D system of the Orbitron enables more versatile motion for vehicles that needs to navigate in tight spaces but controlling two parameters, direction and speed, for each wheel results in eight parameters in need of simultaneous control.

So our goal was simple: to develop an algorithm that achieves an Intuitive control that abstracts away the complexity, allowing for a full realization of the vehicle's capabilities.

Algorithm Setup

We chose 3Dconnexion's SpaceMouse (SpaceNavigator) as a controller as it was designed for intuitive navigation on 3D space in CAD.

Connecting the SpaceMouse with Mathematica gave us six numbers ranging from -1 to +1 based on the mouse's position, which became the raw input data for our algorithm.

The algorithm’s main job is to translate these six variables from the SpaceMouse to eight variables, each of which represents either angle or speed of a wheel. The algorithm is responsible for calculating variable transformations and records the sets of timestamped variables in a CSV file. We then use a serial regulator, a C# application I've developed, to deliver the set of variables at a proper time without overfeeding data to the prototype.

When Mathematica Notebook is executed, the interface on the right is continuously updated based on the user’s input. All the arrows shown in this interface are color-coded differently to distinguish each other more effortlessly. Four labels are shown next to each wheel to display updated angle and speed values.

Green Concentric Circles: each represent the radius of curvature of each wheel and the center of the vehicle. This can be applied to vehicles with any number of wheels, but just with more curvature circles.
Blue Arrow: motion of the vehicle's center
Gray Arrow: acts as base for red/pink arrows; always fixed along the vehicle body's angle
Pink Arrow: tangent line of the green circle; angle between the gray arrow and pink arrow is used to determine each wheel's angle
Red Arrow: actual trajectory of the vehicle; As user controls the mouse, arrows' length will always represent the relative velocity of each wheel.

Implementing Different Steering Modes

While the conventional steering system only involves Ackerman Steering, 4WIS/4WID can have three different steering modes: AFRS, Crab Steering, and Spinning. Our algorithm supports all three steering modes and allows us to simultaneously control all four wheels without any danger of movement conflicts. We do so by computing both the wheel’s direction and speed based on the wheel’s angular velocity during a turn. This prevents any conflicts between the signals that can result in a vehicle slipping.

Crab Steering

Crab steering is a special type of active four-wheel steering that operates by steering all wheels in the same direction and at the same angle.

The algorithm uses the Crab steering mode whenever the user slides the SpaceMouse on the plane. In this specific example, when the mouse is facing upper right corner, all four wheels are angled in the direction the mouse is pointing at. All four wheels have the same speed which is computed as the tangential velocity of a circle with a very large radius.

Active Front and Rear Steering

AFRS mechanism involves front and rear wheels turning independently for smaller turning radius and better cornering stability. In this steering mode, rear wheels change the way a vehicle turns based on driving parameters. Each wheel’s velocity is computed as the tangential velocity of a circle.

The twisting motion of the SpaceMouse is responsible for changing the radius of the vehicle’s curvature. The core of the algorithm is that it considers every motion as a circular motion and computes each wheel’s velocity and angle tangent to that circle. The radius of curvature (𝒓) is calculated by the equation shown above, and the angle value (θ) is controlled with the twisting motion of the mouse.

When the mouse is twisted clockwise, θ is increased, resulting in a larger curvature radius. When the mouse is twisted counterclockwise, θ is decreased, resulting in a smaller curvature radius. Negative θ will make the circle move in the other side of the vehicle.

This principle also applies to straight motion in scenarios such as Crab Steering mode. Continuously twisting the mouse counterclockwise will make θ very small, leading the turning radius to be almost infinite. At this point, the vehicle’s motion will be considered to be a straight motion.

In these examples, straight motion and curving motion are accomplished simultaneously. When twisting and shifting the mouse occurs at the same time, the vehicle can perform more complicated motion, such as moving forward with a gradually decreasing turning radius.

Spinning

Also known as the Zero Turn mode, Spinning is the motion of a vehicle rotating with zero radius. It is easily accomplished by turning all wheels perpendicular to the center diagonal line and turning the wheels in the same direction.

When one of the two buttons on the side of the SpaceMouse is pressed, the vehicle rotates in the respective direction. As shown in the screenshots above, the vehicle turns in a clockwise direction when the right button is pressed, and the vehicle rotates in a counterclockwise direction when left button is pressed.

Conclusion

In six months, I built a prototype vehicle and experimentally confirmed that our algorithm successfully processes the driver’s intention conveyed by a SpaceMouse and cooperatively controls all four wheels so that they don’t conflict with each other to accomplish the intended motion.

I have worked on several for-fun projects in Arduino before, but Project Orbitron was by far the largest and most complex one I have ever done. Building the prototype alone took the entire summer vacation and over $1,000 in the budget. I spent another three months developing the algorithm while teaching myself Mathematica and constantly tweaking Orbitron's settings.

Upon finishing the project, I participated in a local science fair and presented my work to the Kent Guild, an academic society at my school.

Project Orbitron has become the core experience/project of my journey and ended up being the main topic for my college essay. The prototype is currently displayed on the first floor of Kent School's Pre-Engineering Center, and my algorithm is going through a patent process in Korea. (Application Number: KR 10-2019-0087022)