<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Hirotaka Ishihara</title>
    <description>The latest articles on DEV Community by Hirotaka Ishihara (@jerryishihara).</description>
    <link>https://dev.to/jerryishihara</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F467091%2F2585a3f3-afeb-4296-a451-f59d92f54bf6.jpeg</url>
      <title>DEV Community: Hirotaka Ishihara</title>
      <link>https://dev.to/jerryishihara</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jerryishihara"/>
    <language>en</language>
    <item>
      <title>Kaggle: Lyft Motion Prediction for Autonomous Vehicles</title>
      <dc:creator>Hirotaka Ishihara</dc:creator>
      <pubDate>Mon, 30 Nov 2020 02:16:26 +0000</pubDate>
      <link>https://dev.to/jerryishihara/kaggle-lyft-motion-prediction-for-autonomous-vehicles-47dp</link>
      <guid>https://dev.to/jerryishihara/kaggle-lyft-motion-prediction-for-autonomous-vehicles-47dp</guid>
      <description>&lt;p&gt;In this Kaggle competition, I built motion prediction models for self-driving vehicles to predict how cars, cyclists, and pedestrians move in the autonomous vehicles (AV’s) environment, with the support of the largest &lt;a href="https://self-driving.lyft.com/level5/prediction/" rel="noopener noreferrer"&gt;Prediction Dataset&lt;/a&gt; [1] ever released to train and test the models.&lt;/p&gt;

&lt;h1&gt;
  
  
  Competition Description
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F0d7nvihnhqdanph97hoi.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F0d7nvihnhqdanph97hoi.jpg" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In summary, the goal of this competition is to predict other car / cyclist / pedestrian (called “agent”)’s motion in the next 5 seconds by using past frames based on the view of the AV’s views. A raster generates a bird’s eye view (BEV) top-down raster, which encodes all agents and the map. The network infers the future coordinates of the agent-based upon this raster.&lt;/p&gt;




&lt;h1&gt;
  
  
  Lyft Level 5 Prediction Dataset
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fw3p4wgjmmy5bzk3mr5kx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fw3p4wgjmmy5bzk3mr5kx.png" alt="Image from the paper: One Thousand and One Hours: Self-driving Motion Prediction Dataset&amp;lt;br&amp;gt;
"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The dataset was collected along a fixed route in Palo Alto, California. It consists of 170,000 scenes capturing the environment around the autonomous vehicle. Each scene encodes the state of the vehicle’s surroundings at a given point in time.&lt;/p&gt;

&lt;p&gt;The dataset consists of frames and agent states. A frame is a snapshot in time, consisting of ego pose, time, and multiple agent states. Each agent state describes the position, orientation, bounds, and type.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F4vryz1qloc9oqfxmx52d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F4vryz1qloc9oqfxmx52d.png" alt="data format"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A detailed exploratory data analysis is available in this &lt;a href="https://hirotaka-ishihara.netlify.app/project/lyft/lyft-first-data-exploration.html" rel="noopener noreferrer"&gt;Jupiter Notebook&lt;/a&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Evaluation &amp;amp; Score
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;This is a brief summary of the evaluation, please refer to the &lt;a href="https://github.com/lyft/l5kit/blob/master/competition.md" rel="noopener noreferrer"&gt;metrics page in the L5Kit repository&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After the positions of a trajectory are predicted, a negative log-likelihood of the ground truth data given these multi-modal predictions is calculated. Assume the ground truth positions of a sample trajectory are&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Feknant15udr1hnc3reas.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Feknant15udr1hnc3reas.gif" alt="ground truth"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and the predicted K hypotheses, represented by means&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fk65qda34niomc9cj7oe8.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fk65qda34niomc9cj7oe8.gif" alt="hypothesis"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;for each hypothesis, the model also generates a confidence value c (i.e. for a single model, the value c is 1). Assume the ground truth positions to be modeled by a mixture of multi-dimensional independent Normal distributions over time. The goal is to maximize the following likelihood&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F3bgvddy2tev0rr1y51o7.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F3bgvddy2tev0rr1y51o7.gif" alt="likelihood"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As for yielding the loss, we simply take the log and take the negative of the likelihood equation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fzbjcjxne2smuppg7jgk1.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fzbjcjxne2smuppg7jgk1.gif" alt="loss"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F8t89vruy09xeskxodiaw.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F8t89vruy09xeskxodiaw.gif" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For numeral stability (preventing the condition of underflow caused by an extremely small value), the &lt;a href="https://en.wikipedia.org/wiki/LogSumExp#log-sum-exp_trick_for_log-domain_calculations" rel="noopener noreferrer"&gt;log-sum-exp trick&lt;/a&gt; is applied to the equation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fp8wj2e9m4o1uwgnb7mwz.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fp8wj2e9m4o1uwgnb7mwz.gif" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A huge thanks to the competition host for providing the &lt;a href="https://github.com/lyft/l5kit/blob/20ab033c01610d711c3d36e1963ecec86e8b85b6/l5kit/l5kit/evaluation/metrics.py" rel="noopener noreferrer"&gt;implementation&lt;/a&gt; of this loss function.&lt;/p&gt;




&lt;h1&gt;
  
  
  Image Raster &amp;amp; Pixel Size Selection
&lt;/h1&gt;

&lt;p&gt;In general, it is less feasible to implement and train state-of-art models for motion prediction in just 2 months. Instead, playing with the input data and applying feature engineering to them wisely is the key to winning the competition.&lt;/p&gt;

&lt;p&gt;In this competition, the key factors for gaining higher accuracy are the image raster size the pixel size.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;raster size: the image size in pixel&lt;/li&gt;
&lt;li&gt;pixel size: spatial resolution (meters/pixel)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Frhke93c9fxsidt8njcn5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Frhke93c9fxsidt8njcn5.png" alt="raster size"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For a fixed pixel size, a larger raster size means more surrounding information. In the meantime, it also means a longer training period (more computation).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fixys1650o4d7iwn7qspu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fixys1650o4d7iwn7qspu.png" alt="pixel size"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For a fixed raster size, a smaller pixel size means higher resolution.&lt;/p&gt;




&lt;h1&gt;
  
  
  Models
&lt;/h1&gt;

&lt;p&gt;I’ve tried several baseline models. Each model has been trained for roughly 3 days on a Tesla V100 GPU. The following is a summary of the performance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ffchvkz36l7zrmsiysv1f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ffchvkz36l7zrmsiysv1f.png" alt="models"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It turns out that more layers are not necessarily mean better scores. The trade-off between model/input size and the training speed further restricts my model choices. I decided to explore deeper with the models ResNet18 and ResNet34.&lt;/p&gt;




&lt;h1&gt;
  
  
  Ensemble
&lt;/h1&gt;

&lt;p&gt;In the end, I finished the competition with a score of 19.02 in the private leaderboard and 18.938 in the public leaderboard, ranked at 94/937.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The private leaderboard is calculated with approximately 50% of the test data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The ensembling improved my score (the best single model score) from 19.823 to 18.938 with the following models, the ensemble weights are based on their scores:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;[(sum score — individual score) / (sum score * 3)]&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;ResNet34 / Raster Size 512 / Pixel Size 0.2&lt;/li&gt;
&lt;li&gt;ResNet34 / Raster Size 350 / Pixel Size 0.4&lt;/li&gt;
&lt;li&gt;ResNet18 / Raster Size 512 / Pixel Size 0.2&lt;/li&gt;
&lt;li&gt;ResNet18 / Raster Size 448 / Pixel Size 0.3&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Summary of Some Great Insights from the Top Rankers
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Ensemble with Gaussian Mixture Model (GMM).&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;At the final ensemble stage, GMM with 3 components was used to fit the multiple trajectory positions generated by the trained models.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;Rasterizing based on the agent’s speed.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Slow agent speed means the prediction only needs a small raster size (since the vehicle probably won’t travel too far in the next 5 seconds). In the meantime, the “slow model” increases the number of history frames to increase accuracy. Similarly, the “fast model” increases the raster size and reduces the frames.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;Extract meta-data from the Agent-Dataset&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;There is some useful information in the agent dataset, such as centroid, rotation, velocity, etc. A second head with two fully connected layers could be used to encode this information, and then concatenate the output vector to the output of the ResNet pooling layer.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  My Experiments that Didn’t Work
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Lane encoder&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;I separately trained a Conv2d-Autoencoder on the semantic lane channels and then fine-tuned with the ResNet model.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;GRU over time-stacked models&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Adding more layers on top of the ResNet head&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Graph Convolutional Network on the lane nodes&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;I treated each 4x4 grid in the lane map as a single node and set up the adjacency matrix based on their pixel values. (e.g., a 4x4 grid has a maximum of 8 neighbors, exclude the neighbor if the value is below a threshold, which means no lane).&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;This is my first Kaggle competition experience. It’s really encouraging to see my effort in the past three months earned a bronze medal. Also, I saw many novel usages of traditional machine learning techniques that showed their outstanding performance. My journey in Kaggle has just started.&lt;/p&gt;




&lt;h1&gt;
  
  
  Reference
&lt;/h1&gt;

&lt;p&gt;[1] Houston, J. and Zuidhof, G. and Bergamini, L. and Ye, Y. and Jain, A. and Omari, S. and Iglovikov, V. and Ondruska, P., One Thousand and One Hours: Self-driving Motion Prediction Dataset, &lt;a href="https://arxiv.org/abs/2006.14480v2" rel="noopener noreferrer"&gt;arXiv:2006.14480v2&lt;/a&gt;, 2020&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
