Skip to content

bearpelican/lyft-perception-challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

(Unofficial) 1st Place Lyft Preception Challenge Writeup


This project submission is for the image segmentation challenge organized by Lyft and Udacity hosted in May 2018.

Competition

Competition provided dashcam images and pixel labels generated by the CARLA simulator.
Goal: achieve the highest accuracy on vehicle and road detection with an FPS greater than 10 on a Nvidia K80 GPU.

Results

Achieved the highest "unofficial" score (95.97) in the competition by exploiting a bug in the competition. This score was almost 2 points higher than the leader 93.22.
alt text

However, after reporting this exploit, I was asked to remove my results. Without the exploit, this model achieved 16th.

Deeper explanation of the bug here

Summary

Library: Fast.Ai + Pytorch Data: 600x800 dashcam images and pixel by pixel labels
Data preprocessing: Trimmed the sky and the car hood images (384x800px). Processed target labels to 3 categories: Vehicle, Road, Everything Else.
Data augmentation: Random resized crop, random horizontal flip, random color jitter Architecture: U-net with Resnet34 backbone. Inspired by Fast.AI's Carvana implmentation
Loss function: Custom F-beta loss function
Training: Progressive resizing, Cyclical Learning Rates.

A lot of the techniques used for this project were taken from FastAi's Stanford DawnBENCH submission


Data

I generated an additional 10,000 segmentation images by running the CARLA simulator on a windows machine.

I ran their autopilot script. With a segmentation camera:

camera1 = Camera('CameraSeg', PostProcessing='SemanticSegmentation')
camera1.set_image_size(800, 600)
camera1.set_position(1.30, 0, 1.30)
settings.add_sensor(camera1)

Output:
alt text

Data Preprocess

Trimmed the sky and the car hood images (384x800px). Image dimensions needed to be divisible by 32 due to the 5 downsample layers in Resnet34. Preprocessed target labels to 3 categories: Vehicle, Road, Everything Else.
alt text

Data augmentation

Random resized crop: trained with both square cropping (384x384) and rectangle cropping (384x800)
Random Horizontal Flip
Random Color Jitter (Brightness, Contrast, Saturation): .2, .2, .2
Normalization: Used imagenet stats

Architecture

Model
U-net with Resnet34 backbone. Inspired by Fast.AI's Carvana implmentation

Tried several different architectures in this notebook. Chose the one suited best for speed and accuracy.

Original U-net: 3x slower to train
U-net with VGG11 backbone: 1.5x slower. Not as accurate. This paper seems to think Resnets work better as a backbone too
U-net with Resnet50 backbone: 2x slower
U-net + LSTM with Resnet34 backbone:

  • Ran the encoder through an LSTM before sending it to the decoder. Used an RNN to encode temporal video data. Inspired by STFCN paper.
  • Implementation notebook
  • Ran out of patience as accuracy did not fare much better than submission

I wanted to choose the fastest architecture to iterate as fast as possible. Probably could have achieved a much higher accuracy towards the end by using a deeper network.

Loss Function

Used custom loss function here

Competition scoring was measured by a weighted F score:

  1. Car F(beta=2)
  2. Road F(beta=0.5)

Losses tried: Weighted Cross-Entropy, Weighted Dice Loss, Custom F-beta Loss

I went through most of the competition using weighted dice loss before realizing I could just use the F-beta score directly.

This in turn helped me realized that recall was a much more important metric for detecting cars and the opposite for detecting roads. Car beta of 2 means recall is about 2x more important than precision. Similarly, a beta of 0.5 means precision is weighted more heavily.

Turns out the simple weighted cross-entropy (car weight of 5, road weight of 2) performed just as well as the custom f1 score. 91.2 vs 91.3

Sigmoid vs Softmax: sigmoid worked better for me, but did not do enough testing to verify this.

FP16

Could not get half precision training to work with pytorch. However, I did use it for evaluation, to enable a larger batch size and slightly faster inference. Half precision specifically fails on the loss function when evaluating segmentation models. Cross entropy average = sum of losses / shape of y. With segmentation, the shape of y (batch x width x height) is already higher than the maximum range of fp16 (65520 ~ 2^16).
As a result, when calling backwards, you'll get RuntimeError: value cannot be converted to type Half without overflow
This is not a problem for classification as y (batch x classes) is pretty small. the mean calculation of the loss does not run into the half precision limit because both the sum of losses and the product of dimensions is usually low - as you are usually calculating the cross entropy for less than 10 classes.


Exploits

Discovering the location of the answers
Udacity provided contestants with a GPU server with 50 hours of time. Results were submitted by uploading your model to the server and running a script.
While trying to understand how the submission process worked, I discovered that the answers were downloaded to a temporary directory on the server and evaluated against your model. The trick was just to figure out where that temporary directory was located, and copy it before the obfuscated script deleted it.

With this knowledge, I was able to achieve a perfect score of 10 with an FPS of 1000 by just submitting the answers I found.
Udacity knew about this loophole and left it open. To prevent bogus submissions, they would evaluate the final results on another private dataset instead.

However, due to the next bug, the final results were not actually evaluated on a private test set...

Discovering that these answers are actually off
While submitting my results, I noticed something strange. I would achieve a weighted F score of 9.6 on my validation set. However, when I submitted the results, I would only get a score of 9.2 on the public leaderboard.
Because of this, I started to look at the example video they gave me and overlayed the provided segmentation labels on top. Turns out half way through the video, the segmentation labels are off by 1 frame.

Frame 0 Frame 30
image4 image5

This means the vehicles in the answers they provided were not where they were supposed to be

I did the same overlaying technique on the private test set I found and realized it had the same problem.

To confirm that the Udacity private test set had a bug inside, I submitted one of my old models, but corrected a few of the frames to be no longer off by 1.

mismatched_idxs = list(range(15,44)) + list(range(200,750))

The score jumped from 9.2 to 9.6 - more in line with my own evaluation metrics. This meant that if corrected, all the scores on the leaderboard could potentially jump 4 points.

Reporting exploits to Udacity
Figured in the spirit of fairness (after all this was a Lyft competition, not Uber), I should report this to Udacity. Turns out they knew about exploit #1 (discovering answers). However, exploit #2 (wrong answers) turned out to be a really obscure encoding/decoding error with a third party library they were using. With only a few days left of the competition, it was too late to correct the error on the original test set. Thus, Udacity thanked me, and kindly asked me not to submit my 9.6 score. This is totally understandable. The exploit wasn't exactly fair to begin with.

Which means my ranking dropped from this:
alt text

to this:
alt text

To my knowledge and theirs, none of the other contestants knew of these bugs or at least used them to exploit the leaderboard.
Because of this disclosure, the final results were no longer evaluated on the private test set. Fixing this mistake on the private test set could have changed the leaderboard very dramatically. Some models could have been less affected by misplaced cars than others...

The final first place winner may have unintentionally used this as an advantage
AsadZia won the whole thing using a very clever technique called dilation. It basically expanded area around the car detection center. This idea allowed him to jump from 11th to 1st.
It's interesting to wonder if part of the effectiveness of Dilation was due to the video encoding bug. No doubt he had one of the best models regardless. But since vehicles were not always in the place they were supposed to be, increasing the detected area in this case makes a lot of sense.

Final thoughts

It was a really fun competition hosted by Lyft/Udacity. Having to balance out accuracy vs speed was a great way to test engineering skills, not just research knowledge. Joining this competition forced me to understand every part of the pipleline to gain an edge. Even find bugs I didn't expect to exist =)

About

My "unofficial" winning submission to Lyft's image segmentation challenge - https://www.udacity.com/lyft-challenge

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages