Skip to content

Latest commit

 

History

History
29 lines (18 loc) · 1.32 KB

simmim.md

File metadata and controls

29 lines (18 loc) · 1.32 KB

November 2021

tl;dr: Large scale pretraining based on Masked Image Modeling. Similar to MAE.

Overall impression

This paper is published a week after MAE, obviously rushed by the publication of the latter. The ideas are very similar, but execution (hyperparameter tuning, paper writing) is considerably inferior to MAE.

Difference between MAE and SimMIM:

  • MAE uses asymmetric design of encoder and decoder, where encoder does not see masked patches. SimMIM uses symmetric design.
  • SimMIM stressed the difference between prediction (of only masked patches) and reconstruction (of all patches), and mentioned that the former yields better performance. MAE also observes the trend (in footnote). However MAE also demonstrates the mid-ground: training without losses on visible patches but prediction on all the patches.
  • SimMIM was not validated on more fine-grained downstream tasks such as object detection and segmentation.

Similarities between MAE and SimMIM:

  • directly regress the pixels
  • light decoder design

Key ideas

  • Summaries of the key ideas

Technical details

  • Summary of technical details

Notes

  • Questions and notes on how to improve/revise the current work