Vision Transformer

Projecting the NLP Transformer to the Image Domain

This section describes the Vision Transformer (ViT) architecture, which is a transformer-based model for image classification. The ViT architecture was introduced in the paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” by Dosovitskiy et al. (2020)