Visual Language Models
In this chapter,
We start with CLIP, the classic image-text alignment model that uses contrastive learning,
We then cover BLIP-2 that introduces the concept of using a frozen vision encoder (often CLIP or ViT) with a powerful LLM (like FlanT5 or Llama) and a “querying” module to connect the two.
Finally, present LLaVA that goes a step further: directly combines a vision encoder (CLIP ViT) with a large language model (Llama 2, Vicuna, etc.) for instruction-following, dialog, and rich vision-language reasoning.