ViT, CLIP, SigLIP

Use this notebook as a reference and the video below, to answer the following questions.

Q1: What trade-offs come with smaller vs. larger patch sizes in ViT?

In [3]:

# Your answer or notes here

Q2: What inductive biases do CNNs have that ViTs lack? What are the consequences?

In [5]:

# Your answer or notes here

Q3: Why is positional encoding necessary in ViT, and how is it implemented?

In [7]:

# Your answer or notes here

Q4: What are the two separate encoders in CLIP, and what is their purpose?

In [9]:

# Your answer or notes here

Q5: Explain CLIP’s contrastive loss. How does it align image and text representations?

In [11]:

# Your answer or notes here

Q6: How does CLIP enable zero-shot classification? What role do prompts like “a photo of a ___” play?

In [13]:

# Your answer or notes here

Q7: What is the main difference in loss function between CLIP and SigLIP?

In [15]:

# Your answer or notes here

Q8: What might be the impact of removing the softmax normalization across the batch in SigLIP?

In [17]:

# Your answer or notes here

Q9: What are potential advantages of SigLIP when deploying models in low-latency environments?

In [19]:

# Your answer or notes here