# Your answer or notes here
ViT, CLIP, SigLIP
Use this notebook as a reference and the video below, to answer the following questions.
Q1: What trade-offs come with smaller vs. larger patch sizes in ViT?
In [3]:
Q2: What inductive biases do CNNs have that ViTs lack? What are the consequences?
In [5]:
# Your answer or notes here
Q3: Why is positional encoding necessary in ViT, and how is it implemented?
In [7]:
# Your answer or notes here
Q4: What are the two separate encoders in CLIP, and what is their purpose?
In [9]:
# Your answer or notes here
Q5: Explain CLIP’s contrastive loss. How does it align image and text representations?
In [11]:
# Your answer or notes here
Q6: How does CLIP enable zero-shot classification? What role do prompts like “a photo of a ___” play?
In [13]:
# Your answer or notes here
Q7: What is the main difference in loss function between CLIP and SigLIP?
In [15]:
# Your answer or notes here
Q8: What might be the impact of removing the softmax normalization across the batch in SigLIP?
In [17]:
# Your answer or notes here
Q9: What are potential advantages of SigLIP when deploying models in low-latency environments?
In [19]:
# Your answer or notes here