Vision Transformers (ViT)

import torch
import torch.nn as nn
import torchvision.models as models
from torch.utils.data import DataLoader, Dataset

# 1. Data Preparation (Minimal example - replace with your dataset)
class MyDataset(Dataset):
    def __init__(self, num_samples=100):
        self.data = torch.randn(num_samples, 3, 224, 224)  # Example image data
        self.labels = torch.randint(0, 2, (num_samples,))  # 2 classes

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]


# 2. ViT Model (Simplified)
class ViT(nn.Module):
    def __init__(self, patch_size=16, num_classes=2):
        super(ViT, self).__init__()
        self.patch_size = patch_size
        self.base_model = models.resnet18(pretrained=True) # Utilize a pretrained ResNet for the feature extractor.
        self.num_features = self.base_model.fc.in_features  # Get the number of features from the ResNet
        self.linear = nn.Linear(self.num_features, num_classes)

    def forward(self, x):
        # Extract features from the pre-trained ResNet
        x = self.base_model.conv1(x)
        x = self.base_model.bn1(x)
        x = self.base_model.relu(x)
        x = self.base_model.maxpool1(x)

        # Pass through the rest of the ResNet.  We'll only use the last convolutional layer output.
        x = self.base_model.layer1(x)
        x = self.base_model.layer2(x)
        x = self.base_model.layer3(x)
        x = self.base_model.layer4(x)
        x = self.base_model.layer5(x)

        x = torch.flatten(x, 1)  # Flatten to prepare for the linear layer
        x = self.linear(x)

        return x


# 3. Training Setup
batch_size = 16
num_epochs = 3
learning_rate = 0.001


# 4. Instantiate & Train the Model
dataset = MyDataset()
data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
model = ViT()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()  # For multi-class classification

for epoch in range(num_epochs):
    for inputs, labels in data_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

print("Training complete!")

Explanation and Critical Considerations:

ResNet as Feature Extractor: I’ve used a pre-trained ResNet-18 model as the feature extractor. This is a common practice. You can start with a smaller ResNet or even a Vision Transformer pre-trained on a massive dataset (e.g., ImageNet) and fine-tune it for your specific task.
Patchify: The torch.flatten(x, 1) is effectively doing the patchify operation, converting the image into a sequence of features. In a real ViT, you would use a patchify layer for more control and potential optimization.
Positional Encoding: The positional encoding has been omitted for simplicity. Adding positional encodings is a key component and requires careful design.
Data: This example uses a very small, randomly generated dataset for demonstration.
Scalability: This is extremely simplified. Real ViT implementations often involve techniques like memory-efficient attention mechanisms and parallelization to handle larger images and larger batch sizes.

Next Steps and Further Exploration:

Learn More about Attention Mechanisms: Dive deep into the workings of self-attention.
Explore Different Patch Sizes: Experiment with different patch sizes to see their impact on performance.
Positional Encoding Strategies: Investigate various positional encoding methods (learned, sinusoidal, etc.).
Implement Efficient Attention: Research techniques like linear attention or sparse attention to reduce the computational cost of self-attention.
Fine-Tuning: Fine-tune a pre-trained ViT model on a real dataset.

This provides a foundational understanding of ViT and how to get started with it in PyTorch. It’s a revolutionary architecture pushing the boundaries of computer vision, and I encourage you to explore further! Let me know if you’d like to delve into a particular aspect in more detail.