Syllabus
Books
SZELINSKI - Computer Vision: Algorithms and Applications, 2nd Edition. This book is required and is free to download for personal use.
BISHOP - Deep Learning - Foundations and Concepts, by C Bishop and H Bishop. This book can be access and viewed online from the book’s website.
TIF - Foundations of Computer Vision, Antonio Torralba, Phillip Isola and William T. Freeman. This book is not free, it is not required but goes through the very latest applications of deep learning for computer vision such as diffusion models and others.
Planned Schedule
Lecture | Title | Details |
---|---|---|
1 | Introduction and the foundations of vision | We start with an introduction to Computer Vision for the general application area of agents with egomotion. Throughout this course we will assume that a monocular or stereo camera is mounted on an agent that can in general move in a 3D environment and focus on its perception system assuming only the presence of camera sensing. In this lecture we explain we review prerequisites on programming (Python) as well as linear algebra, probability theory and basics on how cameras work. With the help of the TAs & other tutorial videos, we also ensure that students have setup a programming environment necessary for the projects and assignments of the course. Reading: selected pages from the course web site. Reading: Selected pages from the course website. |
Part I: Detection and Segmentation | ||
2 | Introduction to Prediction and Neural Networks | The computer vision system is now dissected into its parts with the very first part being featurization. We introduce the end to end prediction problem using simpler learning architectures and subsequently fully connected neural architectures. Our focus here is to understand how prediction can be engineered by taking the maximum likelihood optimization principle and its associated cross entropy loss function and applying it to neural networks for supervised regression and classification tasks. Both of these tasks will come together when we discuss object detection in a future lecture. Reading: BISHOP Chapter 4, Chapter 5 |
3 | Convolutional Neural Networks (CNNs) | We then introduce CNNs with their innate ability to efficiently learn spatial hierarchies of features through backpropagation. We treat simple tasks such as image classification and then quickly dive into architectures that were particularly made for image featurization such as Residual Networks (ResNets) explaining why they are so popular especially for real-time perception . Reading: BISHOP Chapter 10. |
4 | Object Detection | As a first task in scene understanding, we now design object detectors, initially from CNNs, that identify and locate objects of interest. We treat two main architectures: YOLO and Faster R-CNN . YOLO is known for its speed and efficiency while Faster R-CNN focusing on higher detection accuracy often resulting in better precision and recall in complex scenes. Reading: SZELINSKI Chapter 6. |
5 | Semantic Segmentation | Many computer vision applications require far finer granularity than a bounding box around the object(s). Here we expand on the task of CNN-based object detectors to include heads that are able to label the specific pixels of the object that occupy the scene as well as expand on panoptic segmentation that labels everything in the scene Reading: SZELINSKI Chapter 6. |
6 | Vision Transformers (ViT) | At this point, we introduce transformer-based architectures that will be the basis for more advanced tasks later on in this course. We focus on Vision Transformers (ViT) that leverage a self-attention mechanism to model global dependencies within an image treating the image as a sequence of patches. We understand that comparatively to CNNs, ViT-models suffer from increased latency inhibiting real-time applications relative to CNN counterparts while for non real-time setting they improve performance on tasks requiring an understanding of the whole image context. |
Part II: Moving Pictures and Object Tracking | ||
7 | Single-Object Tracking (SOT) | In this lecture we focus entirely on video streams and on the requirements of many computer video applications such as video surveillance to track objects within the geometrical boundaries of a single camera. We look at various architectures, that can correct the reflexive nature of earlier build CNN and ViT detectors to track an object despite challenges such as occlusion, motion blur, and changes in appearance. Reading: Course notes based on SOT/MOT competition papers and benchmarks . |
8 | Multi-Object Tracking (MOT) | We now extend single object tracking with robust object association establishment techniques to maintain consistent identities for all objects throughout the video scene under the scope of a single sensor . Extending to multiple sensors where a featured object such as a person can be seen across different steams is also treated using representations learning and vector similarity search approaches. Reading: Course notes based on SOT/MOT competition papers and benchmarks. |
Part III: Multimodal Vision Models | ||
9 | Learning Image Captions | In Parts I and II we developed models and systems that can perceive the environment purely from visual information. There is however a wide range of applications that are mandate interactivity between humans or multi-agents. In this lecture we start with the task of describing with natural language an image - a generative task that is based on the attention models we have developed in an earlier lecture. Reading: Lecture notes based on the “Show, Attend and Tell” paper and others. |
10 | Visual Question Answering (VQA) | We continue with an interactive application where the goal is to answer questions about the contents of an image. We will train models on datasets specifically designed for VQA or use OpenAI’s CLIP’s ability to understand and relate visual content with textual descriptions, providing a foundational model that can be fine-tuned for VQA tasks. Reading: Course notes based on VQA and CLIP papers. |
11 | Prompted Scalable Vision Models | Closing the topic of multimodal reasoning, we demonstrate one attribute that has been shown to offer ground braking performance improvements in the NLP domain: model scaling and prompting. Meta’s release of the Segment Anything Model (SAM) scaled to very large parameter sizes, paved the way to execute tasks, such as image segmentation, for object classes that have never seen before. Apart from scaling, this “zero-shot” learning ability is achieved by guiding the model to perform specific segmentation tasks using textual or visual cues. Reading: Course notes based on SAM paper. |
Part IV: Generative Vision Models | ||
12 | Representing Scenes as Neural Radiance Fields | In the last few years we have seen significant advances in the ability of models to generate realistic scenes. We are now at a point in this course where we have all the tools available to us to study the first generative method that, amazingly, creates 3D scenes from a set of 2D images. We cover concepts such as volume rendering and others, paired with fully connected neural networks to achieve photorealistic generation of 3D scenes. Reading: Course notes based on NeRF paper. |
13 | Diffusion Models and DALL-E | Finally, we look at diffusion, inspired from thermodynamics, as a general modeling approach, called physics-inspired learning, that does conditional image generation given a textual description. In essence we are trying to reverse the image captioning task we have treated earlier and in this exciting last lecture we will combine the textual representation learning of the CLIP model we have seen earlier with a conditional diffusion process to create photorealistic images guided by our prompts. Reading: Course notes on DALL-E/2 and Stable Diffusion. |