Cameras and Image Processing

Figure 1: Its no coincidence that if you ask the question as to what is the first thing the people notice in this image, they will say the yellow cab. This is because our brains are wired to detect salient features in the environment and use them to build a mental model of the world around us.

As elaborated here, humans build up a more schematic version of the environment across eye fixations than was previously thought. This schematic version of the environment is typically known as scene gist. It contains conceptual information about the scene’s basic category – is it natural, human-made, a cityscape perhaps – and general layout, maybe limited to a few objects and/or features. This schematic version of the environment is a far cry from the “picture in the head” scenario. But it’s this schematic information that guides us from one eye fixation to the next, during which more detailed information can be sampled. In the picture above, the brain will first detect the cityscape schematic and then process one of the scene fixations - e.g. the yellow cab.

Neuroscientists David Hubel and Torsten Wiesel defined the so called V1 region of the brain - the region at the back of our head that is responsible for the processing of visual sensory input signals coming from the eye’s retina and researchers modelled this region as stacks of convolutional neural layers. Before going into the details of such layers, we will first look at the cameras sensors and how they capture images. Futher since no sensor is ideal, we will also look at camera calibration that is used to account for such imperfections and improve the quality of the captured images and by extension the quality of the geometric information that we can extract from them.

Color images as functions

Figure 2: Grayscale image can be seen as a matrix of pixel intensities. Each pixel is represented by a single value that indicates the intensity of light at that point, typically ranging from 0 (black) to 255 (white) in an 8-bit encoding.

We typically refer to the size of the image as \(w \times h\) where \(w\) is the width (number of columns) and \(h\) is the height (number of rows). We write the image as a tensor, a matrix in this grayscale case, \(\mathbf x(i,j)\) where the convention is to refer to the columns \(j\) of the matrix that represents each function with \(x\) and to the rows \(i\) with \(y\) - this may cause some confusion at first.

A function \(f(i,j)\) maps the pixel coordinates to the intensity value at that pixel. The dynamic range of intensity values depend on how many bits the sensor has been manufactured with. Typically we see 8-bits per pixel therefore limiting the dynamic range to [0, 255], although some cameras can capture images with 10, 12 or even 16 bits per pixel.

Most of the real world is however, captured in color. The naturally colored image in Figure 1 can be seen as a vector function:

\[\mathbf f(i,j)= \begin{bmatrix} r(i,j) \\ g(i,j) \\ b(i,j) \end{bmatrix}\]

with the 3rd dimension assigned to the image channels. If these corresponds to the well known Red, Green and Blue fundamental colors, we call this the RGB color space encoding. The mixture or superposition of these channels can generate each pixel color of the original image.

Image Normalization

Before processing images we need to convert them to floating point values in the range [0, 1]. This is called normalization and it is done by dividing each pixel value by the maximum value of the channel. For example, if we have an RGB image with pixel values in the range [0, 255], we normalize it as follows:

\[\mathbf x_{norm}(i,j) = \begin{bmatrix} \frac{r(i,j)}{255} \\ \frac{g(i,j)}{255} \\ \frac{b(i,j)}{255} \end{bmatrix}\]

Figure 3: Normalized RGB channels of the image in Figure 1.

In the next section, we look at how cameras capture images and how we can model the camera sensor as a function that maps the 3D world to a 2D image.

Camera Models

The camera models are mathematical representations of how a camera captures the 3D world and projects it onto a 2D image plane. There are several different camera models, each with its own assumptions and characteristics. Some of the most common camera models include:

The Pinhole Camera Model: This is the simplest camera model, which assumes that light rays pass through a single point (the pinhole) and project onto the image plane. It is characterized by its focal length and the position of the pinhole.
The Perspective Camera Model: This model extends the pinhole camera model by incorporating lens distortion and other optical effects. It is commonly used in computer vision and graphics to simulate realistic camera behavior.
The Orthographic Camera Model: In this model, parallel lines in the 3D world remain parallel in the 2D image. This is useful for technical drawings and architectural visualizations, where accurate measurements are important.
The Spherical Camera Model: This model captures a 360-degree view of the scene by using a spherical image sensor. It is commonly used in virtual reality and panoramic photography.
The Omnidirectional Camera Model: Similar to the spherical camera model, this model captures a wide field of view (FOV) by using multiple lenses or a fisheye lens. It is often used in robotics and surveillance applications.

Each of these camera models has its own set of equations and parameters that describe how 3D points are projected onto the 2D image plane. Understanding these models is essential for tasks such as image rectification, 3D reconstruction, and camera calibration.

Here we focus on the pinhole camera model in that order.