This post is more than two years old and may contain outdated information.
January 10, 2020
Computer vision is a field of artificial intelligence and generally used to describe methods for extracting information from images. Computer vision is needed when machines interact with the physical world, such as in robotics, or use machines to extract useful information from images (and images are everywhere).
In this post:
A transformation of an image into a description is not obvious, it is an ill-posed problem. One image can have many interpretations, one object can result in many images, so this problem is exponentially large:
imaging: mapping from 3D to 2D is unique, problem is well-posed and forward
vision: mapping from 2D to 3D is not unique, problem is ill-posed and inverse, many objects can generate similar image
Almost all production-level vision systems use constraints and priors (prior knowledge) to make some interpretations more likely than others. Our brain produces one interpretation from many possible ones.
Inference
Human perception use inference, which is prior information about the world, to constrain results and (sometimes) produce a more accurate interpretation:
checker board with added shadow (illumination) can result in perceived intensity, not reflecting actual image intensity, making shadowed tiles look lighter than actual
Ames Room illusion (perspective)
knowing what an image look like makes it easier to see something that was harder to see before (prior knowledge)
expecting an image to look a certain way makes it hard to spot variations (prior expectation), such as common brand with typo or different text
exposure to an image makes it hard to spot changes to same image (prior exposure, change blindness), such as removing an object from a scene in image
knowing context makes it easier to complete otherwise incomplete images, such as letter missing in common word
Human perception is influenced by prior expectations, or simply priors, and the examples listed above are illusions, which is from assumptions that our vision system makes to solve an under-constrained problem:
prior knowledge, or familiarity, such as previous experience with certain objects or knowledge about formation process in general
prior exposure, motion, or priming, which is recent or preceding input
current context, such as objects surrounding visual scene, and concurrent input
An image is formed when a sensor registers radiation that interacted with a physical object. An image that is formed is affected by two types of parameters (note that camera and computer vision systems can also work with non-human-visible wavelengths, such as infra-red and x-ray):
radiometric, to determine intensity (color, earth surface intensity spectrum between 400-700 nm), illumination (type and location), surface reflectance properties (material and orientation), and sensor properties (sensitivity)
geometric, to determine where a scene point is in the image, camera position and orientation in space, camera optics (focal length), and projection geometry (mapping 3D to 2D)
Colors are created from mixing light, or additive, adds illumination to increase intensity, which results in a white color at extreme, or mixing pigments, or subtractive, reflects illumination to decrease intensity, which result in a black color at extreme.
A color is determined by luminance, which is amount of light striking sensor, illuminance, which amount of light striking surface, and reflectance, which is light absorbed (depends on surface material). So, luminance at a location, $(x, y)$, and wavelength, $w$, is a function of illuminance and reflectance at same location and wavelength:
if illuminance of light changes, intensity of light also changes (luminance changes with illuminance)
if reflectance of light changes, intensity of light also changes (luminance changes with reflectance)
The human eye seems to recover surface color, illuminance, since perceived colors remain unchanged with changes to illumination, such as looking at color of fruits in different lightning.
Light spreads out from a point, where focus is light converging (restricting flow of light) into a single image point and exposure is time needed to allow light through to form an image. The pinhole camera model describes that a small pinhole gives sharp focus and dim image (long exposure) and large pinhole gives bright but blurred image (short exposure).
Lenses
A lens can be used with large pinhole cameras to produce a bright and sharp image, the lens keeps light from spreading out (light is refracted):
a lens is positioned between object and image
the thin lens equation is 1/f = 1/ABS(z) + 1/ABS(z')
where f
is focal length
z
is object distance from lens, and z'
is image distance from lensA lens follows the pinhole model (also called perspective camera model) for objects that are in focus:
P
is a point with coordinates (x, y, z)
, P'
is its image with coordinates (x', y', z')
, and O
is the origin (or pinhole, center of lens)
image plane is located at distance f'
from pinhole, optical axis is line perpendicular to image plane and passing through O
, and C'
is image center (intersection of optical axis and image plane)
x'/x = f'/z => x' = f'/z * x
x' = f'/z * x
, y' = f'/z * y
, z' = f'
External reference frame
An external reference frame is used when camera is moving or with more than one camera (stereo vision):
O_XYZ
is world reference frame and O_xyz
is camera reference frame
P
is coordinates of point in frame O_XYZ
and t
is coordinates of origin of frame O_xyz
in frame O_XYZ
(translation vector)
R
is rotation matrix describing orientation of frame O_xyz
with respect of O_XYZ
p
is coordinates of point in frame O_xyz
In Euclidean geometry, objects are described as they are, transformations within a 3D world, translations and rotations, and same shape. In projective geometry, objects are described as they appear, transformations from a 3D world to a 2D image, scaling and shear in addition to translation and rotation. A vanishing point is a projection of a point at infinity, such as looking at train tracks in the distance.
A digital image is represented as a 2D array with numbers and is sampled at discrete points (pixels), value at each pixel is light intensity at that point (0 is black and 255 is white, sometimes 1 is white). An image is usually denoted as I
, origin is in top-left corner, where a point on the image is denoted as p
, such as p = T(x, y)
(transposed vector), and pixel value is denoted as I(p)
, or I(x, y)
.
An intensity value is averaged at sampling point, also called pixelization, and represented using finite discrete values (quantization):
1 bit per pixel gives 2 gray levels (binary)
2 bit per pixel gives 4 gray levels
8 bit per pixel gives 256 gray levels
8 bit per pixel and 3 real values (RGB) gives 24 bits (true color)
Charge-coupled devices
A charge-coupled device (CCD) is a semiconductor with a two-dimensional matrix of photo-sensors (photodiodes) where each sensor is small and isolated capacitive region, which can accumulate charge.
The photoelectric effect converts photons on sensors into electrons, the accumulated charge is proportional to light intensity and exposure time. CCDs can be used with colored filters to make different pixels selective to different colors, such as Bayer filter.
Bayer filter
A Bayer filter, or mask, has twice as many green filters as red and blue, samples colors at specific locations, and use demosaicing, which is an algorithm used to compute color at pixel (based on local red, green, blue values in subsampled images) to fill in missing values:
nearest neighbor interpolation copies adjacent pixel value in same color channel and is fast but inaccurate
bilinear interpolation takes average of nearest two-to-four-pixel value in same color channel is fast and accurate in smooth regions, but inaccurate at edges
smooth hue transition interpolation (only red and blue, green is same as bilinear interpolation) takes ratio of bilinear interpolation between red/ blue and green
edge-directed interpolation performed on axis where change in value is lowest
The human eye has several parts (listing relevant):
cornea, perform initial refraction at fixed focus
lens, perform further refraction and can be stretched to change focal length
iris, allow regulation of exposure to light, both as protection and to improve focus
optic nerve, transmit information to brain
retina, contains light sensitive photoreceptors, where photoreceptors are farthest from light and transduce, or transform a form of energy to another, light to electric signals as voltage charges
Photoreceptors
A photoreceptor is a rod, which is highly sensitive and can operate in low light levels, or a cone, which has low sensitivity but sensitive to different wavelengths. Blue has short-wavelength and peak sensitivity 440nm, green has medium-wavelength and peak sensitivity 545nm, and red has long-wavelength and peak sensitivity 580nm.
A blind spot has no photoreceptors (it is blind), the fovea has no rods but many cones, and periphery has many rods and few cones. The fovea is high resolution (high density of photoreceptors), color (cones), and low sensitivity (no rods), whereas the periphery is low resolution (low density of photoreceptors), monochrome (rods), and high sensitivity (rods).
Centre-surround RF function
A center-surround receptive field (RF) is an area of visual space from which neuron receive input:
on-center/off-surround is active if stimulus is brighter than background
off-center/on-surround is active if stimulus is darker than background
A center-surround RF measure change in intensity, or contrast, between adjacent locations. The relative contrast should be independent of lightning conditions, so illuminance should be irrelevant.
The process of applying a filter, or mask, to an image is called convolution, where each location in an image (i, j)
is the weighted sum of pixel values in adjacent locations (range is defined by k
and l
), so I'(i, j) = I * H = SUM[I(i - k, j - l) * H(k, l)]
. Convolution is commutative, so I * H = H * I
, and associative, so (I * H) * G = I * H * G
.
So, for each pixel (note that a 2D convolution is separable if H
is a convolution of two vectors, which is much more efficient, H = h_1 * h_2
):
center a rotated mask on pixel and multiply each mask element by corresponding image value (assume image values outside boundary is zero)
sum products to get new pixel value
For more convolution examples, see: Example of 2D Convolution.
A filter, or mask, is a point-spread function, image with isolated white dots on black background would superimpose mask at each pixel.
A mask is a template, and convolution output is maximum when large image values are multiplied with large mask values (respond most strongly at features like rotated mask), where rotated mask can be used as template to find image features:
each pixel replaced by itself
0 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 0 |
each pixel replaced by the one to the left (rotated)
0 | 0 | 0 |
0 | 0 | 1 |
0 | 0 | 0 |
each pixel replaced by average of itself and its eight neighbors, also called smoothing, mean filter, or box mask (weights add up to one to preserve average grey levels)
1/9 | 1/9 | 1/9 |
1/9 | 1/9 | 1/9 |
1/9 | 1/9 | 1/9 |
Gaussian mask
A Gaussian mask gives more weight to nearby pixels to reduce hard edges (it is separable), here's the equation: G(x, y) = 1/(2 * pi * sigma ** 2) * exp(-(x ** 2 + y ** 2)/(2 * sigma ** 2))
, where sigma
is scale (standard deviation).
Difference mask
A difference mask is used to calculate differences instead of averages. The difference between pixel values gives gradient of intensity values (smoothing approximate mathematical integration, difference approximate differentiation) and highlight locations with intensity changes:
weights add up to zero to generate zero response to constant image regions
-1 | -1 | -1 |
-1 | 8 | -1 |
-1 | -1 | -1 |
The Laplacian mask is a combination of difference masks in each direction and detects intensity discontinuities at all orientations (sensitive to noise).
An edge, or discontinuity, is often where the intensity changes in an image. Edges can usually be found with difference masks and often correspond to physical characteristics, or features:
depth discontinuity is due to surface at different levels
orientation discontinuity is due to changes in orientation of surface
reflectance discontinuity is due to changes in surface material properties
illuminance discontinuity is due to changes in light (shadow not helpful to recognize object)
Edges are most efficiently detected using both differencing mask and smoothing mask, where one approach is Gaussian smoothing combined with Laplacian, also called Laplacian of Gaussian (LoG):
An example Laplacian of Gaussian mask:
1/8 | 1/8 | 1/8 |
1/8 | 1 | 1/8 |
1/8 | 1/8 | 1/8 |
Other methods to find edges include:
Difference of Gaussian (DoG), difference operator combined with Gaussian smoothing (approximation to LoG)
Canny edge detector, convolution with derivatives of Gaussian masks to calculate magnitude and direction, non-maximum supression, and thresholding using both low and high threshold
Image feature
An image feature can be found at different scales by applying filters of different sizes or applying filters at fixed size to an image of different sizes (most common). A down-sampling algorithm is often used to scale an image by taking every $n$-pixel and can be used recursively to create images in various sizes. However, results can be both good or bad (aliased, or not representative of image), where smoothing can be used to fix bad sampling:
Gaussian pyramid, Gaussian smoothing applied recursively to a scaled image, which can be used to create scaled images that are representative to the original image
Laplacian pyramid, Laplacian mask applied recursively to a scaled image, which can be used to detect intensity discontinuities in scaled images.
In our brain, the cortex is responsible for all higher cognitive functions, such as perception, learning, language, memory, and reasoning, and the pathway from our retina to the cortex is not straightforward:
the right visual field (RVF) projects to left side on each retina
ganglion cells on left side in left eye projects to left lateral geniculate nucleus (LGN), where LGN cells have center-surround RFs (like retinal ganglion cells)
ganglion cells on left side in right eye cross over at optic chiasm to left LGN, so RVF projects to left LGN
the left LGN projects to left primary visual cortex (V1), striate cortex, which performs initial low-level processing
V1 to parietal cortex for spatial and motion information
V1 to inferotemporal cortex for identity and categorical information
Primary visual cortex
The primary visual cortex (V1) have receptive fields (some similar to retinal ganglion cells) for color, orientation (stronger response at a specific angle, simple cells act as edge and bar detectors, complex cells act as edge and bar detectors with tolerance to location, hyper-complex cells are selective to length and optimal when matching width of RF), direction, spatial frequency, eye of origin (monocular cells are input from one eye only, binocular cells are optimal with same input from both eyes), binocular disparity (difference in location in each eye, depth of stimulus), and position.
The V1 have center-surround RFs, which is like LGN and retina, double-opponent (DO) cells to detect location where color changes, and a hypercolumn, which is a collection of all neurons with RFs in same location on retina (each hypercolumn can process every image attribute).
A 2D Gabor function is a Gaussian multiplied with a sinusoid:
a cosine function with gaussian, where gaussian function goes towards zero on the edges
simulate simple cells in one orientation, convolve image with Gabor as mask
simulate complex cells in one orientation, sum output from convolving image with Gabor mask at different phases (multiple simple cells)
simulate complex cells in multiple orientations (detect edges), sum output from convolving image with Gabor mask at different phases and orientations
Image components
An image component is a small part of different images that are selected (as a mask) and summed to produce an image, such as set of handwritten parts to generate a number as if written by different people:
A set of randomly selected image components in natural images are like Gabor functions, which seem to capture the intrinsic structure of natural images. A set of Gabor functions can represent every image (with sparsity constraint, such as using as few as possible), which is useful for:
image compression, such as jpeg, to save storage space, where jpeg reconstruct each image as linear combination of set of image components
image denoising to reconstruct a noisy image with less noise, components do not include noise
image inpainting to fill-in missing parts in image using a sparse-subset of image components (non-corrupted)
The mid-level vision process involves grouping elements that belong together (intensities, colors, edges, features) and segmenting elements in groups from each other (differentiate between groups). A top-down approach is grouping elements that are on the same object based on internal knowledge or prior experience, whereas bottom-up approach is grouping elements that are similar based on image properties.
Gestalt laws
An object grouped with another object is influenced by Gestalt laws, bottom-up approach, to increase simplicity or likelihood (note that the border-ownership of an object is decided in V2, which comes after V1):
proximity, objects close to each other tend to be grouped together
similarity, objects that are similar tend to be grouped together (shape, colors, size)
closure, objects resulting in a closure tend to be grouped together (circles, boxes)
continuity, objects that can continue tend to be grouped together (straight lines, curves, surfaces)
common fate, objects that have coherent motion tend to be grouped together
symmetry, objects that form symmetric groups tend to be grouped together (shaped object on black or white background)
common region, objects within a region tend to be grouped together (signs, labels)
connectivity, objects that are connected tend to be grouped together (chain, ladder)
simplicity (or Prägnanz, overrides all other), objects tend to be grouped so that the structure is as simple as possible (such as 12 individual dots vs columns, triangle and box stacked on top of each other vs a strange shape)
In the context of vision, a feature is used to determine which elements that belong together (individually or combination, based on Gestalt laws), such as location/proximity, color/similarity, texture/similarity, size/similarity, depth, motion/common fate), not separated by contour/common region, and form a known shape (top-down approach). A feature space is a coordinate system with image elements, or features, as points, where similarity is determined by distance between points in the feature space:
similarity is measured by affinity (like Gaussian), cross-relation, normalized cross-correlation (NCC), correlation coefficient
distance is measured by Euclidean distance, sum of squared differences (SDD), sum of absolute differences (SAD)
Features can be weighted differently based on relative importance, finding best performance is non-trivial, but can be scaled to make calculations easier, such as within range of 0 and 1.
A region-based method for segmentation try to group image elements with similar feature vectors (look similar), such as thresholding (applied to intensity), region growing, region merging, split and merge, k-means clustering, hierarchical clustering, and graph cutting.
An edge-based method partitions an image based on changes in feature values (intensity discontinuities), such as thresholding (applied to edge detectors), Hough transform (model-based, fit data to a predefined model), and active contours (also model-based).
Thresholding
Thresholding methods have regions defined by differences in intensity values and feature space is one-dimensional, where I'(x, y) = 1
if I(x, y)
is above threshold, otherwise I'(x, y) = 0
(note that results are often missing parts in figure, such as edges with missing pixels, or has unwanted pixels in figure):
average intensity, plot intensity diagram, find peak, place threshold between peaks
hysteresis thresholding, define two thresholds, above and below high threshold is background, between thresholds is figure if adjacent to other figure pixels
set threshold so that fraction of image is figure (if fraction of image with objects is known)
local thresholding, thresholds applied to blocks of image that is split into regions, local contrast used to determine threshold (useful when image has shadow or changes to illumination)
Morphological operations can be used to clean up results of thresholding, such as neighbor is defined by structuring element (matrix), dilation to expand area of foreground pixels to fill gaps (background pixels that neighbor foreground pixel is changed from 0 to 1), erosion to shrink area of foreground pixels to remove unwanted pixels (bridges, branches, foreground pixels that neighbor background pixel is changed from 1 to 0) and can be used in combination to remove and fill gaps:
close gap, dilation into erosion (gap is closed, foreground gaps are filled)
remove noise, use erosion into dilation (gap is opened, background noise is removed)
Example: region growing
seed pixel chosen randomly, set region label
check unlabeled pixels that are neighbors
repeat until region is not growing
pick another unlabeled seed pixel
repeat until all pixels assigned to a region
Example: region merging
set unique label to each pixel, or region (note that result depends on order of merging due to averaging merged pixels)
compare region properties with neighbor regions
continue merging until no more match, mark as final
repeat until all image regions are final
Example: split and merge
set same label to all pixels, or region (all pixels belong to same region)
for each region
for each region, compare with neighbors and merge similar regions, repeat until no more similar regions
A partitional clustering algorithm divide data into non-overlapping subsets, or clusters, where each data point is in exactly one cluster. A few methods to determine similarity between clusters:
single-link, distance between clusters is shortest distance between any of its elements
complete-link, distance between clusters is longest distance between any of its elements
group-average, distance between clusters is average distance between its elements
centroid, distance between clusters is distance between average of feature vectors in each cluster (centroid is point in cluster)
K-means
The k-means clustering algorithm assume k
-clusters as input (work best with equally sized clusters):
randomly pick $k$-cluster centers
allocate each element to nearest cluster center
compute new cluster center as mean position of elements in cluster
repeat until cluster centers are unchanged (repeat from step 2 until no new allocations)
Hierarchical
Hierarchical clustering produces a set of nested clusters organized as a tree, such as divisive clustering, where data is regarded as single cluster and then recursively split, or agglomerative clustering, where each data point is regarded as cluster and then recursively merged with most similar cluster.
each data point is cluster
compute proximity matrix (different approaches to determine distance)
merge two closest clusters, update proximity matrix
repeat step 3 until predefined number of clusters
A feature space can be represented as a graph, G = (V, E)
, where vertices, V
, represent image elements (feature vector), edges, E
, represent connections between pair of vertices, and each edge is the similarity between the two connected elements.
A graph cutting process involves:
finding the optimal partitioning (cutting graph into disjoint subgraphs)
similarity between subgraphs is minimized, edges that are cut are weak
similarity within subgraphs is maximized, strong interior links
Normalized cuts (Ncuts) are used to avoid bias towards small subgraphs (cost of cutting graph). Finding set of nodes that produce minimum cut is a NP-hard problem and only approximations are possible for real images, with bias towards partitioning into equal segments.
Fitting algorithms use mathematical models to represent a set of elements. A model of the outline of an object can be rotated and translated to compare with an image, and used to represent closely fit data points, or line segments, in an image, where data points that fit the model is grouped together.
Hough transform
The Hough transform is useful for fitting lines, where data points vote on which line to belong to (there are often a lot of potential straight lines):
any point (x, y)
in an image could be part of a set of lines that pass-through point
a line has the relationship r = y * cos(a) - x * sin(a)
an accumulator array is used to count $votes$ for each set of parameter values
can be generalized to fit any shape that can be expressed parametrically
A generalized Hough transform is an extension to express shapes that cannot be expressed parametrically:
using a table to describe location of each edge pixel in shape relative to some reference point (such as centroid)
for each point in image, assume it is at each edge location and vote for location of reference point
Active contours
The active contour algorithm, also called snakes, should produce result that is near the edge and smooth, where a snake is a curve that moves to minimize energy:
internal energy is determined by shape, bending and stretching increase energy
external energy is determined by proximity to other image features, large intensity gradients decrease energy.
Minimizing the energy of a snake results in a curve that is short, smooth, and close to intensity discontinuities, but only work on shapes that are closed (no gaps), is sensitive to parameters (smooth vs short vs proximity), and dependent on initial position, which are often placed around object manually.
Typically, multiple images are used when multiple cameras take two or more images at same time (such as stereo, recover 3D information), one camera take two or more images at different times (such as video, motion tracking), or for object recognition, such as current image and training images.
The correspondence problem refers to the problem of finding matching image elements across different views (need to decide what to search for, intensity, edges). Correspondence require that most scene points are visible in both images and corresponding regions must appear similar, some issues might be:
occlusions, some elements may not have a corresponding element, or not visible
false matches, several elements are similar
changing element characteristics, feature values in corresponding images may differ due to lightning (intensity) and viewpoint (size and shape)
large search space, each element in an image may have many possible matches in another image
A correlation-based method tries to match image intensities over window of pixels to find correspondence:
start from raw image intensity values, match image windows, and compare matches using similarity measure for intensity values
for each region r_1
in I_1
, and for each region r_2
in I_2
(both regions are same size)
r_1
and r_2
I_2
, correspondence point is center of region with highest similarity, repeat for all regions in I_1
Correlation-based methods
Correlation-based methods are easy to implement and have dense correspondence map (calculate at all points), but computationally expensive, constant or repetitive regions give false matches, and viewpoints cannot be too different.
A feature-based method find correspondence by matching sparse sets of image features (detect interest points in both images):
start from image features extracted from preprocessing, match image features, and compare matches using distance between feature descriptions
for each interest point ip_1
in I_1
, and for each ip_2
in I_2
Feature-based methods
Feature-based methods are less sensitive to illumination and appearance, computationally cheaper than correlation-based methods (only need to match selected locations instead of every pixel), but have sparse correspondence maps, which is still sufficient for many tasks, and bad with constant or random regions.
The choice of interest points is important, where an interest point is typically a corner, so need to detect same point independently in each image (repeatable detector, invariant to scaling, rotation), and need to recognize corresponding points in each image (distinctive descriptor, sufficiently complex to map with high probability).
Typically, corners are selected as interest points and can be detected by computing intensity gradients (I_x
and I_y
), convolving image with derivative of Gaussian masks, then sum gradients over small area (Gaussian window) around each pixel:
H
h_1
and h_2
, which corresponds to maximum slope of intensity gradient at two orthogonal directionsmin(h_1, h_2)
is greater than some thresholdHarris corner detector
The Harris corner detector is invariant to translation and rotation, partly invariant to illumination and viewpoint, but not invariant to scale.
First find Hessian matrix then find R
:
R = det(H) - k * trace(H) ** 2
, where k
is a constant in range 0.04-0.06 (note that k
is empirically determined)
R
is large at corner, negative with large magnitude at edge, and abs(R)
is small in flat regions
R
is greater than some thresholdlocal maxima of R
as interest points
1 | 2 | 2 | 2 |
0 | 1 | [4] | 3 |
0 | 2 | 2 | 2 |
0 | 1 | 1 | 0 |
use non-maximum suppression to find local maxima
0 | 0 | 0 | 0 |
0 | 0 | 4 | 0 |
0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 |
An image pyramid can be used to detect interest points at different scales (scale invariant interest points). The Harris-Laplacian find local maximum in space and scale, where descriptor (list of features) is small window around interest point, or set of pixel intensity values.
A Scale Invariant Feature Transform (SIFT) find local maximum using difference of gaussians in space and scale:
convolve image with DoG mask and repeat for different resolutions (create Laplacian image pyramid)
detect maxima and minima of difference of Gaussian across scale space
(trace(H) ** 2) / det(H) < [(r + 1) ** 2] / r
where r
is usually about 10The descriptor can be found using this method:
calculate magnitude and orientation of intensity gradient at all pixels around interest point (using Gaussian smoothed image at scale where interest point is found, approximate using pixel differences)
create histogram of all orientations around the interest point
create separate histograms for all orientations in sub-windows
A match that is correct belong to inliers, incorrect are outliers, and used to extract features (feature-based methods), compute matches, find the most likely transformation (most inliers and fewest outliers).
RANSAC is a random sample consensus algorithm and can be used to fit model to data set with outliers, where data consist of inliers and outliers, and parameterized model explains inliers:
randomly pick minimal subset of data points (sample)
fit model to subset
t
to model predictionrepeat n
-times
RANSAC is simple and effective, works with different model fitting problems, such as segmentation, camera transformation, and object trajectory, but can sometimes requires many iterations with a lot of parameters, which can be computationally expensive.
A camera projects a 3D point onto 2D plane, so all 3D points on the same line-of-sight has the same 2D image location, so depth information is lost. A stereo image (two images) can be used to recover depth information, where points project to same location in one image but to different locations in the other image.
Two images can be used to measure how far each point are from each other in each image (different viewpoint, need to solve correspondence problem), which is is useful for:
path planning and collision avoidance
virtual advertising
3D model building from 2D images (such as in google maps)
Disparity, denoted as d
, is difference between coordinates of two corresponding points (2D vector).
A pair of stereo images define a field of disparity vectors, or disparity map, and coplanar cameras has disparity in x
-coordinates only, where d = x_L' - x_R' = f(x / z) - f[(x - B) / z] = f(B / z)
, or f(B / d)
:
disparity is measured by finding corresponding points in images
if baseline distance is known, measure disparity and calculate depth of point
if baseline distance is not known, calculate relative depths of points from relative disparities
An epipolar constraint is cameras on the same plane, y_L' = y_R'
, so possible to search along straight line to find corresponding point in other image, and maximum disparity constraint is when length of search region depends on maximum expected disparity, so d_max = f(B / z_max)
.
Other constraints are:
continuity constraint, neighboring points have similar disparities, or relative distance between different corresponding points the same
uniqueness constraint, location in image should only match single point in the other image (exception is line-of-sight)
ordering constraint, matching points along corresponding epipolar (straight line) is same order in images (exception is difference in depth)
A field of view (FOV) is points in image visible to camera, where one camera has one FOV and two cameras have two FOV, and one common FOV visible to both cameras (coplanar):
short baseline gives large common FOV and large depth error, where changes in depth result small changes in disparity
long baseline gives small common FOV and small depth error, where changes in depth result large changes in disparity
Non-coplanar cameras are when cameras are at different planes with intersecting optical axes, which gives large common FOV and small depth error, fixation point determined by convergence angle (note that corresponding points still occur on straight lines, or epipolar lines, but not strictly horizontal, so search can be reduced to line):
disparity is measured using angles instead of distance, a_L - a_R
, and is zero at fixation point and zero at points on curve where equal angles intersect (horopter, or zero-disparity curve)
magnitude increase with distance from horopter and need to be consistent with signs and order, so outside if a_L - a_R > 0
and inside if a_L - a_R < 0
.
In epipolar geometry:
baseline is line through camera projection centers
epipole is projection of optic center of one camera in the image plane of the other camera
epipolar plane is plane going through a particular 3D point and optic centers of both cameras
epipolar lines are intersection of epipolar plane and each image plane
conjugated epipolar lines is epipolar lines generated by the same 3D point in both image planes
epipolar constraints, corresponding points must be on conjugated epipolar lines
A rectification is a transform to make epipolar lines parallel to rows of an image and warps both images so all epipolar lines are horizontal (free transform to treat as coplanar cameras).
Recognising depth
Depth in images can be recognized using different methods:
interposition, view of one object is interrupted by another object (relative depth), manipulating interposition can produce impossible objects, such as Penrose triangle
size familiarity, different sizes of same object appear closer or further away
texture gradients, uniformly textured surfaces get smaller and closely spaced at distance (tiles, waves)
linear perspective, property of parallel lines converging at infinity, manipulating perspective can produce unusual perceptions, such as street 3D paintings and Trompe L'oeil art
aerial perspective, scattering of light by particles in the atmosphere, where distant objects look less sharp, lower contrast, and lower color saturation
shading, distribution of light and shadow on objects (usually light comes from above)
motion parallax, objects closer than fixation point appear to move opposite to observer, and further away move in same direction (twisting)
optic flow, camera moving forward or backward and pattern of stimulation across visual field changes, where points closer to camera move faster
accretion and deletion, parts of object appear and disappear as observer moves relative to two surfaces at different depths
structure from motion (kinetic depth), movement of object or camera can induce perception of 3D structure (spinning cylinder in flat image)
A video is a series of n
-images, also referred to as frames, at discrete time instants. A static image has intensity of pixel as function of spatial coordinates x
and y
, so I(x, y)
, whereas a video has intensity of pixel as function of spatial coordinates x
, y
and time t
, so I(x, y, t)
.
An optic flow is change in position from one image to another image, optic flow vector is the image motion of a scene point, and optic flow field is collection of all optic flow vectors, which can be sparse or dense (defined for specified features or defined everywhere). An optic flow provides an approximation of a motion field, true image motion of scene points from actual projection of relative motion between camera and 3D scene, but not always accurate (smooth surfaces, moving light source). It is measured by finding corresponding points at different frames (note that a discontinuity in an optic flow field indicate different depths, so different objects):
feature-based methods extract features from around point to find similar features in next frame (like stereo correspondence)
direct methods recover image motion at each pixel from temporal variations of image brightness (convolve image with spatio-temporal filters)
Constraints for finding corresponding points in video:
spatial coherence, similar neighboring flow vectors are preferred over dissimilar, assuming scene points are smooth surfaces and closer points more likely to belong to same surface
small motion, small optic flow vectors are preferred over large, assuming relative velocities are slow compared to frame rate and motion between frames is likely small compared to size of image
An optic flow is measured to estimate layout of environment, such as depth and orientation of surfaces, estimating ego motion (camera velocity relative to visual frame of reference), estimating object motion relative to visual frame of reference or environment frame of reference, and to predict information for control of action.
The aperture problem is the inability to determine optic flow along direction of brightness pattern (no edges or corners along straight lines), where any movement with a component perpendicular to an edge is possible. It is solved by combining local motion measurements across space.
Example: recovering depth (perpendicular)
Below is an example of recovering depth from velocity if direction of motion is perpendicular to optical axis and velocity of camera is known:
x = f(X / Z)
Z = f(X_1 / x_1) = f(X_2 / x_2)
x = (x_2 - x_1) / t
X_1 * x_2 = X_2 * x_1
X_1 * x_2 = (X_1 - V * t) * x_1
X_1 * (x_2 - x_1) = -V * t * x_1
X_1 = -V * x_1 / x
=> Z = f(X_1 / x_1) = -f(V / x)
Example: recovering depth (along optical axis)
Below is an example of recovering depth from velocity if direction of motion is along optical axis and velocity of camera is known:
x = f(X / Z)
fX = x_1 * Z_1 = x_2 * Z_2
x = (x_2 - x_1) / t
x_1 * (Z_1 + V * t) = x_2 * Z_2
x_1 * V * t = (x_2 - x_1) * Z_2
=> Z_2 = V * x_1 / x
Parallel and radial optic flow fields
A parallel optic flow field, where depth velocity is zero, is when all optic flow vectors are parallel, direction of camera movement is opposite to direction of optic flow field, speed of camera movement is proportional to length of optic flow vectors, and depth is inversely proportional to magnitude of optic flow vector (like motion parallax with fixation on infinity).
A radial optic flow field, where depth velocity is not zero, is when all optic flow vectors point towards or away from vanishing point, direction of camera movement is determined by focus of expansion or focus of contraction, destination of movement is focus of expansion, depth is inversely proportional to magnitude of optic flow vector and proportional to distance from point to vanishing point.
Tracking algorithms
An optic flow algorithm, such as track algorithm, is matching high-level features across several frames to track objects. It uses previous frames to predict location in next frame (Kalman filter, Particle filter) to restrict search and do less work. Measurement noise is averaged to get better estimates, where an object matched to location should be near predicted location.
Segmentation from motion is typically done using optic flow discontinuities, optic flow and depth, or Gestalt law of common fate:
image differencing is the process of subtracting pixel by pixel from next frames to create a binary image (absolute difference above threshold), intensity levels change the most in regions with motion
I(x, y, t)
and I(x, y, t + 1)
abs[I(x, y, t + 1) - I(x, y, t)] > T
, where T
is some threshold (note that it is not optimal when background matches object, or for smooth objects)background subtraction is when background is used as reference image (static image) and each image is subtracted from previous image in sequence, adjacent frame difference, which is like image differencing, B(x, y) = I(x, y, t - 1)
B(x, y) = 1 / N * sum[I(x, y, t)]
B(x, y) = (1 - beta) * B(x, y) + beta * I(x, y, t)
Background subtraction
In background subtraction, new objects that are temporarily stationary is seen as foreground, dilation and erosion can be used to clean result, so if B(x, y)
and I(x, y, t)
, then abs[I(x, y, t) - B(x, y)] > T
, where T
is some threshold:
for each frame t = 1 .. N
, update background model, B(x, y) = (1 - beta) * B(x, y) + beta * I(x, y, t)
compute frame difference abs[I(x, y, t) - B(x, y)]
, where threshold frame difference is * > T
remove noise, close(*_T)
An object recognition task is often to determine the identity of an individual instance of an object, such as recognizing different phone models, or people.
A classification task is to determine category of an object, such as human vs ape, or phone vs calculator, where each category has a different level:
abstract categories are objects, such as animal, man-made, and mammal
basic level is lowest level where objects have distinct features (apple or banana is easier than Rex vs Annie), humans are fast at recognizing objects in category and start at basic level categorization before identification
specific categories are names, such as poodle, doberman, Rex, and Annie
A localization task determines presence and location of an object in image, such as finding faces and cars, where image segmentation is used to determine location of multiple different objects in image (semantic segmentation).
Object recognition should be sensitive to small image differences relevant to different categories of objects (phone vs calculator) and insensitive to large image differences that do not affect the identity or category of object (mobile phone vs old phone), such as background clutter and occlusion, viewpoint, lightning, non-rigid deformations (same object changing shape, wind blowing in tree), or variation within category (different type of chairs).
Note that it is generally hard to distinguish between similar objects (many false positives), and global representation is sensitive to viewpoint and occlusion (many false negatives), so one approach to object recognition is to use intermediate complexity between local and global, or hierarchy of features with range of complexities, or sensitivity.
The object recognition process is to associate information extracted from images with objects, which requires accurate image data, representations of objects, and some matching technique:
extract representations from training examples, local vs global features, 2D vs 3D, pixel intensities vs other features
extract representation from image and match with training examples to determine category or identity, top-down vs bottom-up, measure of similarity, classification criteria
Note that an image can be represented as:
pixel intensities, feature vectors, or geometry
2D (image-based), 3D (object-based)
local features, global features
The most common methods used for object recognition are sliding window, ISM, SIFT, and bag-of-words, where matching procedures are:
top-down, which is generative and expectation-driven, model object to find in image, such as model-based methods, templates, and edge matching
bottom-up, which is discriminative and stimulus-driven, extract description and match to descriptions in database, such as SIFT, bag-of-words, intensity histograms, and ISM
Template matching
Template matching is a general technique for object recognition, where image of some object to be recognized is represented as a template, which is an array with pixel intensities (note that template needs to be very similar to target object, so not scaled or rotated, and is sensitive to occlusion):
search each image region
calculate similarity between template and image region using similarity measures
pick best match or above threshold
Sliding window
A sliding window method is a template matching method with a classifier for each image region, where each image region is warped to increase tolerance to changes in appearance. The classifier, which could be a deep neural network (CNN), then determines if image region contains the object instead of simple intensity value comparison. However, it is very computationally expensive (pre-processing using image segmentation can reduce number of image regions to classify, sky not likely to have cows in it).
Edge-matching
In edge-matching, an image of some object is represented as a template, which is pre-processed to extract edges. Basically, search each image region and calculate average minimum distance between points on edge template, T
, and points on edge image, I
, so D(T, I) = 1 / T sum[d_1(t)]
(minimum value is best match)
Model-based methods
A model-based method is model of object identity and pose, where object is rendered in image, or back-project, and compared to image (edges in model match edges in image). It is represented as 2D or 3D model of object shape, and compared using edge score, image edges near predicted object edges (unreliable), or oriented edge score, image edges near predicted object edges with correct orientation. It is very computationally expensive.
Intensity histogram
An intensity histogram method is histogram of pixel intensity values (grayscale or color) representing some object, color histograms are commonly used in face detection and recognition algorithms. Histograms are compared to find closest match, which is fast and east to compute, and matches are insensitive to small viewpoint changes. However, it is sensitive to lightning and within-category changes to appearance, and insensitive to spatial configuration, where image with similar configurations result in same histogram.
ISM
An implicit shape model (ISM) method is where 2D image fragments, or parts, are extracted from image regions around each interest point (Harris detector):
image fragments can be collected from training images and clustered into sets with similar image regions (use cluster center as key)
image fragments match set features to training images using template matching, where each set feature finds possible object centers (features on car points towards middle)
find interest points in input image, match extracted features with features in set, and then matched features vote on object center
Feature-based methods
A feature-based method is when training image content is transformed into local features, so invariant to translation, rotation, and scale. Local features are extracted from input image and then features are matched with training image features.
In SIFT feature matching, a 128-element histogram of orientations of intensity gradients is binned into 8 orientations and 4x4 pixel windows around the interest point, normalized with dominant orientation vertical:
locality, features are local and insensitive to occlusion and clutter, distinctiveness, features can be matched to database of objects, quantity, many features can be generated for small objects, and efficiency, close to real-time performance
set of interest points obtained from each training image, each has descriptor (128 components), all sets stored in database with key as interest point
input image gives new set of interest points and descriptor pairs, then for each pair, find two best matching descriptors in database, match if ratio of distance to first nearest descriptor to second within some threshold (can use RANSAC to determine if matches are consistent)
Bag-of-words
A bag-of-words method is when different objects have distinct set of features that occur in different frequencies:
find interest points using regular grid, interest point detectors, or randomly, then extract descriptor from image regions around interest points (also referred to as words)
objects have a distribution of feature occurrences, or histogram, and cluster descriptors can be used to find cluster center (word stem, set of clusters represent vocabulary), where most frequent words that occur in most images are removed
input image is matched by comparing distance between histograms
A geometric invariant is some property of an object in a scene that do not change with viewpoint (note that geometric invariant methods compare values of cross-ratio in image with cross-ratios in training images, but sensitive to occlusion and availability and distinctiveness of points):
in Euclidean space, which include translation and rotation, invariants are lengths, angles, and areas
in similarity space, which include translation, rotation, and scale, invariants are ratios of length and angles
in affine space, which include translation, rotation, scale, and shear, invariants are parallelism, ratios of lengths along lines, and ratio of areas
in projective space, which include translation, rotation, scale, shear, and foreshortening, invariants are cross-ratio, which is ratio of ratios of lengths, and is constant from any viewpoint
Object-based theory, or recognition-by-components, is a concept where an object is represented as 3D model with an object-centered reference frame.
Objects are stored in our brain as structural descriptions (parts and configuration), and each object is a collection of shapes (or geometric components), such as human head and body could be represented as a cube shape above cylinder shape.
Geometric components are primitive elements (or geons, letters forming words), such as cube, wedge, pyramid, cylinder, barrel, arch, cone, expanded cylinder, handle, expanded handle, and different combinations of geons can be used to represent a large variety of objects:
each geon is sufficiently different from each other, so robust to noise and occlusion, view-invariant (same set of geons in same arrangement in different views)
objects can be matched by finding elements and configuration
Image-based theory is a concept where an object is represented by multiple 2D views with a viewer-centered reference frame and matched using template matching, which is an early version of an image-based approach (less flexible compared to human recognition), or multiple views approach, which encode multiple views of an object through experience (templates for recognition).
Classification
To assign objects to different categories of features in a feature space, or classification, where category membership is defined by abstract rules (such as "has three sides, four legs, and barks"):
prototypes, calculate average for all instances in each category, where each new instance is compared and assigned to nearest one (nearest mean classifier, decision boundary is linear)
exemplars, specific instances in each category are stored as exemplars, where each new instance is compared and assigned to nearest one (nearest neighbor classifier), or best match
k
-nearest exemplars (such as k-nearest neighbors classifier)k
small and odd to break ties)A nearest neighbor classifier have a non-linear decision boundary and do not deal with outliers. The k-nearest neighbors classifier also has a non-linear decision boundary but reduce effect of outliers.
To determine similarity (using similarity measures):
minimum distance measures
maximum similarity measures
In the cortical visual system, there are pathways for different kinds of information, such as spatial and motion (the where and how), which goes from V1 to parietal cortex, or identity and category (the what), which goes from V1 to inferotemporal cortex.
The receptive fields become greater further down a pathway, like a hierarchy, or progression:
center-surround cells, such as in the eye, respond to isolated spots of contrasting intensity
simple cells, such as LGN, respond when multiple co-aligned center-surround cells are active, and to edges and bars at a specific orientation, contrast, and location
complex cells, such as V1, respond when multiple similarly oriented simple cells are active and to edges and bars at specific orientation, but any contrast and location
Feedforward model
A feedforward model is a layer of neurons with progressively more complex receptive fields, and at progressively less specific locations, where features at one stage is built from features at earlier stages (heirarchical).
Hierarchical max-pooling (HMAX) is a deep neural network with different mathematical operations required to increase complexity of receptive fields, where multiple models use alternating layers of neurons with different properties, such as simple(S)
and complex(C)
. The simple cells are sums (similar to logical $and$ operation) and used to increase selectivity, whereas complex cells are max (similar to logical or
operation) and used to increase invariance.
A convolutional neural network (CNN) is a deep neural network similar to HMAX, but alternating layers with convolution and sub-sampling.