THORBJOERN

Contemporary Technology

Back to Main Page

Image Recognition: Computers Learn to See

The ability of humans to process visual information in as little as 150 miliseconds is just outstanding. Humans naturally are able to distinguish between different objects or different landscapes in their field of vision (e.g. beach and ocean) and also recognise, what objects they see. Imitating this ability is a large research area in artificial intelligence and computer vision. Recent advancements in computing power and algorithm design have produced astonishing results. A neural network by Szegedy et al. called Inception v4 reached a top 1 error rate of 20.0% and a top 5 error rate of 5.0%, in a dataset (ILSVRC-2012) with over 1000 different classes (dog, flower, car etc.). This means it classifies objects in images 4 out of 5 times to the exact correct class, and looking at the 5 most probable classes the neural network outputs, it has a 95% accuracy containg the exact correct class in those 5 classes. Also the neural network called YOLOv3, which can even be run in real-time, shows impressing results in localizing and classifying objects:

Yolov3
Figure 1: Object detection by Yolov3 [1]


At this point you might be wondering, what information are computers extracting in image recognition and how do computers actually extract information from pixels?

Image recognition by computers can be divided in several subcategories and varies a bit by definition. We're going to look at the two main classes of image recognition: Image classification and image segmentation. Image classification is the task, where the output is a class. So it will for example output "dog", if there's a dog in the image, or "cat", "bird" and so on (not restricted to animals). There's also the possibility to output multiple classes, if several recognizable objects are in the image. Further, image segmentation does not only look at the entity of the pixels in an image, as seen in image classification, but image segmentation even assigns each pixel to a class. This is a powerful concept to distinguish between different segments in images and localizing objects in images. Partitioning the image into different segments can be done via several methods, such as thresholding, edge detection or with deep learning methods.

But how does all of this actually work? In the following, we're going to focus on deep learning methods for image classification and segmentation.

A common class of neural network used for image recognition tasks is called Convolutional Neural Network (CNN). The goal of CNNs is to extract high level features (shapes, lines etc.) from low level features (pixels) by applying a series of mathematical operations. The idea is to extract information/features, that characterize and resemble objects as closely as possible. A usual way of going from low level features to high level features (or also called high resolution feature), is that the image is passed to a convolutional layer, which uses a filter/kernel (e.g. a size of 5x5, 3x3 or 2x2 pixels) and convolves/runs over the whole image. The input area to the filter is called receptive field and is multiplied element wise with the filter and then summed up. When this is done for all receptive fields, it leaves a new array, called the feature or activation map. Afterwards, a maxpooling layer is applied, which also convolves over the feature map and gives back the highest value for each receptive field of the activation map. Maxpooling is used to avoid overfitting and reduce the sample size, while retaining as much information as possible. An example with input dimension 3x3 pixels can be found in the a simplified example below:

Convolution Relu Maxpool
Figure 2: Series of convolution, activation function (relu) and maxpool
This series of convolution, activation and maxpooling, can be repeated several times, which leads to a convolutional neural network. For image classification a fully connected layer is added at the end, which connects every neuron in the previous layer to every neuron in the following layer. This can be seen in the illustration below:
Convolutional Neural Network
Figure 3: Convolutional Neural Network for classification [2]
An essential part of CNNs are the filters and its weights. A filter can be seen as a shape detector, which will have a strong response if one of the receptive fields passed to the filter has a similar shape. For the example below, the response will be strong if the receptive field will have a similar shape to the vertical line of the filter, whereas the response will be weaker if the shape is not matching:
Convolutional Neural Network
Figure 4: Strong and weak resamblance of receptive field and filter
The weights in the filters are adapted during the training of the network. Simply speaking, we need to show the CNN many images and tell the network which class(es) it is seeing in each image. Through optimizers and a loss function, the weights in the network will be updated during the training according to how strongly each weight contributes to the loss function, which is a measure for the deviation of the actual output to the true output. After a certain amount of training cycles, the CNN will be able to recognise the classes (e.g. a car) in new images it hasn't seen before. If you are further interested in the training of neural networks in general, have a look at the concepts of backpropagation, loss functions and optimizers.

For image segmentation, the feature extraction part is the same as in image classification, but as the desired output is not a class, but rather an image with the segment regions, it is necessary to upsample the high level feature maps (which through the maxpooling have become smaller regarding the dimensions than the original image). Upsampling will increase resolution in each step and after several steps give us back an output image, which has the same dimensions as the input image, but with the desired segmentation. This can be done with transposed convolutions, which is basically the reverse operation of a convolution. One popular network used for image segmentation is Unet, shown in the figure below:

Convolutional Neural Network
Figure 5: Visualization of Unet [3]
Another concept introduced in Unet are the skip connections, where the resolution features of the downsampling path are concatenated with the upsampled features for better localization and more accurate output. For the training of CNNs for image segmentation, it is necessary to feed it with images as features and segmentation maps as labels, whereas in image classification the features are the images and the labels are the classes.

So as you can see after this introduction, image recognition and CNNs rely on a lot of math!



Sources:
[1]: https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b
[2]: https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html
[3]: Ronneberger, Olaf; Fischer, Philipp; Brox, Thomas (2015). "U-Net: Convolutional Networks for Biomedical Image Segmentation". arXiv:1505.04597