The field of AI image processing is composed of several branches. In this article we will discuss the classification of medical images such as CT or MRI, in grayscale, by deep learning methods.
The different types of families of image classification models
Image classification has seen a major breakthrough in performance, thanks to the rise of convolutional neural networks (CNN).
Before convolutional neural networks, the most commonly used machine learning classification methods for images were the algorithms:
- k-Nearest Neighbor (k-NN),
- Support Vector Machine, SVM), and
- Random Forest, RF [1].
How do convolutional neural networks work ?
Artificial intelligence is the use of methods and techniques to imitate human intelligence. In this part we will explain the functioning of an artificial neuron, then present the architecture of convolutional neural networks and its main components.
What is an artificial neuron?
An artificial neuron is a set of mathematical operations. First of all a weight and a bias are applied in a refined way to an input value: in computer vision this is the value of a pixel. Then, an activation function is applied to the intermediate result to represent the data in the data space of that function. Often, this is a non-linear activation function, because it allows to represent complex data where the linear combination does not work.
Credits: [2]
What shape does the convolutional neural network architecture have ?
This network architecture is inspired by the functioning of the visual cortex of animals. The visual field analysis is performed through a set of overlapping sub-regions that tile the image. Each sub-region is analyzed by a neuron in the animal’s brain to pre-process small amounts of information. This is called convolutional processing [2].
The architecture of a convolutional neural network is formed by a succession of building blocks to extract the features that discriminate against the belonging class of the image of others. A building block consists of one or more:
- convolutional layers (CONV) that process the data of a receptor field;
- correction layers (ReLU), often referred to by abuse as "ReLU" with reference to the activation function (Rectified Linear Unit);
- pooling layers (POOL), which compresses information by reducing the size of the intermediate image (often by sub-sampling).
The building blocks are succeed each other until the final layers of the network, which perform the image classification and the calculation of the error between the prediction and the target value:
- fully connected (FC) layer, which is a perceptron-like layer;
- loss layer (LOSS) [2].
The particularity of the network’s architecture is the way in which the convolutional layers, the correction and pooling succeed each other in the building blocks and the way in which the building blocks themselves succeed each other. This is defined as a result of applied research work. In another article, we will see which CNN architectures are most commonly used in medical image classification.
How does a convolution work ?
The image is:
- cut into sub-regions, called tiles
- analyzed by a convolutional kernel
This convolutional kernel is the size of a tile, often 3*3 or 5*5. The area analyzed (receptive field) is slightly larger than the kernel, as a stride is added so that the receptive fields overlap. This trick allows a better representation of the image and to improve the coherence of the processing of it.
The analysis of the image characteristics by the convolutional kernel is a filtering operation with weights associated to each pixel. Applying the filter to the image is called a convolution [2].
After a convolution, a features map is obtained, it is an abstract representation of the image. Its values depend on the parameters of the convolution kernel applied and the pixel values of the input image.
What is a convolutional layer (CONV)?
A convolutional layer is a stack of convolutions. Indeed, several convolutional kernels run through the image, leading to several output features maps. Each convolutional kernel has parameters specific to the information that is searched for in the image (for example: a Sobel filter type convolutional kernel has parameters to search for contours in the image).
The choice of convolutional kernel parameters depends on the task to be solved. With deep learning methods, these parameters are automatically learned by the algorithm from the training data [3]. In particular, thanks to the gradient backpropagation technique, which allows the parameters to be adjusted according to the value of the gradient of the loss function. The loss function calculates the error between the predicted value and the target value.
Diagram of a convolution with a kernel size 3*3 and a stride of 2
How does a correction layer (ReLU) work?
The correction or activation layer is the application of a non-linear function to the feature maps at the output of the convolutional layer. By making the data non-linear, it facilitates the extraction of complex features that cannot be modeled by a linear combination of a regression algorithm.
The most commonly used non-linear functions are:
- sigmoid or logistic,
- hyperbolic tangent,
- Rectified Linear Unit (ReLU).
Very often, the ReLU function is chosen because it maximizes the decision of the affine function applied by convolution.
What is a pooling layer (POOL)?
The pooling step is a sub-sampling technique. Generally, a pooling layer is inserted at regular intervals between the correction and convolutional layers. By reducing the size of the feature maps, and thus the number of network parameters, this speeds up the computational time and reduces the risk of over-fitting.
The most common pooling operation is the maximum one: MaxPool(2*2, 2). It is more efficient than the average because it maximizes the weight of strong activations. It is applied at the output of the previous layer as a size convolution filter (2*2) and moves with a stride of 2. At the output of the pooling layer, one obtains a characteristic map compressed by a factor of 4.
Diagram of a pooling operation with a MaxPool kernel of size 2*2 and a stride of 2
How does a "fully connected" (FC)layer work?
This layer is at the end of the network. It allows the classification of the image based on the characteristics extracted by the succession of processing blocks. It is fully connected, because all the inputs of the layer are connected to the output neurons of the layer. They have access to all the input information. Each neuron assigns to the image a probability value belonging to class i among the C possible classes.
Diagram of a fully connected layer with 6 classes
In contrast, the feature extraction phase, where the processing neurons are independent of each other, only has access to the information of the receptive field they are processing.
What is a loss layer (LOSS)?
The loss layer is the last layer of the network. It calculates the error between the network prediction and the actual value. During a classification task, the random variable is discrete, because it can only take the values 0 or 1, 1 meaning it belongs to a class and 0 meaning there is none. This is why the most common and suitable loss function is the cross-entropy function.
This function comes from the field of information theory, and measures the global difference between two probability distributions (that of the model's prediction and that of the real) for a random variable or a set of events [4]. Formally, it is written:
where y is the estimated probability that x belongs to class i, p is the real probability that x belongs to class i, given that there are C classes.
The hyper-parameters
The hyper-parameters enable to monitor automatic learning in large dimensions. They are divided into two categories:
- model hyper-parameters;
- algorithm hyper-parameters [5].
The first let you define the size of the network (i.e. width, depth), and its type (e.g. auto-encoder). The latter influences the learning process (i.e. speed, quality) rather than the model performance. We will detail the automatic learning strategies, and discuss in more detail the choice of hyper-parameter values in this article.
[1] Loussaief S., Abdelkrim A. "Machine learning framework for image classification," 2016 7th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT), Hammamet, 2016, pp. 58-61, https://doi.org/10.1109/SETIT.2016.7939841.
[2] https://fr.wikipedia.org/wiki/R%C3%A9seau_neuronal_convolutif
[4] https://machinelearningmastery.com/cross-entropy-for-machine-learning/