CNN MODEL ARCHITECTURES WITH THEIR STRENGTHS AND WEAKNESSES

10 min readJun 9, 2021

Over the most recent couple of long stretches of the IT business, there has been an enormous interest for once specific range of abilities termed as Deep Learning. Deep Learning is a subset of Machine Learning that comprises of calculations which are roused by working of the human mind, hence are categorized as Neural Networks. There are various Deep Learning models that include Artificial Neural Networks (ANN), Reinforcement Learning and Recurrent Neural Networks (RNN). However, there exist one specific model that has contributed a ton in the field of computer vision which is the Convolutional Neural Networks (CNN).

CNNs belong to Deep Neural Networks which are able to perceive and characterize specific highlights in pictures and are broadly utilized for analysis of real pictures.The application areas include image classification, computer vision and image analysis. The entity ‘Convolution’ in CNN signifies the numerical capacity of convolution that is an exceptional sort of direct activity wherein the multiplication of two functions deliver a output entity that communicates as to how state of one capacity is altered by the other.

CNN is a numerical model including three kinds of layers or building blocks: convolution, pooling, and fully associated layer. The initial two layers, convolution and pooling, performs extraction of features, while a completely connected layer in classification performs mapping of filtered or extracted features into final output. CNN relies heavily on the convolution layer, which consists of a variety of direct numerical operations like convolution. The value of pixels are contained in a two-dimensional grid in digital images, which is made up of numbers and a parameter grid kernel.

CNNs are very useful in image processing since the feature can appear anywhere in the image. An easy-to-use feature extractor, applied to each pixel region of the image. The performance of one layer in a CNN model is passed on to the next layer, resulting in more complex features. Training is the process of using character-like parameters to reduce the difference between real and predicted outcomes by using output algorithms such as gradient decrease and backpropagation Normally, two pictures which can be addressed as networks are multiplied to give a yield that is utilized to extricate highlights.

CNN Architecture

The fundamental parts of a CNN design are as follows:

A convolution function which isolates and recognises different highlights of the picture for examination called as Feature Extraction
A fully connected layer uses the result from the convolution cycle and predicts the class to which the picture belongs and are dependent on the highlights removed in earlier stages.

CNN is made up of three kinds of layers: the convolutional layers, pooling layers, and fully connected layer as shown in Figure 1.

Figure 1: Basic CNN Architecture

Source: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53

At the point when CNN layers are stacked, the shape of the model is developed. Notwithstanding these three layers, there are two more huge limits which are : use of activation function and dropout layer and the initiation work which are characterized beneath.

1. Convolutional Layer

It is the most important layer which is utilized to extricate different highlights in info pictures. Here, the numerical activity of convolution is conducted between channel and picture. When the channel is slided on the picture, the dot product is calculated between filter and pixel values of picture. The result is stored as Feature map that will give data about the picture. Afterward, feature map is passed to different layers in order to gain proficiency with a few different highlights of the info picture.

2. Pooling Layer

Much of the time, a Convolutional Layer is trailed by a Pooling Layer. The essential point of using pooling is to diminish the dimension of convolved highlight hence leading to decrease computational expenses. It is conducted by diminishing the associations among layers and freely work on each component map. Contingent on method utilized, there exists few sorts of Pooling activities.

In Max Pooling, biggest component is extracted from highlight layout. Normal Pooling computes normal of the components in an estimated picture area.

3.FC ( Fully Connected Layer)

This layer comprises bias and weights which are utilized in interfacing neuron between 2 distinct layers. This forms the last network in CNN model. Here, the info picture from past layers will be leveled and taken care of to the layer. The leveled parameter then, at that point goes through not many more FC layers where the mathematical capacities tasks ordinarily happen.

4. Dropout

Regularly, when all of the features are related with the FC layer, it can cause overfitting in the readiness dataset. Overfitting will happen when specific model functions admirably on the training data which will cause an adverse consequence in model exhibition after utilizing it on another information.

A dropout layer is used to defeat the above issue where couple of neurons are eliminated from the network organization while preparing measure bringing about decreased model size. If a dropout is passed of 0.4, 40% of nodes are exited haphazardly from network organization.

5. Activation Function

At last, perhaps the main boundaries associated with CNN is activation function. The functions are utilized to learn and estimate ceaseless or complex connection among factors of the organization.Basically, it chooses among data of which one will fire in forward direction hence it will add non-linearity in network. Few generally utilized functions like the ReLU, Softmax, ReLU, Sigmoid and tanH exist. Every one among capacities possess a particular use. In paired characterization CNN model, softmax and sigmoid capacities are favored, for the most part softmax is utilized.

Common Architectures of CNN:

Different types of CNN architectures include:

LeNet 5
AlexNet
VGG16
Inception-v1
Inception-v3
ResNet-50
Xception
Inception-v4
ResNeXt-50

1. LeNet 5 (1998)

LeNet-5 is perhaps the least complex engineering model. It comprises 2 convolutional layer and 3 fully connected layers where 5 is exceptionally basic denoting the quantity of layers) as demonstrated in Figure 2. The sub sampling layer is average pooling layer it had trained loads as weights. The model uses around 60,000 parameters.

This engineering has become the standard format: stacking of convolutional layer with activation function and pooling layer is performed, finishing the organization with at least one fully connected layer.

Figure 2: LeNet Architecture

Advantages:

The architecture was able to classify hand-written digits efficiently.
It was the first architecture to learn features from raw pixels automatically.

Disadvantages:

The model was not able to perform classification of images.
The model suffers from the problem of overfitting.

2. AlexNet (2012)

AlexNet comprises 8 layers — out of which 5 are convolutional and 3 denotes fully-connected. A couple of more layers were stacked onto LeNet 5, hence forming AlexNet as demonstrated in Figure 3. This architecture was the first to carry out ReLU activation function and utilized Dropout layers.The architecture uses 60,000 parameters.

Figure 3: Alexnet Architecture

Source: https://neurohive.io/en/popular-networks/alexnet-imagenet-classification-with-deep-convolutional-neural-networks/

Advantages:

The model performed classification of images efficiently.
The computation performed is fast.
More computation and memory efficiency
It is robust

Disadvantages:

They are difficult in application of high resolution images

3. VGG-16 (2014)

VGG is a traditional convolutional neural organization design. It depended on an investigation of how to build the profundity of such organizations. The organization uses little 3 x 3 channels. In any case the organization is described by its effortlessness: the solitary different parts being pooling layers and a completely associated layer.

The model VGG16 was developed by Visual Geometry Group (VGG) that comprises of 13 convolutional and 3 fully connected layers, carrying ReLu custom from AlexNet. More and more layers are stacked on AlexNet to get the VGG model. It comprises 138M boundaries and occupies storage space of 500MB . They likewise planned a more deep variation, VGG-19.

Figure 4: VGG Model

VGG 19

It is a variety of VGG model that includes 19 layers (16 convolution layer, 5 MaxPool, 3 Fully related layer and 1 SoftMax layer).

Advantages:

The model is efficient in performing transfer learning and small classification tasks.
It is more robust

Disadvantages:

Due to the use of large no of parameters i.e 138 million parameters, it will increase the computation cost
The model isn’t useful for deep networks as more the deeper it goes, it is more inclined to Vanishing Gradients Problem.

4. Inception-v1 (2014)

The model with 22-layer engineering and 5 million parameters is known as the Inception-v1. Here, the Network In Network approach is intensely utilized which is finished through ‘Inception modules’. Every module presents 3 thoughts:

Having equal pinnacles of convolutions with various channels, trailed by connection, catches various highlights at 1×1, 3×3 and 5×5, in this manner ‘bunching’ them.
For dimensionality reduction, 1×1 convolutions are used to remove computational problems.
The addition of 1x1 convolution adds nonlinearity in the model.

Advantages:

It was able to perform well over low contrast images.
Proper application and utilisation of computing resources

Disadvantages:

The problem in representation leads to feature space reduction in the next layer that can result in loss of useful information.

5. Inception-v3 (GoogleNet) (2015)

Inception-v3 uses 24 million parameters and is a successor to Inception-v1 as shown in Figure 5. Inception-v2 stands similar to v3 but is not used commonly. The network Inception-v3 include certain changes in loss function, optimiser and also adds batch normalisation in the network.

Figure 5: GoogleNet Architecture

Source: https://hacktildawn.com/2016/09/25/inception-modules-explained-and-implemented/

The inspiration to use Inception-v2 and Inception-v3 is to stay away from illustrative bottlenecks (this implies definitely decreasing the dimension of input ) and utilization of factorisation methods gives more productive calculations . This strategy was among the primary fashioners to utilize batch normalization..

Improvements with context to Inception-v1:

nxn convolution is factorised into 1×n and n×1 convolutions
Factorise 5×5 convolution to two 3×3 convolution
7×7 convolution is replaced to a series of 3×3 convolutions

Advantages:

It was able to perform well over low contrast images.

Disadvantage:

In this network, there can be reduction in feature space of succeeding layers that can further lead to loss of useful information.

6. ResNet-50 (2015)

Considering previous CNN, it is encountered that the layers are expanding gradually and hence leading to accomplishment of better execution. However, with the network profundity expanding, precision gets saturated and afterward debases quickly. The people from Microsoft Research resolved the issue with ResNet — utilizing skip associations. ResNet also comes under the category of adopting batch normalization with 26M boundaries early which will beat the issue of vanishing gradient problem.The fundamental structure block for the model are the conv and character block where the contribution to some layer is passed straightforwardly or as an alternate way to some other layer as demonstrated in Figure 6.This architecture uses popularised skip connections and can design deeper CNNs. Deeper CNNs having 152 layers can be designed with no compromisation in model power.

Figure 6: ResNet Architecture

Source : https://neurohive.io/en/popular-networks/resnet/

Advantages:

The training process is fast
Minimized vanishing gradient problem
It can train deeper networks

Disadvantages:

The training takes a large amount of time which makes it infeasible and impractical for real world applications.

7. Xception (2016)

The transformation of Inception model in which different Inception modules are supplanted with depthwise distinguishable convolutions is called Xception model. It posesses generally similar number of parameters as 23 M of Inception-v1 . Inception theory include:

Initially, cross-channel connections are caught using 1×1 convolutions.
Hence, spatial relationships inside each channel are caught by means of the ordinary 3×3 or 5×5 convolutions.

Figure 7 : Xception Model

Source: https://www.kdnuggets.com/2017/08/intuitive-guide-deep-network-architectures.html/2

Taking this plan to a limit implies performing 1×1 to each channel, then, at that point playing out a 3×3 to each yield. The Inception module supplanted with h depthwise detachable convolutions is indistinguishable.

Advantages:

The performance of this model is better than the Inception model.

Disadvantages:

This model is expensive to train

8. Inception-v4 (2016)

The Inception v-4 with 43M parameters is again hit by the people at Google. This model is an improvement over Inception-v3. The principle distinction is some minor changes in the Inception-C and Stem Gathering. With everything taken into account, note that it was referenced that Inception-v4 perform better in light of expanded model size.Inception v-4 works better with respect to expanded model size.

9. ResNeXt-50 (2017)

This model has 25M boundaries. The ResNeXt engineering is an augmentation of the deep residual model where the residual block is replaced with “split-change merge” methodology utilized in the Inception models. Basically, instead of performing convolutions over the full information highlight map, the square’s information is projected into a progression of lower (channel) dimensional portrayals of which we independently apply a couple convolutional channels prior to combining the outcomes.

Advantages:

High accuracy

Disadvantages:

It is very complex hence contributing to more cost.

CONCLUSION

In this blog we have covered how Deep Learning has become important in IT industry.The famous architecture of Deep Learning that is Convolutional Neural Networks is used as one of the popular technique to classify images.We even discussed different types of CNN architectures that were introduced to perform applications in computer vision.The different architectures of CNN exist with their strengths and weaknesses that is also covered in the blog.This broad architecture has various applications in IT field and has gained great importance throughout all fields.

CNN MODEL ARCHITECTURES WITH THEIR STRENGTHS AND WEAKNESSES

CNN Architecture

The fundamental parts of a CNN design are as follows:

1. Convolutional Layer

2. Pooling Layer

3.FC ( Fully Connected Layer)

4. Dropout

5. Activation Function

3. VGG-16 (2014)

4. Inception-v1 (2014)

5. Inception-v3 (GoogleNet) (2015)

6. ResNet-50 (2015)

7. Xception (2016)

9. ResNeXt-50 (2017)

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Meghakumar

No responses yet