## EECS at UC Berkeley

Search form.

- Technical Reports

## Efficiently Designing Efficient Deep Neural Networks

Eecs department university of california, berkeley technical report no. ucb/eecs-2022-69 may 11, 2022, http://www2.eecs.berkeley.edu/pubs/techrpts/2022/eecs-2022-69.pdf.

To tackle this sisyphean task, modern methods expend gargantuan amounts of computation to design solutions, exceeding thousands of GPU hours or years of GPU compute to design a single neural network. Not to mention, these works maximize just one performance metric – accuracy – under a single set of resource constraints. What if the set of resource constraints changes? If additional performance metrics rise to the forefront, such as explainability or generalization? Modern methods for designing efficient neural networks are handicapped by excessive computation requirements for goals too singularly and narrowly sighted.

This thesis tackles the bottlenecks of modern methods directly, achieving state-of-the-art performance by efficiently designing efficient deep neural networks. These improvements don’t only reduce computation or only improve accuracy; instead, our methods improve performance and reduce computational requirements, despite increasing search space size by orders of magnitude. We also demonstrate missed opportunities with performance metrics beyond accuracy, redesigning the task so that accuracy, explainability, and generalization improve jointly, an impossibility by conventional wisdom, which suggests explainability and accuracy participate in a zero-sum game.

This thesis culminates in a set of models that set new flexibility and performance standards for production-ready models: those that are state-of-the-art accurate, explainable, generalizable, and configurable for any set of resource constraints in just CPU minutes.

Advisor: Joseph Gonzalez

BibTeX citation:

EndNote citation:

## MIT Libraries home DSpace@MIT

- DSpace@MIT Home
- MIT Libraries
- Doctoral Theses

## Unlocking the potential of neural networks in resource and data constrained environments

## Other Contributors

Terms of use, description, date issued, collections.

- Reference Manager
- Simple TEXT file

## People also looked at

Review article, an introductory review of deep learning for prediction models with big data.

- 1 Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
- 2 Institute of Biosciences and Medical Technology, Tampere, Finland
- 3 School of Management, University of Applied Sciences Upper Austria, Steyr, Austria
- 4 Department of Biomedical Computer Science and Mechatronics, University for Health Sciences, Medical Informatics and Technology (UMIT), Hall in Tyrol, Austria
- 5 College of Artificial Intelligence, Nankai University, Tianjin, China

Deep learning models stand for a new learning paradigm in artificial intelligence (AI) and machine learning. Recent breakthrough results in image analysis and speech recognition have generated a massive interest in this field because also applications in many other domains providing big data seem possible. On a downside, the mathematical and computational methodology underlying deep learning models is very challenging, especially for interdisciplinary scientists. For this reason, we present in this paper an introductory review of deep learning approaches including Deep Feedforward Neural Networks (D-FFNN), Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs), Autoencoders (AEs), and Long Short-Term Memory (LSTM) networks. These models form the major core architectures of deep learning models currently used and should belong in any data scientist's toolbox. Importantly, those core architectural building blocks can be composed flexibly—in an almost Lego-like manner—to build new application-specific network architectures. Hence, a basic understanding of these network architectures is important to be prepared for future developments in AI.

## 1. Introduction

We are living in the big data era where all areas of science and industry generate massive amounts of data. This confronts us with unprecedented challenges regarding their analysis and interpretation. For this reason, there is an urgent need for novel machine learning and artificial intelligence methods that can help in utilizing these data. Deep learning (DL) is such a novel methodology currently receiving much attention ( Hinton et al., 2006 ). DL describes a family of learning algorithms rather than a single method that can be used to learn complex prediction models, e.g., multi-layer neural networks with many hidden units ( LeCun et al., 2015 ). Importantly, deep learning has been successfully applied to several application problems. For instance, a deep learning method set the record for the classification of handwritten digits of the MNIST data set with an error rate of 0.21% ( Wan et al., 2013 ). Further application areas include image recognition ( Krizhevsky et al., 2012a ; LeCun et al., 2015 ), speech recognition ( Graves et al., 2013 ), natural language understanding ( Sarikaya et al., 2014 ), acoustic modeling ( Mohamed et al., 2011 ) and computational biology ( Leung et al., 2014 ; Alipanahi et al., 2015 ; Zhang S. et al., 2015 ; Smolander et al., 2019a , b ).

Models of artificial neural networks have been used since about the 1950s ( Rosenblatt, 1957 ); however, the current wave of deep learning neural networks started around 2006 ( Hinton et al., 2006 ). A common characteristic of the many variations of supervised and unsupervised deep learning models is that these models have many layers of hidden neurons learned, e.g., by a Restricted Boltzmann Machine (RBM) in combination with Backpropagation and error gradients of the Stochastic Gradient Descent ( Riedmiller and Braun, 1993 ). Due to the heterogeneity of deep learning approaches a comprehensive discussion is very challenging, and for this reason, previous reviews aimed at dedicated sub-topics. For instance, a bird's eye view without detailed explanations can be found in LeCun et al. (2015) , a historic summary with many detailed references in Schmidhuber (2015) and reviews about application domains, e.g., image analysis ( Rawat and Wang, 2017 ; Shen et al., 2017 ), speech recognition ( Yu and Li, 2017 ), natural language processing ( Young et al., 2018 ), and biomedicine ( Cao et al., 2018 ).

In contrast, our review aims at an intermediate level, providing also technical details usually omitted. Given the interdisciplinary interest in deep learning, which is part of data science ( Emmert-Streib and Dehmer, 2019a ), this makes it easier for people new to the field to get started. The topics we selected are focused on the core methodology of deep learning approaches including Deep Feedforward Neural Networks (D-FFNN), Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs), Autoencoders (AEs), and Long Short-Term Memory (LSTM) networks. Further network architectures which we discuss help in understanding these core approaches.

This paper is organized as follows. In the section 2, we provide a historical overview of general developments of neural networks. Then in section 3, we discuss major architectures distinguishing neural networks. Thereafter, we discuss Deep Feedforward Neural Networks (section 4), Convolutional Neural Networks (section 5), Deep Belief Networks (section 6), Autoencoders (section 7) and Long Short-Term Memory networks (section 8) in detail. In section 9, we provide a discussion of important issues when learning neural network models. Finally, this paper finishes in section 10 with conclusions.

## 2. Key Developments of Neural Networks: A Time Line

The history of neural networks is long, and many people have contributed toward their development over the decades. Given the recent explosion of interest in deep learning, it is not surprising that the assignment of credit for key developments is not uncontroversial. In the following, we were aiming at an unbiased presentation highlighting only the most distinguished contributions.

In 1943 , the first mathematical model of a neuron was created by McCulloch and Pitts (1943) . This model aimed at providing an abstract formulation for the functioning of a neuron without mimicking the biophysical mechanism of a real biological neuron. It is interesting to note that this model did not consider learning.

In 1949 , the first idea about biologically motivated learning in neural networks was introduced by Hebb (1949) . Hebbian learning is a form of unsupervised learning of neural networks.

In 1957 , the Perceptron was introduced by Rosenblatt (1957) . The Perceptron is a single-layer neural network serving as a linear binary classifier. In the modern language of ANNs, a Perceptron uses the Heaviside function as an activation function (see Table 1 ).

Table 1 . An overview of frequently used activation functions for neuron models.

In 1960 , the Delta Learning rule for learning a Perceptron was introduced by Widrow and Hoff (1960) . The Delta Learning rule, also known as Widrow & Hoff Learning rule or the Least Mean Square rule, is a gradient descent learning rule for updating the weights of the neurons. It is a special case of the backpropagation algorithm.

In 1968 , a method called Group Method of Data Handling (GMDH) for training neural networks was introduced by Ivakhnenko (1968) . These networks are widely considered the first deep learning networks of the Feedforward Multilayer Perceptron type. For instance, the paper ( Ivakhnenko, 1971 ) used a deep GMDH network with 8 layers. Interestingly, the numbers of layers and units per layer could be learned and were not fixed from the beginning.

In 1969 , an important paper by Minsky and Papert (1969) was published which showed that the XOR problem cannot be learned by a Perceptron because it is not linearly separable. This triggered a pause phase for neural networks called the “AI winter.”

In 1974 , error backpropagation (BP) has been suggested to use in neural networks ( Werbos, 1974 ) for learning the weighted in a supervised manner and applied in Werbos (1981) . However, the method itself is older (see e.g., Linnainmaa, 1976 ).

In 1980 , a hierarchical multilayered neural network for visual pattern recognition called Neocognitron was introduced by Fukushima (1980) . After the deep GMDH networks (see above), the Neocognitron is considered the second artificial NN that deserved the attribute deep . It introduced convolutional NNs (today called CNNs). The Neocognitron is very similar to the architecture of modern, supervised , deep Feedforward Neural Networks (D-FFNN) ( Fukushima, 2013 ).

In 1982 , Hopfield introduced a content-addressable memory neural network, nowadays called Hopfield Network ( Hopfield, 1982 ). Hopfield Networks are an example for recurrent neural networks.

In 1986 , backpropagation reappeared in a paper by Rumelhart et al. (1986) . They showed experimentally that this learning algorithm can generate useful internal representations and, hence, be of use for general neural network learning tasks.

In 1987 , Terry Sejnowski introduced the NETtalk algorithm ( Sejnowski and Rosenberg, 1987 ). The program learned how to pronounce English words and was able to improve over time.

In 1989 , a Convolutional Neural Network was trained with the backpropagation algorithm to learn handwritten digits ( LeCun et al., 1989 ). A similar system was later used to read handwritten checks and zip codes, processing cashed checks in the United States in the late 90s and early 2000s.

Note: In the 1980s, the second wave of neural network research emerged in great part via a movement called connectionism ( Fodor and Pylyshyn, 1988 ). This wave lasted until the mid 1990s.

In 1991 , Hochreiter studied a fundamental problem of any deep learning network, which relates to the problem of not being trainable with the backpropagation algorithm ( Hochreiter, 1991 ). His study revealed that the signal propagated by backpropagation either decreases or increases without bounds. In case of a decay, this is proportional to the depth of the network. This is now known as the vanishing or exploding gradient problem.

In 1992 , a first partial remedy to this problem has been suggested by Schmidhuber (1992) . The idea was to pre-train a RNN in an unsupervised way to accelerate subsequent supervised learning. The studied network had more than 1,000 layers in the recurrent neural network.

In 1995 , oscillatory neural networks have been introduced in Wang and Terman (1995) . They have been used in various applications like image and speech segmentation and generating complex time series ( Wang and Terman, 1997 ; Hoppensteadt and Izhikevich, 1999 ; Wang and Brown, 1999 ; Soman et al., 2018 ).

In 1997 , the first supervised model for learning RNN was introduced by Hochreiter and Schmidhuber (1997) , which was called Long Short-Term Memory (LSTM). A LSTM prevents the decaying error signal problem between layers by making the LSTM networks “remember” information for a longer period of time.

In 1998 , the Stochastic Gradient Descent algorithm (gradient-based learning) was combined with the backpropagation algorithm for improving learning in CNN ( LeCun et al., 1989 ). As a result, LeNet-5, a 7-level convolutional network, was introduced for classifying hand-written numbers on checks.

In 2006 , is widely considered a breakthrough year because in Hinton et al. (2006) it was shown that neural networks called Deep Belief Networks can be efficiently trained by using a strategy called greedy layer-wise pre-training. This initiated the third wave of neural networks that made also the use of the term deep learning popular.

In 2012 , Alex Krizhevsky won the ImageNet Large Scale Visual Recognition Challenge by using AlexNet, a Convolutional Neural Network utilizing a GPU and improved upon LeNet5 (see above) ( LeCun et al., 1989 ). This success started a convolutional neural network renaissance in the deep learning community (see Neocognitron).

In 2014 , generative adversarial networks were introduced in Goodfellow et al. (2014) . The idea is that two neural networks compete with each other in a game-like manner. Overall, this establishes a generative model that can produce new data. This has been called “the coolest idea in machine learning in the last 20 years” by Yann LeCun.

In 2019 , Yoshua Bengio, Geoffrey Hinton, and Yann LeCun were awarded the Turing Award for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing.

The reader interested in a more detailed early history of neural networks is referred to Schmidhuber (2015) .

In Figure 1 , we show the evolution of publications related to deep learning from the Web of Science publication database. Specifically, the figure shows the number of publications in dependence on the publication year for DL, deep learning; CNN, convolutional neural network; DBN, deep belief network; LSTM, long short-term memory; AEN, autoencoder; and MLP, multilayer perceptron. The two dashed lines are scaled by a factor of 5 (deep learning) and 3 (convolutional neural network), i.e., overall, for deep learning we found the majority of publications (in total 30, 230). Interestingly, most of these are in computer science (52.1%) and engineering (41.5%). In application areas, medical imaging (6.2%), robotics (2.6%), and computational biology (2.5%) received most attention. These observations are a reflection of the brief history of deep learning indicating that the methods are still under development.

Figure 1 . Number of publications in dependence on the publication year for DL, deep learning; CNN, convolutional neural network; DBN, deep belief network; LSTM, long short-term memory; AEN, autoencoder; and MLP, multilayer perceptron. The legend shows the search terms used to query the Web of Science publication database. The two dashed lines are scaled by a factor of 5 (deep learning) and 3 (convolutional neural network).

In the following sections, we will discuss all of these methods in more detail because they represent the core methodology of deep learning. In addition, we present background information about general artificial neural networks as far as this is needed for a better understanding of the DL methods.

## 3. Architectures of Neural Networks

Artificial Neural Networks (ANNs) are mathematical models that have been motivated by the functioning of the brain. However, the models we discuss in the following do not aim at providing biologically realistic models. Instead, the purpose of these models is to analyze data.

## 3.1. Model of an Artificial Neuron

The basic entity of any neural network is a model of a neuron. In Figure 2A , we show such a model of an artificial neuron.

Figure 2. (A) Representation of a mathematical artificial neuron model. The input to the neuron is summed up and filtered by activation function ϕ (for examples see Table 1 ). (B) Simplified Representation of an artificial neuron model. Only the key elements are depicted, i.e., the input, the output, and the weights.

The basic idea of a neuron model is that an input, x , together with a bias, b is weighted by, w , and then summarized together. The bias, b , is a scalar value whereas the input x and the weights w are vector valued, i.e., x ∈ ℝ n and w ∈ ℝ n with n ∈ ℕ corresponding to the dimension of the input. Note that the bias term is not always present but is sometimes omitted. The sum of these terms, i.e., z = w T x + b forms then the argument of an activation function, ϕ, resulting in the output of the neuron model,

Considering only the argument of ϕ one obtains a linear discriminant function ( Webb and Copsey, 2011 ).

The activation function, ϕ, (also known as unit function or transfer function) performs a non-linear transformation of z . In Table 1 , we give an overview of frequently used activation functions.

The ReLU activation function is called Rectified Linear Unit or rectifier ( Nair and Hinton, 2010 ). The ReLU activation function is the most popular activation function for deep neural networks. Another useful activation function is the softmax function ( Lawrence et al., 1997 ):

The softmax maps a n -dimensional vector x into a n -dimensional vector y having the property ∑ i y i = 1 . Hence, the components of y represent probabilities for each of the n elements. The softmax is often used in the final layer of a network. If the Heaviside step function is used as activation function, the neuron model is known as perceptron ( Rosenblatt, 1957 ).

Usually, the model neuron shown in Figure 2A is represented in a more ergonomic way by limiting the focus on its key elements. In Figure 2B , we show such a representation that highlights merely the input part.

## 3.2. Feedforward Neural Networks

In order to build neural networks (NNs), the neurons need to be connected with each other. The simplest architecture of a NN is a feedforward structure. In Figures 3A,B , we show examples for a shallow and a deep architecture.

Figure 3 . Two examples for Feedforward Neural Networks. (A) A shallow FFNN. (B) A Deep Feedforward Neural Network (D-FFNN) with 3 hidden layers.

In general, the depth of a network denotes the number of non-linear transformations between the separating layers whereas the dimensionality of a hidden layer, i.e., the number of hidden neurons, is called its width. For instance, the shallow architecture in Figure 3A has a depth of 2 whereas Figure 3B has a depth of 4 [total number of layers minus one (input layer)]. The required number to call a Feedforward Neural Network (FFNN) architecture deep is debatable, but architectures with more than two hidden layers are commonly considered as deep ( Yoshua, 2009 ).

A Feedforward Neural Network, also called a Multilayer Perceptron (MLP), can use linear or non-linear activation functions ( Goodfellow et al., 2016 ). Importantly, there are no cycles in the NN that would allow a direct feedback. Equation (3) defines how the output of a MLP is obtained from the input ( Webb and Copsey, 2011 ).

Equation (3) is the discriminant function of the neural network ( Webb and Copsey, 2011 ). For finding the optimal parameters one needs a learning rule. A common approach is to define an error function (or cost function) together with an optimization algorithm to find the optimal parameters by minimizing the error for training data.

## 3.3. Recurrent Neural Networks

The family of Recurrent Neural Network (RNN) models has two subclasses that can be distinguished based on their signal processing behavior. The first contains finite impulse recurrent networks (FRNs) and the second infinite impulse recurrent networks (IIRNs). That difference is that a FRN is given by a directed acyclic graph (DAG) that can be unrolled in time and replaced with a Feedforward Neural Network, whereas an IIRN is a directed cyclic graph (DCG) for which such an unrolling is not possible.

## 3.3.1. Hopfield Networks

A Hopfield Network (HN) ( Hopfield, 1982 ) is an example for a FRN. A HN is defined as a fully connected network consisting of McCulloch-Pitts neurons. A McCulloch-Pitts neuron is a binary model with an activation function given by

The activity of the neurons x i , i.e.,

is either updated synchronously or asynchronously. To be precise, x j refers to x j t and x i to x i t + 1 (time progression).

Hopfield Networks have been introduced to serve as a model of a content-addressable (“associative”) memory, i.e., for storing patterns. In this case, it has been shown that the weights are obtained by

whereas P is the number of patterns, t ( k ) is the k-th pattern and t i ( k ) its i-th component. From Equation (6), one can see that the weights are symmetrical. An interesting question in this context is what is the maximal value of P or P / N , called the network capacity (here N is the total number of patterns). In Hertz et al. (1991) it was shown that the network capacity is ≈0.138. It is interesting to note that the neurons in a Hopfield Network cannot be distinguished as input neurons, hidden neurons and output neurons because at the beginning every neuron is an input neuron, during the processing every neuron is a hidden neuron and at the end every neuron is an output neuron.

## 3.3.2. Boltzmann Machine

A Boltzmann Machine ( Hinton and Sejnowski, 1983 ) can be described as a noisy Hopfield network because it uses a probabilistic activation function

whereas x i is obtained as in Equation (5). This model is important because it is one of the first neural networks that uses hidden units (latent variables). For learning the weights, the Contrastive Divergence algorithm (see Algorithm 9) can be used to train Boltzmann Machines. Put simply, Boltzmann Machines are neural networks consisting of two layers—a visible layer and a hidden layer. Each edge between the two layers is undirected, implying that information can flow in a bi-directional way. The whole network is fully connected, which means that each neuron in the network is connected to all other neurons via undirected edges (see Figures 8A,B ).

## 3.4. An Overview of Network Architectures

There is a large variety of different network architectures used as deep learning models. The following Table 2 does not aim to provide a comprehensive list, but it includes the most popular models currently used ( Yoshua, 2009 ; LeCun et al., 2015 ).

Table 2 . List of popular deep learning models, available learning algorithms (unsupervised, supervised) and software implementations in R or python.

It is interesting to note that some of the models in Table 2 are composed by other networks. For instance, CDBNs are based on RBMs and CNNs ( Lee et al., 2009 ); DBMs are based on RBMs ( Salakhutdinov and Hinton, 2009 ); DBNs are based on RBMs and MLPs; dAEs are stochastic Autoencoders that can be stacked on top of each other to build stacked denoising Autoencoders (SdAEs).

In the following sections, we discuss the major core architectures Deep Feedforward Neural Networks (D-FFNN), Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs), Autoencoders (AEs), and Long Short-Term Memory networks (LSTMs) in more detail.

## 4. Deep Feedforward Neural Networks

It can be proven that a Feedforward Neural Network with one hidden layer and a finite number of neurons can approximate any continuous function on a compact subset of ℝ n ( Hornik, 1991 ). This is called the universal approximation theorem . The reason for using a FFNN with more than one hidden layer is that the universal approximation theorem does not provide information on how to learn such a network, which turned out to be very difficult. A related issue that contributes to the difficulty of learning such networks is that their width can become exponentially large. Interestingly, the universal approximation theorem can also be proven for FFNN with many hidden layers and a bounded number of hidden neurons ( Lu et al., 2017 ) for which learning algorithms have been found. Hence, D-FFNNs are used instead of (shallow) FFNNs for practical reasons of learnability.

Formally, the idea of approximating an unknown function f * can be written as

Here f is a function from a specific family that depends on the parameters θ, and ϕ is a non-linear activation function with one layer. For many hidden layers ϕ has the form

Instead of guessing the correct family of functions from which f should be chosen, D-FFNNs learn this function by approximating it via ϕ , which itself is approximated by the n hidden layers.

The practical learning of the parameters of a D-FFNN (see Figure 3B ) can be accomplished with the backpropagation algorithm, although for computational efficiency nowadays the Stochastic Gradient Descent is used ( Bottou, 2010 ). The Stochastic Gradient Descent calculates a gradient for a set of randomly chosen training samples (batch) and updates the parameters for this batch sequentially. This results in a faster learning. A drawback is an increase in imprecision. However, for data sets with a large number of samples (big data), the speed advantage outweighs this drawback.

## 5. Convolutional Neural Networks

A Convolutional Neural Network (CNN) is a special Feedforward Neural Network utilizing convolution, ReLU and pooling layers. Standard CNNs are normally composed of several Feedforward Neural Network layers including convolution, pooling, and fully-connected layers.

Typically, in traditional ANNs, each neuron in a layer is connected to all neurons in the next layer, whereas each connection is a parameter in the network. This can result in a very large number of parameters. Instead of using fully connected layers, a CNN uses a local connectivity between neurons, i.e., a neuron is only connected to nearby neurons in the next layer. This can significantly reduce the total number of parameters in the network.

Furthermore, all the connections between local receptive fields and neurons use a set of weights, and we denote this set of weights as a kernel. A kernel will be shared with all the other neurons that connect to their local receptive fields, and the results of these calculations between the local receptive fields and neurons using the same kernel will be stored in a matrix denoted as activation map . The sharing property is referred to as weight sharing of CNNs ( Le Cun, 1989 ). Consequently, different kernels will result in different activation maps, and the number of kernels can be adjusted with hyper-parameters. Thus, regardless of the total number of connections between the neurons in a network, the total number of weights corresponds only to the size of the local receptive field, i.e., the size of the kernel. This is visualized in Figure 4B , where the total number of connections between the two layers is 9 but the size of the kernel is only 3.

Figure 4. (A) An example for a Convolutional Neural Network. The red edges highlight the fact that hidden layers are connected in a “local” way, i.e., only very few neurons connect the succeeding layers. (B) An example for shared weights and local connectivity in CNN. The red edges highlight the fact that hidden layers are connected in a “local” way, i.e., only very few neurons connect the succeeding layers. The labels w 1 , w 2 , w 3 indicate the assigned weight for each connection, three hidden nodes share the same set of weights w 1 , w 2 , w 3 when connecting to three local patches.

By combining weight sharing and the local connectivity property, a CNN is able to handle data with high dimensions. See Figure 4A for a visualization of a CNN with three hidden layers. In Figure 4A , the red edges highlight the locality property of hidden neurons, i.e., only very few neurons connect to the succeeding layers. This locality property of CNN makes the network sparse compared to a FFNN which is fully connected.

## 5.1. Basic Components of CNN

5.1.1. convolutional layer.

A convolutional layer is an essential part in building a convolutional neural network. Similar to a hidden layer of an ordinary neural network, a convolutional layer has the same goal, which is to convert the input into a representation of a more abstract level. However, instead of using a full connectivity, the convolutional layer uses a local connectivity to perform the calculations between input and the hidden neurons. A convolutional layer uses at least one kernel to slide across the input, performing a convolution operation between each input region and the kernel. The results are stored in the activation maps, which can be seen as the output of the convolutional layer. Importantly, the activation maps can contain features extracted by different kernels. Each kernel can act as a feature extractor and will share its weights with all neurons.

For the convolution process, some spatial arguments need to be defined in order to produce the activation maps of a certain size. Essential attributes include:

1. Size of kernels (N). Each kernel has a window size, which is also referred to as receptive field. The kernel will perform a convolution operation with a region matching its window size from the input, and produce results in its activation map.

2. Stride (S). This parameter defines the number of pixels the kernel will move for the next position. If it is set to 1, each kernel will make convolution operations around the input volume and then shift 1 pixel at a time until it reaches the specified border of the input. Hence, the stride can be used to downsize the dimension of the activation maps as the larger the stride the smaller the activation maps.

3. Zero-padding (P). This parameter is used to specify how many zeros one wants to pad around the border of the input. This is very useful for preserving the dimension of the input.

These three parameters are the most common hyper-parameters used for controlling the output volume of a convolutional layer. Specifically, for an input of dimension W input × H input × Z , for the hyper-parameters size of the kernel (N), Stride (S), and Zero-padding (P) the dimension of the activation map, i.e., W out × H out × D can be calculated by:

An example of how to calculate the result between an input matrix and a kernel can be seen in Figure 5 .

Figure 5 . An example for calculating the values in the activation map. Here, the stride is 1 and the zero-padding is 0. The kernel slides by 1 pixel at a time from left to right starting from the left top position, after reaching the boarder the kernel will start from the second row and repeat the process until the whole input is covered. The red area indicates the local patch to be convoluted with the kernel, and the result is stored in the green field in the activation map.

The shared weights and the local connectivity help significantly in reducing the total number of parameters of the network. For example, assuming that an input has dimension 100 × 100 × 3, and that the convolutional layer and the number of kernels is 2 and each kernel has a local receptive field of size 4, then the dimension of each kernel is 4 × 4 × 3 (3 is the depth of the kernel which will be the same as the depth of the input volume). For 100 neurons in the layer there will be in total only 4 × 4 × 3 × 2 = 96 parameters in this layer because all the 100 neurons will share the same weights for each kernel. This considers only the number of kernels and the size of the local connectivity but does not depend on the number neurons in the layer.

In addition to reducing the number of parameters, shared weights and a local connectivity are important in processing images efficiently. The reason therefore is that local convolutional operations in an image result in values that contain certain characteristics of the image, because in images local values are generally highly correlated and the statistics formed by the local values are often invariant in the location ( LeCun et al., 2015 ). Hence, using a kernel that shares the same weights can detect patterns from all the local regions in the image, and different kernels can extract different types of patterns from the image.

A non-linear activation function (for instance ReLu, tanh, sigmoid, etc.) is often applied to the values from the convolutional operations between the kernel and the input. These values are stored in the activation maps, which will be later passed to the next layer of the network.

## 5.1.2. Pooling Layer

A pooling layer is usually inserted between a convolutional layer and the following layer. Pooling layers aim at reducing the dimension of the input with some pre-specified pooling method, resulting in a smaller input by conserving as much information as possible. Also, a pooling layer is able to introduce spatial invariance into the network ( Scherer et al., 2010 ), which can help to improve the generalization of the model. In order to perform pooling, a pooling layer uses stride, zero-padding, and a pooling window size as hyper-parameters. The pooling layer will scan the entire input with the specified pooling window size in the same manner as the kernel in a convolutional layer. For instance, using a stride of 2, window size of 2 and 0 zeros-padding for pooling will half the size of the input dimension.

There are many types of pooling methods, e.g., averaging-pooling, min-pooling and some advanced pooling methods, such as fractional max-pooling and stochastic pooling. The most common used pooling method is max-pooling, as it has been shown to be superior in dealing with images by capturing invariances efficiently ( Scherer et al., 2010 ). Max-pooling extracts the maximum value within each specified sub-window across the activation map. The max-pooling can be formulated as A i, j, k = max ( R i − n : i + n, j − n : j + n, k ), where A i, j, k is the maximum activation value from the matrix R of size n × n centered at index i, j in the kth activation map with n is the window size.

## 5.1.3. Fully-Connected Layer

A fully-connected layer is the basic hidden layer unit in FFNN (see section 3.2). Interestingly, also for traditional CNN architectures, a fully connected layer is often added between the penultimate layer and the output layer to further model non-linear relationships of the input features ( Krizhevsky et al., 2012b ; Simonyan and Zisserman, 2014 ; Szegedy et al., 2015 ). However, recently the benefit of this has been questioned because of the many parameters introduced by this, leading potentially to overfitting ( Simonyan and Zisserman, 2014 ). As a result, more and more researchers started to construct CNN architecture without such a fully connected layer using other techniques like max-over-time pooling ( Lin et al., 2013 ; Kim, 2014 ) to replace the role of linear layers.

## 5.2. Important Variants of CNN

5.2.1. vggnet.

VGGNet ( Simonyan and Zisserman, 2014 ) was a pioneer in exploring how the depth of the network influences the performance of a CNN. VGGNet was proposed by the Visual Geometry Group and Google DeepMind, and they studied architectures with a depth of 19 (e.g., compared to 11 for AlexNet Krizhevsky et al., 2012b ).

VGG19 extended the network from eight weight layers (a structure proposed by AlexNet) to 19 weights layers by adding 11 more convolutional layers. In total, the parameters increased from 61 million to 144 million, however, the fully connected layer takes up most of the parameters. According to their reported results, the error rate dropped from 29.6 to 25.5 regrading top-1 val.error (percentage of times the classifier did not give the correct class with the highest score) on the ILSVRC dataset, and from 10.4 to 8.0 regarding top-5 val.error (percentage of times the classifier did not include the correct class among its top 5) on the ILSVRC dataset in ILSVRC2014. This indicates that a deeper CNN structure is able to achieve better results than shallower networks. In addition, they stacked multiple 3 × 3 convolutional layers without a pooling layer placed in between to replace the convolutional layer with a large filter sizes, e.g., 7 × 7 or 11 × 11. They suggested such an architecture is capable of receiving the same receptive fields as those composed of larger filter sizes. Consequently, two stacked 3 × 3 layers can learn features from a 5 × 5 receptive field, but with less parameters and more non-linearity.

## 5.2.2. GoogLeNet With Inception

The most intuitive way for improving the performance of a Convolutional Neural Network is to stack more layers and add more parameters to the layers ( Simonyan and Zisserman, 2014 ). However, this will impose two major problems. One is that too many parameters will lead to overfitting, and the other is that the model becomes hard to train.

GoogLeNet ( Szegedy et al., 2015 ) was introduced by Google. Until the introduction of inception, traditional state-of-the-art CNN architectures mainly focused on increasing the size and depth of the neural network, which also increased the computation cost of the network. In contrast, GoogLeNet introduced an architecture to achieve state-of-the-art performance with a light-weight network structure.

The idea underlying an inception network architecture is to keep the network as sparse as possible while utilizing the fast matrix computation feature provided by a computer. This idea facilitates the first inception structure (see Figure 6 ).

Figure 6 . Inception block structure. Here multiple blocks are stacked on top of each other, forming the input layer for the next block.

As one can see in the Figure 6 , several parallel layers including 1 × 1 convolution and 3 × 3 max pooling operate at the same level on the input. Each tunnel (namely one separated sequential operation) has a different child layer, including 3 × 3 convolutions, 5 × 5 convolutions and 1 × 1 convolution layer. All the results from each tunnel are concatenated together at the output layer. In this architecture, a 1x1 convolution is used to downscale the input image while reserving input information ( Lin et al., 2013 ). They argued that concatenating all the features extracted by different filters corresponds to the idea that image information should be processed at different scales and only the aggregated features should be sent to the next level. Hence, the next level can extract features from different scales. Moreover, this sparse structure introduced by an inception block requires much fewer parameters and, hence, is much more efficient.

By stacking the inception structure throughout the network, GoogLeNet won first place in the classification task of ILSVRC2014, demonstrating the quality of the inception structure. Followed by the inception v1, inception v2, v3, and the latest version v4 were introduced. Each generation introduced some new features, making the network faster, more light-weight and more powerful.

## 5.2.3. ResNet

In principle, CNNs with a deeper structure perform better than shallow ones ( Simonyan and Zisserman, 2014 ). In theory, deeper networks have a better ability to represent high level features from the input, therefore improving the accuracy of predictions ( Donahue et al., 2014 ). However, one cannot simply stack more and more layers. In the paper ( He et al., 2016 ), the authors observed the phenomena that more layers can actually hurt the performance. Specifically, in their experiment, network A had N layers, and network B had N + M layers, while the initial N layers had the same structure. Interestingly, when training on the CIFAR-10 and ImageNet dataset, network B showed a higher training error than network B. In theory, the extra M layers should result in a better performance, but instead they obtained higher errors which cannot be explained by overfitting. The reason for this is that the loss is getting optimized to local minima, which is different to the vanishing gradient phenomena. This is referred to as the degradation problem ( He et al., 2016 ).

ResNet ( He et al., 2016 ) was introduced to overcome the degradation problem of CNNs to push the depth of a CNN to its limit. In ( He et al., 2016 ), the authors proposed a novel structure of a CNN, which is in theory capable of being extended to an infinite depth without losing accuracy. In their paper, they proposed a deep residual learning framework, which consists of multiple residual blocks to address the degradation problem. The structure of a residual block is shown in the Figure 7 .

Figure 7 . The structure of a residual block. Inside a block there can be as many weight layers as desired.

Instead of trying to learn the desired underlying mapping H ( x ) from each few stacked layers, they used an identity mapping for input x from input to the output of the layer, and then let the network learn the residual mapping F ( x ) = H ( x ) − x . After adding the identity mapping, the original mapping can be reformulated as H ( x ) = F ( x ) + x . The identity mapping is realized by making shortcut connections from the input node directly to the output node. This can help to address the degradation problem as well as the vanishing (exploding) gradient issue of deep networks. In extreme cases, deeper layers can just learn the identity map of the input to the output layer, by simply calculating the residuals as 0. This enables the ability for a deep network to perform at least not worse than shallow ones. Also, in practice, the residuals are never 0, which makes it possible for very deeper layers to always learn something new from the residuals therefore producing better results. The implementation of ResNet helped to push the layers of CNNs to 152 by stacking so-called residual blocks through out the network. ResNet achieved the best result in the ILSVRC2016 competition, with an error rate of 3.57.

## 6. Deep Belief Networks

A Deep Belief Network (DBN) is a model that combines different types of neural networks with each other to form a new neural network model. Specifically, DBNs integrate Restricted Boltzmann Machines (RBMs) with Deep Feedforward Neural Networks (D-FFNN). The RBMs form the input unit whereas the D-FFNNs form the output unit. Frequently, RBMs are stacked on top of each other, which means more than one RBM is used sequentially. This adds to the depth of the DBN.

Due to the different nature of the networks RBM and D-FFNN, two different types of learning algorithms are used. Practically, the Restricted Boltzmann Machines are used for initializing a model in an unsupervised way. Thereafter, a supervised method is applied for the fine tuning of the parameters ( Yoshua, 2009 ). In the following, we describe these two phases of the training of a DBN in more detail.

## 6.1. Pre-training Phase: Unsupervised

Theoretically, neural networks can be learned by using supervised methods only. However, in practice it was found that such a learning process can be very slow. For this reason, unsupervised learning is used to initialize the model parameters. The standard neural network learning algorithm (backpropagation) was initially only able to learn shallow architectures. However, by using a Restricted Boltzmann Machine for the unsupervised initialization of the parameters one obtains a more efficient training of the neural network ( Hinton et al., 2006 ).

A Restricted Boltzmann Machine is a special type of a Boltzmann Machine (BM), see section 3.3.2. The difference between a Restricted Boltzmann Machine and a Boltzmann Machine is that Restricted Boltzmann Machines (RBMs) have constraints in the connectivity of their structure ( Fischer and Igel, 2012 ). Specifically, there can be no connections between nodes in the same layer. For an example, see Figure 8C .

Figure 8 . Examples for Boltzmann Machines. (A) The neurons are arranged on a circle. (B) The neurons are separated according to their type. Both Boltzmann Machines are identical and differ only in their visualization. (C) Transition from a Boltzmann Machine (left) to a Restricted Boltzmann Machine (right).

The values of neurons, v , in the visible layer are known, but the neuron values, h , in the hidden layer are unknown. The parameters of the network are learned by defining an energy function, E , of the model which is then minimized.

Frequently, a RBM is used with binary values, i.e., v i ∈ {0, 1} and h i ∈ {0, 1}. The energy function for such a network is given by ( Hinton, 2012 ):

whereas Θ = { a, b , W } is the set of model parameters.

Each configuration of the system corresponds to a probability defined via the Boltzmann distribution in Equation (11):

In Equation (12), Z is the partition function given by:

The probability for the network assigning to a visible vector v is given by summing over all possible hidden vectors:

Maximum-likelihood estimation (MLE) is used for estimating the optimal parameters of the probabilistic model ( Hayter, 2012 ). For a training data set D = D t r a i n = { v 1 , … , v l } consisting of l patterns, assuming that the patterns are iid (independent and identical) distributed, the log-likelihood function is given by:

For simple cases, one may be able to find an analytical solution for Equation (15) by solving ∂ ∂ θ ln L ( θ | D ) = 0 . However, usually the parameters need to be found numerically. For this, the gradient of the log-likelihood is a typical approach for estimating the optimal parameters:

In Equation (16), the constant, η, in front of the gradient is the learning rate and the first regularization term, −λθ ( t ) , is the weight-decay. The weight-decay is used to constrain the optimization problem by penalizing large values of θ ( Hinton, 2012 ). The parameter λ is also called the weight-cost . The second regularization term in Equation (16) is called momentum. The purpose of the momentum is to make learning faster and to reduce possible oscillations. Overall, this should stabilize the learning process.

For the optimization, the Stochastic Gradient Ascent (SGA) is utilized using mini-batches . That means one selects randomly a number of samples from the training set, k , which are much smaller than the total sample size, and then estimates the gradient. The parameters, θ, are then updated for the mini-batch. This process is repeated iteratively until an epoch is completed. An epoch is characterized by using the whole training set once. A common problem is encountered when using mini-batches that are too large, because this can slow down the learning process considerably. Frequently, k is chosen between 10 and 100 ( Hinton, 2012 ).

Before the gradient can be used, one needs to approximate the gradient of Equation (16). Specifically, the derivatives with respect to the parameters can be written in the following form:

In Equation (17), H i denotes the value of hidden unit i and p ( v ) is the probability defined in Equation (14). For the conditional probability, one finds

and correspondingly

Using the above equations in the presented form would be inefficient because these equations require a summation over all visible vectors. For this reason, the Contrastive Divergence (CD) method is used for increasing the speed for the estimation of the gradient. In Figure 9A , we show pseudocode of the CD algorithm.

Figure 9. (A) Contrastive Divergence k-step algorithm using Gibbs sampling. (B) Backpropagation algorithm. (C) iRprop + algorithm.

The CD uses Gibbs sampling for drawing samples from conditional distributions, so that the next value depends only on the previous one. This generates a Markov chain ( Hastie et al., 2009 ). Asymptotically, for k → ∞ the distribution becomes the true stationary distribution. In this case, the CD → ML . Interestingly, already k = 1 can lead to satisfactory approximations for the pre-training ( Carreira-Perpinan and Hinton, 2005 ).

In general, pre-training of DBNs consists of stacking RBMs. That means the next RBM is trained using the hidden layer of the previous RBM as visible layer. This initializes the parameters for each layer ( Hinton and Salakhutdinov, 2006 ). Interestingly, the order of this training is not fixed but can vary. For instance, first, the last layer can be trained and then the remaining layers can be trained ( Hinton et al., 2006 ). In Figure 10 , we show an example for the stacking of RBMs.

Figure 10 . Visualizing the stacking of RBMs in order to learn the parameters Θ of a model in an unsupervised way.

## 6.2. Fine-Tuning Phase: Supervised

After the initialization of the parameters of the neural network, as described in the previous step, these can now be fine-tuned. For this step, a supervised learning approach is used, i.e., the labels of the samples, omitted in the pre-training phase, are now utilized.

For learning the model, one minimizes an error function (also called loss function or sometimes objective function). An example for such an error function is the mean squared error (MSE).

In Equation (20), o i = ϕ ( x i ) is the i th output from the network function ϕ :ℝ m → ℝ n given the i th input x i from the training set D = D t r a i n = { ( x 1 , t 1 ) , … ( x l , t l ) } and t i is the target output.

Similarly, for maximizing the log-likelihood function of a RBM (see Equation 16), one uses gradient descent to find the parameters that minimize the error function.

Here, the parameters (η, λ and ν) have the same meaning as explained above. Again, the gradient is typically not used for the entire training data D , but instead smaller batches are used via the Stochastic Gradient Descent (SGD).

The gradient of the RBM log-likelihood can be approximated using the CD algorithm (see Figure 9A ). For this, the backpropagation algorithm is used ( LeCun et al., 2015 ).

Let us denote by a i l the activation of the ith unit in the lth layer ( l ∈ {2, …, L }), b i t the corresponding bias and w i j l the weight for the edge between the jth unit of the ( l − 1)th layer and the ith unit of the lth layer. For activation function, φ, the activation of the lth layer with the (l - 1)th layer as input is a l = φ( z ( l ) ) = φ( w ( l ) a ( l −1) + b ( l ) ).

Application of the chain rule leads to ( Nielsen, 2015 ):

In Equation (22), the vector δ L contains the errors of the output layer ( L ), whereas the vector δ l contains the errors of the lth layer. Here, · indicates the element-wise product of vectors. From this the gradient of the error of the output layer is given by

In general, the result of this depends on E . For instance, for the MSE we obtain ∂ E ∂ a j ( L ) = ( a j - t j ) . As a result, the pseudocode for the backpropagation algorithm can be formulated as shown in Figure 9B ( Nielsen, 2015 ). The estimated gradients from Figure 9B are then used to update the parameters (weights and biases) via SGD (see Equation 21). More updates are performed using mini-batches until all training data have been used ( Smolander, 2016 ).

The resilient backpropagation algorithm (Rprop) is a modification of the backpropagation algorithm that was originally introduced to speed up the basic backpropagation (Bprop) algorithm ( Riedmiller and Braun, 1993 ). There exist at least four different versions of Rprop ( Igel and Hüsken, 2000 ) and in Algorithm 9 pseudocode for the iRprop + algorithm (which improves Rprop with weight-backtracking) is shown ( Smolander, 2016 ).

As one can see in Algorithm 9, iRprop + uses information about the sign of the partial derivative from time step ( t − 1) to make a decision for the update of the parameter. Importantly, the results of comparisons have shown that the iRprop + algorithm is faster than Bprop ( Igel and Hüsken, 2000 ).

It has been shown that the backpropagation algorithm with SGD can learn good neural network models even without a pre-training stage when the training data are sufficiently large ( LeCun et al., 2015 ).

In Figure 11 , we show an example of the overall DBN learning procedure. The left-hand side shows the pre-training phase and the right-hand side the fine-tuning.

Figure 11 . The two stages of DBN learning. (Left) The hidden layer (purple) of one RBM is the input of the next RBM. For this reason their dimensions are equal. (Right) The two edges in fine-tuning denote the two stages of the backpropagation algorithm: the input feedforwarding and the error backpropagation. The orange layer indicated the output.

DBNs have been used successfully for many application tasks, e.g., natural language processing ( Sarikaya et al., 2014 ), acoustic modeling ( Mohamed et al., 2011 ), image recognition ( Hinton et al., 2006 ) and computational biology ( Zhang S. et al., 2015 ).

## 7. Autoencoder

An Autoencoder is an unsupervised neural network model used for representation learning, e.g., feature selection or dimension reduction. A common property of autoencoders is that the size of the input and output layer is the same with a symmetric architecture ( Hinton and Salakhutdinov, 2006 ). The underlying idea is to learn a mapping from an input pattern x to a new encoding c = h ( x ), which ideally gives as output pattern the same as the input pattern, i.e., x ≈ y = g ( c ). Hence, the encoding c , which has usually lower dimension than x , allows to reproduce (or code for) x .

The construction of Autoencoders is similar to DBNs. Interestingly, the original implementation of an autoencoder ( Hinton and Salakhutdinov, 2006 ) pre-trained only the first half of the network with RBMs and then unrolled the network, creating in this way the second part of the network. Similar to DBNs, a pre-training phase is followed by a fine-tuning phase. In Figure 12 , an illustration of the learning process is shown. Here, the coding layer corresponds to the new encoding c providing, e.g., a reduced dimension of x .

Figure 12 . Visualizing the idea of autoencoder learning. The learned new encoding of the input is represented in the code layer (shown in blue).

An Autoencoder does not utilize labels and, hence, it is an unsupervised learning model. In applications, the model has been successfully used for dimensionality reduction. Autoencoders can achieve a much better two-dimensional representation of array data, when an adequate amount of data is available ( Hinton and Salakhutdinov, 2006 ). Importantly, PCAs implement a linear transformation, whereas Autoencoders are non-linear. Usually, this results in a better performance. We would like to highlight that there are many extensions of these models, e.g., sparse autoencoder, denoising autoencoder or variational autoencoder ( Vincent et al., 2010 ; Deng et al., 2013 ; Pu et al., 2016 ).

## 8. Long Short-Term Memory Networks

Long short-term memory (LSTM) networks were introduced by Hochreiter and Schmidhuber in 1997 ( Hochreiter and Schmidhuber, 1997 ). LSTM is a variant of a RNN that has the ability to address the shortcomings of RNNs which do not perform well, e.g., when handling long-term dependencies ( Graves, 2013 ). Furthermore, LSTMs avoid the gradient vanishing or exploding problem ( Hochreiter, 1998 ; Gers et al., 1999 ). In 1999, a LSTM with a forget gate was introduced which could reset the cell memory. This improved the initial LSTM and became the standard structure of LSTM networks ( Gers et al., 1999 ). In contrast to Deep Feedforward Neural Networks, LSTMs contain feedback connections. Furthermore, they can not only process single data points, such as vectors or arrays, but sequences of data. For this reason, LSTMs are particularly useful for analyzing speech or video data.

## 8.1. LSTM Network Structure With Forget Gate

Figure 13 shows an unrolled structure of a LSTM network model ( Wang et al., 2016 ). In this model, the input and output are organized vertically, while information is delivered horizontally over the time series.

Figure 13. (Left) A folded structure of a LSTM network model. (Right) An unfolded structure of a LSTM network model. x i is the input data at time i and y i is the corresponding output ( i is the time step starting from ( t − 1)). In this network, only y t + 2 ′ activated by softmax function is the final network output.

In a standard LSTM network, the basic entity is called LSTM unit or a memory block ( Gers et al., 1999 ). Each unit is composed of a cell, the memory part of the unit, and three gates: an input gate, an output gate and a forget gate (also called keep gate) ( Gers et al., 2002 ). A LSTM unit can remember values over arbitrary time intervals and the three gates control the flow of information through the cell. The central feature of a LSTM cell is a part called “constant error carousel” (CEC) ( Lipton et al., 2015 ). In general, a LSTM network is formed exactly like a RNN, except that the neurons in the hidden layers are replaced by memory blocks.

In the following, we discuss some core concepts and the corresponding technicalities ( W and U stand for the weights and b for the bias). In Figure 14 , we show a schematic description of a LSTM block with one cell.

• Input gate: A unit with sigmoidal function that controls the flow of information into the cell. It receives its activation from both output of the previous time h ( t −1) and current input x ( t ) . Under the effect of the sigmoid function, an input gate i t generates values between zero and one. Zero indicates it blocks the information entirely, whereas values of one allow all the information to pass.

• Cell input layer: The cell input has a similar flow as the input gate, receiving h ( t −1) and x ( t ) as input. However, a tanh activation is used to squish input values to a range between -1 and 1 (denoted by l t in Equation 25).

• Forget gate: A unit with a sigmoidal function determines which information from previous steps of the cell should be memorized or forgotten. The forget gate f t assumes values between zero and one based on the input, h ( t −1) and x ( t ) . In the next step, f t is given by a Hadamard product with an old cell state c t −1 to update to a new cell state c t (Equation 26). In this case, a value of zero means the gate is closed, so it will completely forget the information of the old cell state c t −1 , whereas values of one will make all information memorable. Therefore, a forget gate has the right to reset the cell state if the old information is considered meaningless.

• Cell state: A cell state stores the memory of a cell over a longer time period ( Ming et al., 2017 ). Each cell has a recurrently self-connected linear unit which is called Constant Error Carousel (CEC) ( Hochreiter and Schmidhuber, 1997 ). The CEC mechanism ensures that a LSTM network does not suffer from the vanishing or exploding gradient problem ( Elsayed et al., 2018 ). The CEC is regulated by a forget gate and it can also be reset by the forget gate. At time t , the current cell state c t is updated by the previous cell state c t −1 controlled by the forget gate and the product of the current input and the cell input, i.e., ( i t ∘ l t ). Overall, Equation (27) describes the combined update of a cell state,

• Output gate: A unit with a sigmoidal function can control the flow of information out of the cell. A LSTM uses the values of the output gate at time t (denoted by o t ) to control the current cell state c t activated by a tanh function, to obtain the final output vector h ( t ) ,

Figure 14 . Internal connectivity pattern of a standard LSTM unit (blue rectangle). The output from the previous time step, h ( t −1) , and x ( t ) , are the input to the block at time t , then the output h ( t ) at time t will be an input to the same block in the next time step ( t + 1).

## 8.2. Peephole LSTM

A Peephole LSTM is a variant of a LSTM proposed by Gers and Schmidhuber (2000) . In contrast to a standard LSTM discussed above, a Peephole LSTM uses the cell state c , instead of h for regulating the forget gate, input gate and output gate. In Figure 15 , we show the internal connectivity of a Peephole LSTM unit whereas the red arrows represent the new peephole connections.

Figure 15 . Internal connectivity of a Peephole LSTM unit (blue rectangle). Here x ( t ) is the input to the cell at time t , and h ( t ) is its output. The red arrows are the new peephole connections added, compared to the standard LSTM in Figure 14 .

The key difference between a Peephole LSTM and a standard LSTM is that the forget gate f t , input gate i t and output gate o t do not use h ( t −1) as input. Instead, these gates use the cell state c t −1 . In order to understand the base idea behind a Peephole LSTM, let us assume the output gate o t −1 in a traditional LSTM network is closed. Then the output of the network h ( t −1) at time ( t − 1) will be 0, according to Equation (29), and in the next time step t , the regulating mechanism of all three gates will only depend on the network input x ( t −1) . Therefore, the historical information will be lost completely. A Peephole LSTM avoids this problem by using a cell state instead of output h to control the gates. The following equations describe a Peephole LSTM formally.

Aside from these main forms of LSTMs described above, there are further variants. For instance, a Bidirectional LSTM Network (BLSTM) has been introduced by ( Graves and Schmidhuber, 2005 ), which can access long-range context in both input directions. Furthermore, in 2014, the concept of “Gated Recurrent Unit” was proposed, which is viewed as a simplified version of LSTM ( Cho et al., 2014 ) and in 2015, Wai-kin Wong and Wang-chun Woo introduced a Convolutional LSTM Network (ConvLSTM) for precipitation nowcasting ( Xingjian et al., 2015 ). There are further variants of LSTM networks; however, most of them are designed for specific application domains without clear performance advantage.

## 8.3. Applications

LSTMs have a wide range of applications in text generation, text classification, language translation or image captioning ( Hwang and Sung, 2015 ; Vinyals et al., 2015 ). In Figure 16 , an LSTM classifier model for text classification is shown. In this figure, the input of the LSTM structure at each time step is a word embedding vector V i , which is a common choice for text classification problems. A word embedding technique maps the words or phrases in the vocabulary to vectors consisting of real numbers. Some common word embedding techniques include word2vec, GloVe, FastText, etc. Zhou (2019) . The output y N is the corresponding output at the Nth time step and y N ′ is the final output after softmax activation of y N , which will determine the classification of the input text.

Figure 16 . An LSTM classifier model for text classification. N is the sequence length of the input text (the number of words). Input from V 1 to V N is a sequence of word embedding vectors used as input to the model at different time steps. y N ′ is the final prediction result.

## 9. Discussion

9.1. general characteristics of deep learning.

A property common to all deep learning models is that they perform so-called representation learning. Sometimes this is also called feature learning. This denotes a model that learns new and better representations compared to the raw data. Importantly, deep learning models do not learn the final representation within one step but multiple ones corresponding to multi-level representation transformations between the hidden layers ( LeCun et al., 2015 ).

Another common property of deep learning models is that the subsequent transformations between layers are non-linear (see Figure 3 ). This increases the expressive power of the model ( Duda et al., 2000 ). Furthermore, individual representations are not designed manually, but learned via training data ( LeCun et al., 2015 ). This makes deep learning models very flexible.

## 9.2. Differences Between Models

Currently, CNNs are the dominating deep learning models for computer vision tasks ( LeCun et al., 2015 ). They are effective when the data consist of arrays where nearby values in an array are correlated with each other, e.g., as is the case for images, videos, and sound data. A convolutional layer can easily process high-dimensional input by using the local connectivity and shared weights, while a pooling layer can down-sample the input without losing essential information. Each convolutional layer is capable of converting the input image into groups of more abstract features using different kernels; therefore, by stacking multiple convolution layers, the network is able to transform the input image to a representation that captures essential patterns from the input, thus making precise predictions.

However, also in other areas, CNNs have shown very competitive results compared to other deep learning architectures, e.g., in natural language processing ( Kim, 2014 ; Yang et al., 2020 ). Specifically, CNNs can be good at extracting local information from text and exploring meaningful semantic and syntactic meanings between phrases and words. Also, the natural composition of text data can be easily handled by a CNN architecture. Hence, CNNs show very strong potential in performing classification tasks where successful predictions heavily rely on extracting key information from input text ( Yin et al., 2017 ).

The classical network architecture is fully connected and feedforward corresponding to a D-FFNN. Interestingly, in ( Mayr et al., 2016 ), it has been shown that a D-FFNN outperformed other methods for predicting the toxicity of drugs. Also for drug target predictions, a D-FFNN has been shown to be superior compared to other methods ( Mayr et al., 2018 ). This shows that even such an architecture can be successfully used in modern applications.

Commonly, RNNs are used for problems with sequential data, such as speech and language processing or modeling ( Sundermeyer et al., 2012 ; Graves et al., 2013 ; Luong and Manning, 2015 ). While DBNs and CNNs are feedforward networks, connections in RNNs can form cycles. This allows the modeling of dynamical changes over time ( LeCun et al., 2015 ).

A problem with finding the right application for a deep learning model is that their application domains are not mutually exclusive from each other. Instead, as the discussion above shows, there is a considerable overlap and the best model can in many cases only be found by conducting a comparative study. In Table 3 , we show several examples of different applications involving images, audio, text, and genomics data.

Table 3 . Overview of applications of deep learning methods.

## 9.3. Interpretable Models vs. Black-Box Models

Any model in data science can be categorized either as an inferential model or a prediction model ( Breiman, 2001 ; Shmueli, 2010 ). An inferential model does not only make predictions but provides also an interpretable structure. Hence, it is a model of the prediction process itself, e.g., a causal model. In contrast, a prediction model is merely a black-box model for making predictions.

The models discussed in this review neither aim at providing physiological models of biological neurons nor offer an interpretable structure. Instead, they are prediction models. An example for a biologically motivated learning rule for neural networks is the Hebbian learning rule ( Hebb, 1949 ). Hebbian learning is a form of unsupervised learning of neural networks that does not use global information about the error as backpropagation. Instead, only local information is used from adjacent neurons. There are many extensions of Hebb's basic learning rule that have been introduced based on new biological insights (see e.g., Emmert-Streib, 2006 ).

Recently, there is great interest in interpretable or explainable AI (XAI) ( Biran and Cotton, 2017 ; Doshi-Velez and Kim, 2017 ). Especially in the clinical and medical area, one would like to have understandable decisions of statistical prediction models because patients are affected ( Holzinger et al., 2017 ). The field is still in its infancy, but if meaningful interpretations of general deep learning models could be found this would certainly revolutionize the field.

As a note, we would like to add that the distinction between an explainable AI model and a non-explainable model is not well-defined. For instance, the sparse coding model by Olshausen and Field (1997) was shown to be similar to the coding of images in the human visual cortex ( Tosic and Frossard, 2011 ) and an application of this model can be found in Charles et al. (2011) , where an unsupervised learning approach was used to learn an optimal sparse coding dictionary for the classification of high spectral imagery (HIS) data. Some may consider this model as an XAI model because of the similarity to the working mechanism of the human cortex, whereas others may question this explanation.

## 9.4. Big Data vs. Small Data

In statistics, the field of experimental design is concerned with assessing if the available sample sizes are sufficient to conduct a particular analysis (for a practical example see Stupnikov et al., 2016 ). In contrast, for all methods discussed in this paper, we assumed that we are in the big data domain implying sufficient samples. This corresponds to the ideal case. However, we would like to point out that for practical applications, one needs to assess this situation case-by-case to ensure the available data (respectively the sample sizes) are sufficient to use deep learning models. Unfortunately, this issue is not well-represented in the current literature. As a rule-of-thumb , deep learning models usually perform well for tens of thousands of samples but it is largely unclear how they perform in a small data setting. This leaves it to the user to estimate learning curves of the generalization error for a given model to avoid spurious results ( Emmert-Streib and Dehmer, 2019b ).

As an example to demonstrate this problem, we conducted an analysis to explore the influence of the sample size on the accuracy of the classification of the EMNIST data. EMNIST (Extended MNIST) ( Cohen et al., 2017 ) consists of 280, 000 handwritten characters (240, 000 training samples and 40, 000 test samples) for 10 balanced classes (0–9). We used a multilayered Long Short-Term Memory (LSTM) model for the 10-class handwritten digit classification task. The model we used is a four-layer network (three hidden layers and one fully connected layer), and each hidden layer contains 200 neurons. For this analysis, we set the batch size to 100 and the training samples were randomly drawn if the number of training samples was < 240, 000 (subsampling).

From the results in Figure 17 , one can see that thousands of training samples are needed to achieve a classification error below 5% (blue dashed line). Specifically, more than 25, 000 training samples are needed. Given the relative simplicity of the problem—classification of ten digits, compared to classification or diagnosis of cancer patients—the severity of this issue should become clear. Also, these results show that a deep learning model cannot do miracles. If the number of samples is too small, the method breaks down. Hence, the combination of a model and data is crucial for solving a task.

Figure 17 . Classification error of the EMNIST data in dependence on the number of training samples. The standard errors are shown in red and the horizontal dashed line corresponds to an error of 5% (reference). The results are averaged over 10 independent runs.

## 9.5. Data Types

A related problem to the sample size issue discussed above is the type of data. Examples for different data types are text data, image data, audio data, network data or measurement/sensor data (for instance from genomics) to name just a few. One can further subdivide these data according to the application domain from which these originate, e.g., text data from medical publications, text data from social media or text data from novels. Considering such categorizations, it becomes clear that the information content of 'one sample' does not have the same meaning for each data type and for each application domain. Hence, the assessment of deep learning models needs to be always conducted in a domain specific manner because the transfer of knowledge between such models is not straight forward.

## 9.6. Further Advanced Models

Finally, we would like to emphasize that there are additional but more advanced models of deep learning networks, which are outside the core architectures. For instance, deep learning and reinforcement learning have been combined with each other to form deep reinforcement learning ( Mnih et al., 2015 ; Arulkumaran et al., 2017 ; Henderson et al., 2018 ). Such models have found application in problems from robotics, games and healthcare.

Another example for an advanced model is a graph CNN, which is particularly suitable when data have the form of graphs ( Henaff et al., 2015 ; Wu et al., 2019 ). Such models have been used in natural language processing, recommender systems, genomics and chemistry ( Li et al., 2018 ; Yao et al., 2019 ).

Lastly, a further advanced model is a Variational Autoencoder (VAE) ( An and Cho, 2015 ; Doersch, 2016 ). Put simply, a VAR is a regularized Autoencoder that uses a distribution over the latent spaces as encoding for the input, instead of a single point. The major application of VAE is as a generative model for generating similar data in an unsupervised manner, e.g., for image or text generation.

## 10. Conclusion

In this paper, we provided an introductory review for deep learning models including Deep Feedforward Neural Networks, (D-FFNN), Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs), Autoencoders (AE) and Long Short-Term Memory networks (LSTMs). These models can be considered the core architectures that currently dominate deep learning. In addition, we discussed related concepts needed for a technical understanding of these models, e.g., Restricted Boltzmann Machines and resilient backpropagation. Given the flexibility of network architectures allowing a “ Lego-like ” construction of new models, an unlimited number of neural network models can be constructed by utilizing elements of the core architectural building blocks discussed in this review. Hence, a basic understanding of these elements is key to be equipped for future developments in AI.

## Author Contributions

FE-S conceived the study. All authors contributed to all aspects of the preparation and the writing of the manuscript.

MD thanks the Austrian Science Funds for supporting this work (project P 30031).

## Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

## Acknowledgments

We would like to thank Johannes Smolander for discussions about Deep Belief Networks.

Alipanahi, B., Delong, A., Weirauch, M. T., and Frey, B. J. (2015). Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838. doi: 10.1038/nbt.3300

PubMed Abstract | CrossRef Full Text | Google Scholar

An, J., and Cho, S. (2015). Variational Autoencoder Based Anomaly Detection Using Reconstruction Probability . Special Lecture on IE 2.

Google Scholar

Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A. (2017). Deep reinforcement learning: a brief survey. IEEE Signal Process. Mag. 34, 26–38. doi: 10.1109/MSP.2017.2743240

CrossRef Full Text | Google Scholar

Bergmeir, C., and Benítez, J. M. (2012). Neural networks in R using the stuttgart neural network simulator: RSNNS. J. Stat. Softw. 46, 1–26. doi: 10.18637/jss.v046.i07

Biran, O., and Cotton, C. (2017). “Explanation and justification in machine learning: a survey,” in IJCAI-17 Workshop on Explainable AI (XAI) . Vol. 8, 1.

Bottou, L. (2010). “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT'2010 (Springer), 177–186.

Breiman, L. (2001). Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat. Sci. 16, 199–231. doi: 10.1214/ss/1009213726

Cao, C., Liu, F., Tan, H., Song, D., Shu, W., Li, W., et al. (2018). Deep learning and its applications in biomedicine. Genomics Proteomics Bioinform. 16, 17–32. doi: 10.1016/j.gpb.2017.07.003

Cao, S., Lu, W., and Xu, Q. (2016). “Deep neural networks for learning graph representations,” in Thirtieth AAAI Conference on Artificial Intelligence .

Carreira-Perpinan, M. A., and Hinton, G. E. (2005). “On contrastive divergence learning,” in Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (Citeseer), 33–40.

Charles, A. S., Olshausen, B. A., and Rozell, C. J. (2011). Learning sparse codes for hyperspectral imagery. IEEE J. Select. Top. Signal Process. 5, 963–978. doi: 10.1109/JSTSP.2011.2149497

Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., et al. (2015). Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems.

Chimera (2019). Pydbm . arXiv:1512.01274.

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv [Preprint] . arXiv:1406.1078. doi: 10.3115/v1/D14-1179

Chollet, F. (2015). Keras . Available online at: https://github.com/fchollet/keras

Cohen, G., Afshar, S., Tapson, J., and van Schaik, A. (2017). Emnist: an extension of mnist to handwritten letters. arXiv[Preprint]. arXiv:1702.05373. doi: 10.1109/IJCNN.2017.7966217

Dai, J., Wang, Y., Qiu, X., Ding, D., Zhang, Y., Wang, Y., et al. (2018). BigDL: a distributed deep learning framework for big data. arXiv:1804.05839.

[Dataset] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., et al. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467.

[Dataset] Bondarenko, Y. (2017). Boltzman-Machines .

[Dataset] Candel, A., Pramar, V., LeDell, E., and Arora, A. (2015). Deep Learning With H2O .

[Dataset] Dieleman, S., Schlüter, J., Raffel, C., Olson, E., Sonderby, S. K., Nouri, D., et al. (2015). Lasagne: First Release .

[Dataset] Howard J., and Gugger S. (2018). fastai: A Layered API for Deep Learning . arXiv:2002.04688.

Deng, J., Zhang, Z., Marchi, E., and Schuller, B. (2013). “Sparse autoencoder-based feature transfer learning for speech emotion recognition,” in 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (IEEE), 511–516.

Dixon, M., Klabjan, D., and Wei, L. (2017). Ostsc: over sampling for time series classification in R.

Doersch, C. (2016). Tutorial on variational autoencoders. arXiv [Preprint] . arXiv:1606.05908.

Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., et al. (2014). “Decaf: a deep convolutional activation feature for generic visual recognition,” in International Conference on Machine Learning , 647–655.

Doshi-Velez, F., and Kim, B. (2017). Towards a rigorous science of interpretable machine learning. arXiv [Preprint] . arXiv:1702.08608.

Duda, R. O., Hart, P. E., and Stork, D. G. (2000). Pattern Classification. 2nd Edn. Wiley.

Elsayed, N., Maida, A. S., and Bayoumi, M. (2018). Reduced-gate convolutional LSTM using predictive coding for spatiotemporal prediction. arXiv [Preprint] . arXiv:1810.07251.

Emmert-Streib, F. (2006). A heterosynaptic learning rule for neural networks. Int. J. Mod. Phys. C 17, 1501–1520. doi: 10.1142/S0129183106009916

Emmert-Streib, F., and Dehmer, M. (2019a). Defining data science by a data-driven quantification of the community. Mach. Learn. Knowl. Extract. 1, 235–251. doi: 10.3390/make1010015

Emmert-Streib, F., and Dehmer, M. (2019b). Evaluation of regression models: model assessment, model selection and generalization error. Mach. Learn. Knowl. Extract. 1, 521–551. doi: 10.3390/make1010032

Enarvi, S., and Kurimo, M. (2016). TheanoLM–an extensible toolkit for neural network language modeling. Proc. Interspeech 3052–3056 doi: 10.21437/Interspeech.2016-618

Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., et al. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature 542:115. doi: 10.1038/nature21056

Fischer, A., and Igel, C. (2012). “An introduction to restricted boltzmann machines,” in Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications (Springer), 14–36.

Fodor, J. A., and Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: a critical analysis. Cognition 28, 3–71.

PubMed Abstract | Google Scholar

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybernet. 36, 193–202.

Fukushima, K. (2013). Training multi-layered neural network neocognitron. Neural Netw. 40, 18–31. doi: 10.1016/j.neunet.2013.01.001

Gers, F. A., and Schmidhuber, J. (2000). “Recurrent nets that time and count,” in Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium (IEEE), Vol. 3, 189–194.

Gers, F. A., Schmidhuber, J., and Cummins, F. (1999). Learning to forget: continual prediction with LSTM. Neural Comput . 12, 2451–2471. doi: 10.1162/089976600300015015

PubMed Abstract | CrossRef Full Text

Gers, F. A., Schraudolph, N. N., and Schmidhuber, J. (2002). Learning precise timing with lstm recurrent networks. J. Mach. Learn. Res. 3, 115–143. Available online at: http://www.jmlr.org/papers/v3/gers02a.html

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning . MIT Press.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). “Generative adversarial nets,” in Advances in Neural Information Processing Systems , 2672–2680.

Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., et al. (2013). Pylearn2: a machine learning research library. arXiv:1308.4214.

Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv [Preprint]. arXiv:1308.0850.

Graves, A., Mohamed, A., and Hinton, G. E. (2013). “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) .

Graves, A., and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18, 602–610. doi: 10.1016/j.neunet.2005.06.042

Hastie, T. J., Tibshirani, R. J., and Friedman, J. H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction . Springer Series in Statistics. Springer.

Hayter, H. O. (2012). Probability and Statistics for Engineers and Scientists. 4th Edn. Duxbury Press).

He, K., Zhang, X., Ren, S., and Sun, J. (2016). “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 770–778.

Hebb, D. (1949). The Organization of Behavior . New York, NY: Wiley.

Henaff, M., Bruna, J., and LeCun, Y. (2015). Deep convolutional networks on graph-structured data. arXiv [Preprint] . arXiv:1506.05163.

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. (2018). “Deep reinforcement learning that matters,” in Thirty-Second AAAI Conference on Artificial Intelligence .

Hertz, J., Krogh, A., and Palmer, R. (1991). Introduction to the Theory of Neural Compuation . Addison-Wesley.

Hinton, G. E. (2012). Neural Networks: Tricks of the Trade. 2nd Edn. Chapter. A Practical Guide to Training Restricted Boltzmann Machines. Berlin; Heidelberg: Springer Berlin Heidelberg, 599–619.

Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554. doi: 10.1162/neco.2006.18.7.1527

Hinton, G. E., and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science 313, 504–507. doi: 10.1126/science.1127647

Hinton, G. E., and Sejnowski, T. J. (1983). “Optimal perceptual inference,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (Citeseer), 448–453.

Hochreiter, S. (1991). Untersuchungen zu Dynamischen Neuronalen Netzen . Diploma, Technische Universität München 91.

Hochreiter, S. (1998). The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 6, 107–116.

Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural Comput. 9, 1735–1780.

Holzinger, A., Biemann, C., Pattichis, C. S., and Kell, D. B. (2017). What do we need to build explainable AI systems for the medical domain? arXiv [Preprint] . arXiv:1712.09923.

Hopfield, J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554–2558.

Hoppensteadt, F. C., and Izhikevich, E. M. (1999). Oscillatory neurocomputers with dynamic connectivity. Phys. Rev. Lett. 82:2983.

Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Netw. 4, 251–257.

Hosny, A., Parmar, C., Coroller, T. P., Grossmann, P., Zeleznik, R., Kumar, A., et al. (2018). Deep learning for lung cancer prognostication: a retrospective multi-cohort radiomics study. PLoS Med. 15:e1002711. doi: 10.1371/journal.pmed.1002711

Hwang, K., and Sung, W. (2015). “Single stream parallelization of generalized LSTM-like rnns on a GPU,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE), 1047–1051.

Igel, C., and Hüsken, M. (2000). “Improving the RPROP learning algorithm,” in Proceedings of the Second International ICSC Symposium on Neural Computation (NC 2000) , Vol. 2000 (Citeseer), 115–121.

Ivakhnenko, A. G. (1968). The group method of data of handling; a rival of the method of stochastic approximation. Soviet Autom. Control 13, 43–55.

Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Trans. Syst. Man Cybernet. SMC-1, 364–378.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). “Caffe: convolutional architecture for fast feature embedding,” in Proceedings of the 22Nd ACM International Conference on Multimedia , MM '14 (New York, NY: ACM), 675–678.

Jiang, M., Liang, Y., Feng, X., Fan, X., Pei, Z., Xue, Y., et al. (2018). Text classification based on deep belief network and softmax regression. Neural Comput. Appl. 29, 61–70. doi: 10.1007/s00521-016-2401-x

Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv [Preprint]. arXiv:1408.5882. doi: 10.3115/v1/D14-1181

Kou, Q., and Sugomori, Y. (2014). Rcppdl .

Kraemer, G., and Reichstein, M., and D., M. M. (2018). dimRed and coRanking—unifying dimensionality reduction in R. R J. 10, 342–358. doi: 10.32614/RJ-2018-039

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012a). ImageNet Classification with Deep Convolutional Neural Networks . Curran Associates, Inc.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012b). “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems , 1097–1105.

Lai, S., Xu, L., Liu, K., and Zhao, J. (2015). “Recurrent convolutional neural networks for text classification,” in Twenty-Ninth AAAI Conference on Artificial Intelligence .

Lawrence, S., Giles, C. L., Tsoi, A. C., and Back, A. D. (1997). Face recognition: a convolutional neural-network approach. IEEE Trans. Neural Netw. 8, 98–113.

Le Cun, Y. (1989). Generalization and Network Design Strategies . Technical Report CRG-TR-89-4, Connectionism in Perspective. University of Toronto Connectionist Research Group, Toronto, ON.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature 521:436.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 541–551.

Lee, H., Pham, P., Largman, Y., and Ng, A. Y. (2009). “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Advances in Neural Information Processing Systems , 1096–1104.

Leung, M. K. K., Xiong, H. Y., Lee, L. J., and Frey, B. J. (2014). Deep learning of the tissue-regulated splicing code. Bioinformatics 30, 121–129. doi: 10.1093/bioinformatics/btu277

Li, R., Wang, S., Zhu, F., and Huang, J. (2018). “Adaptive graph convolutional neural networks,” in Thirty-Second AAAI Conference on Artificial Intelligence .

Lin, M., Chen, Q., and Yan, S. (2013). Network in network. arXiv [Preprint] . arXiv:1312.4400.

Linnainmaa, S. (1976). Taylor expansion of the accumulated rounding error. BIT Numer. Math. 16, 146–160.

Liou, C.-Y., Cheng, W.-C., Liou, J.-W., and Liou, D.-R. (2014). Autoencoder for words. Neurocomputing 139, 84–96. doi: 10.1016/j.neucom.2013.09.055

Lipton, Z. C., Berkowitz, J., and Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. arXiv [Preprint] . arXiv:1506.00019.

Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. (2017). “The expressive power of neural networks: a view from the width,” in Advances in Neural Information Processing Systems , 6231–6239.

Luong, M.-T., and Manning, C. D. (2015). “Stanford neural machine translation systems for spoken language domains,” in Proceedings of the International Workshop on Spoken Language Translation , 76–79.

Mayr, A., Klambauer, G., Unterthiner, T., and Hochreiter, S. (2016). Deeptox: toxicity prediction using deep learning. Front. Environ. Sci. 3:80. doi: 10.3389/fenvs.2015.00080

Mayr, A., Klambauer, G., Unterthiner, T., Steijaert, M., Wegner, J. K., Ceulemans, H., et al. (2018). Large-scale comparison of machine learning methods for drug target prediction on chembl. Chem. Sci. 9, 5441–5451. doi: 10.1039/C8SC00148K

McCulloch, W., and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115–133.

Ming, Y., Cao, S., Zhang, R., Li, Z., Chen, Y., Song, Y., et al. (2017). “Understanding hidden memories of recurrent neural networks,” in 2017 IEEE Conference on Visual Analytics Science and Technology (VAST) (IEEE), 13–24.

Minsky, M., and Papert, S. (1969). Perceptrons . MIT Press.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level control through deep reinforcement learning. Nature 518:529. doi: 10.1038/nature14236

Mohamed, A.-R., Dahl, G. E., and Hinton, G. (2011). Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20, 14–22. doi: 10.1109/TASL.2011.2109382

Nair, V., and Hinton, G. E. (2010). “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10) , 807–814.

Nielsen, M. A. (2015). Neural Networks and Deep Learning . Determination Press.

Olshausen, B. A., and Field, D. J. (1997). Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Res. 37, 3311–3325.

Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., et al. (2016). Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Trans. Audio Speech Lang. Process. 24, 694–707. doi: 10.1109/TASLP.2016.2520371

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., et al. (2017). Automatic differentiation in pytorch . Available online at: https://www.semanticscholar.org/paper/Automatic-differentiation-in-PyTorch-Paszke-Gross/b36a5bb1707bb9c70025294b3a310138aae8327a

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830. Available online at: http://www.jmlr.org/papers/v12/pedregosa11a

Pham, T., Tran, T., Phung, D., and Venkatesh, S. (2016). “Deepcare: a deep dynamic memory model for predictive medicine,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining (Springer), 30–41.

Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A., et al. (2016). “Variational autoencoder for deep learning of images, labels and captions,” in Advances in Neural Information Processing Systems , 2352–2360.

Quast, B. (2016). RNN: A Recurrent Neural Network in R . Working Papers.

Rawat, W., and Wang, Z. (2017). Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput. 29, 2352–2449. doi: 10.1162/neco_a_00990

Riedmiller, M., and Braun, H. (1993). “A direct adaptive method for faster backpropagation learning: the rprop algorithm,” in IEEE International Conference on Neural Networks (IEEE), 586–591.

Rong, X. (2014). Deep Learning Toolkit in R .

Rosenblatt, F. (1957). The Perceptron, A Perceiving and Recognizing Automaton Project Para . Cornell Aeronautical Laboratory.

Rumelhart, D., Hinton, G., and Williams, R. (1986). Learning representations by back-propagating errors. Nature 323, 533–536.

Sahu, S. K., and Anand, A. (2018). Drug-drug interaction extraction from biomedical texts using long short-term memory network. J. Biomed. Inform. 86, 15–24. doi: 10.1016/j.jbi.2018.08.005

Salakhutdinov, R., and Hinton, G. E. (2009). “Deep boltzmann machines,” in International conference on artificial intelligence and statistics , 448–455.

Sarikaya, R., Hinton, G. E., and Deoras, A. (2014). Application of deep belief networks for natural language understanding. IEEE/ACM Trans. Audio Speech Lang. Process. 22, 778–784. doi: 10.1109/TASLP.2014.2303296

Scherer, D., Müller, A., and Behnke, S. (2010). “Evaluation of pooling operations in convolutional architectures for object recognition,” in International Conference on Artificial Neural Networks (Springer), 92–101.

Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of history compression. Neural Comput. 4, 234–242.

Schmidhuber, J. (2015). Deep learning in neural networks: an overview. Neural Netw. 61, 85–117. doi: 10.1016/j.neunet.2014.09.003

Sejnowski, T. J., and Rosenberg, C. R. (1987). Parallel networks that learn to pronounce english text. Complex Syst. 1, 145–168.

Shen, D., Wu, G., and Suk, H.-I. (2017). Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 19, 221–248. doi: 10.1146/annurev-bioeng-071516-044442

Shmueli, G. (2010). To explain or to predict? Stat. Sci. 25, 289–310. doi: 10.1214/10-STS330

Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv [Preprint] . arXiv:1409.1556.

Smolander, J. (2016). Deep learning classification methods for complex disorders (Master's thesis), The School of the thesis, Tampere University of Technology, Tampere, Finland . Available online at: https://dspace.cc.tut.fi/dpub/handle/123456789/23845

Smolander, J., Dehmer, M., and Emmert-Streib, F. (2019a). Comparing deep belief networks with support vector machines for classifying gene expression data from complex disorders. FEBS Open Bio 9, 1232–1248. doi: 10.1002/2211-5463.12652

Smolander, J., Stupnikov, A., Glazko, G., Dehmer, M., and Emmert-Streib, F. (2019b). Comparing biological information contained in mRNA and non-coding RNAs for classification of lung cancer patients. BMC Cancer 19:1176. doi: 10.1186/s12885-019-6338-1

Soman, K., Muralidharan, V., and Chakravarthy, V. S. (2018). An oscillatory neural autoencoder based on frequency modulation and multiplexing. Front. Comput. Neurosci. 12:52. doi: 10.3389/fncom.2018.00052

Stupnikov, A., Tripathi, S., de Matos Simoes, R., McArt, D., Salto-Tellez, M., Glazko, G., et al. (2016). samExploreR: exploring reproducibility and robustness of RNA-seq results based on SAM files. Bioinformatics 32, 3345–3347. doi: 10.1093/bioinformatics/btw475

Sundermeyer, M., Schlüter, R., and Ney, H. (2012). “LSTM neural networks for language modeling,” in Thirteenth Annual Conference of the International Speech Communication Association .

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 1–9.

Theano Development Team (2016). Theano: a Python framework for fast computation of mathematical expressions. arXiv [Preprint] . arXiv:abs/1605.02688.

Tosic, I., and Frossard, P. (2011). Dictionary learning. IEEE Signal Process. Mag. 28, 27–38.

Venkataraman, S., Yang, Z., Liu, D., Liang, E., Falaki, H., Meng, X., et al. (2016). “Sparkr: Scaling R programs with spark,” in Proceedings of the 2016 International Conference on Management of Data , SIGMOD '16 (New York, NY: ACM), 1099–1104. doi: 10.1145/2882903.2903740

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P. -A. (2010). Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408. Available online at: http://www.jmlr.org/papers/v11/vincent10a.html

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). “Show and tell: a neural image caption generator,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 3156–3164.

Wan, L., Zeiler, M., Zhang, S., Cun, Y. L., and Fergus, R. (2013). “Regularization of neural networks using dropconnect,” in Proceedings of the 30th International Conference on Machine Learning (ICML-13) , 1058–1066.

Wang, D., and Terman, D. (1995). Locally excitatory globally inhibitory oscillator networks. IEEE Trans. Neural Netw. 6, 283–286.

Wang, D., and Terman, D. (1997). Image segmentation based on oscillatory correlation. Neural Comput. 9, 805–836.

Wang, D. L., and Brown, G. J. (1999). Separation of speech from interfering sounds based on oscillatory correlation. IEEE Trans. Neural Netw. 10, 684–697.

Wang, Y., Huang, M., Zhao, L., et al. (2016). “Attention-based lstm for aspect-level sentiment classification,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , 606–615.

Webb, A. R., and Copsey, K. D. (2011). Statistical Pattern Recognition. 3rd Edn. Wiley.

Werbos, P. (1974). Beyond regression: new tools for prediction and analysis in the behavioral sciences (Ph.D. thesis), Harvard University, Harvard, MA, United States.

Werbos, P. J. (1981). “Applications of advances in nonlinear sensitivity analysis,” in Proceedings of the 10th IFIP Conference, 31.8–4.9 , New York, 762–770.

Widrow, B., and Hoff, M. E. (1960). Adaptive Switching Circuits . Technical Report, Stanford University, California; Stanford Electronics Labs.

Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu, P. S. (2019). A comprehensive survey on graph neural networks. arXiv [Preprint] . arXiv:1901.00596.

Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., and Woo, W.-C. (2015). “Convolutional lstm network: a machine learning approach for precipitation nowcasting,” in Advances in Neural Information Processing Systems , 802–810.

Yang, Z., Dehmer, M., Yli-Harja, O., and Emmert-Streib, F. (2020). Combining deep learning with token selection for patient phenotyping from electronic health records. Sci. Rep. 10:1432. doi: 10.1038/s41598-020-58178-1

Yao, L., Mao, C., and Luo, Y. (2019). “Graph convolutional networks for text classification,” in Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 33, 7370–7377.

Yin, W., Kann, K., Yu, M., and Schütze, H. (2017). Comparative study of cnn and rnn for natural language processing. arXiv [Preprint] . arXiv:1702.01923.

Yoshua, B. (2009). Learning deep architectures for AI. Foundat. Trends Mach. Learn. 2, 1–127. doi: 10.1561/2200000006

Young, T., Hazarika, D., Poria, S., and Cambria, E. (2018). Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13, 55–75. doi: 10.1109/MCI.2018.2840738

Yu, D., and Li, J. (2017). Recent progresses in deep learning based acoustic models. IEEE/CAA J. Autom. Sinica 4, 396–409. doi: 10.1109/JAS.2017.7510508

Zhang, S., Zhou, J., Hu, H., Gong, H., Chen, L., Cheng, C., et al. (2015). A deep learning framework for modeling structural features of rna-binding protein targets. Nucleic Acids Res. 43:e32. doi: 10.1093/nar/gkv1025

Zhang, X., Zhao, J., and LeCun, Y. (2015). “Character-level convolutional networks for text classification,” in Advances in Neural Information Processing Systems , 649–657.

Zhou, Y. (2019). Sentiment classification with deep neural networks (Master's thesis). Tampere University, Tampere, Finland.

Keywords: deep learning, artificial intelligence, machine learning, neural networks, prediction models, data science

Citation: Emmert-Streib F, Yang Z, Feng H, Tripathi S and Dehmer M (2020) An Introductory Review of Deep Learning for Prediction Models With Big Data. Front. Artif. Intell. 3:4. doi: 10.3389/frai.2020.00004

Received: 24 October 2019; Accepted: 31 January 2020; Published: 28 February 2020.

Reviewed by:

Copyright © 2020 Emmert-Streib, Yang, Feng, Tripathi and Dehmer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Frank Emmert-Streib, v@bio-complexity.com

## Artificial neural networks and their application to sequence recognition

Downloadable content.

- Bengio, Yoshua
- This thesis studies the introduction of a priori structure into the design of learning systems based on artificial neural networks applied to sequence recognition, in particular to phoneme recognition in continuous speech. Because we are interested in sequence analysis, algorithms for training recurrent networks are studied and an original algorithm for constrained recurrent networks is proposed and test results are reported. We also discuss the integration of connectionist models with other analysis tools that have been shown to be useful for sequences, such as dynamic programming and hidden Markov models. We introduce an original algorithm to perform global optimization of a neural network/hidden Markov model hybrid, and show how to perform such a global optimization on all the parameters of the system. Finally, we consider some alternatives to sigmoid networks: Radial Basis Functions, and a method for searching for better learning rules using a priori knowledge and optimization algorithms.
- Computer Science.
- Artificial Intelligence.
- McGill University
- https://escholarship.mcgill.ca/concern/theses/qv33rx48q
- All items in eScholarship@McGill are protected by copyright with all rights reserved unless otherwise indicated.
- School of Computer Science
- Doctor of Philosophy
- Theses & Dissertations

## University of Minnesota

Digital conservancy.

- University Digital Conservancy Home
- University of Minnesota Twin Cities
- Dissertations and Theses
- Dissertations

## View/ Download file

Persistent link to this item, appears in collections, description, suggested citation, udc services.

- About the UDC
- How to Deposit
- Policies and Terms of Use

## Related Services

- University Archives
- U of M Web Archive
- UMedia Archive
- Copyright Services
- Digital Library Services
- News & Events
- Staff Directory
- Subject Librarians
- Vision, Mission, & Goals

## Graph Matching Based on a Few Seeds: Theoretical Algorithms and Graph Neural Network Approaches

Since graphs are natural representations for encoding relational data, the problem of graph matching is an emerging task and has attracted increasing attention, which could potentially impact various domains such as social network de-anonymization and computer vision. Our main interest is designing polynomial-time algorithms for seeded graph matching problems where a subset of pre-matched vertex-pairs (seeds) is revealed.

However, the existing work does not fully investigate the pivotal role of seeds and falls short of making the most use of the seeds. Notably, the majority of existing hand-crafted algorithms only focus on using ``witnesses'' in the 1-hop neighborhood. Although some advanced algorithms are proposed to use multi-hop witnesses, their theoretical analysis applies only to \ER random graphs and requires seeds to be all correct, which often do not hold in real applications. Furthermore, a parallel line of research, Graph Neural Network (GNN) approaches, typically employs a semi-supervised approach, which requires a large number of seeds and lacks the capacity to distill knowledge transferable to unseen graphs.

In my dissertation, I have taken two approaches to address these limitations. In the first approach, we study to design hand-crafted algorithms that can properly use multi-hop witnesses to match graphs. We first study graph matching using multi-hop neighborhoods when partially-correct seeds are provided. Specifically, consider two correlated graphs whose edges are sampled independently from a parent \ER graph $\mathcal{G}(n,p)$. A mapping between the vertices of the two graphs is provided as seeds, of which an unknown fraction is correct. We first analyze a simple algorithm that matches vertices based on the number of common seeds in the $1$-hop neighborhoods, and then further propose a new algorithm that uses seeds in the $D$-hop neighborhoods. We establish non-asymptotic performance guarantees of perfect matching for both $1$-hop and $2$-hop algorithms, showing that our new $2$-hop algorithm requires substantially fewer correct seeds than the $1$-hop algorithm when graphs are sparse. Moreover, by combining our new performance guarantees for the $1$-hop and $2$-hop algorithms, we attain the best-known results (in terms of the required fraction of correct seeds) across the entire range of graph sparsity and significantly improve the previous results. We then study the role of multi-hop neighborhoods in matching power-law graphs. Assume that two edge-correlated graphs are independently edge-sampled from a common parent graph with a power-law degree distribution. A set of correctly matched vertex-pairs is chosen at random and revealed as initial seeds. Our goal is to use the seeds to recover the remaining latent vertex correspondence between the two graphs. Departing from the existing approaches that focus on the use of high-degree seeds in $1$-hop neighborhoods, we develop an efficient algorithm that exploits the low-degree seeds in suitably-defined $D$-hop neighborhoods. Our result achieves an exponential reduction in the seed size requirement compared to the best previously known results.

In the second approach, we study GNNs for seeded graph matching. We propose a new supervised approach that can learn from a training set how to match unseen graphs with only a few seeds. Our SeedGNN architecture incorporates several novel designs, inspired by our theoretical studies of seeded graph matching: 1) it can learn to compute and use witness-like information from different hops, in a way that can be generalized to graphs of different sizes; 2) it can use easily-matched node-pairs as new seeds to improve the matching in subsequent layers. We evaluate SeedGNN on synthetic and real-world graphs and demonstrate significant performance improvements over both non-learning and learning algorithms in the existing literature. Furthermore, our experiments confirm that the knowledge learned by SeedGNN from training graphs can be generalized to test graphs of different sizes and categories.

## Degree Type

- Doctor of Philosophy
- Electrical and Computer Engineering

## Campus location

- West Lafayette

## Advisor/Supervisor/Committee Chair

Additional committee member 2, additional committee member 3, additional committee member 4, usage metrics.

- Machine learning not elsewhere classified
- Statistics not elsewhere classified

## Matteo Tiezzi

PostDoc Researcher @ SAILAB

- Siena, Italy
- ResearchGate
- Google Scholar

## PhD Thesis: Local Propagation in Neural Network Learning by Architectural Constraints

6 minute read

Published: March 26, 2021

A crucial role for the success of the Artificial Neural Networks (ANN) processing scheme has been played by the feed-forward propagation of signals. The input patterns undergo a series of stacked parametrized transformations, which foster deep feature extraction and an increasing representational power. Each artificial neural network layer aggregates information from its incoming connections, projects it to another space, and immediately propagates it to the next layer.

Since its introduction in the ’80s, BackPropagation (BP) is considered to be the de facto algorithm for training neural nets. The weights associated to the connections between the network layers are updated due to the backward pass, that is a straightforward derivation of the chain rule for the computation of the derivatives in a composition of functions. This computation requires to store all the intermediate values of the process. Moreover, it implies the use of non-local information, since the activity of one neuron has the ability to affect all the subsequent units up to the last output layer.

However, learning in the human brain can be considered a continuous, life-long and gradual process in which neuron activations fire, leveraging local information, both in space, e.g neighboring neurons, and time, e.g. previous states.

Following this principle, this thesis is inspired by the ideas of decoupling the computational scheme, behind the standard processing of ANNs, in order to decompose its overall structure into local components. Such local parts are put into communication leveraging the unifying notion of constraint . In particular, a set of additional variables are added to the learning problem, in order to store the information on the status of the constrained neural units. Therefore, it is possible to describe the computations performed by the network itself guiding the evolution of these auxiliary variables via constraints. This choice allows us to setup an optimization procedure that is ``local’’, i.e., it does not require (1) to query the whole network, (2) to accomplish the diffusion of the information, or (3) to bufferize data streamed over time in order to be able to compute gradients. The thesis investigates three different learning settings that are instances of the aforementioned scheme: (1) constraints among layers in feed-forward neural networks, (2) constraints among the states of neighboring nodes in Graph Neural Networks, and (3) constraints among predictions over time.

Download the slides here

Download the thesis here

## Research questions and contributions

Constraint-based neural networks.

In the proposed approach, Local Propagation , DAGs can be decomposed into local components. The processing scheme of neural architecture is enriched with auxiliary variables corresponding to the neural units, and therefore can be regarded as a set of constraints that correspond with the neural equations. Constraints enforce and encode the message passing scheme among neural units, and in particular the consistency between the input and the output variables by means of the corresponding weights of the synaptic connections. The proposed scheme leverages completely local update rules, revealing the opportunity to parallelize the computation.

Based on “Local Propagation in Constraint-based Neural Network” , In International Joint Conference on Neural Networks(IJCNN2020)

## Constraining the Information Diffusion in Graph Neural Networks

The original GNN model encode the state of the nodes of the graph by means of an iterative diffusion procedure that, during the learning stage, must be computed at every epoch, until the fixed point of a learnable state transition function is reached, propagating the information among the neighbouring nodes. Lagrangian Propagation GNNs decompose this costly operation, proposing a novel approach to learning in GNNs, based on constrained optimization in the Lagrangian framework. Learning both the transition function and the node states is the outcome of a joint process, in which the state convergence procedure is implicitly expressed by a constraint satisfaction mechanism, avoiding iterative epoch-wise procedures and the network unfolding.

Based on “A Lagrangian Approach to Information Propagation in Graph Neural Networks” , In European Conference of Artificial Intelligence (ECAI2020)

and on “Deep Constraint-based Propagation in Graph Neural Networks” , In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

## Constraint-based Mutual Information Computation in Salient Areas of Video Streams

We consider the problem of transferring information from an input visual stream to the output space of a neural architecture that performs pixel-wise predictions. This problem consists in maximizing the Mutual Information (MI) index. Most approaches of learning commonly assume uniform probability density of the input. Actually, devising an appropriate spatio-temporal distribution of the visual data can foster the information transfer. In the proposed approach, a human-like focus of attention model takes care of filtering the spatial component of the visual information, restricting the analysis on the salient areas. On the other side, various temporal locality criteria can be explored. In particular, the analysis sweeps over the probability estimates obtained in subsequent time instants. Global changes in the entropy of the output space are approximated by introducing a specific constraint. The probability predictions obtained at each time instant can once more be regarded as local components, that are put into relation by soft-constraints enforcing a temporal estimate not limited to the current frame.

Based on “Focus of Attention Improves Information Transfer in Visual Features” , In Advances in Neural Information Processing Systems (NeurIPS2020)

We use cookies to ensure that we give you the best experience on our website. By continuing to browse this repository, you give consent for essential cookies to be used. You can read more about our Privacy and Cookie Policy .

- Departments
- University Research
- About the University

## Theory and applications of artificial neural networks

Chen, Jian-Rong (1991) Theory and applications of artificial neural networks. Doctoral thesis, Durham University.

In this thesis some fundamental theoretical problems about artificial neural networks and their application in communication and control systems are discussed. We consider the convergence properties of the Back-Propagation algorithm which is widely used for training of artificial neural networks, and two stepsize variation techniques are proposed to accelerate convergence. Simulation results demonstrate significant improvement over conventional Back-Propagation algorithms. We also discuss the relationship between generalization performance of artificial neural networks and their structure and representation strategy. It is shown that the structure of the network which represent a priori knowledge of the environment has a strong influence on generalization performance. A Theorem about the number of hidden units and the capacity of self-association MLP (Multi-Layer Perceptron) type network is also given in the thesis. In the application part of the thesis, we discuss the feasibility of using artificial neural networks for nonlinear system identification. Some advantages and disadvantages of this approach are analyzed. The thesis continues with a study of artificial neural networks applied to communication channel equalization and the problem of call access control in broadband ATM (Asynchronous Transfer Mode) communication networks. A final chapter provides overall conclusions and suggestions for further work.

## Quick links

- Latest additions
- Browse by year
- Browse by department
- Deposit thesis
- Usage statistics

## Prospective students

- International students
- Research degrees
- Durham e-Theses
- Deposit Guide

Last Modified: Summer 2013 | Disclaimer | Trading name | Powered by EPrints 3

Help | Advanced Search

## Quantum Physics

Title: quantum neural networks.

Abstract: This PhD thesis combines two of the most exciting research areas of the last decades: quantum computing and machine learning. We introduce dissipative quantum neural networks (DQNNs), which are designed for fully quantum learning tasks, are capable of universal quantum computation and have low memory requirements while training. These networks are optimised with training data pairs in form of input and desired output states and therefore can be used for characterising unknown or untrusted quantum devices. We not only demonstrate the generalisation behaviour of DQNNs using classical simulations, but also implement them successfully on actual quantum computers. To understand the ultimate limits for such quantum machine learning methods, we discuss the quantum no free lunch theorem, which describes a bound on the probability that a quantum device, which can be modelled as a unitary process and is optimised with quantum examples, gives an incorrect output for a random input. Moreover we expand the area of applications of DQNNs in two directions. In the first case, we include additional information beyond just the training data pairs: since quantum devices are always structured, the resulting data is always structured as well. We modify the DQNN's training algorithm such that knowledge about the graph-structure of the training data pairs is included in the training process and show that this can lead to better generalisation behaviour. Both the original DQNN and the DQNN including graph structure are trained with data pairs in order to characterise an underlying relation. However, in the second extension of the algorithm we aim to learn characteristics of a set of quantum states in order to extend it to quantum states which have similar properties. Therefore we build a generative adversarial model where two DQNNs, called the generator and discriminator, are trained in a competitive way.

## Submission history

Access paper:.

- Download PDF
- Other Formats

## References & Citations

- INSPIRE HEP
- Google Scholar
- Semantic Scholar

## BibTeX formatted citation

## Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

- Institution

## arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Neural networks, also known as artificial neural networks (ANNs) or simulated neural networks (SNNs), are a subset of machine learning and are at the heart of deep learning algorithms. Their name and structure are inspired by the human brain, mimicking the way that biological neurons signal to one another.

Artificial neural networks (ANNs) are comprised of a node layers, containing an input layer, one or more hidden layers, and an output layer. Each node, or artificial neuron, connects to another and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network.

Neural networks rely on training data to learn and improve their accuracy over time. However, once these learning algorithms are fine-tuned for accuracy, they are powerful tools in computer science and artificial intelligence , allowing us to classify and cluster data at a high velocity. Tasks in speech recognition or image recognition can take minutes versus hours when compared to the manual identification by human experts. One of the most well-known neural networks is Google’s search algorithm.

Subscribe to the IBM newsletter

Think of each individual node as its own linear regression model, composed of input data, weights, a bias (or threshold), and an output. The formula would look something like this:

∑wixi + bias = w1x1 + w2x2 + w3x3 + bias

output = f(x) = 1 if ∑w1x1 + b>= 0; 0 if ∑w1x1 + b < 0

Once an input layer is determined, weights are assigned. These weights help determine the importance of any given variable, with larger ones contributing more significantly to the output compared to other inputs. All inputs are then multiplied by their respective weights and then summed. Afterward, the output is passed through an activation function, which determines the output. If that output exceeds a given threshold, it “fires” (or activates) the node, passing data to the next layer in the network. This results in the output of one node becoming in the input of the next node. This process of passing data from one layer to the next layer defines this neural network as a feedforward network.

Let’s break down what one single node might look like using binary values. We can apply this concept to a more tangible example, like whether you should go surfing (Yes: 1, No: 0). The decision to go or not to go is our predicted outcome, or y-hat. Let’s assume that there are three factors influencing your decision-making:

- Are the waves good? (Yes: 1, No: 0)
- Is the line-up empty? (Yes: 1, No: 0)
- Has there been a recent shark attack? (Yes: 0, No: 1)

Then, let’s assume the following, giving us the following inputs:

- X1 = 1, since the waves are pumping
- X2 = 0, since the crowds are out
- X3 = 1, since there hasn’t been a recent shark attack

Now, we need to assign some weights to determine importance. Larger weights signify that particular variables are of greater importance to the decision or outcome.

- W1 = 5, since large swells don’t come around often
- W2 = 2, since you’re used to the crowds
- W3 = 4, since you have a fear of sharks

Finally, we’ll also assume a threshold value of 3, which would translate to a bias value of –3. With all the various inputs, we can start to plug in values into the formula to get the desired output.

Y-hat = (1*5) + (0*2) + (1*4) – 3 = 6

If we use the activation function from the beginning of this section, we can determine that the output of this node would be 1, since 6 is greater than 0. In this instance, you would go surfing; but if we adjust the weights or the threshold, we can achieve different outcomes from the model. When we observe one decision, like in the above example, we can see how a neural network could make increasingly complex decisions depending on the output of previous decisions or layers.

In the example above, we used perceptrons to illustrate some of the mathematics at play here, but neural networks leverage sigmoid neurons, which are distinguished by having values between 0 and 1. Since neural networks behave similarly to decision trees, cascading data from one node to another, having x values between 0 and 1 will reduce the impact of any given change of a single variable on the output of any given node, and subsequently, the output of the neural network.

As we start to think about more practical use cases for neural networks, like image recognition or classification, we’ll leverage supervised learning, or labeled datasets, to train the algorithm. As we train the model, we’ll want to evaluate its accuracy using a cost (or loss) function. This is also commonly referred to as the mean squared error (MSE). In the equation below,

- i represents the index of the sample,
- y-hat is the predicted outcome,
- y is the actual value, and
- m is the number of samples.

𝐶𝑜𝑠𝑡 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛= 𝑀𝑆𝐸=1/2𝑚 ∑129_(𝑖=1)^𝑚▒(𝑦 ̂^((𝑖) )−𝑦^((𝑖) ) )^2

Ultimately, the goal is to minimize our cost function to ensure correctness of fit for any given observation. As the model adjusts its weights and bias, it uses the cost function and reinforcement learning to reach the point of convergence, or the local minimum. The process in which the algorithm adjusts its weights is through gradient descent, allowing the model to determine the direction to take to reduce errors (or minimize the cost function). With each training example, the parameters of the model adjust to gradually converge at the minimum.

See this IBM Developer article for a deeper explanation of the quantitative concepts involved in neural networks .

Most deep neural networks are feedforward, meaning they flow in one direction only, from input to output. However, you can also train your model through backpropagation; that is, move in the opposite direction from output to input. Backpropagation allows us to calculate and attribute the error associated with each neuron, allowing us to adjust and fit the parameters of the model(s) appropriately.

The all new enterprise studio that brings together traditional machine learning along with new generative AI capabilities powered by foundation models.

Neural networks can be classified into different types, which are used for different purposes. While this isn’t a comprehensive list of types, the below would be representative of the most common types of neural networks that you’ll come across for its common use cases:

The perceptron is the oldest neural network, created by Frank Rosenblatt in 1958.

Feedforward neural networks, or multi-layer perceptrons (MLPs), are what we’ve primarily been focusing on within this article. They are comprised of an input layer, a hidden layer or layers, and an output layer. While these neural networks are also commonly referred to as MLPs, it’s important to note that they are actually comprised of sigmoid neurons, not perceptrons, as most real-world problems are nonlinear. Data usually is fed into these models to train them, and they are the foundation for computer vision, natural language processing , and other neural networks.

Convolutional neural networks (CNNs) are similar to feedforward networks, but they’re usually utilized for image recognition, pattern recognition, and/or computer vision. These networks harness principles from linear algebra, particularly matrix multiplication, to identify patterns within an image.

Recurrent neural networks (RNNs) are identified by their feedback loops. These learning algorithms are primarily leveraged when using time-series data to make predictions about future outcomes, such as stock market predictions or sales forecasting.

Deep Learning and neural networks tend to be used interchangeably in conversation, which can be confusing. As a result, it’s worth noting that the “deep” in deep learning is just referring to the depth of layers in a neural network. A neural network that consists of more than three layers—which would be inclusive of the inputs and the output—can be considered a deep learning algorithm. A neural network that only has two or three layers is just a basic neural network.

To learn more about the differences between neural networks and other forms of artificial intelligence, like machine learning, please read the blog post “ AI vs. Machine Learning vs. Deep Learning vs. Neural Networks: What’s the Difference? ”

The history of neural networks is longer than most people think. While the idea of “a machine that thinks” can be traced to the Ancient Greeks, we’ll focus on the key events that led to the evolution of thinking around neural networks, which has ebbed and flowed in popularity over the years:

1943: Warren S. McCulloch and Walter Pitts published “ A logical calculus of the ideas immanent in nervous activity (link resides outside ibm.com)” This research sought to understand how the human brain could produce complex patterns through connected brain cells, or neurons. One of the main ideas that came out of this work was the comparison of neurons with a binary threshold to Boolean logic (i.e., 0/1 or true/false statements).

1958: Frank Rosenblatt is credited with the development of the perceptron, documented in his research, “ The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain ” (link resides outside ibm.com). He takes McCulloch and Pitt’s work a step further by introducing weights to the equation. Leveraging an IBM 704, Rosenblatt was able to get a computer to learn how to distinguish cards marked on the left vs. cards marked on the right.

1974: While numerous researchers contributed to the idea of backpropagation, Paul Werbos was the first person in the US to note its application within neural networks within his PhD thesis (link resides outside ibm.com).

1989: Yann LeCun published a paper (link resides outside ibm.com) illustrating how the use of constraints in backpropagation and its integration into the neural network architecture can be used to train algorithms. This research successfully leveraged a neural network to recognize hand-written zip code digits provided by the U.S. Postal Service.

Design complex neural networks. Experiment at scale to deploy optimized learning models within IBM Watson Studio.

Build and scale trusted AI on any cloud. Automate the AI lifecycle for ModelOps.

Take the next step to start operationalizing and scaling generative AI and machine learning for business.

- < Previous

Home > GRAD > THESIS > 402

## Artificial Neural Network Concepts and Examples

Harcharan Kabbay , University of Missouri-St. Louis Follow

## Document Type

Master of Arts

Mathematics

## Date of Defense

Graduate advisor.

Adrian Clingher

Dr. Adrian Clingher, Ph.D

Dr. Qingtang Jiang, Ph.D

Dr. Haiyan Cai, Ph.D

Artificial Neural Networks have gained much media attention in the last few years. Every day, numer- ous articles on Artificial Intelligence, Machine Learning, and Deep Learning exist. Both academics and business are becoming increasingly interested in deep learning. Deep learning has innumerable uses, in- cluding autonomous driving, computer vision, robotics, security and surveillance, and natural language processing. The recent development and focus have primarily been made possible by the convergence of related research efforts and the introduction of APIs like Keras. The availability of high-speed compute resources such as GPUs and TPUs has also been instrumental in developing deep learning models.

While the development of the APIs like Keras offers a layer of abstraction and makes the model development convenient, the Mathematical logic behind the working of the Neural Networks is often misunderstood. The thesis focuses on the building blocks of a Neural Network in terms of Mathemat- ical terms and formulas. The research article also includes the details on the core parts of the Deep Learning algorithms like Forwardpropagation, Gradient Descent, and Backpropagation.

The research briefly covers the basic operations in Convolution Neural Networks, and a working example of multi-class classification problem using Keras library in R. CNN is a vast area of research in itself, and covering all the aspects of the ConvNets is out of scope of this paper. However, it provides an excellent foundation for understanding how Neural Networks work and how a CNN uses the concepts of the building blocks of a primary Neural Network in an image classification problem.

## Recommended Citation

Kabbay, Harcharan, "Artificial Neural Network Concepts and Examples" (2022). Theses . 402. https://irl.umsl.edu/thesis/402

Since September 29, 2022

## Included in

Other Mathematics Commons

Advanced Search

- Notify me via email or RSS
- Collections
- Disciplines

## Participate

- Submission Guidelines
- Submit Research
- University of Missouri, St. Louis
- UMSL Library

Home | About | FAQ | My Account | Accessibility Statement

Privacy Copyright

## IMAGES

## VIDEO

## COMMENTS

In this thesis, I investigate two major directions for explaining deep neural networks.

Research in Neural Networks is becoming more popular each year. Re-search has introduced different ways to utilize Neural Networks, but an im-portant aspect is missing: Testing. There are only 16 papers that strictly address Testing Neural Networks with a majority of them focusing on Deep Neural Networks and a small part on Recurrent Neural ...

1000 image pairs from the SINTEL dataset to be used in the training of a neural network. This neural network was coded in Python using Keras, an open source neural network library. The final product of the testing provided results on par with modern image classification neural network benchmarks, and together they produce an easy path for fine ...

Neural Networks (NNs){Deep Neural Networks (DNNs)in particular { are a burgeoning area of arti cial intelligence research, rife with impressive computational results on a wide variety of tasks. Beginning in 2006, when the term Deep Learning was coined [32], there have been numerous contest-winning neural network architectures developed. That is not

This thesis tackles the bottlenecks of modern methods directly, achieving state-of-the-art performance by efficiently designing efficient deep neural networks.

Training Recurrent Neural Networks is the PhD thesis of Ilya Sutskever, a leading researcher in machine learning and artificial intelligence. In this work, he presents novel methods and applications of recurrent neural networks, such as natural language processing, speech recognition, and image generation. Learn from the expert how to design and optimize these powerful models.

In this thesis, we address the fundamental problem of modeling intelligence that can learn to represent and reason about the world. We study both questions from the lens of graph neural networks, a class of neural networks acting on graphs.

Scaling graph neural networks to larger graphs Citation Zhang, William. 2022. Scaling graph neural networks to larger graphs. Bachelor's thesis,

development convenient, the Mathematical logic behind the working of the Neural Networks is often misunderstood. The thesis focuses on the building blocks of a Neural Network in terms of Mathemat-ical terms and formulas. The research article also includes the details on the core parts of the Deep

The goal of our research is to develop methods advancing automatic visual recognition. In order to predict the unique or multiple labels associated to an image, we study different kind of Deep Neural Networks architectures and methods for supervised features learning. We first draw up a state-of-the-art review of the Convolutional Neural Networks aiming to understand the history behind this ...

The objective of this thesis was to study the application of deep learning in image classification using convolutional neural networks. The Python programming language with the TensorFlow framework and Google Colaboratory hardware were used for the thesis. Models were chosen from available ones online and adjusted by the author.

Abstract. Data driven methods based on deep neural networks (DNNs) have ushered in a new era in the field of machine learning computer vision. Conventional algorithmic approaches are being replaced by end-to-end deep learning systems that can leverage big data. Deep learning has begun revolutionizing human centric fields such as health-care and ...

Therefore, several directions for explaining neural models have recently been explored. In this thesis, I investigate two major directions for explaining deep neural networks. The rst direction consists of feature-based post-hoc explanatory methods, that is, methods that aim to explain an already trained and

Deep learning models stand for a new learning paradigm in artificial intelligence (AI) and machine learning. Recent breakthrough results in image analysis and speech recognition have generated a massive interest in this field because also applications in many other domains providing big data seem possible. On a downside, the mathematical and computational methodology underlying deep learning ...

This thesis studies the introduction of a priori structure into the design of learning systems based on artificial neural networks applied to sequence recognition, in particular to phoneme recognition in continuous speech. Because we are interested in sequence analysis, algorithms for training recurrent networks are studied and an original ...

Thesis or Dissertation Abstract Learning powerful data embeddings has recently become the core of machine learning algorithms especially in natural language processing and computer vision domains.

Furthermore, a parallel line of research, Graph Neural Network (GNN) approaches, typically employs a semi-supervised approach, which requires a large number of seeds and lacks the capacity to distill knowledge transferable to unseen graphs. In my dissertation, I have taken two approaches to address these limitations.

The new model family introduced in this thesis is summarized under the term Recursive Deep Learning. The models in this family are variations and extensions of unsupervised and supervised recursive neural networks (RNNs) which generalize deep and feature learning ideas to hierarchical structures. The RNN models of this thesis

the same structure as the Dense Neural Network explained in the section dedicated to it). 1.2.2.2 Standard CNNs W e list here the main CNN Architectures which led the foundation of today's Computer

The thesis investigates three different learning settings that are instances of the aforementioned scheme: (1) constraints among layers in feed-forward neural networks, (2) constraints among the states of neighboring nodes in Graph Neural Networks, and (3) constraints among predictions over time. Material. Download the slides here

The thesis also covers the practical applications of artificial neural network technology and how it is used in different fields of industry. Positive and negative properties of artificial neural networks and how they should be developed in the future will be observed. For this thesis a lot of research material from different research areas was ...

In this thesis some fundamental theoretical problems about artificial neural networks and their application in communication and control systems are discussed. We consider the convergence properties of the Back-Propagation algorithm which is widely used for training of artificial neural networks, and two stepsize variation techniques are proposed to accelerate convergence.

This thesis examines the viability of applying ANNs to problems in the finance domain using Australian financial data. The research focuses on two particular areas: predicting distress in credit unions of New South Wales and modeling the Australian/US dollar foreign exchange trading systems.

This PhD thesis combines two of the most exciting research areas of the last decades: quantum computing and machine learning. We introduce dissipative quantum neural networks (DQNNs), which are designed for fully quantum learning tasks, are capable of universal quantum computation and have low memory requirements while training. These networks are optimised with training data pairs in form of ...

Neural networks, also known as artificial neural networks (ANNs) or simulated neural networks (SNNs), are a subset of machine learning and are at the heart of deep learning algorithms. Their name and structure are inspired by the human brain, mimicking the way that biological neurons signal to one another.

The thesis focuses on the building blocks of a Neural Network in terms of Mathemat- ical terms and formulas. The research article also includes the details on the core parts of the Deep Learning algorithms like Forwardpropagation, Gradient Descent, and Backpropagation.