Network architecture search

As a representative of computational intelligence methods, artificial neural network originated in the 1940s, and experienced the prosperity in the 1950s and 1960s, the low tide in the 1970s, the revival in the 1980s and the extensive attention in the last decade. Now, it has become the frontier direction of increasingly perfect theory and gradually developed application. Hinton? Waiting for 2006? The article published in Science in 2000 triggered the upsurge of deep neural network research. Faced with the challenge of big data, the deep neural network model represented by deep belief network, convolutional neural network and recursive neural network shows obvious advantages and potentials in many application fields, especially with the increase of data volume and data dimension, the advantages of deep learning become more and more prominent. Like Google? AlphaGo was developed by deep learning? It can learn correct decisions from massive games. Microsoft speech recognition uses deep learning to greatly reduce the recognition error rate. Baidu's robot "Xiao" based on deep learning has surpassed humans in cross-age face recognition.

After years of research and development, the recognition method based on artificial neural network has gradually replaced the traditional pattern recognition method. At present, neural network has become an advanced technology, which is used to solve many challenging recognition tasks such as character recognition, voice recognition, fingerprint recognition, remote sensing image recognition, face recognition, handwritten character recognition and so on. Among them, the mainstream neural network models are convolutional network and recursive neural network, and convolutional neural network is composed of? Yann? Lecun? Are you online? 1998? Proposed in 2000, since? AlexNe？ Are you online? 20 12? Year? ImageNet？ This framework was used to win the first prize in the competition, and convolutional neural networks quickly became popular and widely used in visual tasks. At present, the most advanced convolutional neural network algorithm can even exceed the accuracy of human naked eye recognition in image recognition. Where is the recurrent neural network? 1990? In, as a generalization of recurrent neural network, recurrent neural network can introduce gating mechanism to learn long-distance dependence, which is suitable for machine learning tasks involving structural relationships and has important applications in sequence recognition.

Deep neural network and deep learning algorithm are very popular because of their remarkable achievements in scientific research and engineering tasks. It replaces the traditional manual feature extraction method and can automatically extract and learn features end to end. Among them, the remarkable success of deep neural network is usually due to its successful architecture design, and the focus of research has shifted from extracting features to finding the optimal architecture. Generally speaking, the larger the capacity of the model, the better the performance of the network, and it can fit any function. Therefore, in order to improve the network performance, the network structure is designed more and more complex. Like VGG- 16? There are about1.400 million floating-point parameters, and the whole network occupies more than 500 megabytes of storage space. It takes1.500 million floating-point operations to process an image with a size of $224 \ multiplied by $224. Although deeper network level and complex topological structure can learn features more effectively, the increase of network scale means that more trial and error time is needed when designing networks manually, and even experts need a lot of resources and time to create models with good performance.

Neural network structure search is a new method for automatically learning network structure, which is used to reduce the heavy network design cost. So far, the performance of the network designed by NAS method has exceeded the architecture designed by hand. NAS can be regarded as a sub-field of automatic machine learning (AutoML), which obviously overlaps with hyperparametric optimization and meta-learning. The differences between different NAS methods mainly lie in three dimensions: search space, search strategy and performance evaluation, which are investigated respectively.

Search space: Search space defines all optional structures and operations of the network, which is usually exponential or even unbounded. Combining the prior knowledge when designing the search space, that is, referring to the existing advanced structural design knowledge for the current task, can effectively narrow the search space and simplify the search. However, this will also introduce preferences, thus limiting online learning to the structure beyond the current human knowledge.

Search strategy: after defining the search space, the search strategy guides the search of high-performance model architecture, and the difficulty is to ensure the balance between exploration and utilization. On the one hand, we hope to find the architecture with good performance quickly, on the other hand, we need to avoid premature convergence to the sub-optimal architecture.

Performance evaluation: The purpose of NSA is to find a framework with good generalization performance for unknown data, and once the model is generated, its performance needs to be evaluated. The intuitive method is to train convergence on the training set and get its performance on the verification set, but this method will consume huge computing power, thus limiting the network structure that can be explored. Some advanced methods focus on reducing the calculation cost of performance evaluation, but they will introduce errors. Therefore, the efficiency and effect of balance evaluation is a problem that needs to be studied.

From the calculation point of view, neural network represents a function that transforms the input variable X into the output variable Y through a series of operations. Based on computational graphics language, neural network can be represented as directed acyclic graph (DAG), in which each node represents a tensor z? , connected to its parent node I(k) by edges, each edge representing an operation o selected from the candidate operation set o.. The calculation formula of k is:

The candidate operation set $O$ mainly includes convolution, pooling, activation function, jump connection, splicing, addition and other basic operations. In addition, in order to further improve the performance of the model, some advanced manual design modules can also be used as candidate operations, such as deep separable convolution, extended convolution and group convolution. Different hyperparameters can be selected based on the type of operation, such as input node selection, number, size and step size of convolution kernels, etc. Different search space design, selection and combination operations have different methods, so parameterization forms are also different. Generally speaking, a good search space should be able to exclude human prejudice and be flexible enough to cover a wider model architecture.

The global search space has a high degree of freedom to search the complete network structure. The simplest example is the chain search space, as shown in figure 1 left. A fixed number of nodes are stacked in sequence, and only the output of the previous node is provided as input to the next node. Each node represents a layer and has a specified operation. The picture on the right introduces more complex jump links and multi-branch structures, at which time the current node can combine the outputs of all previous nodes into inputs, greatly increasing the freedom of search. Many networks are special cases of multi-branch networks, such as

1) chain network:;

2) Remaining network:

3)DenseNets:

Although full-structure search is easy to implement, it also has some shortcomings. Firstly, the size of the search space is exponentially related to the depth of the network, so it is very expensive to find a deep network with good generalization performance. In addition, the generated architecture lacks portability and flexibility, and the model generated on a small data set may not be suitable for a larger data set. Some studies show that the selection of initial architecture is very important when searching global structure. Under appropriate initial conditions, an architecture with the same performance as the unit search space can be obtained, but the guiding principle of initial architecture selection is not clear.

The cell-based search space is inspired by artificial design knowledge, and many effective network structures will reuse fixed structures, such as repeating LSTM blocks or stacking remaining modules in RNNs. Therefore, we can only search for such repeated cells, and the whole search problem of neural structure is simplified to searching for the optimal cell structure in the cell search space, thus greatly reducing the search space. Most studies compare the experimental results based on global search space and unit search space, and prove that unit search space can achieve better performance. Another advantage of unit search space is that it can be easily generalized between data sets and tasks, because the complexity of architecture can be changed almost arbitrarily by increasing or decreasing the number of convolution kernels and units.

NASNet is one of the earliest unit search spaces, and it is also the most popular choice at present. Most of the subsequent improvements are only a few modifications to the operation selection and unit combination strategy. As shown in Figure 2, it consists of two types of cells, namely, standard cells (normal? Cell), and a simplification unit (reduction? Cell). Each cell consists of B blocks, and each block is defined by its two inputs and corresponding operations. Optional inputs include the outputs of the first two cells and the outputs of previously defined blocks in the cells, so it supports jump connection across cells. Unused blocks are connected and used as the output of cells, and finally these cells are cascaded through predefined rules.

Different from connecting cell structures according to artificially defined macro structure, hierarchical structure takes the cell structure generated in the previous step as the basic component of the next cell structure, and obtains the final network structure through iterative thinking. Hier's hierarchical search space generates high-level units by merging low-level units, which realizes the simultaneous optimization of unit level and network level. This method is divided into three layers. The first layer contains a series of basic operations; The second layer connects the first layer through directed acyclic graph to construct different units. The graph structure is encoded by adjacency matrix. The third layer is network-level coding, which determines how the units in the second layer are connected and combined into a complete network. The cell-based search space can be regarded as a special case of this hierarchical search space.

Reinforcement learning can effectively simulate a continuous decision-making process, in which the subject interacts with the environment, and the subject learns to improve his behavior to maximize the target return. (Figure 3) gives an overview of NAS algorithm based on hardening. An agent is usually a recurrent neural network (RNN), which performs an action at each step T, samples a new sample from the search space, receives the observed values of the state and the returns in the environment at the same time, and updates the sampling strategy of the agent. This method is very suitable for neural structure search. The behavior of agent is to generate neural structure, and the behavior space is the search space. The environment refers to the training and evaluation of the network generated by agent, and the reward is the prediction performance of the trained network structure for unknown data, which is obtained after the last behavior.

4.2 Evolutionary algorithm

Evolutionary algorithm is a mature global optimization method with strong robustness and wide applicability. Many studies use evolutionary algorithms to optimize the structure of neural networks. Evolutionary algorithm evolved a set of models, that is, a set of networks; In each generation, at least one model is selected from this group of models as the parent and the mutated offspring. After the offspring are trained, their fitness is evaluated and added to the population.

Typical evolutionary algorithms include selection, crossover, mutation and update. When selecting, the Coalition selection algorithm is generally used to sample the parent class, and the parent class is the one with the best adaptability. Lemonade uses kernel density estimation for fitness, so the probability of network selection is inversely proportional to density. Interleaving patterns vary with different coding schemes. Mutation is some operations aimed at the parent, such as adding or deleting layers, changing the super parameters of layers, adding jump connections and changing the training super parameters. For offspring, most methods randomly initialize the weights of sub-networks, while Lemonade uses network morphism to transfer the weights learned by the parent network to its sub-networks. Real and others let future generations inherit all the parameters of their parents that are not affected by mutation. Although this kind of inheritance is not strictly functional reservation, it can accelerate learning. When generating a new network, it is necessary to remove some individuals from the group. Real and others eliminated the worst individuals from the population, and AmoebaNet eliminated the oldest individuals. Other methods are to discard all individuals regularly, or not to delete individuals at all. EENA adjusts the deletion probability of the worst model and the oldest model through a variable.

Optimization method based on agent model (SMBO) uses agent model to approximate the objective function. In other words, it is not necessary to train the sampled network structure, but only need to train a proxy model and use this proxy model to predict the network performance. Usually, in practice, we only need to get the performance ranking of the architecture without calculating the specific loss value, so the proxy model only needs to predict the relative score and choose the promising candidate architecture. Then only the architecture with good prediction performance is evaluated, and the proxy model is updated with its verification accuracy, so only a small number of candidate architectures need to be trained completely, which greatly reduces the search time. The proxy model is usually trained to minimize the square error:

Bayesian optimization is one of the most popular methods in hyperparametric optimization. The most classic one is BO based on Gaussian process, and the verification result of the generated neural structure can be modeled as Gaussian process. However, the reasoning time scale of BO based on Gaussian process is cubic in observation times, and it is not good at dealing with variable-length neural networks. Some works use tree-based or random forest-based methods to search efficiently in very high-dimensional space, and have achieved excellent results on many problems. Negrinho uses the tree structure of its search space and uses Monte Carlo tree to search. Although there is no complete comparison, preliminary evidence shows that these methods can surpass evolutionary algorithms.

The above search strategy search is to extract neural structure samples from discrete search spaces. DARTS proposed the continuous relaxation of search space. The neural structure of searching in continuously differentiable's search space is shown in Figure 4, and the following softmax function is used to relax discrete space:

After relaxation, the task of structure search is transformed into the joint optimization of network structure and neural weight. These two kinds of parameters are optimized alternately on the training set and the verification set respectively, which shows a bi-level optimization problem.

In order to guide the search process, it is necessary to evaluate the performance of the generated neural network. An intuitive method is to train network convergence and then evaluate its performance. However, this method requires a lot of time and computing resources. Therefore, several methods to accelerate model evaluation are proposed.

In order to reduce the computational burden, the performance can be estimated by the low-quality approximation of the actual performance. The implementation methods include: shortening the training time, selecting a subset of data sets, training on low-resolution images, using fewer channels in each layer and stacking fewer cell structures. The optimal network or unit searched under the condition of low quality builds the final structure, and retrains on the data set to get the target network. Although these low-precision approximations can reduce the training cost, their performance is underestimated and errors are inevitably introduced. Recent research shows that when there is a big difference between low-quality evaluation and complete evaluation, the relative ranking of network performance may change greatly, and it is emphasized that this error will increase gradually.

Early stop technology was originally used to prevent over-fitting. Some studies predict the network performance at the initial stage of training, and the model with poor performance on the prediction verification set will be forced to stop training to speed up the model evaluation. One way to evaluate network performance in the early stage is learning curve extrapolation. Domhan? It is suggested to insert the learning curve at the beginning of training and terminate the training of those network structures with poor prediction performance. Swersky and others take the superparameter of network architecture as a reference factor when evaluating the learning curve. The other method stops in advance according to the local statistical information of the gradient, which no longer depends on the verification set, allowing the optimizer to make full use of all the training data.

The proxy model can be trained to predict network performance. PNAS suggests training a proxy network (LSTM) to predict the performance of the network structure. PNAS does not consider the learning curve, but predicts the performance according to the characteristics of the structure, and infers the larger network structure during training. SemiNAS is a semi-supervised NAS method, which further improves the search efficiency by using a large number of unlabeled architectures. There is no need to train the model, only the proxy model is used to predict the model accuracy. The main difficulty in predicting network performance is that in order to speed up the search process, it is necessary to make a good prediction on the basis of less evaluation of large search space. When the optimization space is too large and difficult to quantify, and the evaluation cost of each structure is extremely high, the agent-based method is not applicable.

The proxy model can also be used to predict network weights. A supernet is a neural network that is trained to generate network weights for various architectures. Supernet saves the training time of candidate architectures in the search process, because their weights are obtained through the prediction of Supernet. Zhang et al. proposed a graph representation method, and used graph supernetwork to predict the weights of all possible structures faster and more accurately than the traditional SMASH algorithm.

Weight inheritance is to let the new network structure inherit the weights of other network structures that have been trained before. One method is network morphism. The general network design method is to design a network structure first, and then train and test its performance on the verification set. If the performance is not good, please redesign the network. It is obvious that this design method will do a lot of useless work, so it will take a lot of time. The method based on network morphism structure can be modified on the basis of the original network structure, and the modified network can reuse the weights trained before. Its special transformation mode can ensure that the new network structure can be restored to the original network, so the performance of the sub-network will at least not be worse than the original network, and it can continue to grow into a more robust network in a short training time. Specifically, network morphism can handle any nonlinear activation function, add jump connections, support adding layers or channels, and get deeper or wider equivalent models. Classical network morphism can only make the network bigger, which may lead to the network being too complex. The approximate network morphism proposed later, through the refinement of knowledge, makes the network structure simplified. Evolutionary algorithms often use network morphism-based mutation, or directly let children inherit the weight of their parents, and then carry out general mutation operation, so that the generated network has a better initial value without starting training again.