Convolutional neural network based detection and judgement of environmental obstacle in vehicle operation

2019-09-17 07:28:22GuanqiuQiHuanWangMatthewHanerChenjieWengSixinChenZhiqinZhu

CAAI Transactions on Intelligence Technology 2019年2期

Guanqiu Qi, Huan Wang, Matthew Haner ?, Chenjie Weng, Sixin Chen, Zhiqin Zhu

1Department of Mathematics & Computer and Information Science, Mansfield University of Pennsylvania, Mansfield, PA 16933, USA

2College of Automation, Chongqing University of Posts and Telecommunications, Chongqing 400065, People’s Republic of China

Abstract: Precise real-time obstacle recognition is both vital to vehicle automation and extremely resource intensive.Current deep-learning based recognition techniques generally reach high recognition accuracy, but require extensive processing power. This study proposes a region of interest extraction method based on the maximum difference method and morphology, and a target recognition solution created with a deep convolutional neural network. In the proposed solution, the central processing unit and graphics processing unit work collaboratively. Compared with traditional deep learning solutions, the proposed solution decreases the complexity of algorithm, and improves both calculation efficiency and recognition accuracy. Overall it achieves a good balance between accuracy and computation.

1 Introduction

The number of both commercial and passenger cars worldwide continues to grow unabated as the world modernises. Although vehicular safety has continually improved, very few improvements have actually mitigated collisions but rather have made advances to survivability after the crash. Moreover, the vast majority of automotive accidents are human error, rather than mechanical failures. With the sheer volume of both information and decisions required to drive even for a single minute, it is impressive there are not even more collisions and accidents.

Within the last decade or so, modern processing power and affordable sensing technology have given us the opportunity to change the very nature of vehicular transportation. Following the development of image processing, sensing, and artificial intelligence, various functions of intelligent driving have been gradually realised and applied to newly produced vehicles.Although the new technologies, when employed in somewhat controlled settings, are already making driving safer than before,they are not perfect and can cause traffic accidents.

In general, these new technologies greatly reduce the number of accidents caused by careless driving. Such systems require both great precision and real-time speed requirements for safe vehicle operation. Possibly the greatest of the many challenges to overcome before widespread adoption of autonomous vehicles can take place is obstacle detection and identification. One only needs to look to the news to see that, when a car is in a full-autonomous mode, a missed or mislabelled object can lead to tragedy.

In object detection,image/video processing techniques are applied to sensor data for perception and identification[1, 2]. However,the algorithms face the same challenges as humans do – the driving environment is complicated and changes rapidly. The real-time reaction and high precision that is needed raises the requirements in computation capacity. Cloud computing resources have the requisite high computation capacity, and are often used in image/video processing, including obstacle perception and identification[3, 4]. In short, the high precision and real-time obstacle detection techniques can be employed to avoid traffic accidents, effectively saving property damage and more importantly, promoting safety.

Obstacle detection techniques can be categorised into different types, such as reconstruction of vehicle visual perception, edge detection, colour detection, optical flow field calculation, machine learning, and so on.

Vehicle visual perception reconstruction:This technique is a perceptual method that utilises two cameras for stereoscopic vision.Based on the height of a targeted image pixel, a judgement matrix is used to determine whether the pixel is out of the setting range.This allows the obstacle to be identified. 4Disler identified the object by using three-dimensional reconstruction [5]. The identification process that forms the theoretical framework of visual reconstruction has three steps: feature extraction, depth of field recovery, and representation and identification. This theoretical framework has been widely used in automotive stereoscopic vision systems. First, it does a three-dimensional reconstruction of the target. Then the capturing device is calibrated to obtain sparse or dense parallaxes by three-dimensional matching. Finally, it completes the detection and identification of the target. Unfortunately,this method is not robust with respect to image processing, and the recognition rate is easily negatively affected. Moreover, this method has high time complexity, and it struggles to return recognised objects in time. Aloimonos also did research in threedimensional reconstruction algorithms. He proposed a biological vision to identify objects [6]. It also contains three main stages.First, associate the relevant objects with the targeted object, then explain the associated objects, and finally, select the most accurate description of the targeted object from all associated objects. It demonstrates that there is no relationship between object recognition and reconstruction. The detailed information of visible and infrared images in the same scene was also analysed to reconstruct the vehicle contour for identification [7].

Edge detection: An edge is the pixels where the features of adjacent-region image pixels change. Edges are one of the basic features that can comprehensively and effectively segment the foreground and background of the target image. Sivaraman detected vertical edge features of source image [8].According to the detected features, a targeted vehicle was marked in the corresponding location of the source image to achieve detection. According to the image edge features, Wong categorised the image into parts,such as background, foreground, and two horizontal sections [9].For instance, sky and road are background, vehicles are treated as foreground, and the integrated edge information is used as two horizontal sections. Edge-detection based method can distinguish obvious targets from the background. Due to surrounding,silhouettes, and other interference, the detection accuracy decreases with bright lighting. Not limited to geometric features, detailed information such as the morphological similarity of the image,cartoon&texture information,and structure information,was used for the further identification of objects [10,11].

Colour detection: Colour is a low-level image feature. The difference in colour can be used to distinguish between the background and target. Colour detection is commonly employed for obstacle detection. Chromaticity and saturation are typically not exactly the same among different objects. Although the intensity can be greatly affected by the amount of light, colourbased segmentation is only disturbed a little. Guo proposed a vehicle detection method that combined vehicle taillight colour and morphic mask recognition [12]. It located the vehicle position in the source image by detecting vehicle taillights. Overall, it was found that over a wide range of brightness scenarios, the detection accuracy of colour-based detection methods is not stable, and can easily be affected by the changes of environmental colours.

Optical flow field calculation: Horn and Schunck proposed the calculation method of optical flow field [13]. The method follows two cases: one is based on the grey-gradient invariant feature, and the other is based on the constant brightness assumption. Schaub and Baumgartner proposed their vehicle detection method based on the combination of double-level optical flow clustering [14]. It clusters the features of vehicle optical flow information first, then completes the vehicle detection and distance estimation in real time to prevent vehicle crashes automatically. Li used an optical-flow method to estimate the image optical flow information from time and space features, and then performed a three-dimensional reconstruction for vehicle identification and road-scene simulation [15]. The optical-flow based detection method is good for the detection of obvious obstacles, and does not depend on prior knowledge, but it is easily affected by changes of light and noise in the environment.

Machine learning: Recent developments of machine learning have also been applied to obstacle detection. Li proposed an online boosting based vehicle tracking method, which analyses the high-volume of vehicle features, and realises the real-time tracking of targeted vehicle with low computation costs [16]. Based on Haar-like features, Khammar used the Adaboost method to verify the hypothetical area of vehicles [17]. Sivaraman combined SVM[18]classifier with the histogram of oriented gradient (HOG) [19]feature detection method to sense vehicles by using two-step active learning [20]. Lee used statistical features to achieve real-time detection and tracking of multiple vehicles under 3D constraints [21]. Lee pointed out that a multi-scale classifier can be used to detect vehicles from any viewing angle, and the targeted object and background can be described by two kinds of presented histograms. Satzoda employed adaptive learning on multiple sub-regions of the vehicle, such as lights, bumpers, and other symmetric objects [22]. The trained classifier was used to classify symmetric objects and extract the location of vehicles in images. It was an effective method to do vehicle detection rapidly. Zhang proposed a sparse-representation based vehicle identification and tracking method [23]. This shallow learning method uses a single learned dictionary to identify pedestrians and vehicles.The performance of this method is determined by the feature classifier.However,extracted single features cannot represent target accurately.Traditional feature classifiers focus on solving binary classification,and their scalability is weak.So a traditional feature classifier is not suitable for target detection in complex scenes. Qi used machine learning techniques to explore low-and high-frequency components in source images for further obstacle identification [24, 25]. The geometric features of images,such as edges,contours,and textures,are analysed and used to classify and identify the objects in captured images [26, 27].

Neural networks obstacle identification includes shallow and deep learning. Compared to a single-layer neural network of shallow learning, deeper networks can be more accurate, uncover more comprehensive information such as sign text, and can directly extract features of various obstacles in within the training set. It can classify environmental obstacles adaptively and effectively, and moreover, it is able to process multiple obstacles synchronously. Meanwhile, intelligent perception can be enhanced by deep learning to ensure real-time processing and high obstacle detection accuracy. Traditional obstacle detection and recognition methods suffer from detection accuracy while data over-fitting is an issue that seriously affects convolutional neural networks(CNNs). To achieve accurate and comprehensive detection and identification, this paper improves the traditional CNN and further explores detection and identification methods. Specifically, this paper proposes a deep CNN utilising a 15-layer network with obstacle detection and obstacle-area recommendation.

A five-layer convolutional layer and two-layer pooling layer are used to extract obstacle features. A four-layer region proposal network (RPN) recommends the obstacle area. A region of interest(ROI) pooling layer generates a fixed-size feature map of each interested region. A two-layer whole connection layer is employed to reduce data dimensionality and does further feature extraction of targets. The output layer sets five neurons, which means that there are five categories of obstacles in detection and identification. The proposed CNN trains the parameters from captured images.

This paper also proposes an algorithm for obstacle area extraction in vehicle operation. The proposed solution not only analyses the underlying image features and applies four types of ROI extraction methods, but also considers visual attention, human–computer interaction, and the huge amount of real-time video data processing. Based on the regional growth algorithm, it designs an automatic obstacle area extraction method that integrates a maximum variance method with morphological operations. This method transforms the vehicle image from RGB colour space to Lab colour space. The grey-level pixel features and maximum variance are used to separate obstacles from the background.It applies a morphological operation to eliminate redundant objects, fill area holes, and smooth edges. A regular pattern is used to extract obstacles from the vehicle image, and the dataset of obstacle is created. The proposed obstacle detection and identification method can be applied to different types of obstacles, such as cars, motorcycles, and pedestrians. Compared to the traditional Gaussian mixture model with a Kalman filter, the proposed method improves on obstacle categorisation, and increases detection accuracy. This paper uses a vehicle video from real environment to test and verify the proposed obstacle detection and identification method. In brief, most of the conventional methods are designs based on optical flow and image edge features. The shape of the target is not considered as a whole, so the overall effect of target recognition is not good in complex environments. When deep CNN is used to identify objects, ROI is divided and calculated by deep neural networks (DNNs) and graphics processing units. It results in a significant increase in computation and power consumption. This paper has the following three main contributions to solve existing issues in identification accuracy and efficiency:

(i) A ROI extraction solution based on the maximum difference method and morphology. It locks the domain of targeted region efficiently, which lays a solid foundation for subsequent target recognition. It also makes the best of computing resources to quickly extract the targeted ROI from source images.

(ii) An improved CNN method to achieve target recognition. An over-fitting solution is proposed and applied to the CNN training.The over-fitting phenomenon in CNN is effectively reduced.At the same time, the accuracy, effectiveness, and efficiency of multi-target recognition are improved.

(iii) An image target recognition solution that integrates the traditional extraction of the interested region and CNN target recognition. It improves the accuracy of image recognition and also the efficiency of calculation. Compared with conventional deep learning solutions, it can achieve better recognition effects with low computing resources.

The rest of this paper is structured as follows: Section 2 presents and specifies the proposed solutions; Section 3 simulates the proposed solutions and analyses experiment results; and Section 4 concludes this paper.

2 Obstacle detection and vehicle identification

In theory, adding layers to neural networks should increase identification. However, adding layers does not always achieve better detection and identification rates beyond a point. Different constraints affect the actual performance, such as the difficulty in network training, the size of the database, the support of hardware,the constraints of specific application scenario, and so on. Utilising existing data resources, hardware devices, and so on, this paper builds a 15-layer neural network for obstacle detection and obstacle area recommendation. Based on a large number of experiments, it finds that a 15-layer neural network reaches the best balance point for both detection accuracy and processing time. For neural networks, processing time grows exponentially with linear increases in layers. This work shows that a 15-layer neural network can achieve the expected detection accuracy efficiently.

The 15-layer neural network is composed as follows:a five-layer convolutional layer and a 180 two-layer pooling layer are used to extract obstacle features. A four-layer RPN network recommends obstacle areas. A one-layer ROI pooling layer of fixed-size feature map is generated for each ROI. A two-layer whole connection layer reduces data dimensionality, and also does feature extraction of the target. The output layer is set to five neurons. It means five obstacle categories can be detected and identified. This paper improves the CNN, and proposes a specific training method. The trained data is used as testing parameters during experimentation.Based on the ROI algorithm, there are several improvements in the proposed obstacle detection solution. It uses the maximum variance of region growth algorithm to obtain the divided edges of image automatically. Then the morphological operation is used to smooth the image edges. Finally, it divides image edges from ROI area completely. The detailed steps of obstacle detection and identification are shown in Fig. 1.

2.1 Establishment and training of neural network for obstacle detection

Fig. 1 Proposed vehicle obstacle recognition framework

Fig. 2 Proposed deep CNN

Fig. 3 Workflow of the proposed deep CNN

2.1.1 Establishment of neural network for obstacle detection: Based on the AlexNet structure model, the proposed deep CNN shown in Fig. 2 integrates deep learning algorithms,DNNs, and obstacle detection and identification methods. The proposed solution can be applied to the detection and identification of various obstacles in vehicle operation. It is essentially a training network that extracts obstacle features from a source image.The features of various obstacles, such as pedestrians or vehicles,are abstracted to obtain corresponding high-level features that can be used to represent the whole image. Finally, it reports the classification accuracy probability of the categorisation of obstacles.

The positioning part of the BoundingBox and the ROI algorithms are integrated into traditional neural networks. It reduces both the complexity of the network and computational costs. Increases in the complexity of the environment or the number of obstacle categories result in more uncontrollable factors.With the deepening of network layer, the network can learn richer and more complex features. In theory, identification accuracy grows following the increase of layer number. Actually, when the number of layers increases, the identification accuracy does not always increase.Many factors constrain the performance of identification, such as the details of implementation of network training, database size,hardware conditions, and specific application scenarios.

This paper builds a 15-layer deep CNN. Before Conv5, there are seven layers for the feature extraction of obstacles, including a five-layer convolutional layer and a two-layer pooling layer.Some network layers use Maxout activation function and normalise for normalisation. The RPN network is a four-layer network consisting of Rpn_Conv1, Rpn_Cls Score, Rpn_Bbox_Pred, and Proposal. It is used to generate the recommendation of obstacle area and reduce the image retrieval range. Then it adds the ROI algorithm as ROI Pool1 layer that generates a fixed-size feature map for each ROI. An added two-layer whole connection layer achieves a reduction of data dimensionality and further feature extraction of the target. The output layer is set to five neurons as five categories of obstacles can be detected and identified. In this paper, the created obstacle dataset and the database of PASCAL VOC 2007 and 2012 are used as sampled data to train and test proposed deep CNN. The key data transmission process of the proposed deep CNN is shown in Fig. 3. The input data is processed by convolution, maxout, normalisation, and pooling operation, sequentially. After that, the output data is obtained.

For instance, it takes an image of size 227*227*3 (3 represents three channels of RGB colour image) as input, and applies 5-layer convolution and 2-layer maximum pooling to the input. The feature map obtained by conv1 is activated by the Maxout activation function. Normalise selects the LRN function, and sets its parameter a to 0.00005 and b to 0.75. Similarly, the previous standardised results are pooled in the following operations. After five convolutions and two poolings, the feature maps are 13*13*256, where 13*13 is the size of feature map, and 256 is the number of feature maps.

After the fifth layer convolution operation, the introduced Rpn network extracts image features. It works with the subsequent Rpn_Cls Score and Rpn_Bbox_Pred network layers to provide several candidate regions for neural network, and obtains 36 feature maps with size 13*13 and image proposals. It introduces ROI Pool1 pooling layer to process data. It uses the output of the fifth layer convolution and the proposal results obtained by RPN network as input to obtain 256 feature maps with size 7*7.Constructing a two-layer full connection, it obtains 4096 onedimensional data as 1*1*4096-dimensional data. The convolution operation is performed by using 1*1 convolution kernel, and the number of classification categories of neural network is set to 5.Maxout function activates neurons, and obtains the recognition probability of deep CNN for five types of obstacles that are five float-type real numbers between 0 and 1.

2.1.2 Improvements of convolutional neural network: There are three main improvements of the CNN in multi-category obstacle detection and identification.

(i) Maxout function: Traditional neural network uses sigmoid function or tanh function to activate neurons. ImageNet-k series network uses ReLU function, which is more suitable for biological neuron information transmission mechanism. For deep CNN,optimisation strategies cannot only accelerate network training, but also improve the accuracy of target detection in the application of multi-class classification.

This paper analyses the Maxout activation function and further studies the improvement of training process in neural network.According to the idea of dropout, Maxout function has its own working mechanism. Dropout makes each input neuron not work with a certain probability. Maxout is more extreme, and takes the maximum value of multiple feature maps across the previous layer as the output. The proposed Maxout network shown in Fig. 4 has four neurons. From the input layer to the output of Maxout, it takes three affine transformations. There is no non-linear activation function in the transformation, which is completely a linear transformation. The three transformations take the maximum value to obtain four neurons of Maxout as the output. The mathematical expression of Maxout activation function is shown as follows:

where i and j are the layer number and the neuron number,respectively; z is the output of jth neuron in ith layer; W is the weight; b is the bias; d and m are the number of neurons in input and hidden layer, respectively; k represents that k neurons picked from hidden layer as a group; and h is the output of corresponding hidden layer. Maxout operation takes the maximum value from the corresponding k output values of k neurons in a group as the input of next neuron. Maxout function has the powerful fitting ability which can fit any convex function. Multi-layer perception (MLP)is a general-purpose function fitter. Since each Maxout model can contain arbitrary affine transformations that make Maxout model be a general function fitter, Maxout model is an MLP-based activation function. As shown in Fig. 4, blue neurons in the middle layer are set as a group of k neurons.

(ii) RPN network:In this paper,the constructed RPN network is a four-layer network for targeted area recommendation.It is integrated after the CNN of target detection. RPN network and the target detection network share the convolutional layer. After extracting input image features by the CNN, the obtained feature map is used in RPN network. RPN convolutional layer and the whole connection layer provide a number of candidate areas for searching target location.

(iii) ROI pooling layer: ROI pooling layer is used to normalise the input images of different sizes and unify them by a fixed scale.It generates the same size image for both feature maps from the CNN and RPN network recommended area.The subsequent full connection layer further learns the obtained images.It trains the features of objects and their corresponding position information in obtained images. It classifies the appeared obstacles. At the same time, it locates the corresponding specific positions in original image or video. It is more accurate in obstacle detection and identification.

2.1.3 CNN training of vehicle identification: This paper uses back propagation (BP) algorithm as training method that is the most commonly used learning algorithm in the supervised neural network. It needs input and output samples with tags as the objects of network learning and the template of error adjustment.CNN is one of the classic neural network models with supervised learning. To maximise learning features of input samples, it needs to use BP algorithm to adjust the network parameters constantly,so that the network could achieve global optimisation as possible.

BP is based on the gradient descent. The difference between actual output and expected output is reversely transmitted to the previous layer network. The neurons in the previous layer network autonomously update their parameters according to the difference. For the weight correction in network, it needs to define a cost function to quantify the difference between actual output and expected output in order to actually perform BP with mathematical operations.For s single input sample data(x, y),the cost function is defined as

where x and y are vectors as input;E is the tolerance;W is the weight;b is the bias;and O is the actual output.When the number of samples is N, the defined cost function is

Fig. 4 Working mechanism of maxout function

where p means pth step; other parameters are same as previous equation. In order to train the network to achieve the best state, the weight matrix(W,b)needs to be adjusted to minimise the cost function E(W, b). As mentioned, BP algorithm updates network parameters based on the gradient descent. According to the fastest descending direction of cost function gradient, it searches the optimal solution. In order to achieve the optimal training effect for the whole neural network, the optimal training is carried out for each part of neural network. Each weight is adjusted along the optimal gradient descent direction.

The operation process of BP algorithm is described in detail as follows. Global optimum means the cost function Enof a single training sample n in the neural network training set is minimal as shown in the following equation:

where K is the number of output units as well as the number of target types in the dataset;is the output of first k neuron in the output layer;corresponds to the expected output of neuron k. In the training process of neural network, each neuron in the network autonomously corrects its own weight and offset vector according to the gradient descent algorithm. The modified formula is demonstrated as follows:

where l is the learning rate of neural network;is the connection weight between the qth neuron of l layer and the pth neuron of l+1 layer;is the connection weight between the pth neuron of l+1 layer and the bias term of l layer.

To estimate the contribution of a neuron in any layer to the total error of neural network, the residual variable is defined

The output neuron is

where L is the number of neural network layers;Slis the total number of neurons in lth layer.

The neuron in hidden layer is

According to the chain law,the partial derivatives of the connection weights and offsets are calculated as follows:

The training of CNN has three parts: (i) according to the input of training data, calculate the corresponding actual output by using forward propagation; (ii) calculate the actual output and the corresponding ideal output by using reverse transmission; (iii) use BP to adjust the weighted matrix to minimise errors.

Forward propagation:Give a set of initial data to neural network.Transmit them to the corresponding neuron positions of hidden layer by the neurons of input layer.An intermediate output is obtained by the weight sum of initial data and the weight of neuron connection.Then select the appropriate function to activate the intermediate output. The output of this layer from forward to backward is obtained. The obtained result is used as the input of next network layer for further training in subsequent networks.

Backward propagation: After the completion of a forward propagation, the actual output of neural network is obtained. The actual output is compared with the expected output sample, and the performance of network is evaluated initially. However, the network output obtained by forward propagation is far away from the expectation. At this time, the network does not have self-learning ability. So it needs to introduce a backward propagation mechanism to train the network. It ensures that the network can achieve real self-learning. The data information related to forward propagation is processed in backward propagation. As the equivalence of network feedback operation, it uses backward to forward transmission method. Based on the basic criterion of gradient descent, backward propagation transmits the difference between the actual output and the expected output to previous network layer. According to the difference, the neurons in previous network layer update their parameters automatically.

2.2 Obstacle detection and identification

In this paper, obstacle detection and identification has three main parts as follows. The first part is the extraction of obstacle areas. It analyses the underlying features of image and the related ROI extraction algorithm. Based on maximum variance algorithm combined with the morphological operation of the regional growth algorithm, it completes the extraction of interested image regions as the extraction of obstacle areas. In the second part, CNN is used to extract the features of targeted obstacles. It builds a deep CNN for multiple types of obstacles detection and identification in vehicle environment. RPN network is integrated to achieve the extraction of targeted object features and the recommendation of targeted areas. The third part is the obstacle identification. In real vehicle environment, it tests CNN performance in various types of obstacles detection, such as real-time detection of obstacles, identification accuracy, and locating obstacle position by rectangular frame.

Based on the maximum variance method combined with the morphological operation [28], the proposed method can extract ROI region automatically. First, it extracts the RGB colour information of image, and differentiate corresponding R, G, and B components. Then it uses makecform and applycform functions to convert RGB colour space to Lab colour space. It selects a colour component of Lab space. It uses the maximum variance method to automatically obtain the threshold of image segmentation, based on the grey features of image. Function graythresh is used to divide the image into two categories,target and background. To achieve the binarisation of grey image,it calls expansion function imdilate and corrosion function imerode to process the binarised image. It takes different operations,such as smoothing the boundary, clearing small redundant objects,filling small holes between different areas, and others. It tries to keep the completeness of ROI areas as much as possible. It uses function imfill as filling operation to fill the holes in ROI areas surrounded by boundary lines. All colour information in non-ROI areas of image is updated to 0 to eliminate the background. It keeps the colour information in ROI areas. So the vehicle images can be separated from ROI area.

2.3 Extraction of growing obstacle area in improved area

There are a large amount of works and complex computations,when CNN is applied to obstacle detection and identification of entire image or video. In vehicle environment, driving vehicles may collide or scratch, only when obstacles appear at a certain area around them. ROI extraction algorithm is introduced. ROI extraction algorithm can not only effectively reduce the searching range and the network complexity, but also locate and separate the target area from source image. It improves the accuracy of obstacle detection and identification. In order to complete large-scale image processing, this paper proposes a new ROI extraction algorithm based on region growing algorithm. This algorithm uses the maximum variance in region growing algorithm to obtain the separating boundary of image automatically. Then it uses the morphological operation to smooth the image boundary, so that it can separate ROI region completely.

According to certain criteria, ROI extraction of region growing algorithm separates the image first. Then it annotates the separated regions. Finally, ROI is extracted from the above results by different methods, such as histogram method, P parameter method,threshold method based on maximum variance, grey difference method based on pixel in adjacent regions. Based on the automatic threshold acquisition of maximum variance method in region growing algorithm, this paper integrates morphological operation to improve ROI extraction algorithm. It realises the automatic separation and extraction of interested image regions.

The image, that only has target and background as two types of areas, is taken as an example. It uses maximum variance method to obtain the image threshold automatically. Then the whole process of ROI is extracted. According to the threshold T of grey level, the pixels of image are divided into two regions C0and C1as follows:

where fminand fmax(fmin≤f ≤fmax)are the minimum and maximum grey level values of image f(x,y),respectively.It sets Nias the grey level value of pixel i. The grey level value of image f(x, y) isSuppose that the occurrence probability of each grey level is P(i)=Ni/N.

The total occurrence probability of area C0is

Its average value is

The total occurrence probability of area C1is

Its average value is

The average value of image f(x, y) is u0and uiare average grey level values of areas C0and C1,respectively. u is the average grey level value of whole image. The variance between two regions is

When the variance between two regions that need to be divided is the maximum value, it means that the current threshold is the optimal threshold T?. At this time, it can obtain the optimal segmentation of two regions.

Based on the maximum variance, this method can obtain the threshold automatically. The grey level features of image are used to achieve the best separation of target and background. The larger the variance between foreground and background is, the less the feature similarity between two image regions is. When the variance between two regions reaches the maximum value, it is the best time to divide the image. If it misses the best threshold T?, it may cause error or misjudgement. For multiple targeted areas in image, it only needs to extend the above method and perform multiple iterations.

2.4 Avoid network overfitting

To achieve more accurate and comprehensive study of problems,the training set of DNN is generally very large, such as the classic AlexNet network model uses ImageNet dataset that contains 2200 categories of 15 M image information. If there is no sufficient training samples to ensure the network can learn enough features and generalise, it is necessary to solve the overfitting problem caused by insufficient data volume. Dropout and DropConnect are two methods to avoid overfitting.

Dropout optimisation:The working principle of dropout is shown in Fig. 5. It breaks the conventional arrangement of network,and transforms the combination of network. When the number of network training parameters is large and training samples are not sufficient, it can reduce overfitting effectively and generalise the network to a certain extent. BP algorithm is used in network training process.

Fig. 5 Working principle of dropout

For normal connected neurons, the parameter transfer formula is For those neurons with any ‘dropout’, the parameter transfer formula is

Local response normalisation, that is integrated into the convolution operation of network, is used to normalise local outputs after the convolution operation. It does lateral inhibition of some neurons. Essentially, it is a way to activate neurons selectively. The formula is shown as follows:

DropConnect: It is an improvement of Dropout. Essentially it is same as Dropout. The implementation is slightly different.Unlikely, Dropout sets the output of neurons in hidden layer to 0 with probability p. DropConnect sets each connection weight between neuron nodes to 0 with probability p. The setting of DropConnect randomly abandons some connections in the full connection to ensure that the adaptive dependence is reduced before activation. Fig. 6 shows the working principle of DropConnect. The formula of DropConnect is shown as follows:

The symbols in (21) are same as those defined in (18) and (19).When the network does not use DropConnect operation, the calculation of parameter transmission between neurons is subject to (21). When DropConnect is added, the network connection weight is set to 0 by probability p of Bernoulli distribution.

Fig. 6 Working principle of DropConnect

Both Dropout and DropConnect are proposed to solve the overfitting problem caused by insufficient training samples.DropConnect is one de-overfitting method that improves based on Dropout. Both of them do sparse processing of network connection layer. It eliminates the redundant information in network training process as much as possible. It improves the generalisation ability of network. There are some differences between two of them. Dropout mainly eliminates the network dependency on specific features. It changes the network structure by randomly ‘dropout’ the neuron connection in network. So it can provide more training modes, and finally improve the robustness of network. DropConnect does sparse operation on the weights of network. It reduces the training parameters,and improves the training speed. At the same time, it also enhances the generalisation ability of network. The comparison of Dropout and DropConnect is further discussed in Section 3.

3 Experiments and analyses

This chapter analyses the detection and identification of multiple obstacles in CNN. It mainly focuses on the detection and identification of car, person, motorbike, bus, and bicycle that frequently appear in real driving environment. The proposed method is simulated and compared with existing solutions.

3.1 Experimentation environment

The proposed method is implemented in Windows environment using Matlab2016a. Table 1 shows environment and configuration information. Caffe [30]is a general deep learning framework developed by Berkeley Vision and Learning Center. Based on the Matlab framework of Caffe, this paper uses text format to define new model in design and implementation. Its codes and model are open source, and its running speed is fast. So it facilitates the secondary development of researchers. NVIDIA GeForce GTX 980 Ti graphics card is used to implement the related operations of graphics. In order to explain the effectiveness and advantage of the proposed method, this paper selects both PASCAL VOC 2007 and PASCAL VOC 2012 [29]as the training and testing sets of improved deep CNN. PASCAL VOC 2007 contains 9963 images used in training, verifying, and testing, and marks 24,640 objects.Similarly,PASCAL VOC 2012 contains 11,540 images,and marks 27,450 objects. 10 h video with 85 clips is collected from real driving environment, and is used in network training and testing to do real-time obstacle detection. PASCAL VOC database contains 20 types of objects. This paper only detects and identifies five of them, such as car, person, motorbike, bus, and bicycle.

Table 1 Experimentation environment and configurations

3.2 Experimentation steps

There are three main steps as follows:

? Extraction of obstacle areas: Collect a large number of vehicle videos. Separate them into sub-frames, and obtain a large amount of single-frame images. ROI region of image is extracted by maximum variance method of region growing algorithm integrated with morphological operation. The target area of ROI is marked by rectangular frame. Split the target from original image. Create an obstacle dataset with 20,000 images.

? Feature extraction:The obstacle dataset,PASCAL VOC 2007,and PASCAL VOC 2012 database are used to train and test the improved deep CNN. Five-layer convolution operation and pooling operation are used to extract obstacle features. RPN network and CNN as shared convolutional layer recommend obstacle areas. ROI pooling unifies the size of feature map in target detection and area recommendation, and puts them into the follow-up training in the whole connection and convolutional layer to achieve the comprehensive extraction of image features.

? Obstacle identification:The output layer has five neurons,as five types of obstacles can be detected and identified. It obtains each category information of obstacle and identification accuracy. It locates the obstacles in real time by BoundingBox, as two types of output information ‘Detections’ and ‘Scores’.

3.3 Experimentation results and analysis

This section tests the image classification of multiple-group networks using PASCAL VOC database first. Then, it tests the obstacle detection and identification in real environment. Finally, it compares the results from simulated and real environment.

3.4 Experimentation results

Testing of improved neural network: In the previous section, it discusses the de-overfitting of CNN. Dropout and DropConnect are often used to solve the overfitting issues caused by insufficient samples. It compares the results of improved neural network with and without Dropout and DropConnect, respectively.

Table 2 shows the identification accuracy of improved neural network with and without Dropout. The results illustrate that Dropout can enhance the target identification. The corresponding identification accuracy of improved neural network in PASCAL VOC 2007 and PASCAL VOC 2012 increases 0.3 and 3%,respectively.

The identification accuracy of improved neural network with and without DropConnect is shown in Table 3. Similarly,DropConnect can also improve the target identification accuracy.The corresponding identification accuracy of improved neuralnetwork in PASCAL VOC 2007 and PASCAL VOC 2012 increases 1.5 and 4.1%, respectively.

Table 2 Identification accuracy comparison of improved CNN with and without dropout

Comparing Dropout and DropConnect,both of them can improve the identification accuracy of neural network in PASCAL VOC 2007 and PASCAL VOC 2012. According to our experiment results,DropConnect is slightly better than Dropout.

3.5 Image classification testing

Tables 4 and 5 show the testing results of different methods in PASCAL VOC 2007 and PASCAL VOC 2012 database,respectively. Comparing with RCNN, fast RCNN, and faster RCNN, the proposed CNN network has shorter processing time and higher classification accuracy. In PASCAL VOC 2007 database, the average classification accuracy of the proposed CNN is 22.8% higher than RCNN. Comparing with faster RCNN,the proposed CNN still improves 2.7% on average classification accuracy. The processing time of faster RCNN and proposed CNN are 0.067 and 0.055 s/frame. The processing speed increases about 18%. Comparing with RCNN 6.9 s/frame and fast RCNN 3 s/frame, the proposed method has significant improvements in processing speed. YOLO has the shortest processing time among five methods. The proposed CNN still needs further improvement,especially in processing time.

Although processing speed and average classification accuracy of PASCAL VOC 2012 are lower than the corresponding results of PASCAL VOC 2007, the proposed CNN still has better performance than others. The average classification accuracy of the proposed CNN is significantly higher than RCNN and YOLO,5.45% higher than fast RCNN, and a little bit higher than fasterRCNN. The proposed CNN has much faster processing speed than RCNN and fast RCNN. Comparing with faster RCNN with high classification accuracy, the proposed method increases the processing speed from 0.072 to 0.058 s/frame, about 19.4%growth. YOLO has the best processing time, but it sacrifices the classification accuracy.

Table 4 PASCAL VOC 2007 comparison testing

Table 5 PASCAL VOC 2012 comparison testing

The results of the above two comparison experiments show that the deep CNN applied to the detection and identification of obstacle in vehicle environment can meet the requirements of real time and accuracy in practical application.

3.6 Obstacle detection and identification in driving environment

In order to verify the feasibility of CNN in multi-category obstacle detection and identification, it collects vehicle videos from real driving environment for verification. All videos used in experimentation are captured under normal light.

Fig. 7 shows the result of obstacle detection and identification in simple traffic condition. The proposed CNN network can achieve accurate detection and identification of cars. The detection accuracy of close-distance and far-distance car is 99.2 and 87.3%,respectively. The overall accuracy is high. The close-distance car can be detected and identified with higher accuracy than far-distance car.

Actually,the traffic condition is complex in most cases.There are many types of obstacles, such as car, person, motorcycle, bus, and so on. The proposed deep convolutional neural work can detect and identify all these obstacles with high accuracy. Even if the captured obstacle information is not complete, the obstacle can still be accurately detected and identified. In Fig. 8, it only has many vehicles. The proposed CNN network detects and identifies all of them. In Figs. 9 and 10, there are cars, person, and motorbike. Multi-category obstacles can be detected and identified by the proposed CNN network. All identification accuracies are>60%, and most of them are >90%.

Fig. 7 Obstacle detection and identification in simple traffic condition

Fig. 8 Obstacle detection and identification of only vehicles in driving environment

Fig. 10 Obstacle detection and identification of human–vehicle interaction in driving environment – 2

The obstacle detection of single-frame image by CNN network takes about 0.055 s, as same as the classification time of PASCAL VOC image. It can ensure the real-time detection. The follow-up processing time of testing is not repeated. The test results show that the proposed CNN network has high accuracy in real-time detection and identification of multi-category obstacle under normal light, no matter simple or complex traffic condition is.

In order to systematically analyse the performance of the proposed method, it is compared with Gaussian mixture model with Kalman filter on the detection and identification of obstacle. Gaussian mixture model with Kalman filter detects and identifies obstacles based on the colour information of image and the moving trend of target in image. This method selects several Gaussian models to model the colour information of image, and then analyses and predicts the colour information of foreground and the background,respectively. At the same time, Kalman filter is used to predict the moving trend of dynamic object and track the obstacles. This method only uses the colour information of detect obstacles. It is easy to be affected by environment, and has high error rate in detection and identification. Tables 6 and 7 show the detection and identification results of these two methods. In Table 6, the proposed method has more than 90% identification accuracy on car and person. For motorbike, bus, and bicycle, the identificationaccuracy is a little bit lower, but it is still more 80%. Table 7 shows the identification results of Gaussian mixture model with Kalman filter. Gaussian mixture model can only identify car and person. The other three types of objects cannot be identified.Based on the same number of samples, Gaussian mixture model only has 61.6 and 56.5% identification accuracy of car and person,respectively. The proposed CNN extracts colour and contour information of obstacles. Its hierarchical structure can extract the high-level features of target layer by layer. So it has high identification accuracy.

Table 6 Performance of the proposed CNN in multi-category obstacle identification

Table 7 Performance of Gaussian mixture model in multi-category obstacle identification

The comparison results show that the proposed deep CNN not only improves the efficiency and accuracy of obstacle detection and identification, but also achieves the detection and identification of multi-category obstacle.

4 Conclusion

In this paper, we have devised a deep CNN to detect and identify multi-category obstacles. It improves on the traditional CNN, and achieves a higher identification accuracy as shown in Tables 6 and 7. Meanwhile, it adds three additional categories of obstacles while being more algorithmically efficient.

A large number of samples is used to test the performance of the proposed solution. The proposed solution has more than 80% overall identification accuracy in the complex driving environment. For humans and cars, the identification accuracy achieves 90%. Comparing the experimental results show that the proposed solution has better performance in efficiency and identification accuracy than traditional solutions.

5 Acknowledgments

This work is jointly supported by the National Natural Science Foundation of China under grant 61703347; the Chongqing Natural Science Foundation grant cstc2016jcyjA0428; the Common Key Technology Innovation Special of Key Industries under grant no.cstc2017zdcy-zdyf0252 and cstc2017zdcy-zdyfX0055; the Artificial Intelligence Technology Innovation Significant Theme Special Project under grant nos. cstc2017rgzn-zdyf0073 and cstc2017rgznzdyf0033; the China University of Mining and Technology Teaching and Research Project (2018ZD03,2018YB10).

6 References

[1]Li, Y., Sun, Y., Huang, X., et al.: ‘An image fusion method based on sparse representation and sum modified-laplacian in nsct domain’, Entropy, 2018, 20,(7), p. 522

[2]Zhu,Z.,Qi,G.,Chai,Y.,et al.:‘A novel multi-focus image fusion method based on stochastic coordinate coding and local density peaks clustering’, Future Internet, 2016, 8, (4), p. 53

[3]Tsai, W., Qi, G., Chen, Y.: ‘A cost-effective intelligent configuration model in cloud computing’. 2012 32nd Int. Conf. on Distributed Computing Systems Workshops, Macau, China, June 2012, pp. 400–408. doi: 10.1109/ICD-CSW.2012.46

[4]Tsai, W.T., Qi, G.: ‘Dicb: dynamic intelligent customizable benign pricing strategy for cloud computing’. 2012 IEEE Fifth Int. Conf. on Cloud Computing, Honolulu, HI, USA, June 2012, pp. 654–661

[5]Disler, D.G., Scott, M.D., Rosenthal, D.I.: ‘Accuracy of volume measurements of computed tomography and magnetic resonance imaging phantoms by three-dimensional reconstruction and preliminary clinical application’,Invest. Radiol., 1994, 29, (8), pp. 739–745

[6]Aloimonos,Y.,Rosenfeld,A.:‘A response to‘ignorance,myopia,and naiveté in computer vision systems’ by R.C. Jain and T.O. Binford’, CVGIP, Image Underst., 1991, 53, (1), pp. 120–124

[7]Zhu,Z.,Qi,G.,Chai,Y.,et al.:‘A novel visible-infrared image fusion framework for smart city’, Int. J. Simul. Process Model., 2018, 13, (2), pp. 144–155

[8]Sivaraman,S.,Trivedi,M.M.:‘Vehicle detection by independent parts for urban driver assistance’,IEEE Trans.Intell.Transp.Syst.,2013,14,(4),pp.1597–1608

[9]Wong, C.C., Siu, W.C., Jennings, P., et al.: ‘A smart moving vehicle detection system using motion vectors and generic line features’, IEEE Trans. Consum.Electron., 2015, 61, (3), pp. 384–392

[10]Qi, G.: ‘Multi-focus image fusion via morphological similarity-based dictionary construction and sparse representation’, CAAI Trans. Intell. Technol., 2018, 3,pp. 83–94(11)

[11]Zhu, Z., Yin,H., Chai,Y.,et al.:‘A novel multi-modality image fusion method based on image decomposition and sparse representation’, Inf. Sci., 2018, 432,pp. 516–529

[12]Guo,J.M.,Hsia,C.H.,Wong,K.,et al.:‘Night-time vehicle lamp detection and tracking with adaptive mask training’,IEEE Trans.Veh.Technol.,2016,65,(6),pp. 4023–4032

[13]Horn, B.K., Schunck, B.G.: ‘Determining optical flow’, Artif. Intell., 1981, 17,(1), pp. 185–203

[14]Schaub, A., Baumgartner, D., Burschka, D.: ‘Reactive obstacle avoidance for highly maneuverable vehicles based on a two-stage optical flow clustering’,IEEE Trans. Intell. Transp. Syst., 2016, 18, (8), pp. 2137–2152

[15]Li, Y., Liu, Y., Su, Y., et al.: ‘Three-dimensional traffic scenes simulation from road image sequences’, IEEE Trans. Intell. Transp. Syst., 2016, 17, (4),pp. 1121–1134

[16]Li, W.H., Liu, P.X., Wang, Y., et al.: ‘Co-training algorithm based on on-line boosting for vehicle tracking’. 2013 IEEE Int. Conf. on Information and Automation (ICIA), Yinchuan, China, August 2013, pp. 592–596

[17]Khammari, A., Nashashibi, F., Abramson, Y., et al.: ‘Vehicle detection combining gradient analysis and adaboost classification’. Proc. 2005 IEEE Intelligent Transportation Systems,Vienna,Austria,September 2005,pp.66–71

[18]Bennett, K.P., Bredensteiner, E.J.: ‘Duality and geometry in svm classifiers’.Proc. of the Seventeenth Int. Conf. on Machine Learning, ICML ‘00,San Francisco, CA, USA, 2000, pp. 57–64. ISBN 1-55860-707-2

[19]Xie, Y., Liu, L.-F., Li, C.-H., et al.: ‘Unifying visual saliency with hog feature learning for traffic sign detection’. 2009 IEEE Intelligent Vehicles Symp.,Xi’an, China, June 2009, pp. 24–29

[20]Sivaraman, S., Trivedi, M.M.: ‘Active learning for on-road vehicle detection: a comparative study’, Mach. Vis. Appl., 2014, 25, (3), pp. 599–611

[21]Lee, K.H., Hwang, J.N., Chen, S.I.: ‘Model-based vehicle localization based on 3-d constrained multiple-kernel tracking’, IEEE Trans. Circuits Syst. Video Technol., 2015, 25, (1), pp. 38–50

[22]Satzoda, R.K., Trivedi, M.M.: ‘Multipart vehicle detection using symmetry-derived analysis and active learning’, IEEE Trans. Intell. Transp.Syst., 2016, 17, (4), pp. 926–937

[23]Zhang,Z.,Xu,H.,Chao,Z.,et al.:‘A novel vehicle reversing speed control based on obstacle detection and sparse representation’,IEEE Trans.Intell.Transp.Syst.,2015, 16, (3), pp. 1321–1334

[24]Qi, G., Wang, J., Zhang, Q., et al.: ‘An integrated dictionary-learning entropy-based medical image fusion framework’, Future Internet, 2017, 9, (4),p. 61

[25]Qi,G.,Zhu,Z.,Erqinhu,K.,et al.:‘Fault-diagnosis for reciprocating compressors using big data and machine learning’,Simul.Modelling Pract.Theory,2018,80,(Supplement C), pp. 104–127

[26]Wang, K., Qi, G., Zhu, Z., et al.: ‘A novel geometric dictionary construction approach for sparse representation based image fusion’, Entropy, 2017, 19, (7),p. 306

[27]Zhu,Z.,Qi,G.,Chai,Y.,et al.:‘A geometric dictionary learning based approach for fluorescence spectroscopy image fusion’, Appl. Sci., 2017, 7, (2), p. 161

[28]Wenqing, L.,Jianzhuang, L.: ‘The automatic thresholding of gray-level pictures via two-dimensional OTSU method’, Acta Autom. Sin., 1993, 19, (1), p. 101

[29]Hoiem,D.,Chodpathumwan,Y.,Dai,Q.:‘Diagnosing error in object detectors’.Computer Vision – ECCV 2012, Berlin, Heidelberg, 2012, pp. 340–353

[30]Jia, Y., Shelhamer, E., Donahue, J., et al.: ‘Caffe:convolutional architecture for fast feature embedding’. CoRR, abs/1408.5093, 2014. Available at http://arxiv.org/abs/1408.5093

CAAI Transactions on Intelligence Technology2019年2期

CAAI Transactions on Intelligence Technology的其它文章: Three-stage network for age estimation; Channel-wise attention model-based fire and rating level detection in video; TDD-net:a tiny defect detection network for printed circuit boards; New shape descriptor in the context of edge continuity; Efficient discrete firefly algorithm for Ctrie based caching of multiple sequence alignment on optimally scheduled parallel machines; Visibility improvement and mass segmentation of mammogram images using quantile separated histogram equalisation with local contrast enhancement