Image Deraining for UAV Using Split Attention Based Recursive Network

2020-09-16 01:13:10，，

Transactions of Nanjing University of Aeronautics and Astronautics 2020年4期

，，

College of Computer Science and Technology/College of Artificial Intelligence，Nanjing University of Aeronautics and Astronautics，Nanjing 211106，P.R.China

（Received 15 June 2020；revised 10 July 2020；accepted 25 July 2020）

Abstract: Images captured in rainy days suffer from noticeable degradation of scene visibility. Unmanned aerial vehicles（UAVs），as important outdoor image acquisition systems，demand a proper rain removal algorithm to improve visual perception quality of captured images as well as the performance of many subsequent computer vision applications. To deal with rain streaks of different sizes and directions，this paper proposes to employ convolutional kernels of different sizes in a multi-path structure. Split attention is leveraged to enable communication across multiscale paths at feature level，which allows adaptive receptive field to tackle complex situations. We incorporate the multi-path convolution and the split attention operation into the basic residual block without increasing the channels of feature maps. Moreover，every block in our network is unfolded four times to compress the network volume without sacrificing the deraining performance. The performance on various benchmark datasets demonstrates that our method outperforms state-of-the-art deraining algorithms in both numerical and qualitative comparisons.

Key words:unmanned aerial vehicle（UAV）；deep neural network；image deraining；recursive computation；split attention

0 Introduction

Unmanned aerial vehicles （UAVs） have a broad range of applications，such as disaster relief，aerial surveillance，film making，cargo transport and military reconnaissance. As an important outdoor imaging system，images captured by UAVs inevitably suffer from degradations caused by bad weather conditions. Rain，as the commonest bad weather，enormously decreases the visibility of the background image（Fig.1（a）），which has negative impacts on not only human perception but also subsequent computer vision applications.

Considering the forms of source inputs，rain removal algorithms can be divided into single imagebased and video-based methods. Rain removal from videos，which are composed of sequential image frames，takes advantage of both the similarity in adjacent frames and the dynamics of rain to recover the background image. In contrast，single-image based deraining is more challenging due to the lack of temporal information. Early work solves this illposed problem mainly based on an optimization framework constrained by various image priors，such as Gaussian mixture model［1］and low-rank representations［2-4］. Limited by the manually designed models，they are only capable to remove certain types of rain streaks.

Fig.1 Image captured by UAV in rainy day and UAV affected by rain streaks

Recently，the performance of single image deraining is boosted by deep neural networks learning from massive data simulated using synthetic rain streaks of different directions and densities. For these methods，they usually train an end-to-end deep network model to learn a mapping between rainy images and corresponding ground truths that are regarded as no-rain images. In Ref.［5］，a novel large-scale dataset containing 14 000 rainy/clean image pairs with 14 orientations and scales is synthesized，and then the ResNet model is conducted as the parameter layers to learn a negative mapping between clean and rainy images. In Ref.［6］，a lightweight neuron attention（NA）architectural mechanism is presented to improve the channel attention mechanism，which adaptively recalibrates neuronwise feature responses by modelling interdependencies and mutual influence between neurons. Stagelevel information is concatenated and fused dynamically by NA module，which benefits rain removal of different types. The efforts of adopting and ameliorating advancing network designs for image deraining improve the learning ability and adaptability of deraining algorithms.

In this paper，we focus on the rain removal of images captured by UAV devices. Photography of UAVs is of high dynamicity and broad view. UAVs always feedback in the form of videos. High dynamicity renders the background changing more intensively in the same time interval，which greatly increases the difficulty in utilizing the temporal information. This difficulty can be alleviated by adopting the single-image-based rain removal algorithm frame-by-frame，but requires an efficient and lightweight solution to process rainy frames across different devices. Besides，UAVs take a bird’s-eye view of the landscape，thus has a broad view on a variety of objects that are of abundant textures and scales.This requires a scale-aware deraining scheme to accurately separate rain streaks from the background image.

To this end，we propose a recursive multiscale split attention deraining network （RMSA-derain-Net）for UAVs，where multiscale features are extracted in parallel branches and fused adaptively by the split attention（SA）module in each basic block.Different from NASNet［6］，which uses channel-wise attention only to fuse stage-level features，we model feature-level interdependencies and mutual influences using SA. This separate-and-fuse process repeats several times（depending on the block number）in the deraining process. With only a few parameters increasing compared to a basic residual block， our RMSA-derainNet provides a light-weight but effective solution to tackle various types of rain streaks without spoiling important image details. Moreover，since modifications are made on block level，it is able to apply recursive computation among blocks，which leads to a much smaller model size. Extensive experiments show that our RMSA-derainNet is computationally efficient and addresses the deraining problem effectively for UAVs.

In summary，our contributions are as follows:

（1）We study the difficulty of rain removal on UAV images and propose a novel RMSA-derain-Net for rain removal of UAV images. Grouping strategies are adopted on different levels of the network，which allows a light-weight （grouping of blocks）but effective（grouping of feature maps and their attentions）model for extracting fine-grained features and recovering detailed background information.

（2）SA and channel grouping strategy［7-8］have been proved to be powerful tools for high level vision tasks such as recognition and classification.However，we for the first time demonstrate that these strategies combined with multi-scale convolutions can also boost the performance of the low level task on rainy image recovery，thanks to the adaptive receptive field and efficient information propagation.

（3） Extensive experiments demonstrate that our method can remove different types of rain streaks better and is able to recover fine background details，which not only benefits visual perception but also subsequent applications.

1 Related Work

In real applications， removing rain streaks from a single image is an extremely challenging task for its ill-posed nature and lack of temporal information. To solve this issue，many deep-learning-based models have been proposed recently，which aim to extract the suitable low-level feature representation regarded as rain streaks. These methods generally contain a main structure which plays an important role in their architectures and preserve precise lowlevel feature representation during the training process.

Fu et al.［5］first proposed to adopt a three-layer CNN to learn negative rain streak residuals from the high frequency components. In NASNet［6］，a lightweight NA architectural mechanism is presented to adaptively recalibrate neuron-wise feature responses by modelling interdependencies and mutual influence between neurons. Stage-level information is concatenated and fused dynamically by NA module，which benefits rain removal of different types. Zhang et al.［9］labeled rain images with density levels to deal with rain streaks of different densities. Yang et al.［10］learned the binary map of rain streak position to focus on rain affected areas. Wang et al.［11］regularized the distance between the deraining result and the ground-truth image in the gradient domain.Wang et al.［12］first learned the motion blur kernel of rain streak interference to guided the deraining network. Besides，Wang et al.［13］，Pan et al.［14］and Yang et al.［15］applied residual blocks to deepen the network. Specifically，Pan et al.［14］proposed to recover image structures and details separately using two parallel network branches. Some deraining methods［16］removed rain streaks progressively in multiple cascaded stages. In RESCAN［16］，different stages are connected using recurrent neural networks（RNN），which makes the information from previous stages available for the next stage. Li et al.［17］employed dense blocks to fully connect different layers. Generative adversarial network（GAN）［18］is exploited to refine the deraining results for more visual appealing effects. Li et al.［19］put forth a two-stage network to first remove dense rain accumulation and then correct artefacts introduced by the first stage using a depth-guided GAN. Zhang et al.［20］proposed a conditional GAN-based framework that consists of a densely-connected generator and a multi-scale discriminator. There are some researches exploiting multi-branch or multi-stage network architecture for image rain removal. Parallel dual-branch structure is also proposed to learn different image components and ease the task in the single network branch. Deng et al.［21］took advantage of another network branch to retrieve lost details，which validated the effectiveness of separate learning for the deraining problem.

In addition，there are some researches dedicated to improve the time and memory efficiency for increasing the practical value of rain removal algorithms. Fan et al.［22］built a separable network supervised on block level，where a small number of blocks can be applied in resource-limited scenarios.Fu et al.［23］simplified the learning process using shallow networks on the Laplacian image pyramid.Moreover，they proposed to utilize recursive computation among blocks to further reduce the number of parameters with negligible performance loss. The recursive strategy is also leveraged by PreNet［24］，which shares parameters not only among blocks but also among stages.

In general，existing researches either incorporate advancing network designs into the rain removal problem or combine problem-related information to guide the feature learning process. Popular network modules include residual block［5，24］，dense block［15-17］or novel convolutional operations such as the dilated convolution［9，19］and the squeeze-and-excitation［15，19］. These modules have been proved to be not only effective for the high-level tasks，such as image recognition［25］and object detection［26］，but also effective for the low-level tasks，e.g.，single image deraining，image inpainting［27］，image colorization［28］and image super resolution［29］. In this paper，we present a new multi-scale attentive residual block as the basic modules of our network.

2 RMSA Derain Net

We propose an end-to-end neural network for single image rain removal，called RMSA-derain-Net. Multiscale paths are adopted to allow different levels of receptive field，which benefits the extraction of multiscale image semantics and the recovery of complex background. The overall framework is illustrated in Fig.2，our proposed network consists of novel multiscale attentive residual blocks（MARBs），which first combines multiscale paths with SA in a block structure，as shown in Fig.3.Particularly，in order to reduce the computational complexity and memory cost of the network，we also implement the basic network function by recursively unfolding a group of MARBs several times，which can significantly reduce the number of parameters with negligible performance sacrifice. In the following sections，we will introduce MARB and the recursive network architecture in detail.

Fig.2 Overall framework of our proposed network

Fig.3 Multiscale attentive residual block

2.1 Multiscale attentive residual block

Due to their simple and modular structure，ResNet and its variants are widely used as the backbone network in image-to-image translation or other pixel-wise prediction tasks. In spite of the promising performances in different tasks，variants of ResNet suffer from massive parameters and require more computing resource. Inspired by the work in Ref.［7］，we propose a MARB to construct our network for single image deraining，which only requires few more parameters than the original residual block but still achieves state-of-the-art performance.

As shown in Fig.3，the MARB used in our network consists of two cardinal groups，and each cardinal group also contains three split paths that constructed by convolutional kernels in different scales.Multiscale kernels have different receptive fields and suit the need of filtering semantical objects in different scales. Each path consists of a 1×1 and ak×kconvolutional layers. Denote the input and output feature of MARB asXandY. Mathematically，the outputYcan be described as

whereφ(?) denotes the function of the cardinal group，which can be formulated as

whereh(?) indicates the function of the split path and SA(?) represents the function of the SA module（SAM）.Specifically，h(?)in our network can be expressed as

wherekin three split paths are set as 3，5 and 7，respectively.

The detailed structure of SAM is illustrated in Fig.4. It is proposed in Ref.［30］as an enhancement of the original squeeze-and-excitation module. In SAM，features from different paths are added together and go through a normal“squeeze”process，which contains a global pooling layer and a fully connected（FC）layer that followed by the bath normalization（BN）and the rectified linear unit（ReLU）activation function. Then，the features are split again to be fed into multiple FC paths and activated by r-Softmax function，thus“excite”each individual path using shared features. We argue that this particular structure of SAM can well model the interdependency and mutual influence among multiple paths. The reason of adopting SAM in the proposed multi-path structure is twofold:（1）SAM learns the weights of each path adaptively at training phase，therefore allows a flexible receptive field according to the source input，to adapt to the high dynamicity and broad view of UAV captured images with a better background objects recover in varying sizes；（2）SAM can be adopted at feature level rather than stage or branch level，which leads to more finegrained feature learning. Feature-level SA can be incorporated into the block structure，where the original feature channels can be split to form multiple paths and thus make full use of information in limited channels.

Fig.4 Split attention module[7]

2.2 Recursive network architecture

The overall recursive network architectures are shown in Fig.2. The number of filters is set to be 64，and all the filters in our network are of size 3×3 and padding 1×1. Our proposed network includes two convolution layers and 16 MARBs. The first layer can be interpreted as an encoder，which is used to transform the rainy image into feature maps，and the last layer is used to recover the RGB channels from feature maps. Considering that the long skip connection compensates the long-range information and avoids gradient vanish，the feature maps of the first convolution layer are added to the last feature maps of MARB.

Moreover，motivated by Refs.［31-32］，we divide 16 MARBs into four groups，each group consists of four MARBs. Then，we recursively unfold one MARB three times in each group so that every MARB in the same group shares the same parameters，as shown in Fig.5. Since network parameters mainly come from MARB，this intra-stage recursive design leads to a much smaller model size.

Fig.5 MARBs in one group

2.3 Comprehensive multi-scale loss function

Recently，hybrid loss functions，e.g.，adversarial loss［20］，mean-square error（MSE）+structural similarity（SSIM）［33］，have been widely adopted for low level image tasks. Inspired by Ref.［34］，we utilize a comprehensive multi-scale loss functionLcto train our network，which can be formulated as

whereλ1andλ2are the weighting parameters，which in our experiments are all fixed to be 1，andL1indicates the loss function，which can be formulated as

wheref(?) is the function that we try to learn，Ithe rainy image andI′ the ground-truth rain-free image.Correspondingly，LMS-SSIMdenotes the multiscale SSIM loss［34］，which calculates the SSIM value in different scales of image. Mathematically，it can be expressed as

where MS_SSIM(?) denotes the operation of calculating different scales of SSIM. The combination ofL1loss and multi-scale SSIM loss benefits the preservation of fine details in complex background，sinceL1loss tends to remain sharp edges and multiscale SSIM loss constrains the structure similarity of the whole image at different scales.

3 Experiments

Our proposed method has been implemented on the Pytorch framework. For optimizing our network，the Adam is adopted with a min-batch size of 1 to train the network，where the size of image patch is set to 256 pixel×256 pixel. We initialize the learning rate as 0.01，which is reduce by 50%every 30 epochs. For fully training our proposed network，we set the total number of training epochs to be 240，which takes around 1.5 d to train a model using the training set of Rain200H datasets. All the experiments are performed using a Nvidia 2080 GPU.

To validate the effectiveness of deraining models on UAVs captured images，we collect 500 UAVs captured no-rain outdoor images and 100 UAVs captured rainy images from Google and Youtube，and thus build a dataset containing both synthetic and real UAV rainy images. For synthetic images，we add different types of rain streaks on norain images using PS. This dataset is used to test and compare the performance of our method and existing deraining models.

3.1 Comparison with state-of-the-art methods

In this section，we compare our method with several state-of-the-art deraining methods on serval datasets，including two traditional deraining methods，i.e.，Gaussian mixed model（GMM）［1］，discriminative sparse coding（DSC）［4］，and three learning-based deraining methods，i.e.，deep detail network（DDN）［5］，spatial attentive network（SPANet）［13］，and dual CNN（DualCNN）［14］. The codes of the compared methods are provided by their authors，which can be download from the Github. We use the default configuration provided by the authors and only change the training datasets to train DDN，DualCNN and SPA-Net.

For the synthetic datasets，we evaluate the performance of our proposed method on three datasets，including Rain200H，Rain200L and Rain800. The Rain200H and Rain200L are provided by Yang et al.［10］，which contains 1 800 images for training and 200 images for testing. Compared with Rain200L，the Rain200H is more challenging，which is synthesized with five streak directions and the rain is more heavy. Zhang et al.［20］also collected and synthesized the Rain800 dataset，which contains 700 training images and 100 testing images. Here we use peak signal to noise ratio（PSNR）and SSIM as evaluation metrics for the quantitative comparison. The quantitative evaluation results of PSNR and SSIM are shown in Table 1. As can be observed in Table 1，our proposed method obtains the highest values of PSNR and SSIM than other methods on three synthetic datasets.

Table 1 Quantitative experiments evaluated on three recognized synthetic datasets

The more visual comparisons are shown in Figs.6—7，from which one can observe that our method better remains the structure and preserves the details of the images. For example，GMM and DSC cannot remove the rain streaks completely.Other deraining methods，i.e.，DDN，DualCNN and SPA-Net remove the rain streaks but produce stripes. In the contrary，our method can remove the rain streak well.

For the real-world rainy datasets，since no rainfree ground truths for real-world images are provided，we perform a user study on several real-world datasets collected by Yang et al.［10］and Li et al［19］.

Furthermore，to validate the practicability of our method，we also visually show the deraining results on some real-world rainy images in Fig.8. One can observe that our deraining result maintains more image details than others.

Fig.6 The first group of image deraining results tested in synthetic datasets using GMM, DSC, DDN, SPA-Net,DualCNN, and Ours，respectively

Fig.7 The second group of image deraining results tested in synthetic datasets using GMM, DSC, DDN, SPA-Net,DualCNN, and Ours，respectively

Fig.8 Image deraining results tested in real-word datasets using GMM, DSC, DDN, SPA-Net, DualCNN, UGSM[35]and Ours，respectively

Inspired by Fu et al.［36］，we randomly select 30 rainy images from the real-world rainy datasets. Secondly，we utilize the state-of-the-art methods to remove the rain streaks from the above images.Then，10 people are asked to score the derained results along with its original rainy image from 1 to 5（1 represents the worst quality and 5 denotes the best quality），which are randomly presented without knowing the corresponding methods.

The average scores of different deraining methods are shown in Fig.9，where each point in the scatter plot indicates an image. The horizontal axis denotes the average score of rainy images，and the vertical represents the average score of derained images. From Fig.9 one can see that，our method also achieves a better result.

Fig.9 Average scores of DDN, DualCNN, SPA-Net, and Ours

3.2 Ablations and discussion

In this subsection，we assess the effect of several key modules of our network，including MARB and the recursive computation. All the ablation studies are conducted on Rain200H，which contains 1 800 rainy images for training and 200 rainy images for testing.

Here we replace MARB with residual block（RB）in our network and show the result in Table 2. One can see that the network with MARB leads to higher average PSNR and SSIM values，which indicates the effectiveness of the proposed MARB.

Table 2 Results of our network by replacing MARB with RB

Our proposed network adopts recursive computation，which can reduce the parameters of the network. In Table 3，we evaluate the deraining result with and without recursive computation in our network，and find negligible loss in the deraining performance.

Table 3 Performance and the number of parameters with and without recursive computation in our network

4 Conclusions

This paper proposes a deraining method for the recovery of UAV captured images using SA based recursive network. A novel MARB is constructed by enabling attention on different groups of feature maps，which are obtained after conducting multiscale convolution operations. The learned feature representations are universally improved to a boost the performance of single image deraining，which is validated by our ablation study in Table 2. With this improvement，image structures and semantically important details are well preserved after removing the rain streaks，leading to a better value in use.

Furthermore，the recursive computation is adopted at module level，which allows several blocks collaborate to remove rain streaks progressively using the same parameters. In Table 3，we demonstrate that the model size is greatly compacted with only negligible sacrifice in deraining performance，which increases the practical value for rainy image recovery in resource-limited platforms.

In the future work，since rainy haze caused by rain drop accumulation can be easily observed under the UAV’s view，efforts can be made to remove not only rain streaks but also rainy haze while keeping the deraining model compact and efficient for real-time and cross-platform use.

Transactions of Nanjing University of Aeronautics and Astronautics2020年4期

Transactions of Nanjing University of Aeronautics and Astronautics的其它文章: Experimental Investigation on Low-Velocity Impact Response and Residual Compressive Bearing Capacity of Composite Stringers; A Novel Aircraft Air Conditioning System with a Sterilization Unit by Ultra-High-Temperature Air Stream; Scheduling Check-in Staff with Hierarchical Skills and Weekly Rotation Shifts; An Improved FN Algorithm for Community Division of Air Route Network; Identifying Similar Operation Scenes for Busy Area Sector Dynamic Management; An Improved Gaussian Particle Filter Algorithm Using KLD-Sampling