Ting Xiao·Kunlong Yin·Tianlu Yao·Shuhao Liu
Abstract Landslide susceptibility mapping is vital for landslide risk management and urban planning.In this study,we used three statistical models[frequency ratio,certainty factor and index of entropy(IOE)]and a machine learning model[random forest(RF)]for landslide susceptibility mapping in Wanzhou County, China. First, a landslide inventory map was prepared using earlier geotechnical investigation reports,aerial images,and field surveys.Then,the redundant factors were excluded from the initial fourteen landslide causal factors via factor correlation analysis.To determine the most effective causal factors,landslide susceptibility evaluations were performed based on four cases with different combinations of factors(‘‘cases'').In the analysis,465(70%)landslide locations were randomly selected for model training,and 200(30%)landslide locations were selected for verification. The results showed that case 3 produced the best performance for the statistical models and that case 2 produced the best performance for the RF model.Finally,the receiver operating characteristic(ROC)curve was used to verify the accuracy of each model's results for its respective optimal case.The ROC curve analysis showed that the machine learning model performed better than the other three models,and among the three statistical models,the IOE model with weight coefficients was superior.
Keywords Landslide susceptibility mapping·Statistical model·Machine learning model·Four cases
Landslides are among the most common geological disasters in the world.They occur over a widely with high frequency,and are characterized by fast movement speed and terrible destruction(Yin and Yan 1988).According to the data released by the Ministry of Land and Resources,approximately 349 people were killed,51 went missing and 218 were injured by geological disasters in China in 2014.In addition,the total estimated economic loss from these disasters was 0.9 billion US dollars.In that year,there were 8128 landslides,which accounted for 74.5%of the total geological hazards.To develop and use land resources more responsibly,geological disaster prevention and control need to be conducted in a planned and methodical way.A landslide susceptibility map is generally the first step in landslide hazard and risk management and is also very useful for urban infrastructure planning in mountainous areas(Guzzetti et al.2006;Fell et al.2008).
In recent years,landslide susceptibility mapping has been largely conducted using analytical models based on geographic information systems(GIS)(Chen et al.2017).These GIS-based models fall into two categories:statistical(such as logistic regressions,decision trees,and certainty factors)and machine learning(such as artificial neural networks,random forests,and support vector machines).A variety of simple and useful statistical models are used in landslide susceptibility evaluations(Devkota et al.2013).Some researchers have used logistic regression models to evaluate the landslide susceptibility of the Kakuda-Yahiko mountain area in Japan and areas in Colorado in the USA(Ayalew and Yamagishi 2005;Lee 2007).Uncertainty and probability approaches were used in landslide susceptibility evaluation(Feizizadeh et al.2014;Pourghasemi et al.2014). Currently, machine learning methods are increasingly applied in landslide susceptibility,such as the random forest, neuro-fuzzy, and PSO-support vector machine methods(Trigila et al.2013;Zhou et al.2018).The trigger factors of landslides have regional differences,and the types of data in different study areas are not exactly the same.Consequently,there is no uniformly optimal model for landslide susceptibility mapping.Instead,mapping must be conducted on a case-by-case(that is,regionby-region) basis: each region must be analyzed using multiple models,and the results of the different models must be evaluated and compared to determine which model is the most effective for that region.In this study,landslides susceptibility in Wanzhou County, China, was mapped using three statistical models and one machine learning model,and the differences between the results of the four models are analyzed by a detailed comparison.
In landslide susceptibility assessment of a region,the causal factors of landslides are first selected and classified.An‘‘evaluation model of susceptibility''is acquired via historical landslide data and some algorithm.The model is then applied to the whole region to obtain a‘‘map of susceptibility'',which is an important step in the process of landslide susceptibility evaluation.Four different models,namely,the frequency ratio(FR),the certainty factor(CF),the index of entropy(IOE)and the random forest(RF)model,were used in this study,and their maps of susceptibility were compared.
The study area (Wanzhou County) is located in the Municipality of Chongqing in the southwestern part of China between longitudes 107°55′22′′—108°53′25′′E and latitudes 30°24′25′′—31°14′58′′N(Fig.1).It covers an area of approximately 3457 km2.The study area is in a subtropical humid monsoon climate zone with mild climate,abundant sunshine,and sufficient rainfall.
The study area is located in the northeastern Sichuan Basin and belongs to the Yangtze River valley belt in the eastern Sichuan parallel ridge-and-valley area.The elevation gradually decreases from east to west,forming a hilly landscape with an overall step-like pattern.On both sides of the Yangtze River,the exposed strata[which mostly formed during the Mesozoic(Triassic and Jurassic)from 2.3 to 137 Ma]become older as distance from the river increases.Jurassic strata are the most widely distributed,with some Permian strata dating from 299 to 252 Ma and some Quaternary strata from 2.5 Ma.The rivers and valleys in the study area are well developed.The Yangtze River runs throughout Wanzhou County from southwest to northeast,and 93 large and small streams form a complex surface runoff network.
The first model used to evaluate landslide susceptibility in this study is the FR model,which is a relatively simple and statistical model(Hong et al.2016;Kumar and Anbalagan 2015).The landslide susceptibility index(LSI)is the sum of each factor calculated by Eqs.(1)and(2)(Regmi et al.2013):

Each factor is divided into several classes.For any class,A is the total study area and S is the area of the class.A1is the area of landslides that occurred in the entire study area,and S1is the area of landslides that occurred in the class.FR is the frequency ratio of each factor's class.
The CF model is used to analyze the sensitivity of the factors which are associated with the occurrence of an event.It has been widely considered for landslide susceptibility(Wu et al.2016).The model is a probability function that was first proposed by Short life and improved by Heckeman(Shortliffe 1975;Heckerman 1985).CF values are calculated according to the following formula:

For any class,PPais the conditional probability of landslide event occurrence in the class,and PPsis the prior probability of the total landslide events in the research area.It can be seen from Eq.(3)that the range of CF is[-1,1].A positive CF value indicates a high likelihood of a landslide,and the CF is closer to 1 indicates a landslide is more likely to occur.Negative values indicate a low likelihood of a landslide,and the CF is closer to-1 indicates a landslide is less likely to occur.

Fig.1 Location map of the study area.a Location of the Three Gorges Reservoir(TGR)in China.b Location of the study area in the TGR.c The digital elevation model(DEM)showing landslide locations
The third model used for evaluating landslide susceptibility in this study is the IOE model.The entropy index indicates the extent of disorder in the environment and also expresses which factors in the environment are most relevant to the occurrence of landslides(Devkota et al.2013;Wang et al.2015).This second item is accomplished by calculating a weight for each input variable.The weight value of each variable is represented by the entropy index.The weight parameter is derived from the definition of the entropy index.The following seven formulas are used to calculate the entropy index:


where,a is the domain percentage.b is the landslide percentage,i and j represent the serial number of factors and classes,respectively,Sjis the class number,Hjand Hjmaxare the entropy values,Ijis the information coefficient,and Wjrepresents the resulting weight value for the parameter as a whole.
The random forest(RF)method,which was proposed by Breiman in 1995,is considered to be a relatively new and powerful approach in classification,regression,and unsupervised learning(Youssef et al.2016;Lagomarsino et al.2017).A random forest is an integrated learning method that uses ‘‘bagging'' to generate multiple independent training sets and to generate multiple regression trees for prediction.The main idea is that the results of multiple classifier combinations are superior to that of a single classifier.This method has been widely used in various fields(Lagomarsino et al.2017;Hong et al.2017).In this study,the RF model was constructed in MATLAB2015b.
A landslide inventory map is the key to landslide susceptibility mapping,and the initial map comes from several sources.Landslide records were provided by the geological environment monitoring station of Wanzhou County.Landslides were described via intensive field surveys and aerial photographs and 665 distinct landslide locations were identified.In this study,the total area of landslides was 102.64 km2,accounting for about 2.97%of the total study area.The maximum and minimum sizes of the identified landslides were approximately 9.6×105and 30 m2,respectively.There were 60 landslides with an area of less than 2500 m2,and these landslides accounted for 9.2%of the total number of landslides and 0.06%of the total landslide area.In this study,input samples included landslide points and non-landslide points.Non-landslide points were randomly selected using SPSS software,and its number is the same as that of landslide points.The input samples were randomly split into two groups following a ratio of 70/30.The larger group was used for model training,and the smaller group was used for model verification(Fig.2).
Each landslide is caused by a combination of factors arising from the internal geology of slope as well as the external environment.The internal factors,which include stratum lithology,topography and geomorphology(among others),play a controlling role in the development and priming of a slope for landslides.The external factors,which include the local hydrogeological environment and human engineering activities,are generally responsible for the triggering of a landslide(Borrelli et al.2018;Luo and Liu 2018;Nicu 2018).Based on data analysis and previous research,fourteen causal factors were selected for modeling in the study area:altitude,slope,aspect,plan curvature,profile curvature,stream power index(SPI),topography wetness index (TWI), terrain ruggedness index (TRI),bedding structure,lithology,land use,geological structure,distance to rivers,and distance to roads/railways.

Fig.2 Flow chart of the landslide susceptibility evaluation
Since the function of these factors is not the same,there are different methods(Jenks natural break classification,Equal interval method,Near-zero values method,etc.)for the classification of continuous factors(Table 1)(Pourghasemi et al.2018).In general,factors such as altitude,slope,SPI and TWI could be classified by the Jenks natural break method(Calvello and Ciurleo 2016).The classification of proximity layers such as rivers and roads/railways were based on an equal interval method.Plan and profile curvatures were divided into concave,flat and convex,and the division points are near-zero.However,the classification of certain factors is not standardized in the landslide literature because most scholars base their classifications on their personal experience(Su¨zen and Doyuran 2004).In Table 1,‘‘FR''and‘‘CF''represent the contribution of each class calculated from the FR and CF models,respectively,and‘‘Wj''is the weighting coefficient of each factor(calculated using the IOE model).It is worth noting that the difference of Wjin each factor is much larger than the difference of FR or CF values in each class.
4.2.1 Altitude
In the study area,massive landslides were induced by the Yangtze River,heavily skewing the landslide distribution toward lower altitudes.The altitude range in the study area was 120—1656 m and was divided into six classes:120—350, 350—500, 500—700, 700—900, 900—1100, and 1100—1656 m(Fig.3a).
4.2.2 Slope
Slope steepness is an important factor,since it is related both to the shear stresses acting on the hill slope and to the displacement of the landslide mass(Bui et al.2017).The slope was measured from the Digital Elevation Model with a resolution of 25 m and was divided into six classes:0°—6°, 6°—14°, 14°—21°, 21°—28°, 28°—37°, and 37°—80°(Fig.3b).
4.2.3 Aspect
A slope's aspect largely determines its exposure to sunlight and prevailing winds.This exposure influences the slope's soil moistures and vegetation,both of which play significant roles in the slope's landslide susceptibility.Aspects were divided into nine classes:flat,north,northeast,east,southeast,south,southwest,west,and northwest(Fig.3c).
4.2.4 Plan curvature
Plan curvature directly affects the convergence and dispersion of surface fluids (Nasiri Aghdam et al.2016;Pourghasemi and Rossi 2016;Pradhan 2013;Pourghasemi et al.2013).The plan curvature was within the range of-17.41 to 9.46 and was divided into three classes:concave(-17.41 to-0.001),flat(-0.001 to 0.001),and convex(0.001—9.46)(Fig.3d).
4.2.5 Profile curvature
The profile curvature controls the acceleration or deceleration of the material on slope,which affects the deposition of the material.The profile curvature was within the range of-13.33 to 13.05 and was divided into three classes:concave(-13.33 to-0.001),flat(-0.001—0.001),and convex(0.001—13.05)(Fig.3e).
4.2.6 SPI and TWI
SPI can reflect the ability of a water system to erode surfaces,and TWI accurately describes the impact of topographic changes on soil runoff(Moore and Grayson 1991).The SPI values were divided into four categories:0—500,500—2250,2250—7000,and 7000—31,811(Fig.3f),and the TWI values were divided into five classes:5.8—9,9—12,12—15,15—20,and 20—28(Fig.3g).
4.2.7 TRI
TRI refers to the ratio of the surface area to the horizontal projected area,which reflects the degree of surface erosion and fluctuations in a certain area.In this study,the TRI values were divided into five classes:0—3,3—7,7—11,11—16 and 16—59(Fig.3h).
4.2.8 Bedding structure
Bedding structure is the spatial relationship of strata and slope,and it plays a crucial role in the development of landslides(Zhou et al.2018).The bedding structure was divided into seven classes(BS1 through BS7)based on the combination of four indices:slope,aspect,bed dip angle and bed dip direction(Table 2,Fig.3i).
4.2.9 Lithology
Lithology is an important internal factor in the formation and development of landslides.In particular,lithology plays a significant role in a landslide's scale and type,since strata with low strength parameters are prone to detachment from the slope.The twelve sedimentary rock lithological units inthe study area are bounded by the Yangtze River and develop symmetrically on both sides.The twelve lithologies present in the study area are J3s,T3xj,J3p,T2b,T1j,J1z,J1z-2z,J2x,J2xs,J2s,P2and T1d,which are reported as positive integers from 1 to 12,respectively(Fig.3j).

Table 1 Spatial relationships between causal factors and landslides

Table 1 continued
4.2.10 Land use
In the study area,human engineering activities such as urban construction,traffic construction,and mining are frequent and play a significant role in the triggering of landslides.These engineering activities involve cutting or excavating slopes,thus breaking the original geological conditions and altering the slope's stability.Based on land use planning data in the study area,the eight classes of land use were residential areas,power stations,transportation areas,forest and grass,farmland,water bodies,hydraulic engineering land and others,which are reported as positive integers from 1 to 8,respectively(Fig.3k).

Fig.3 The matic maps of the study area:a altitude;b slope;c aspect;d plan curvature;e profile curvature;f SPI;g TWI;h TRI;i bedding structure;j lithology;k landuse;l geological structure;m distance to rivers;n distance to roads/railways
4.2.11 Geological structure
Landslide susceptibility is highly correlated to geological structure,because the strength of the rocks in a slope is lowered by processes which affect the slope's geological structure(for example,tectonic fracturing).The geological structure of the study area is complex,with multiple folds and faults.According to the characteristics of each structure,the scope of influence was classified,and the landslides in each interval were counted.Finally,the scope of influence of four major geological structures was divided into six classes(Table 3,Fig.3l).

Fig.3 continued

Table 2 Classification of bedding structure

Table 3 Classification of geological structure

Table 4 Classification of rivers
4.2.12 Distance to rivers
When rivers are different in spatial distribution,scale,and direction of flow,their respective relationships with landslide susceptibility will also be different.The rivers in the study area may be divided into three levels.The first-level river is the main stream of the Yangtze River,the secondlevel rivers aremedium streams with small runoff values,and the third-level rivers are seasonal streams.The distance to rivers was divided into five classes(Table 4,Fig.3m).
4.2.13 Distance to roads/railways
Distance to roads/railways is an important anthropogenic factor that influences the topography of natural slopes.There are six types of roads/railways in the study area,which,listed in order of their influence on landslides from largest to smallest, are as follows: railway, highway,national highway,provincial road,county road and rural road.The distance to roads/railways was divided into six classes(Table 5,Fig.3n).

Table 5 Classification of roads/railways
Although each of the fourteen factors used in this study are closely related to landslide susceptibility,there may exist some correlation between them.The inclusion of partially extraneous(that is,correlated)index factors without consideration or correction may skew the results of the analysis.Therefore,it is necessary to test for correlation to ensure that the index factors are sufficiently independent from one another.Spearman coefficients were used to test the correlation among the factors.Table 6 shows that TWI and TRI are both strongly correlated to slope.TWI,which describes the effect of topographic changes on surface water runoff,is inversely related to slope.This is expected,because when a slope is flater,topographic changes will have a greater impact on runoff.In contrast,TRI)is directly related to slope.This is also expected,because steeper slopes should be expected to have a larger surface area,relative to their horizontally projected surface area.
The spatial relationships between landslides and the conditioning factors are shown in Table 1.The FR,CF and IOE statistical models and the RF machine learning model were applied to assess the landslide susceptibility.Since TWI,TRI and slope have some correlation,one or two of them may need to be removed.To achieve the higher degree of model performance, landslide susceptibility evaluations were carried out based on four cases with different factor combinations.Case 1 contained fourteen factors(none were eliminated),while cases 2,3 and 4 removed the TRI factor,the TRI and TWI factors,and the slope factor,respectively.
The area under the receiver operating characteristic(ROC)curve(AUC)was used to evaluate the accuracy of each model's results,and the results of these evaluations are shown in Table 7.The ROC curve mainly reflects the change of the number of landslides in each susceptibility interval from high to low.For the FR and CF models,case 1 had the lowest accuracy,followed by case 4 and case 2,and case 3 had the highest accuracy.For the IOE model,the accuracy was same for the four cases.For the RF model,case 1 had the highest accuracy.In summary,removing both the TRI and TWI factors(as incase 3)produced the highest accuracy for the FR and CF statistical models,but had no effect on the accuracy of the IOE statistical model.Removing only the TRI factor(as in case 2),produced the highest accuracy for the RF machine learning model.


Table 7 The prediction accuracy with elimination of redundant factors

Fig.4 Landslide susceptibility map produced using a FR,b CF,c IOE,and d RF models
After the initial accuracy assessment, the landslide susceptibility of the study area was mapped using the best case for each model.The landslide susceptibility index was divided into five classes by the natural break method:very high,high,moderate,low and very low.The resulting maps are shown in Fig.4.
Validation is used to verify the rationality of landslide susceptibility models.The landslide statistics of different susceptibility levels are shown in Table 8 and Fig.5.These statistical results show that the frequency ratios are positively correlated with the susceptibility levels in all four models.This indicates that the landslide density is higherin areas with a higher susceptibility level,that is,all four models produce predictive maps which are consistent with the real world.The frequency ratio of the RF model in the very high level was the largest(7.304),whereas the values of the IOE,FR and CF models were much smaller(4.906,4.729 and 3.690,respectively).

Table 8 Accuracy statistics of the FR,CF,IOE and RF models

Fig.5 Landslide frequency ratio for each susceptibility level
When using statistical methods to evaluate model performance,one should note that cutoff-dependent approaches require reclassification of the LSI,because the results of cut off-dependent evaluations will vary depending on the breakpoints.The ROC curve is a cutoff-independent evaluation model(Hanley and McNeil 1983).The ROC curves presented in Fig.6 reflects the change in landslide area for each landslide susceptibility interval.For each curve,the AUC value indicates the overall quality of the corresponding model,with a higher AUC value indicating a better model.The RF machine learning model achieved excellent performance in the AUC assessment,with AUC values for the training and verification data sets of 0.834 and 0.801,respectively.For the FR,CF,and IOE models,the AUC values of the training datasets were 0.737,0.746 and 75.2,respectively,and the AUC values of the verification datasets were 0.728,0.732 and 0.738,respectively.These results indicate that the RF machine learning model is more suitable for landslide susceptibility mapping in the study area than the other models.Among the statistical models,the IOE model had the best results,followed by the CF and FR models.

Fig.6 The ROC curves:a training and b verification
Although the FR and CF models use different algorithms,the trends in factor assignment are the same.According to Table 1,most landslides occur in areas where the altitude is less than 350 meters.This is mainly due to the large number of wading landslides on both banks of the Yangtze River,which is surrounded by relatively low-elevation terrain.Since Wanzhou County is dominated by largescale,near-horizontal landslides,landslides with slopes in the range of 6°—14°are the most common.Most landslides occurred on flat or west-facing slopes.The inclination of the strata on the banks of the Yangtze River is approximately 220°,which means that slopes on the right bank are more likely to form forward slopes,whereas slopes are mostly west-facing on the right bank.For the bedding structure,the FR values are the largest in BS1.J2Sis an interbedded layer of sandstone and mudstone in the Shaximiao Formation.Most landslides that occurred in J2Swere caused by weak layers of mudstone that are prone to soften in gand failure,thereby triggering landslides.In Wanzhou County,the well-known Anlesi Landslide,Caojiezi Landslide,and Taibaiyan Landslide are all old landslides with a volume of more than ten million cubic meters.They all developed in near-level sandstone and mudstone interbedded strata. The dip angle is generally only approximately 3°—5°,and the landslide surface is nearly horizontal.Most landslides are in the water body,since the landslides in the study area are mostly reservoir-induced.With increasing proximity to a geological structure or a river,the number of landslides increases;therefore,the geological structures and the rivers are important factors in the formation of landslides.The impact of roads/railways on landslides is relatively insignificant in the study area.
Different models yielded their most accurate results for different cases(that is,different combinations of factors).From Table 7 and Fig.6,the highest AUC value among all the models and all caseswas yielded by the RF model for case 2.The IOE model's performance was the most constant across all four cases,(owing to the small weight coefficients of TWI,TRI and slope),suggesting that it was somewhat superior to the other two statistical models.Both the CF and the FR models had the highest accuracy in case 3,with CF slightly exceeding FR.The ultimate superiority of the machine learning model is not unsurprising,because it involves the construction of an analysis system based ondata learning and does not rely on explicit,pre-defined construction rules(as is the case with all statistical models).The advantage of the IOE model over the other statistical models is that it adds a weight value to each factor with more completeness and stability.
Landslide susceptibility mapping is vital for landslide risk management and urban planning.In this study,landscape susceptibility was evaluated for Wanzhou county by analyzing fourteen causal factors with three statistical models(frequency ratio,certainty factor,and index of entropy)and one machine learning model(random forest).Correlation analysis of the fourteen factors showed that TWI,TRI,and slope are strongly correlated.Unlike many prior studies,we found the optimal combination of factors for each model instead of simply eliminating the most correlated factors directly.To accomplish this optimization,landslide susceptibility evaluations were performed using each model with four combinations of factors.Overall,the RF model performed significantly better than other models,and among the other three models,the IOE model with weight coefficients was superior.
AcknowledgementsThis paper was prepared as part of the projects‘The risk assessment of geological hazards induced by reservoir water level fluctuation in Chongqing, Three-Gorges Reservoir,China.''(No.2016065135)and‘‘The study of mechanism and forecast criterion of the gentle-dip landslides in The Three Gorges Reservoir Region,China''(No.41572292)funded by the National Natural Science Foundation of China.