Hui JIANG, Wenhui XU, Su YANG, Yani ZHU, Zijiang ZHOU, and Jie LIAO
National Meteorological Information Center, China Meteorological Administration, Beijing 100081
ABSTRACT We developed an integrated global land surface dataset (IGLD) at the National Meteorological Information Center of China Meteorological Administration.The IGLD consists of hourly data for 75 variables from five data sources.It contains not only the most widely used variables (e.g., pressure, temperature, dew-point temperature, and precipitation), but also visibility, cloud cover, snow depth, and so on.A hierarchy of data sources was created to identify duplicate records.The records located higher in the hierarchy were adopted preferentially in the IGLD.A comprehensive quality control procedure including extreme value test, internal consistency check, and spatiotemporal consistency check, was applied to the IGLD.The IGLD consists of land surface observations at more than 20,000 global sites from 1901 to 2018, of which about 17,000 stations are currently active.The number of global observatories generally increased over time, except for the 1960s to 1970s.It increased from about 2300 in 1951 to 17,000 in 2018.The observations over America, Europe, and eastern Asia always showed a high temporal integrity and dense spatial coverage, whereas measurements were sparser in South America, Africa, Russia, and the Mediterranean regions.In general, the standard and intermediate standard times for observation suggested by the World Meteorological Organization (WMO) were followed globally, except in Australia, where there were few data measured on the WMO schedule.The IGLD has been used in the China’s first generation global atmospheric reanalysis product (CRA) and the global daily precipitation dataset.
Key words: surface observation, integration, quality control
Hourly surface-based meteorological observations are the most-used and most-requested type of climatological data.They are useful for studying changes in the earth’s climate and for reanalyzing individual meteorological events.For example, surface synoptic data have been used to quantify the frequency of precipitation (Dai,2001a) and its diurnal cycle (Dai, 2001b), the diurnal variations of surface wind and wind divergence fields(Dai and Deser, 1999) and recent changes in surface humidity (Dai, 2006; Willett et al., 2008), and the variations of cloudiness (Dai et al., 2006) and global surface pressure (Dai and Wang, 1999).Willett et al.(2007) derived a homogenized gridded dataset of surface humidity from hourly data to examine changes in surface specific humidity during the late twentieth century.Surface meteorological data extracted from the Integrated Surface Database (ISD) maintained then at the US National Climatic Data Center (NCDC) were used to study the effects of meteorology on ozone in urban areas in eastern America (Camalier et al., 2007).Zou (2010) applied the ISD in a comparative evaluation of the accuracy levels associated with models to assess environmental exposure risk.Hourly dew-point temperature data at 10 stations in the contiguous America were used to develop a method to detect inhomogeneities (Brown and De-Gaetano, 2009).The International Surface Pressure Databank was used to develop a global pressure reanalysis dataset for the twentieth century (Compo et al., 2011).During the last few decades, the Global Telecommunication System (GTS), operated under the auspices of the World Meteorological Organization (WMO), has allowed national meteorological and hydrological services(NMHSs) to share a wide variety of meteorological data regionally and worldwide.However, not all meteorological data transmitted from NMHSs reach all the other nodes (including operational meteorological centers).
The ISD is one of the world’s most extensive global hourly datasets and is hosted by the National Centers for Environmental Information (NCEI, formerly known as NCDC) of NOAA.It is an archive of hourly surface observations from a large number of global surface stations(Lott, 2004; Smith et al., 2011; www.ncei.noaa.gov/products/land-based-station/integrated-surface-database).In spite of over two billion surface observations from more than 20,000 stations worldwide, the ISD has relatively low station densities over Asia (via the GTS), especially in China, compared with America and Europe.The uncertainties arising from poor spatial coverage tend to increase at the local scale in the study of global surface temperature (Ilyas et al., 2017).
The National Meteorological Information Center(NMIC) of the China Meteorological Administration(CMA) is in charge of meteorological data in China.NMIC has been promoting the integrity and quality of meteorological observational data in China by digitizing the historical paper data archives and applying systematic quality control procedures.The aim of this study was to develop a comprehensive integrated global land surface dataset (IGLD) for the period 1901-2018.This dataset has been established and is now conditionally open to the public.Users may apply for access to the data through email or telephone numbers provided on the website http://data.cma.cn/.The IGLD has been used in the China’s first generation global atmospheric reanalysis product (Liu et al., 2017) and the global daily precipitation dataset (Yang et al., 2020).
The rest of the paper is organized as follows.Section 2 gives a brief introduction to the data sources.Section 3 describes the methods used for the integration of multiple data sources, the quality control algorithms, and the assessment of the product.Section 4 discusses the performance of the IGLD.The conclusions from this study are presented in Section 5.

Table1.Basic provenance information for IGLD data sources
We collected data from five data sources, including four global sources and one regional source, to build a compilation dataset (namely IGLD) that uses the best features of the five individual sources.Table 1 lists the basic provenance information for the IGLD.In the table,the data archived by NMIC include the GTS data received in Beijing since 1980 (NMIC_GTS) and the hourly surface data from about 2400 national meteorological sites over China since 1951 (NMIC_China) in the CMA Net, which are updated in nearly real time.We refer to these data as the NMIC data.
We also collected the data assimilated in the Climate Forecast System Reanalysis (CFSR) from 1979 to 2014(Saha et al., 2010) and the Global Data Assimilation System (GDAS) from 2015 to 2018, from the operational data archives of NCEP/NOAA.The meteorological variables available include station pressure, temperature,dew-point temperature, and wind speed and direction.CFSR and GDAS use the NCEP operational observation quality control procedures, performing only rudimentary limit checks of surface pressure observations compared with the background (Saha et al., 2010).We refer to the data assimilated in CFSR and GDAS as the NCEP data.
The ISD of NCEI contains data from over 100 original data sources that collectively archive hundreds of meteorological variables.The overall period of record is currently from 1901 to the present day.The number of active station locations has now reached 13,000, making the ISD one of the world’s most extensive global datasets of sub-daily data observations; the updates are delayed by about two days.The most common meteorological variables in ISD include station pressure, wind speed and direction, temperature, dew-point temperature,cloud data, sea-level pressure, altimeter setting, weather phenomenon, visibility, amounts of precipitation for various time periods, and snow depth.The quality control algorithms for ISD include a series of validity checks, extreme value checks, internal consistency checks, and external (versus another observation for the same station)continuity checks, but do not include spatial quality control (Smith et al., 2011).
We aimed to establish a global hourly meteorological dataset containing records that were as comprehensive as possible by combining the best features of the datasets described in Section 2.The synoptic surface report(FM12; WMO, 2019) and aerodrome meteorological report (FM15; WMO, 2019) in the NCEI ISD and NCEP data were used in the integration.As a result of the current absence of decoding for FM15, only FM12 was considered in the NMIC data.The spatial coverage and volume of the FM12 data are significantly low in America (see Fig.S1 in the online supplemental material) because America does not generally use a synoptic format(FM12) and the reports are augmented with FM15 data(www.webaugur.com/dave/weather/ref/metar/Weather-ObDefFormat/OMF-SYNOP.htm).The FM15 data added to IGLD significantly increase the coverage and volume of data in America (see Fig.S2 in the online supplemental material).Compared with the NCEP data, the FM15 data in the ISD has the advantage of more stations and longer records (see Fig.S3 in the online supplemental material).The ISD is therefore considered as the data source in the FM15 report of the IGLD.
We focused on the integration of the FM12 data in the data sources.A hierarchy of the five datasets was created before integration.In China, NMIC_China was considered a unique data source with top priority, given that NMIC is in charge of all the meteorological data and quality for this country.For regions beyond China, records higher in the hierarchy were preferentially incorporated into the IGLD if there were several optional data sources for one site.The priority of all datasets excluding NMIC_China was determined by the quality control procedures, the stability of the data, the number of stations, and the application (Table 1).The priority score(PS) of each data source was defined and calculated as,

where P1, P2, P3, and P4represent the stability score, the number score of the observation station, the quality score, and the application in different fields, respectively(see Table 2 for details).
The first column of Table 1 provides the final priority of each data source in the integration.The ISD was given the highest priority (PS = 8) as a result of systematic quality control checks, the largest number of observations, and extensive applications.The ISD has the highest number of stations and data volumes and the highest global density of sites, especially in Europe, Japan, and Australia (see Figs.S1, S3 in the online supplemental material).We gave the NCEP data the second highest priority (PS = 6; P1= 2, P2= 2, P3= 1, and P4=1) because of the higher spatial coverage and higher number of observation sites than the NMIC_GTS.Fewer observatories and simple quality control procedures resulted in the NMIC_GTS being given the lowest priority(PS = 5; P1= 2, P2= 1, P3= 1, and P4= 1).
To facilitate the integration process, the chosen datasets and accompanying station metadata were reformatted into a common format and duplicate records were removed.The NCEP and NMIC data use the WMO fivedigit station numbers to identify the stations, whereas the ISD uses a six-digit number.We unified all the identification numbers by converting all the station numbers into the type used by the ISD.If the records were identified according to the same station identification number with the same observation date and time across different data sources, the integration was carried out according to the determined priority.
By referring to the quality control methods developed by the NCEP Meteorological Assimilation Data Ingest System (madis.ncep.noaa.gov/madis_sfc_qc_notes.shtml)and the quality control methods applied to the UK Met Office Hadley Centre observational datasets (Dunn et al.,2012), a quality control flow was implemented for the IGLD, which included extreme value check, internal consistency check, temporal consistency check, and spatial consistency check, to identify the gross errors in the IGLD.The data quality check results were divided into three levels: credible, suspicious, and erroneous.The quality control results at each step jointly determined the final quality control level.Only the records passing all the tests were recognized as credible values; otherwise, they were deemed as erroneous (failing in one or more tests)or suspicious (with one or more suspicious quality control results in the quality control flow).

Table2.Components of the priority score (PS) of a data source
We evaluated the spatial coverage and integrity of the IGLD through analysis of the P_OBS value calculated as follows:

whereDobsis the number of days with meteorological measurements on a 1° grid andDallis the total number of days in the evaluation period.The value of P_OBS represents the integrity of the data in each grid box.The spatial coverage of grids with P_OBS > 0 represents the spatial coverage of the IGLD.
This section discusses the performance of IGLD temperature, dew-point temperature, pressure, wind direction, wind speed, 6-h cumulative precipitation (prep_6h),12-h cumulative precipitation (prep_12h), and 24-h cumulative precipitation (prep_24h), which are among the 75 most widely used variables of the IGLD (see Table S1 in the online supplemental material), based on analysis of spatiotemporal distributions of the data volumes and integrity of the data.
The IGLD contains data from more than 20,000 stations worldwide from 1901 to 2018, and over 17,000 active stations are updated in the dataset.Figure 1a shows the changes in the number of stations with measurements of more than one variable each year from 1901 to 2018.It is clear that the number of sites has increased over time in the last 118 years, except for the 1960s-1970s, a result of the transition from the keying-in of data to the digital transmission/receipt of data (Smith et al., 2011).
Figure 1b shows the monthly volumes of data for different variables in IGLD during 1901-2018.The data volumes for pressure, temperature, dew-point temperature, wind direction, and wind speed show the same variation trend and are higher than the data volumes for prep_6h, prep_12h, and prep_24h.IGLD had few stations and low data volumes before 1930.Figure 1b shows the gaps in the early 1970s before the GTS came into existence.
There were two significant leaps in the volumes of data for pressure, temperature, dew-point temperature,wind direction, and wind speed: (1) around 1997 when automated Meteorological Aviation Report started (Saha et al., 2010) and (2) in the early 2000s when the observation pattern in China changed from 4 times daily manual observations to 24 hourly automatic observations.
There were fluctuations in the prep_6h data volumes from 1998 to 2001 when the active stations in America were unstable (data not shown).A notable increase in the volume of prep_24h data occurred during 2000-2006,when the frequency of observations of prep_24h changed from once a day to 8 times a day in China and from no station to a high spatial coverage in Europe (data not shown).
A significant gap in the prep_24h data occurred during 2006-2014 because of a deterioration of the data integrity in China.There was an increase in the volume of prep_6h, prep_12h, and prep_24h data around 2016 when the cumulative precipitation for all 24-h periods was obtained by summing up the 1-h cumulative precipitation for China.
Figure 2 shows global distributions of P_OBS above 10% for pressure, temperature, dew-point temperature,wind direction, and wind speed.Panels I, II, and III represent the results for 1931-1960, 1961-1990, and 1991-2018, respectively.The spatial coverage and integrity of the surface pressure are different from those of temperature, dew-point temperature, wind direction, and wind speed.Before 1960, the latter four variables had a high spatial coverage over America, Europe, India, and eastern Asia and a low spatial coverage over Africa,South America, and the Mediterranean regions (Fig.2).However, observations of surface pressure before 1960 were absent apart from China, and on average, P_OBS for pressure reached only approximately 30% in most regions of China.
During 1961-1990, measurements were available over all the global land surface, despite the sparse coverage in South America, Africa, and the Mediterranean regions.Compared with the period 1931-1960, the coverage in the low-density regions had significantly improved and the integrity reached about 60%.The integrity in America, Europe, and China for regions with a high spatial coverage was significantly high and increased by about 60% relative to that in the time period 1931-1960.

Fig.1.Temporal (1901-2018) evolution of (a) annual number of stations with measurements of more than one variable in the IGLD in regions outside China (orange shading) and in China (green shading), and (b) monthly volume of data for different variables in the IGLD.The measurements of pressure and prep_6h in the IGLD started from the early 1950s.
Compared with Panels I and II, Panel III shows significantly higher spatial coverage, especially over America,Europe, and eastern Asia, where the measurements covered almost the whole land surface and P_OBS for each grid reached about 90% in 1991-2018.The integrity of most grids in South America, Australia, Africa,the Mediterranean regions, and eastern Russia reached 80%-90%, although the measurements did not completely cover these regions.The integrity improved significantly by 20%-40% globally, especially in Europe,South America, Africa, and Australia relative to that in the time period 1961-1990.

Fig.2.Spatial distributions of P_OBS (%) above 10% for pressure (top row), temperature (second row), dew-point temperature (third row),wind direction (fourth row), and wind speed (bottom row) for each 1° grid over the whole globe.(I) Left column: 1931-1960; (II) middle column: 1961-1990; and (III) right column: 1991-2018.The ISD is the unique data source for the IGLD before 1951 when there had been measurements of temperature and wind, but no pressure.
The spatial distributions of prep_6h, prep_12h, and prep_24h were not consistent with those of pressure,temperature, dew-point temperature, wind direction, and wind speed.Figure 3 shows that there was no site with continuous prep_6h measurements (P_OBS ≥ 10%) over the world during 1931-1960, when the measurements of prep_12h and prep_24h records were mainly located in China with a low integrity of 10%-20% because of the introduction of the NMIC data into IGLD since 1951.
From 1961 to 1990, the measurements of prep_6h covered almost the whole globe and there was a high spatial coverage in America, Europe, and eastern Asia,where P_OBS reached 20%-50%.The spatial coverage of prep_12h with P_OBS above 10% just enlarged in some parts of Europe and Russia relative to that in the time period 1931-1960.The integrity was significantly high in China, where the measurements covered almost every grid and P_OBS reached about 90% after 1961.The spatial distribution of continuous measurements of prep_24h (P_OBS ≥ 10%) shows that the measurements covered almost all lands of the globe, apart from Europe and the Mediterranean regions (P_OBS < 10%).Meanwhile, the integrity was significantly low globally (see the small P_OBS values denoted by dark blue dots),apart from China where P_OBS reached about 90%.The coverage and integrity of prep_6h, prep_12h, and prep_24h in 1961-1990 was better than that in 1931-1960.

Fig.3.Spatial distributions of P_OBS (%) above 10% for prep_6h (top row), prep_12h (middle row), and prep_24h (bottom row) for each 1°grid over the whole globe.(I) Left column: 1931-1960; (II) middle column: 1961-1990; and (III) right column: 1991-2018.Note that the ISD is the unique data source for the IGLD before 1951, when there were only a few measurements of prep_6h, prep_12h, and prep_24h, and there was no site with continuous prep_6h measurements (P_OBS ≥ 10%).In addition, there were nearly no prep_12h data in North America due to unstable and discrete measurements in this region (P_OBS < 10%) during all the three time periods.
Panel III in Fig.3 shows the significantly high spatial coverage and integrity of prep_6h in America, Europe,and eastern Asia, where the measurements covered almost the whole land surface in the time period 1991-2018.P_OBS only had a high integrity (90%) in Europe, Australia, and parts of the Americas.Compared with 1961-1990, the integrity had been significantly improving globally from 1991 to 2018.The P_OBS of prep_6h increased by about 40% in Europe; 30% in eastern America, South America, and southern Africa; 50%in parts of India and Australia; and 60% in China.The spatial coverage of prep_12h was significantly high in Asia, Europe, Russia, and Australia, with noteworthy gaps in America and Africa; and the only regions with a high integrity were China and Europe.The observations of prep_24h covered almost the whole globe; it is similar to that of prep_6h but with fewer sites of high integrity (red dots) in Africa, Europe, South America, and Australia.Compared with 1961-1990, the integrity improved significantly in Europe, Australia, the Mediterranean regions, and Russia, where P_OBS increased by about 60%, during 1991-2018.
The spatial coverages of temperature, dew-point temperature, wind direction, wind speed, prep_6h, and prep_24h have been similar since the 1990s, when the measurements in the IGLD covered almost all of America, Europe, and eastern Asia.However, there were notable gaps in South America, Africa, Russia, and the Mediterranean regions, where the observation sites need upgrading.The integrities of pressure, temperature, dewpoint temperature, wind direction, and wind speed reached about 90% in most regions, significantly higher than those of prep_6h, prep_12h, and prep_24h.
The WMO advises that observations/measurements be executed across the globe at standard times or at intermediate standard times (WMO, 2011; Table 3).Figure 4 shows hourly spatial distributions of P_OBS above 10%for 24-h observations of temperature on a 1° grid from 2008 to 2018.The spatial coverage was high in America,Europe, and eastern Asia, where the integrity reached about 90%-100% during the entire 24-h period.The spatial coverage in eastern South America was high in the whole 24-h period, but with a low integrity due to the high number of new observation sites here since 2016(figure omitted).High-integrity observations in parts of South America occurred at both standard times and intermediate standard times.The observations were generally concentrated at standard and intermediate standard times in India, where high coverage occurred at 0300 and 1200 UTC.High coverage occurred in South and Northwest Africa at standard and intermediate standard times.The observations in southeastern Australia were concentrated at non-standard times and those in northeastern Australia occurred once every 3 h from 0500 UTC.
We found that the observations in most regions, apart from Australia, followed the observational times advised by the WMO.Table 3 shows that the observational times in Region V, including Australia, are at the standard or intermediate standard times as suggested by the WMO.The observational times were concentrated on non-standard times in Australia—for example, at 0200, 0800,1400, and 2000 UTC.In addition, the observations in the Americas, Europe, and eastern Asia were done not only at the suggested observational times, but also at other times.Most observations in India were concentrated at the suggested observational times (0300 and 1200 UTC),with a few observations carried out at all times during the 24-h period.

Table3.Observational times recommended by the WMO in seven regions
We established an hourly integrated global land surface dataset (IGLD), which contains 75 surface meteorological variables by integrating five international datasets.To make the best use of these data sources, the data format and identification codes of the observational stations were unified, and a hierarchy of five datasets was set up based on their quality control procedure, data stability, number of observatories, and application (e.g., as input data for reanalysis products).The records located higher in the hierarchy were adopted in the IGLD if there were several sources of data.A comprehensive quality control procedure was applied to the IGLD, including tests of extreme values, internal consistency, temporal consistency, and spatial consistency.
The IGLD includes over 20,000 stations worldwide from 1901 to 2018 and more than 17,000 active stations that are regularly updated.After the 1990s, the measurements in the IGLD covered almost all of America,Europe, and eastern Asia.It should be noted that South America, Africa, Russia, and the Mediterranean regions have always had a low spatial coverage.The IGLD contains not only the most widely used variables (e.g., pressure, temperature, dew-point temperature, wind direction,wind speed, prep_6h, prep_12h, and prep_24h), but also visibility, cloud cover, and snow depth.The volumes of monthly data have been increasing over time, especially after the 1960s.
Based on the IGLD, we found that most countries carry out their observations at the standard and intermediate standard times suggested by the WMO.In particular, America, Europe, and eastern Asia exhibit a high data integrity at all measurement times.Australia seems to prefer a local measurement schedule.

Fig.4.Spatial distributions of hourly P_OBS (%) above 10% for temperature in each 1° grid over the whole globe for a 24-h period (from 0000 to 2300 UTC) during 2008-2018.
The IGLD has been used in the China’s first generation global atmospheric reanalysis product (Liu et al.,2017) and land reanalysis product (Liang et al., 2020).It is also the data source for the global daily precipitation dataset (Yang et al., 2020).These applications inspire a certain level of confidence in the accuracy and stability of the IGLD.It is updated in real time based on global surface data from the NMIC_GTS and NMIC_China on the CMA Data-as-a-Service platform.
Acknowledgments.The authors would like to thank Zhisen Zhang for his assistance in programming and the relevant agencies for providing the source data for the IGLD of NMIC.
Journal of Meteorological Research2021年5期