999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

Rank regression: an alternative regression approach for data with outliers

2014-12-08 07:38:34TianCHENWanTANGYingLUXinTU
上海精神醫(yī)學 2014年5期
關鍵詞:方法模型

Tian CHEN, Wan TANG, Ying LU, Xin TU*

?Biostatistics in psychiatry (23)?

Rank regression: an alternative regression approach for data with outliers

Tian CHEN1, Wan TANG1, Ying LU2, Xin TU1*

normal distribution, non-normal distribution, linear regression, semi-parametric regression models, rank regression, sexual health

1. Introduction

Regression is widely used in mental health research and related services research to model relationships involving health and service utilization outcomes and clinical and socio-demographic factors. Regression models measure changes in the dependent variable in response to changes in a set of independent variables of interest. Linear regression focuses on continuous dependent variables, while other regression models such as logistic and log-linear regression consider noncontinuous dependent variables such as binary and count outcomes. The dependent variable is often called the response, while the independent variables are frequently referred to as the explanatory variables,predictors, or covariates.

Linear regression is arguably the most popular regression model in practice, because of the ubiquity of continuous outcomes and because it is relatively easy to understand the modeled relationship and interpret the model estimates. Fitting such models is convenient because all major software packages (R,SAS, SPSS and STATA) provide both the model estimates and the diagnostics of the model fit. However, the wide popularity and routine use of the linear regression also creates some problems. Many researchers apply the model without first checking assumptions about the normal distribution of the data underlying the validity of model estimates. The classic normal-based linear regression imposes strong constraints on data, and its estimates are also quite sensitive to departures from assumed mathematical models. Without careful checking of the model assumptions, estimates generated by linear regression models may be difficult to interpret and conclusions drawn from such estimates may be misleading.

2. Different approaches to deal with non-normal study data in regression analyses

Classic linear regression assumes a normally distributed response,yi, and models the mean of this response variable as a function of a set of independent variables,xi= (xi1 , xi2 ...., xip)Tas follows∶

whereβ= (β1, β2, ..., βp)Tis the vector of parameters,nis the sample size,εidenotes the error term,N(μ,σ2)denotes a normal distribution with meanμand varianceσ2, andεi~N(0,σ2) means thatεifollows a normal distribution with mean 0 and varianceσ2. The wellshaped bell curve of the normal distribution is often at odds with the distribution of data arising in real studies,because of its symmetric shape and extremely thin tails(exponential decay). Over the years, various methods have been developed to improve the limitations of the classic linear model. All the different methods can be grouped into 3 major categories.

One approach is to use mathematical distributions that more closely resemble the data distribution in the study.[1]For example, by positing a t-distribution for the errorεi, the resulting linear model can accommodate data distributions with thicker tails. This is possible because the t-distribution has an additional degree of freedom parameter to control the thickness of the tail.However, like the normal distribution, the t-distribution is also symmetric. To model skewed data distributions,a popular approach is to use the chi-square distribution.Although this parametric alternative broadens the scope of data distributions that can be accommodated, it is still quite limited because mathematical distributions always have more regular shapes than those arising in practice.

A second popular alternative is to use semiparametric or distribution-free models.[2]Under this approach, no mathematical model is assumed for the data distribution (the non-parametric part) and the relationship betweenyiandxiis represented by the mean ofyiafter adjustment forxi(parametric component). The latter parametric component is implied by the specification of the classic linear regression in (1)and is given by∶

whereE(yi|xi) denotes mathematical expectation. For those unfamiliar with mathematical expectation, the above expression simply means that the population-level average of the responseyiis a linear function ofxi. This linear relationship is also implicit in the normal-based linear regression in (1). Thus, the semi-parametric linear model in (2) only requires a linear relationship between the response and the set of explanatory variables,thereby offering valid inference for a wide class of data distributions.

Although significantly improving the utility of linear regression, the semi-parametric model still has limited applications. A major problem is that like the classic model it continues to model the mean of the response.Like the sample mean of a variable, model estimates from this approach can be quite biased when there are extremely large or small values, or outliers, in the response.

Various approaches have been developed to address this important issue of outliers. A common approach in psychosocial research is to trim outliers using ad-hoc rules. For example, limiting the values of all observations to 3 times the interquartile range when estimating the mean of an outcome (i.e., a ‘trimmed’mean).[3]However, these ad-hoc methods induce artifacts because of their dependence on the specific rules used, and the use of different rules can result in different outcomes.

Another approach to limiting the influence of outliers is to employ rank tests. The Mann-Whitney-Wilcoxon rank sum test is widely used to compare two groups in such situations. Within the setting of regression analysis, rank regression is a popular approach for dealing with outliers.[4,5]Like the Mann-Whitney-Wilcoxon rank sum test, rank regression does not use the observed responsesyidirectly, but,rather, uses information about the ranking of these observations, thereby yielding estimates that are much less sensitive to outliers.

3. Simulation studies to compare different approaches

The data were simulated from a study with one binary variable and one continuous covariate. To show differences across the different methods, we selected a large sample size (n=500) to reduce the effect of sampling variability on model estimates. We performed simulation of data and fitted the different models to the data generated using the R software. All simulations were performed with a Monte Carlo sample size M=1000 and a type I errorα=0.05.

We simulatedyifrom the following linear model∶

We then simulated 50 (or 10% of the sample size) values from a uniformU(500, 1000000), ordered them as∶

and added the valuesu(1)from the uniform to the 50 largest values ofyi, i.e.,

to form a set of outlying observations, i.e.,

To assess the robustness of the different methods,we replacedy(451)<y(452)< ...< y(500)in the original sample with the valuesz(451)<z(452)< ...< z(500), and fit the models to the resulting observations∶

Table 1 shows the estimates ofβ1andβ2, the corresponding standard errors, and type I error rates from fitting the three methods to data simulated from the normal-distributed errorN(0, 1/2) based on 1000 Monte Carlo simulations both with and without included outliers. (The interceptβ0is estimated by the rank regression and so this estimate is missing in the table.) In the table, values in the column titled‘mean’ are the averaged estimates of each parameter over 1000 Monte Caro replications; the ‘a(chǎn)symptotic standard error’ is the model-based standard error; the‘empirical standard error’ is the standard errors of the 1000 estimates of each parameter; and the ‘type I error’is the percent of times the null hypothesis - that the estimated parameter is equal to the true parameter -is rejected. For example, the empirical type I error rates forβ1in the data set without outliers is the percent oftimes of rejecting the nullH0∶β1=1.

If a model performs well, (a) the averaged value of estimates of each parameter (in the ‘mean’ column)should be close to the true value of the respective parameter; (b) the magnitude of the asymptotic standard error should be close to that of the empirical standard error; and (c) the empirical type I error rate should be close to the nominal value 0.05. As shown in Table 1, in the absence of outliers, all three methods performed well, with the averaged estimates all nearly identical to the true value 1, the asymptotic standard errors all close to their empirical counterparts, and the type I error rate all close to the nominal levelα=0.05.Further, all three methods yielded near identical standard errors, indicating that there is practically no loss of power by using the two robust alternatives instead of the classic linear model for the simulated normal data.

However, results are very different in the presence of outliers. As shown in the Table 1, both the classic and semi-parametric models yielded extremely large estimates that are un-interpretable, impossibly large standard errors, and type I errors close to 1. In contrast,the rank regression model for bothβ1andβ2generated estimates close to the true value 1, reasonable asymptotic and empirical standard errors that were equal to each other, and type I errors that, though elevated, were close to the nominal 0.05 level.

Table 2 shows the results of a similar simulation when the data were simulated from t-distributed error, ,instead of from normal-distributed error. In the absence of outliers the mean estimate and type 1 error of the two parameters were acceptable for all three models;however, the empirical standard error was much larger than the asymptotic standard error for the classical and semi-parametric models while these two types of standard error were similar in magnitude in the rank regression model. In the presence of outliers, as was the case in the normal-error simulation, the estimates generated by the classic and semi-parametric models were un-interpretable while those generated by the rank regression model were acceptable. Thus, for data with t-distribution error the rank regression model preforms better than the classic linear and the semiparametric models both in the absence and in the presence of outliers.

4. A real-life example

To illustrate the three approaches to dealing with outliers, we use results from a recent randomizedcontrolled study[6]to evaluate the efficacy of a sexual risk-reduction intervention program targeting teenage girls in low-income urban settings who are at elevated risk for HIV, sexually transmitted infections, and unintended pregnancies. The study recruited sexuallyactive urban adolescent girls aged 15 to 19 and randomized them to a sexual risk reduction intervention or to a structurally-equivalent health promotion control group. Assessments and behavioral data were collected at baseline, 3, 6 and 12 months post-baseline.The primary interest of the study was to compare the frequency of unprotected vaginal sex between the two treatment conditions. A difficult problem with the study data was the extremely large values reported by some subjects for their sexual activities. For example, five subjects reported over 100 episodes of unprotected vaginal sex over the past 3 months at the 6 month follow-up. If linear regression is applied directly to this outcome, estimates will be severely biased and become un-interpretable. Alternative models need to be considered when analyzing the data.

Table 1. Estimates (mean), asymptotic and empirical standard errors, and empirical type I error rates from fitting the classic linear, semi-parametric, and rank regression models to data simulated from normal-distributed errors

The linear regression for the different methods is specified as follows∶

whereyiis the number of episodes of unprotected vaginal sex,xi1is the binary indicator for the treatment condition (1 for the intervention and 0 for the control group), andεiis the model error. The model errorεifollows the normal distribution for the classic linear regression, while the distribution is unspecified for the semi-parametric and rank regression methods.

To highlight the differences in the models we removed zero observations (i.e., individuals who reported no episodes of unprotected sex in the prior three months) and fit all three models (classic linear,semi-parametric, and rank regression) to the remaining data. In addition, we also recomputed the estimates for the classic linear model and the semi-parametric model after trimming the observed responses to decrease the influence of outliers. We trimmed the observed responses of number of episodes of unprotected vaginal sex in the prior three months at 3 times the interquartile range; the 25%, 50% and 75% quartiles were 2, 4, and 10 episodes, respectively, so the interquartile range was 8 (10 - 2) and any observations below -20 (4 - 3*8)or above +28 (4 + 3*8) were considered outliers. There were no observations below -20 so no lower-level trimming was necessary, but all observations above 28 were trimmed to 28.

Table 3 shows the resulting estimates ofβ1for the treatment condition in the linear model (3) and the corresponding asymptotic standard errors and p-values using the different models. As was the case in the simulation study with outliers, the huge values for the estimates and standard errors using the classic linear and semi-parametric models clearly show that the estimates are profoundly affected by the outliers and,thus, are un-interpretable. In comparison, the classic and semi-parametric methods yielded more reasonable estimates when applied to the trimmed observations.However, results using the trimmed data were still quite different from those generated from the rank regression model; the estimates from the two models that used trimmed data were more than 50% higher than that using the rank regression method and the standard errors were more than double that from the rank regression analysis. Results from the simulation study suggest that rank regression is quite robust against outliers and, unlike models that use trimmed data,are not vulnerable to change when different trimming criteria are employed.

Table 2. Estimates (mean), asymptotic and empirical standard errors, and empirical type I error rates from fitting the classic linear, semi-parametric, and rank regression models to data simulated from t-distributed errors

Table 3. Estimates, standard errors, and p-values from fitting the classic linear, semi-parametric,rank regression, classic linear with trimmed outliers, and semi-parametric with trimmed outliers models to the risk-reduction intervention study

5. Sotfware for alternative linear regression models

Most major software such as R and SAS has the capability of fitting the semi-parametric linear regression model. In R, there are several packages available for fitting the generalized estimating equations (GEE).Although GEE is an extension of the semi-parametric method for longitudinal data, we may still use these packages for fitting the semi-parametric model to crosssectional data by introducing an ‘ID’ variable that has unique values for each of the observations. For example,if the GEE package is installed, then one may apply the following codes to fit the semi-parametric linear regression model∶

where y is the outcome and x is the covariate matrix.

Similarly, SAS also offers ‘Procedures’ for fitting the GEE which can be utilized to provide estimates for semiparametric linear regression models. For example, by adding an ID variable to the SAS data set, we may apply the Procedure GENMOD to fit the semi-parametric model∶

At the time of writing, SAS does not have the capability to fit the rank regression. For our simulated and real study examples, packages in R were used to fit this robust alternative model. To perform this regression model, first download the R functions from the website∶http∶//www.stat.wmich.edu/mckean/HMC/Rcode/AppendixB/ww.r. Then, we use the following command in R to obtain estimates from fitting the rank regression∶

where y is the outcome and x is the covariate matrix.

Note that while SAS is a commercial software package, R is free to download, install, and run. In addition, software for newer statistical methods are generally first available in R. However, unlike SAS, R has no designated technical support so users generally rely on peer-support, web postings, and books for resolving issues concerning applications of specific packages and general data management problems.

6. Discussion

Classic linear regression has a number of weaknesses,limiting its applications to real study data. We discussed two robust alternatives, the semi-parametric model and the rank regression model. Although the former yields more valid estimates than the classic linear model, it breaks down when there are extremely large (or small)observations in the response (i.e., the dependent variable). In the presence of such outliers, the rank regression model provides much more robust estimates.Unlike ad-hoc methods such as trimming outliers based on 3 x interquartile range, rank regression generates the same estimates regardless of the actual values of the response as long as the rankings of the observations remain the same. This formal approach not only removes any subjective element in the estimates, but it also makes it easier to compare results of different analyses based on the same study data and to compare results between different studies. Further, the rank regression model is also capable of addressing outliers in the independent variables, although this tutorial only discussed outliers in the response variable.

Currently, rank regression is only available in some selected software packages such as R - we included sample R codes for fitting this robust regression model in this report to facilitate its use by readers. As this approach becomes more popular, it is likely that other major software giants such as SAS will have similar offerings.

Unlike the classic and semi-parametric linear regression models, rank regression is only available for fitting cross-sectional data. This is, in part, due to the complexity of computing estimates and asymptotic standard errors. However, as longitudinal studies become the norm rather than the exception in modern clinical research, it will become increasingly important to develop software that can extend this robust model to longitudinal research data and, thus, help investigators more effectively deal with imperfections in real study data.

Conflict of interest

The authors report no conflict of interest related to this manuscript.

Funding

The preparation of this manuscript was supported in part by DA027521 and GM108337 from the National Institutes of Health.

1. Kowalski J, Tu XM, Day RS, Mendoza-Blanco JR. On the rate of convergence of the ECME algorithm for multiple regression models with t-distributed errors.Biometrika. 1997; 84∶269-281. doi∶ http∶//dx.doi.org/10.1093/biomet/84.2.269

2. Tang W, He H,Tu XM.Applied Categorical and Count Data Analysis. Boca Raton, Florida, USA∶ Chapman & Hall/CRC Press. 2012

3. Schroder EB, Liao DP, Chambless LE, Prineas RJ, Evans GW,Heiss G. Hypertension, blood pressure, and heart rate variability∶ the Atherosclerosis Risk in Communities (ARIC)study.Hypertension.2003; 42(6)∶ 1106-1111. doi∶ http∶//dx.doi.org/10.1161/01.HYP.0000100444.71069.73

4. Jaeckel LA. Estimating regression coefficients by minimizing the dispersion of the residuals.Ann Math Statist. 1972;43(5)∶ 1449-1458

5. Jureckova J. Nonparametric estimate of regression coefficients.Ann Math Statist.1971; 42(4)∶ 1328-1338

6. Morrison-Beedy D, Jones S, Xia Y, Tu XM, Crean H, Carey M. Reducing sexual risk behavior in adolescent girls∶results from a randomized controlled trial.J Adolesc Health.2013; 52∶ 314-321. doi∶ http∶//dx.doi.org/10.1016/j.jadohealth.2012.07.005

∶ 2014-10-08; accepted∶ 2014-10-10)

Ms. Tian Chen is a fifth-year PhD student in the Department of Biostatistics and Computational Biology, School of Medicine and Dentistry, University of Rochester. Her PhD thesis focuses on semiparametric and rank-based statistical models, and variable selection methods for regression models for both cross-sectional and longitudinal data. She has applied these statistical methods in the analysis of mental health and related research.

等級回歸:離群數(shù)據(jù)的另一種回歸方法

Tian CHEN, Wan TANG, Ying LU, Xin TU

正態(tài)分布,非正態(tài)分布,線性回歸,半?yún)?shù)回歸模型,等級回歸,性健康

Summary:Linear regression models are widely used in mental health and related health services research.However, the classic linear regression analysis assumes that the data are normally distributed, an assumption that is not met by the data obtained in many studies. One method of dealing with this problem is to use semi-parametric models, which do not require that the data be normally distributed. But semi-parametric models are quite sensitive to outlying observations, so the generated estimates are unreliable when study data includes outliers. In this situation, some researchers trim the extreme values prior to conducting the analysis, but the ad-hoc rules used for data trimming are based on subjective criteria so different methods of adjustment can yield different results. Rank regression provides a more objective approach to dealing with non-normal data that includes outliers. This paper uses simulated and real data to illustrate this useful regression approach for dealing with outliers and compares it to the results generated using classical regression models and semi-parametric regression models.

[Shanghai Arch Psychiatry. 2014; 26(5)∶ 310-316. doi∶ http∶//dx.doi.org/10.11919/j.issn.1002-0829.214148]

1Department of Biostatistics and Computational Biology, University of Rochester, NY, USA

2Department of Biostatistics, Stanford University, Stanford, CA, USA

*correspondence∶ xin_tu@urmc.rochester.edu

A full-text Chinese translation of this article will be available at www.shanghaiarchivesofpsychiatry.org on November 25, 2014.

概述: 線性回歸模型被廣泛應用于精神衛(wèi)生和衛(wèi)生服務相關研究。然而,經(jīng)典線性回歸分析是假設該數(shù)據(jù)為正態(tài)分布的,但是很多研究所獲得的數(shù)據(jù)并不符合這種假設。解決該問題的方法之一是采用不要求數(shù)據(jù)為正態(tài)分布的半?yún)?shù)模型。但是,半?yún)?shù)模型對離散數(shù)據(jù)相當敏感,因此在處理包含離散值的數(shù)據(jù)時產(chǎn)生的估計值是不可靠的。在這種情況下,一些研究者在刪減這些極端值后再進行分析,但是,刪減數(shù)據(jù)的事先法則(ad-hoc rules)是基于主觀標準的,所以不同的調整方法就會產(chǎn)生不同的結果。等級回歸為處理包括離散值的非正態(tài)分布數(shù)據(jù)提供了更為客觀的方法。本文采用虛擬和實際數(shù)據(jù)來闡述這個非常有用的處理離散值的回歸方法,并與采用經(jīng)典回歸模型和半?yún)?shù)回歸模型所得出的結果進行比較。

本文全文中文版從2014年11月25日起在www.shanghaiarchivesofpsychaitry.org可供免費閱覽下載

猜你喜歡
方法模型
一半模型
重要模型『一線三等角』
重尾非線性自回歸模型自加權M-估計的漸近分布
學習方法
3D打印中的模型分割與打包
用對方法才能瘦
Coco薇(2016年2期)2016-03-22 02:42:52
FLUKA幾何模型到CAD幾何模型轉換方法初步研究
四大方法 教你不再“坐以待病”!
Coco薇(2015年1期)2015-08-13 02:47:34
賺錢方法
捕魚
主站蜘蛛池模板: 国产精品无码AⅤ在线观看播放| 国产丝袜啪啪| 国产爽歪歪免费视频在线观看| 国产精品999在线| 久久久久国产精品熟女影院| jijzzizz老师出水喷水喷出| 亚洲成人高清在线观看| 成人噜噜噜视频在线观看| 国模粉嫩小泬视频在线观看| 男女性午夜福利网站| 高潮毛片免费观看| 人妻免费无码不卡视频| 制服无码网站| 在线看AV天堂| 成人亚洲国产| 麻豆精品在线播放| 丝袜国产一区| 久久久精品无码一区二区三区| 黄色三级网站免费| 四虎永久在线精品影院| 区国产精品搜索视频| 亚洲人精品亚洲人成在线| 国产丝袜91| 思思热在线视频精品| 亚洲三级网站| 狠狠亚洲五月天| 无码av免费不卡在线观看| 91最新精品视频发布页| 久久 午夜福利 张柏芝| 欧美精品导航| 久久黄色一级视频| 成人午夜福利视频| 欧美福利在线| 国产精品视频第一专区| 久久精品亚洲中文字幕乱码| 国产91高清视频| av一区二区三区高清久久| 亚洲最大福利视频网| 国产精品一区在线观看你懂的| 国产精品视频导航| 国产第一福利影院| 一级片一区| 亚洲欧美另类日本| 亚洲综合精品香蕉久久网| 久久久久人妻一区精品| 午夜人性色福利无码视频在线观看| 久久国产香蕉| 色偷偷一区二区三区| 免费人成网站在线高清| 日韩精品无码免费专网站| 国产激情无码一区二区三区免费| 欧美视频在线不卡| 中文毛片无遮挡播放免费| 高清码无在线看| 国产精品美女自慰喷水| 国产尹人香蕉综合在线电影| 婷婷色在线视频| 国产精品v欧美| 国产一区二区三区夜色 | 九色视频在线免费观看| 国产视频 第一页| 亚洲最大看欧美片网站地址| 波多野结衣无码AV在线| 成人福利一区二区视频在线| 9啪在线视频| 久久影院一区二区h| 日本不卡视频在线| 欧美亚洲国产日韩电影在线| 日本高清有码人妻| 日韩精品成人网页视频在线| 熟女日韩精品2区| 热思思久久免费视频| 无码精油按摩潮喷在线播放 | 国产国语一级毛片| 亚洲国产高清精品线久久| 天堂中文在线资源| 特级毛片8级毛片免费观看| Jizz国产色系免费| 亚洲第一成网站| 日本一本在线视频| 视频一区视频二区日韩专区| 欧美精品在线视频观看|