Heteroskedasticity robust standard error

From Wikipedia, the free encyclopedia

The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors (or simply robust standard errors), Eicker–Huber–White standard errors (also Huber–White standard errors or White standard errors),^[1] to recognize the contributions of Friedhelm Eicker,^[2] Peter J. Huber,^[3] and Halbert White.^[4]

In regression and time-series modelling, basic forms of models make use of the assumption that the errors or disturbances u_i have the same variance across all observation points. When this is not the case, the errors are said to be heteroskedastic, or to have heteroskedasticity, and this behaviour will be reflected in the residuals ${displaystyle {widehat {u}}_{i}}$ estimated from a fitted model. Heteroskedasticity-consistent standard errors are used to allow the fitting of a model that does contain heteroskedastic residuals. The first such approach was proposed by Huber (1967), and further improved procedures have been produced since for cross-sectional data, time-series data and GARCH estimation.

Heteroskedasticity-consistent standard errors that differ from classical standard errors may indicate model misspecification. Substituting heteroskedasticity-consistent standard errors does not resolve this misspecification, which may lead to bias in the coefficients. In most situations, the problem should be found and fixed.^[5] Other types of standard error adjustments, such as clustered standard errors or HAC standard errors, may be considered as extensions to HC standard errors.

History[edit]

Heteroskedasticity-consistent standard errors are introduced by Friedhelm Eicker,^[6]^[7] and popularized in econometrics by Halbert White.

Problem[edit]

Consider the linear regression model for the scalar Y.

${displaystyle y=mathbf {x} ^{top }{boldsymbol {beta }}+varepsilon ,,}$

where is a k x 1 column vector of explanatory variables (features), is a k × 1 column vector of parameters to be estimated, and is the residual error.

The ordinary least squares (OLS) estimator is

${displaystyle {widehat {boldsymbol {beta }}}_{mathrm {OLS} }=(mathbf {X} ^{top }mathbf {X} )^{-1}mathbf {X} ^{top }mathbf {y} .,}$

where is a vector of observations $y_{i}$ , and denotes the matrix of stacked $mathbf {x} _{i}$ values observed in the data.

If the sample errors have equal variance $sigma ^{2}$ and are uncorrelated, then the least-squares estimate of is BLUE (best linear unbiased estimator), and its variance is estimated with

${displaystyle {hat {mathbb {V} }}left[{widehat {boldsymbol {beta }}}_{mathrm {OLS} }right]=s^{2}(mathbf {X} ^{top }mathbf {X} )^{-1},quad s^{2}={frac {sum _{i}{widehat {varepsilon }}_{i}^{2}}{n-k}}}$

where ${displaystyle {widehat {varepsilon }}_{i}=y_{i}-mathbf {x} _{i}^{top }{widehat {boldsymbol {beta }}}_{mathrm {OLS} }}$ are the regression residuals.

When the error terms do not have constant variance (i.e., the assumption of ${displaystyle mathbb {E} [mathbf {u} mathbf {u} ^{top }]=sigma ^{2}mathbf {I} _{n}}$ is untrue), the OLS estimator loses its desirable properties. The formula for variance now cannot be simplified:

${displaystyle mathbb {V} left[{widehat {boldsymbol {beta }}}_{mathrm {OLS} }right]=mathbb {V} {big [}(mathbf {X} ^{top }mathbf {X} )^{-1}mathbf {X} ^{top }mathbf {y} {big ]}=(mathbf {X} ^{top }mathbf {X} )^{-1}mathbf {X} ^{top }mathbf {Sigma } mathbf {X} (mathbf {X} ^{top }mathbf {X} )^{-1}}$

where

While the OLS point estimator remains unbiased, it is not «best» in the sense of having minimum mean square error, and the OLS variance estimator ${displaystyle {hat {mathbb {V} }}left[{widehat {boldsymbol {beta }}}_{mathrm {OLS} }right]}$ does not provide a consistent estimate of the variance of the OLS estimates.

For any non-linear model (for instance logit and probit models), however, heteroskedasticity has more severe consequences: the maximum likelihood estimates of the parameters will be biased (in an unknown direction), as well as inconsistent (unless the likelihood function is modified to correctly take into account the precise form of heteroskedasticity).^[8]^[9] As pointed out by Greene, “simply computing a robust covariance matrix for an otherwise inconsistent estimator does not give it redemption.”^[10]

Solution[edit]

If the regression errors $varepsilon _{i}$ are independent, but have distinct variances $sigma _{i}^{2}$ , then ${displaystyle mathbf {Sigma } =operatorname {diag} (sigma _{1}^{2},ldots ,sigma _{n}^{2})}$ which can be estimated with ${displaystyle {widehat {sigma }}_{i}^{2}={widehat {varepsilon }}_{i}^{2}}$ . This provides White’s (1980) estimator, often referred to as HCE (heteroskedasticity-consistent estimator):

${displaystyle {begin{aligned}{hat {mathbb {V} }}_{text{HCE}}{big [}{widehat {boldsymbol {beta }}}_{text{OLS}}{big ]}&={frac {1}{n}}{bigg (}{frac {1}{n}}sum _{i}mathbf {x} _{i}mathbf {x} _{i}^{top }{bigg )}^{-1}{bigg (}{frac {1}{n}}sum _{i}mathbf {x} _{i}mathbf {x} _{i}^{top }{widehat {varepsilon }}_{i}^{2}{bigg )}{bigg (}{frac {1}{n}}sum _{i}mathbf {x} _{i}mathbf {x} _{i}^{top }{bigg )}^{-1}\&=(mathbf {X} ^{top }mathbf {X} )^{-1}(mathbf {X} ^{top }operatorname {diag} ({widehat {varepsilon }}_{1}^{2},ldots ,{widehat {varepsilon }}_{n}^{2})mathbf {X} )(mathbf {X} ^{top }mathbf {X} )^{-1},end{aligned}}}$

where as above denotes the matrix of stacked ${displaystyle mathbf {x} _{i}^{top }}$ values from the data. The estimator can be derived in terms of the generalized method of moments (GMM).

Also often discussed in the literature (including White’s paper) is the covariance matrix ${displaystyle {widehat {mathbf {Omega } }}_{n}}$ of the -consistent limiting distribution:

${displaystyle {sqrt {n}}({widehat {boldsymbol {beta }}}_{n}-{boldsymbol {beta }}),xrightarrow {d} ,{mathcal {N}}(mathbf {0} ,mathbf {Omega } ),}$

where

${displaystyle mathbf {Omega } =mathbb {E} [mathbf {X} mathbf {X} ^{top }]^{-1}mathbb {V} [mathbf {X} {boldsymbol {varepsilon }}]operatorname {mathbb {E} } [mathbf {X} mathbf {X} ^{top }]^{-1},}$

and

${displaystyle {begin{aligned}{widehat {mathbf {Omega } }}_{n}&={bigg (}{frac {1}{n}}sum _{i}mathbf {x} _{i}mathbf {x} _{i}^{top }{bigg )}^{-1}{bigg (}{frac {1}{n}}sum _{i}mathbf {x} _{i}mathbf {x} _{i}^{top }{widehat {varepsilon }}_{i}^{2}{bigg )}{bigg (}{frac {1}{n}}sum _{i}mathbf {x} _{i}mathbf {x} _{i}^{top }{bigg )}^{-1}\&=n(mathbf {X} ^{top }mathbf {X} )^{-1}(mathbf {X} ^{top }operatorname {diag} ({widehat {varepsilon }}_{1}^{2},ldots ,{widehat {varepsilon }}_{n}^{2})mathbf {X} )(mathbf {X} ^{top }mathbf {X} )^{-1}end{aligned}}}$

Thus,

${displaystyle {widehat {mathbf {Omega } }}_{n}=ncdot {hat {mathbb {V} }}_{text{HCE}}[{widehat {boldsymbol {beta }}}_{text{OLS}}]}$

and

${displaystyle {widehat {mathbb {V} }}[mathbf {X} {boldsymbol {varepsilon }}]={frac {1}{n}}sum _{i}mathbf {x} _{i}mathbf {x} _{i}^{top }{widehat {varepsilon }}_{i}^{2}={frac {1}{n}}mathbf {X} ^{top }operatorname {diag} ({widehat {varepsilon }}_{1}^{2},ldots ,{widehat {varepsilon }}_{n}^{2})mathbf {X} .}$

Precisely which covariance matrix is of concern is a matter of context.

Alternative estimators have been proposed in MacKinnon & White (1985) that correct for unequal variances of regression residuals due to different leverage.^[11] Unlike the asymptotic White’s estimator, their estimators are unbiased when the data are homoscedastic.

Of the four widely available different options, often denoted as HC0-HC3, the HC3 specification appears to work best, with tests relying on the HC3 estimator featuring better power and closer proximity to the targetted size, especially in small samples. The larger the sample, the smaller the difference between the different estimators.^[12]

An alternative to explicitly modelling the heteroskedasticity is using a resampling method such as the Wild Bootstrap. Given that the studentized Bootstrap, which standardizes the resampled statistic by its standard error, yields an asymptotic refinement,^[13] heteroskedasticity-robust standard errors remain nevertheless useful.

Instead of accounting for the heteroskedastic errors, most linear models can be transformed to feature homoskedastic error terms (unless the error term is heteroskedastic by construction, e.g. in a Linear probability model). One way to do this is using Weighted least squares, which also features improved efficiency properties.

Software[edit]

EViews: EViews version 8 offers three different methods for robust least squares: M-estimation (Huber, 1973), S-estimation (Rousseeuw and Yohai, 1984), and MM-estimation (Yohai 1987).^[14]
Julia: the CovarianceMatrices package offers several methods for heteroskedastic robust variance covariance matrices.^[15]
MATLAB: See the hac function in the Econometrics toolbox.^[16]
Python: The Statsmodel package offers various robust standard error estimates, see statsmodels.regression.linear_model.RegressionResults for further descriptions
R: the vcovHC() command from the sandwich package.^[17]^[18]
RATS: robusterrors option is available in many of the regression and optimization commands (linreg, nlls, etc.).
Stata: robust option applicable in many pseudo-likelihood based procedures.^[19]
Gretl: the option --robust to several estimation commands (such as ols) in the context of a cross-sectional dataset produces robust standard errors.^[20]

References[edit]

^ Kleiber, C.; Zeileis, A. (2006). «Applied Econometrics with R» (PDF). UseR-2006 conference. Archived from the original (PDF) on April 22, 2007.
^ Eicker, Friedhelm (1967). «Limit Theorems for Regression with Unequal and Dependent Errors». Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Vol. 5. pp. 59–82. MR 0214223. Zbl 0217.51201.
^ Huber, Peter J. (1967). «The behavior of maximum likelihood estimates under nonstandard conditions». Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Vol. 5. pp. 221–233. MR 0216620. Zbl 0212.21504.
^ White, Halbert (1980). «A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity». Econometrica. 48 (4): 817–838. CiteSeerX 10.1.1.11.7646. doi:10.2307/1912934. JSTOR 1912934. MR 0575027.
^ King, Gary; Roberts, Margaret E. (2015). «How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It». Political Analysis. 23 (2): 159–179. doi:10.1093/pan/mpu015. ISSN 1047-1987.
^ Eicker, F. (1963). «Asymptotic Normality and Consistency of the Least Squares Estimators for Families of Linear Regressions». The Annals of Mathematical Statistics. 34 (2): 447–456. doi:10.1214/aoms/1177704156.
^ Eicker, Friedhelm (January 1967). «Limit theorems for regressions with unequal and dependent errors». Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. 5 (1): 59–83.
^ Giles, Dave (May 8, 2013). «Robust Standard Errors for Nonlinear Models». Econometrics Beat.
^ Guggisberg, Michael (2019). «Misspecified Discrete Choice Models and Huber-White Standard Errors». Journal of Econometric Methods. 8 (1). doi:10.1515/jem-2016-0002.
^ Greene, William H. (2012). Econometric Analysis (Seventh ed.). Boston: Pearson Education. pp. 692–693. ISBN 978-0-273-75356-8.
^ MacKinnon, James G.; White, Halbert (1985). «Some Heteroskedastic-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties». Journal of Econometrics. 29 (3): 305–325. doi:10.1016/0304-4076(85)90158-7. hdl:10419/189084.
^ Long, J. Scott; Ervin, Laurie H. (2000). «Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model». The American Statistician. 54 (3): 217–224. doi:10.2307/2685594. ISSN 0003-1305.
^ C., Davison, Anthony (2010). Bootstrap methods and their application. Cambridge Univ. Press. ISBN 978-0-521-57391-7. OCLC 740960962.
^ «EViews 8 Robust Regression».
^ CovarianceMatrices: Robust Covariance Matrix Estimators
^ «Heteroskedasticity and autocorrelation consistent covariance estimators». Econometrics Toolbox.
^ sandwich: Robust Covariance Matrix Estimators
^ Kleiber, Christian; Zeileis, Achim (2008). Applied Econometrics with R. New York: Springer. pp. 106–110. ISBN 978-0-387-77316-2.
^ See online help for _robust option and regress command.
^ «Robust covariance matrix estimation» (PDF). Gretl User’s Guide, chapter 19.

Further reading[edit]

Freedman, David A. (2006). «On The So-Called ‘Huber Sandwich Estimator’ and ‘Robust Standard Errors’«. The American Statistician. 60 (4): 299–302. doi:10.1198/000313006X152207. S2CID 6222876.
Hardin, James W. (2003). «The Sandwich Estimate of Variance». In Fomby, Thomas B.; Hill, R. Carter (eds.). Maximum Likelihood Estimation of Misspecified Models: Twenty Years Later. Amsterdam: Elsevier. pp. 45–74. ISBN 0-7623-1075-8.
Hayes, Andrew F.; Cai, Li (2007). «Using heteroskedasticity-consistent standard error estimators in OLS regression: An introduction and software implementation». Behavior Research Methods. 39 (4): 709–722. doi:10.3758/BF03192961. PMID 18183883.
King, Gary; Roberts, Margaret E. (2015). «How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It». Political Analysis. 23 (2): 159–179. doi:10.1093/pan/mpu015.
Wooldridge, Jeffrey M. (2009). «Heteroskedasticity-Robust Inference after OLS Estimation». Introductory Econometrics : A Modern Approach (Fourth ed.). Mason: South-Western. pp. 265–271. ISBN 978-0-324-66054-8.
Buja, Andreas, et al. «Models as approximations-a conspiracy of random regressors and model deviations against classical inference in regression.» Statistical Science (2015): 1. pdf

Источник

Так как гетероскедастичность не приводит к смещению оценок коэффициентов, можно по-прежнему использовать МНК. Смещены и несостоятельны оказываются не сами оценки коэффициентов, а их стандартные ошибки, поэтому формула для расчета стандартных ошибок в условиях гомоскедастичности не подходит для случая гетероскедастичности.

Естественной идеей в этой ситуации является корректировка формулы расчета стандартных ошибок, чтобы она давала «правильный» (состоятельный) результат. Тогда можно снова будет корректно проводить тесты, проверяющие, например, незначимость коэффициентов, и строить доверительные интервалы. Соответствующие «правильные» стандартные ошибки называются состоятельными в условиях гетероскедастичности стандартными ошибками (heteroskedasticity consistent (heteroskedasticity robust) standard errors)¹. Первоначальная формула для их расчета была предложена Уайтом, поэтому иногда их также называют стандартными ошибками в форме Уайта (White standard errors). Предложенная Уайтом состоятельная оценка ковариационной матрицы вектора оценок коэффициентов имеет вид:

(widehat{V}{left( widehat{beta} right) = n}left( {X^{‘}X} right)^{- 1}left( {frac{1}{n}{sumlimits_{s = 1}^{n}e_{s}^{2}}x_{s}x_{s}^{‘}} right)left( {X^{‘}X} right)^{- 1},)

где (x_{s}) – это s-я строка матрицы регрессоров X. Легко видеть, что эта формула более громоздка, чем формула (widehat{V}{left( widehat{beta} right) = left( {X^{‘}X} right)^{- 1}}S^{2}), которую мы вывели в третьей главе для случая гомоскедастичности. К счастью, на практике соответствующие вычисления не представляют сложности, так как возможность автоматически рассчитывать стандартные ошибки в форме Уайта реализована во всех современных эконометрических пакетах. Общепринятое обозначение для этой версии стандартных ошибок: «HC0». В работах (MacKinnon, White,1985) и (Davidson, MacKinnon, 2004) были предложены и альтернативные версии, которые обычно обозначаются в эконометрических пакетах «HC1», «HC2» и «HC3». Их расчетные формулы несколько отличаются, однако суть остается прежней: они позволяют состоятельно оценивать стандартные отклонения МНК-оценок коэффициентов в условиях гетероскедастичности.

Для случая парной регрессии состоятельная в условиях гетероскедастичности стандартная ошибка оценки коэффициента при регрессоре имеет вид:

(mathit{se}{left( widehat{beta_{2}} right) = sqrt{frac{1}{n}frac{frac{1}{n — 2}{sumlimits_{i = 1}^{n}{left( {x_{i} — overline{x}} right)^{2}e_{i}^{2}}}}{widehat{mathit{var}}(x)^{2}}.}})

Формальное доказательство состоятельности будет приведено в следующей главе. Пока же обсудим пример, иллюстрирующий важность использования робастных стандартных ошибок.

Пример 5.1. Оценка эффективности использования удобрений

В файле Agriculture в материалах к этому учебнику содержатся следующие данные 2010 года об урожайности яровой и озимой пшеницы в Спасском районе Пензенской области:

PRODP — урожайность в денежном выражении, в тысячах рублей с 1 га,

SIZE – размер пахотного поля, га,

LABOUR – трудозатраты, руб. на 1 га,

FUNG1 – фунгициды, протравители семян, расходы на удобрение в руб. на 1 га,

FUNG2 – фунгициды, во время роста, расходы на удобрение в руб. на 1 га,

GIRB – гербициды, расходы на удобрение в руб. на 1 га,

INSEC – инсектициды, расходы на удобрение в руб. на 1 га,

YDOB1 – аммофос, во время сева, расходы на удобрение в руб. на 1 га,

YDOB2 – аммиачная селитра, во время роста, расходы на удобрение в руб. на 1 га.

Представим, что вас интересует ответ на вопрос: влияет ли использование фунгицидов на урожайность поля?

(а) Оцените зависимость урожайности в денежном выражении от константы и переменных FUNG1, FUNG2, YDOB1, YDOB2, GIRB, INSEC, LABOUR. Запишите уравнение регрессии в стандартной форме, указав коэффициент детерминации и (в скобках под соответствующими коэффициентами) стандартные ошибки для случая гомоскедастичности. Какие из переменных значимы на 5-процентном уровне значимости?

(б) Решите предыдущий пункт заново, используя теперь состоятельные в условиях гетероскедастичности стандартные ошибки. Сопоставьте выводы по поводу значимости (при пятипроцентном уровне) переменных, характеризующих использование фунгицидов.

Решение:

(а) Оценим требуемое уравнение:

Модель 1: МНК, использованы наблюдения 1-200

Зависимая переменная: PRODP

	Коэффициент	Ст. ошибка	t-статистика	P-значение
const	-38,4019	7,5273	-5,1017	<0,00001	***
FUNG1	0,0445755	0,0487615	0,9142	0,36178
FUNG2	0,103625	0,049254	2,1039	0,03669	**
GIRB	0,0776059	0,0523553	1,4823	0,13990
INSEC	0,0782521	0,0484667	1,6146	0,10805
LABOUR	0,0415064	0,00275277	15,0781	<0,00001	***
YDOB1	0,0492168	0,0233328	2,1093	0,03621	**
YDOB2	-0,0906824	0,025864	-3,5061	0,00057	***

Сумма кв. остатков	150575,6	Ст. ошибка модели	28,00443
R-квадрат	0,801958	Испр. R-квадрат	0,794738
F(7, 192)	111,0701	Р-значение (F)	5,08e-64

Переменные FUNG2, LABOUR, YDOB1 и YDOB2 значимы на пятипроцентном уровне значимости (причем LABOUR и YDOB2 — ещё и на однопроцентном).

Если представить те же самые результаты в форме уравнения, то получится вот так:

({widehat{mathit{PRODP}}}_{i} = {{- underset{(7,53)}{38,40}} + {underset{(0,05)}{0,04} ast {mathit{FUNG}1}_{i}} + {underset{(0,05)}{0,10} ast {mathit{FUNG}2}_{i}} +})

({{+ underset{(0,05)}{0,08}} ast mathit{GIRB}_{i}} + {underset{(0,05)}{0,08} ast mathit{INSEC}_{i}} + {underset{(0,003)}{0,04} ast mathit{LABOUR}_{i}} + {})

({{{+ underset{(0,02)}{0,05}} ast {mathit{YDOB}1}_{i}} — {underset{(0,03)}{0,09} ast {mathit{YDOB}2}_{i}}},{R^{2} = 0,802})

(б) При использовании альтернативных стандартных ошибок получим следующий результат:

Модель 2: МНК, использованы наблюдения 1-200

Зависимая переменная: PRODP

Робастные оценки стандартных ошибок (с поправкой на гетероскедастичность),
вариант HC1

	Коэффициент	Ст. ошибка	t-статистика	P-значение
const	-38,4019	7,40425	-5,1865	<0,00001	***
FUNG1	0,0445755	0,0629524	0,7081	0,47975
FUNG2	0,103625	0,0624082	1,6604	0,09846	*
GIRB	0,0776059	0,0623777	1,2441	0,21497
INSEC	0,0782521	0,0536527	1,4585	0,14634
LABOUR	0,0415064	0,00300121	13,8299	<0,00001	***
YDOB1	0,0492168	0,0197491	2,4921	0,01355	**
YDOB2	-0,0906824	0,030999	-2,9253	0,00386	***

Сумма кв. остатков	150575,6	Ст. ошибка модели	28,00443
R-квадрат	0,801958	Испр. R-квадрат	0,794738
F(7, 192)	119,2263	Р-значение (F)	2,16e-66

Оценки коэффициентов по сравнению с пунктом (а) не поменялись, что естественно: мы ведь по-прежнему используем обычный МНК. Однако стандартные ошибки теперь немного другие. В некоторых случаях это меняет выводы тестов на незначимость.

Переменные LABOUR, YDOB1 и YDOB2 значимы на пятипроцентном уровне значимости (причем LABOUR и YDOB2 — ещё и на однопроцентном).

Переменная FUNG2 перестала быть значимой на пятипроцентном уровне. Таким образом, при использовании корректных стандартных ошибок следует сделать вывод о том, что соответствующий вид удобрений не важен для урожайности. Обратите внимание, что если бы мы использовали «обычные» стандартные ошибки, то мы пришли бы к противоположному заключению (см. пункт (а)).

* * *

Важно подчеркнуть, что в реальных пространственных данных гетероскедастичность в той или иной степени наблюдается практически всегда. А даже если её и нет, то состоятельные в условиях гетероскедастичности стандартные ошибки по-прежнему будут… состоятельными (и будут близки к «обычным» стандартным ошибкам, посчитанным по формулам из третьей главы). Поэтому в современных прикладных исследованиях при оценке уравнений по умолчанию используются именно робастные стандартные ошибки, а не стандартные ошибки для случая гомоскедастичности. Мы настоятельно рекомендуем читателю поступать так же². В нашем учебнике с этого момента и во всех последующих главах, если прямо не оговорено иное, для МНК-оценок параметров всегда используются состоятельные в условиях гетероскедастичности стандартные ошибки.

Поскольку довольно утомительно каждый раз произносить это название полностью в англоязычном варианте их часто называют просто robust standard errors, что на русском языке эконометристов превратилось в «робастные стандартные ошибки». Кому-то подобный англицизм, конечно, режет слух, однако в устной речи он и правда куда удобней своей длинной альтернативы.↩︎
Просто не забывайте включать соответствующую опцию в своем эконометрическом пакете.↩︎

Источник

First of all, is it heteroskedasticity or heteroscedasticity? According to
McCulloch (1985),
heteroskedasticity is the proper spelling, because when transliterating Greek words, scientists
use the Latin letter k in place of the Greek letter κ (kappa). κ sometimes is transliterated as
the Latin letter c, but only when these words entered the English language through French, such
as scepter.

Now that this is out of the way, we can get to the meat of this blogpost (foreshadowing pun).
A random variable is said to be heteroskedastic, if its variance is not constant. For example,
the variability of expenditures may increase with income. Richer families may spend a similar
amount on groceries as poorer people, but some rich families will sometimes buy expensive
items such as lobster. The variability of expenditures for rich families is thus quite large.
However, the expenditures on food of poorer families, who cannot afford lobster, will not vary much.
Heteroskedasticity can also appear when data is clustered; for example, variability of
expenditures on food may vary from city to city, but is quite constant within a city.

To illustrate this, let’s first load all the packages needed for this blog post:

library(robustbase)
library(tidyverse)
library(sandwich)
library(lmtest)
library(modelr)
library(broom)

First, let’s load and prepare the data:

data("education")

education <- education %>% 
    rename(residents = X1,
           per_capita_income = X2,
           young_residents = X3,
           per_capita_exp = Y,
           state = State) %>% 
    mutate(region = case_when(
        Region == 1 ~ "northeast",
        Region == 2 ~ "northcenter",
        Region == 3 ~ "south",
        Region == 4 ~ "west"
    )) %>% 
    select(-Region)

I will be using the education data sat from the {robustbase} package. I renamed some columns
and changed the values of the Region column. Now, let’s do a scatterplot of per capita expenditures
on per capita income:

ggplot(education, aes(per_capita_income, per_capita_exp)) + 
    geom_point() +
    theme_dark()

It would seem that, as income increases, variability of expenditures increases too. Let’s look
at the same plot by region:

ggplot(education, aes(per_capita_income, per_capita_exp)) + 
    geom_point() + 
    facet_wrap(~region) + 
    theme_dark()

I don’t think this shows much; it would seem that observations might be clustered, but there are
not enough observations to draw any conclusion from this plot (in any case, drawing conclusions
from plots only is dangerous).

Let’s first run a good ol’ linear regression:

lmfit <- lm(per_capita_exp ~ region + residents + young_residents + per_capita_income, data = education)

summary(lmfit)
## 
## Call:
## lm(formula = per_capita_exp ~ region + residents + young_residents + 
##     per_capita_income, data = education)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.963 -25.499  -2.214  17.618  89.106 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -467.40283  142.57669  -3.278 0.002073 ** 
## regionnortheast     15.72741   18.16260   0.866 0.391338    
## regionsouth          7.08742   17.29950   0.410 0.684068    
## regionwest          34.32416   17.49460   1.962 0.056258 .  
## residents           -0.03456    0.05319  -0.650 0.519325    
## young_residents      1.30146    0.35717   3.644 0.000719 ***
## per_capita_income    0.07204    0.01305   5.520 1.82e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.88 on 43 degrees of freedom
## Multiple R-squared:  0.6292, Adjusted R-squared:  0.5774 
## F-statistic: 12.16 on 6 and 43 DF,  p-value: 6.025e-08

Let’s test for heteroskedasticity using the Breusch-Pagan test that you can find in the {lmtest}
package:

bptest(lmfit)
## 
##  studentized Breusch-Pagan test
## 
## data:  lmfit
## BP = 17.921, df = 6, p-value = 0.006432

This test shows that we can reject the null that the variance of the residuals is constant,
thus heteroskedacity is present. To get the correct standard errors, we can use the vcovHC()
function from the {sandwich} package (hence the choice for the header picture of this post):

lmfit %>% 
    vcovHC() %>% 
    diag() %>% 
    sqrt()
##       (Intercept)   regionnortheast       regionsouth        regionwest 
##      311.31088691       25.30778221       23.56106307       24.12258706 
##         residents   young_residents per_capita_income 
##        0.09184368        0.68829667        0.02999882

By default vcovHC() estimates a heteroskedasticity consistent (HC) variance covariance
matrix for the parameters. There are several ways to estimate such a HC matrix, and by default
vcovHC() estimates the “HC3” one. You can refer to Zeileis (2004)
for more details.

We see that the standard errors are much larger than before! The intercept and regionwest variables
are not statistically significant anymore.

You can achieve the same in one single step:

coeftest(lmfit, vcov = vcovHC(lmfit))
## 
## t test of coefficients:
## 
##                      Estimate  Std. Error t value Pr(>|t|)  
## (Intercept)       -467.402827  311.310887 -1.5014  0.14056  
## regionnortheast     15.727405   25.307782  0.6214  0.53759  
## regionsouth          7.087424   23.561063  0.3008  0.76501  
## regionwest          34.324157   24.122587  1.4229  0.16198  
## residents           -0.034558    0.091844 -0.3763  0.70857  
## young_residents      1.301458    0.688297  1.8908  0.06540 .
## per_capita_income    0.072036    0.029999  2.4013  0.02073 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It’s is also easy to change the estimation method for the variance-covariance matrix:

coeftest(lmfit, vcov = vcovHC(lmfit, type = "HC0"))
## 
## t test of coefficients:
## 
##                      Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept)       -467.402827  172.577569 -2.7084  0.009666 ** 
## regionnortheast     15.727405   20.488148  0.7676  0.446899    
## regionsouth          7.087424   17.755889  0.3992  0.691752    
## regionwest          34.324157   19.308578  1.7777  0.082532 .  
## residents           -0.034558    0.054145 -0.6382  0.526703    
## young_residents      1.301458    0.387743  3.3565  0.001659 ** 
## per_capita_income    0.072036    0.016638  4.3296 8.773e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As I wrote above, by default, the type argument is equal to “HC3”.

Another way of dealing with heteroskedasticity is to use the lmrob() function from the
{robustbase} package. This package is quite interesting, and offers quite a lot of functions
for robust linear, and nonlinear, regression models. Running a robust linear regression
is just the same as with lm():

lmrobfit <- lmrob(per_capita_exp ~ region + residents + young_residents + per_capita_income, 
                  data = education)

summary(lmrobfit)
## 
## Call:
## lmrob(formula = per_capita_exp ~ region + residents + young_residents + per_capita_income, 
##     data = education)
##  --> method = "MM"
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.074 -14.803  -0.853  24.154 174.279 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       -156.37169  132.73828  -1.178  0.24526   
## regionnortheast     20.64576   26.45378   0.780  0.43940   
## regionsouth         10.79694   29.42747   0.367  0.71549   
## regionwest          45.22589   33.07950   1.367  0.17867   
## residents            0.03406    0.04412   0.772  0.44435   
## young_residents      0.57896    0.25512   2.269  0.02832 * 
## per_capita_income    0.04328    0.01442   3.000  0.00447 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Robust residual standard error: 26.4 
## Multiple R-squared:  0.6235, Adjusted R-squared:  0.571 
## Convergence in 24 IRWLS iterations
## 
## Robustness weights: 
##  observation 50 is an outlier with |weight| = 0 ( < 0.002); 
##  7 weights are ~= 1. The remaining 42 ones are summarized as
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05827 0.85200 0.93870 0.85250 0.98700 0.99790 
## Algorithmic parameters: 
##        tuning.chi                bb        tuning.psi        refine.tol 
##         1.548e+00         5.000e-01         4.685e+00         1.000e-07 
##           rel.tol         scale.tol         solve.tol       eps.outlier 
##         1.000e-07         1.000e-10         1.000e-07         2.000e-03 
##             eps.x warn.limit.reject warn.limit.meanrw 
##         1.071e-08         5.000e-01         5.000e-01 
##      nResample         max.it       best.r.s       k.fast.s          k.max 
##            500             50              2              1            200 
##    maxit.scale      trace.lev            mts     compute.rd fast.s.large.n 
##            200              0           1000              0           2000 
##                   psi           subsampling                   cov 
##            "bisquare"         "nonsingular"         ".vcov.avar1" 
## compute.outlier.stats 
##                  "SM" 
## seed : int(0)

This however, gives you different estimates than when fitting a linear regression model.
The estimates should be the same, only the standard errors should be different. This is because
the estimation method is different, and is also robust to outliers (at least that’s my understanding,
I haven’t read the theoretical papers behind the package yet).

Finally, it is also possible to bootstrap the standard errors. For this I will use the
bootstrap() function from the {modelr} package:

resamples <- 100

boot_education <- education %>% 
 modelr::bootstrap(resamples)

Let’s take a look at the boot_education object:

boot_education
## # A tibble: 100 x 2
##    strap          .id  
##    <list>         <chr>
##  1 <S3: resample> 001  
##  2 <S3: resample> 002  
##  3 <S3: resample> 003  
##  4 <S3: resample> 004  
##  5 <S3: resample> 005  
##  6 <S3: resample> 006  
##  7 <S3: resample> 007  
##  8 <S3: resample> 008  
##  9 <S3: resample> 009  
## 10 <S3: resample> 010  
## # ... with 90 more rows

The column strap contains resamples of the original data. I will run my linear regression
from before on each of the resamples:

(
    boot_lin_reg <- boot_education %>% 
        mutate(regressions = 
                   map(strap, 
                       ~lm(per_capita_exp ~ region + residents + 
                               young_residents + per_capita_income, 
                           data = .))) 
)
## # A tibble: 100 x 3
##    strap          .id   regressions
##    <list>         <chr> <list>     
##  1 <S3: resample> 001   <S3: lm>   
##  2 <S3: resample> 002   <S3: lm>   
##  3 <S3: resample> 003   <S3: lm>   
##  4 <S3: resample> 004   <S3: lm>   
##  5 <S3: resample> 005   <S3: lm>   
##  6 <S3: resample> 006   <S3: lm>   
##  7 <S3: resample> 007   <S3: lm>   
##  8 <S3: resample> 008   <S3: lm>   
##  9 <S3: resample> 009   <S3: lm>   
## 10 <S3: resample> 010   <S3: lm>   
## # ... with 90 more rows

I have added a new column called regressions which contains the linear regressions on each
bootstrapped sample. Now, I will create a list of tidied regression results:

(
    tidied <- boot_lin_reg %>% 
        mutate(tidy_lm = 
                   map(regressions, broom::tidy))
)
## # A tibble: 100 x 4
##    strap          .id   regressions tidy_lm             
##    <list>         <chr> <list>      <list>              
##  1 <S3: resample> 001   <S3: lm>    <data.frame [7 × 5]>
##  2 <S3: resample> 002   <S3: lm>    <data.frame [7 × 5]>
##  3 <S3: resample> 003   <S3: lm>    <data.frame [7 × 5]>
##  4 <S3: resample> 004   <S3: lm>    <data.frame [7 × 5]>
##  5 <S3: resample> 005   <S3: lm>    <data.frame [7 × 5]>
##  6 <S3: resample> 006   <S3: lm>    <data.frame [7 × 5]>
##  7 <S3: resample> 007   <S3: lm>    <data.frame [7 × 5]>
##  8 <S3: resample> 008   <S3: lm>    <data.frame [7 × 5]>
##  9 <S3: resample> 009   <S3: lm>    <data.frame [7 × 5]>
## 10 <S3: resample> 010   <S3: lm>    <data.frame [7 × 5]>
## # ... with 90 more rows

broom::tidy() creates a data frame of the regression results. Let’s look at one of these:

tidied$tidy_lm[[1]]
##                term      estimate    std.error  statistic      p.value
## 1       (Intercept) -515.19835839 129.61828003 -3.9747353 2.648477e-04
## 2   regionnortheast  -11.75706535  18.72014312 -0.6280436 5.332970e-01
## 3       regionsouth   -8.89139412  15.51588707 -0.5730510 5.695950e-01
## 4        regionwest   25.96217940  17.48335459  1.4849656 1.448476e-01
## 5         residents   -0.08921914   0.05557157 -1.6054819 1.157079e-01
## 6   young_residents    1.41203710   0.35297506  4.0003877 2.447887e-04
## 7 per_capita_income    0.08609340   0.01191468  7.2258276 6.069534e-09

This format is easier to handle than the standard lm() output:

tidied$regressions[[1]]
## 
## Call:
## lm(formula = per_capita_exp ~ region + residents + young_residents + 
##     per_capita_income, data = .)
## 
## Coefficients:
##       (Intercept)    regionnortheast        regionsouth  
##        -515.19836          -11.75707           -8.89139  
##        regionwest          residents    young_residents  
##          25.96218           -0.08922            1.41204  
## per_capita_income  
##           0.08609

Now that I have all these regression results, I can compute any statistic I need. But first,
let’s transform the data even further:

list_mods <- tidied %>% 
    pull(tidy_lm)

list_mods is a list of the tidy_lm data frames. I now add an index and
bind the rows together (by using map2_df() instead of map2()):

mods_df <- map2_df(list_mods, 
                   seq(1, resamples), 
                   ~mutate(.x, resample = .y))

Let’s take a look at the final object:

head(mods_df, 25)
##                 term      estimate    std.error  statistic      p.value
## 1        (Intercept) -515.19835839 129.61828003 -3.9747353 2.648477e-04
## 2    regionnortheast  -11.75706535  18.72014312 -0.6280436 5.332970e-01
## 3        regionsouth   -8.89139412  15.51588707 -0.5730510 5.695950e-01
## 4         regionwest   25.96217940  17.48335459  1.4849656 1.448476e-01
## 5          residents   -0.08921914   0.05557157 -1.6054819 1.157079e-01
## 6    young_residents    1.41203710   0.35297506  4.0003877 2.447887e-04
## 7  per_capita_income    0.08609340   0.01191468  7.2258276 6.069534e-09
## 8        (Intercept) -526.78466908 142.64004523 -3.6931050 6.207704e-04
## 9    regionnortheast    3.35909252  21.97165783  0.1528830 8.792057e-01
## 10       regionsouth    2.62845462  17.20973263  0.1527307 8.793251e-01
## 11        regionwest   26.40527188  20.50214280  1.2879274 2.046593e-01
## 12         residents   -0.04834920   0.05722962 -0.8448282 4.028830e-01
## 13   young_residents    1.49618012   0.37091593  4.0337445 2.208994e-04
## 14 per_capita_income    0.07489435   0.01179888  6.3475800 1.140844e-07
## 15       (Intercept) -466.17160566 130.18954610 -3.5807146 8.659014e-04
## 16   regionnortheast   -6.75977787  17.24994600 -0.3918724 6.970880e-01
## 17       regionsouth    6.31061828  16.27953099  0.3876413 7.001938e-01
## 18        regionwest   27.89261897  15.43852463  1.8066894 7.781178e-02
## 19         residents   -0.08760677   0.05007780 -1.7494134 8.735434e-02
## 20   young_residents    1.23763741   0.32965208  3.7543746 5.168492e-04
## 21 per_capita_income    0.08469609   0.01236601  6.8491057 2.128982e-08
## 22       (Intercept) -316.44265722 166.96769979 -1.8952328 6.479692e-02
## 23   regionnortheast   11.54907449  14.95746934  0.7721276 4.442620e-01
## 24       regionsouth    9.25759229  16.89974638  0.5477947 5.866654e-01
## 25        regionwest   31.83905855  16.56287888  1.9223143 6.120738e-02
##    resample
## 1         1
## 2         1
## 3         1
## 4         1
## 5         1
## 6         1
## 7         1
## 8         2
## 9         2
## 10        2
## 11        2
## 12        2
## 13        2
## 14        2
## 15        3
## 16        3
## 17        3
## 18        3
## 19        3
## 20        3
## 21        3
## 22        4
## 23        4
## 24        4
## 25        4

Now this is a very useful format, because I now can group by the term column and compute any
statistics I need, in the present case the standard deviation:

(
    r.std.error <- mods_df %>% 
        group_by(term) %>% 
        summarise(r.std.error = sd(estimate))
)
## # A tibble: 7 x 2
##   term              r.std.error
##   <chr>                   <dbl>
## 1 (Intercept)          226.    
## 2 per_capita_income      0.0211
## 3 regionnortheast       19.7   
## 4 regionsouth           19.1   
## 5 regionwest            18.7   
## 6 residents              0.0629
## 7 young_residents        0.509

We can append this column to the linear regression model result:

lmfit %>% 
    broom::tidy() %>% 
    full_join(r.std.error) %>% 
    select(term, estimate, std.error, r.std.error)
## Joining, by = "term"
##                term      estimate    std.error  r.std.error
## 1       (Intercept) -467.40282655 142.57668653 226.08735720
## 2   regionnortheast   15.72740519  18.16259519  19.65120904
## 3       regionsouth    7.08742365  17.29949951  19.05153934
## 4        regionwest   34.32415663  17.49459866  18.72767470
## 5         residents   -0.03455770   0.05318811   0.06285984
## 6   young_residents    1.30145825   0.35716636   0.50928879
## 7 per_capita_income    0.07203552   0.01305066   0.02109277

As you see, using the whole bootstrapping procedure is longer than simply using either one of
the first two methods. However, this procedure is very flexible and can thus be adapted to a very
large range of situations. Either way, in the case of heteroskedasticity, you can see that
results vary a lot depending on the procedure you use, so I would advise to use them all as
robustness tests and discuss the differences.

If you found this blog post useful, you might want to follow me on twitter
for blog post updates.

Источник

Here we briefly discuss how to estimate robust standard errors for linear regression models

1 Which package to use
2 Heteroskedasticity robust standard errors
3 Autocorrelation and heteroskedasticity robust standard errors
4 Heteroskedasticity Robust F-tests
5 Footnotes

Which package to use

There are a number of pieces of code available to facilitate this task^[1]. Here I recommend to use the «sandwich» package. Which has the most comprehensive robust standard error options I am aware of.

As described in more detail in R_Packages you should install the package the first time you use it on a particular computer:

   install.packages("sandwich")

and then call the package at the beginning of your script into the library:

   library(sandwich)

All code snippets below assume that you have done so. In fact, you may instead want to use another package called «AER» which contains the sandwich package and other relevant packaes (such as the one used for instrumental variables estimation IV_in_R). See the relevant CRAN webpage

Heteroskedasticity robust standard errors

I assume that you know that the presence of heteroskedastic standard errors renders OLS estimators of linear regression models inefficient (although they remain unbiased). More seriously, however, they also imply that the usual standard errors that are computed for your coefficient estimates (e.g. when you use the summary() command as discussed in R_Regression), are incorrect (or sometimes we call them biased). This implies that inference based on these standard errors will be incorrect (incorrectly sized). What we need are coefficient estimate standard errors that are correct even when regression error terms are heteroskedastic, sometimes called White standard errors.

Let’s assume that you have calculated a regression (as in R_Regression):

    # Run a regression
    reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata)

The function from the «sandwich» package that you want to use is called vcovHC() and you use it as follows:

   vcv <- vcovHC(reg_ex1, type = "HC1")

This saves the heteroscedastic robust standard error in vcv^[2]. Now you can calculate robust t-tests by using the estimated coefficients and the new standard errors (square roots of the diagonal elements on vcv). But note that inference using these standard errors is only valid for sufficiently large sample sizes (asymptotically normally distributed t-tests).
You may actually want a neat way to see the standard errors, rather than having to calculate the square roots of the diagonal of this matrix. This is done with the following function (this is part of the lmtest package which will be automatically installed if you installed the AER package as recommended above):

   coeftest(reg_ex1, vcv)

if you already calculated vcv. Try it out and you will find the regression coefficients along with their new standard errors, t-stats and p-values. If not, you may as well use this line

   coeftest(reg_ex1, vcov = vcovHC(reg_ex1,type="HC1"))

which incorporates the call to the vcovHC function.

Autocorrelation and heteroskedasticity robust standard errors

When the error terms are autocorrelated (and potentially heteroskedastic) all of the above applies and we need to use yet another estimator for the coefficient estimate standard errors, sometimes called the Newey-West estimators.
The function from the «sandwich» package that you want to use is called vcovHAC() and you use it as follows:

   vcv <- vcovHAC(reg_ex1)

Everything is as for heteroskedastic error terms. I.e. you would print these standard errors along with the coefficient estimates, t-statistics and p-values from:

   coeftest(reg_ex1, vcov = vcovHAC(reg_ex1))

Heteroskedasticity Robust F-tests

To illustrate robust F-tests, we shall basically replicate the example from the standard inference section. We first estimate a somewhat larger regression model.

    reg_ex2 <- lm(lwage~exper+log(huswage)+age+educ,data=mydata)
    reg_ex2_sm <- summary(reg_ex2)

and now we want to test whether the inclusion of the extra two variables age and educ is statistically significant. In the standard inference section we learned that one way to do that is by means of the following command

    > waldtest(reg_ex2, .~. - age - educ)

which delivered

    Wald test

    Model 1: lwage ~ exper + log(huswage) + age + educ
    Model 2: lwage ~ exper + log(huswage)
      Res.Df Df      F    Pr(>F)    
    1    423                        
    2    425 -2 24.741 6.895e-11 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

But this procedure assumed that the error terms were homoskedastic. If you want to allow for for heteroskedastic error terms you merely have to add another input to the waldtest function call.

    > waldtest(reg_ex2, .~. - age - educ, vcov=vcovHC)
    Wald test
    
    Model 1: lwage ~ exper + log(huswage) + age + educ
    Model 2: lwage ~ exper + log(huswage)
      Res.Df Df      F    Pr(>F)    
    1    423                        
    2    425 -2 28.854 1.791e-12 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The input vcov=vcovHC instructs R to use a robust version of the variance covariance matrix.

As you can see it produces slightly different results, although there is no change in the substantial conclusion that you should not omit these two variables as the null hypothesis that both are irrelevant is soundly rejected.

If you prefer the lht function to perform F-tests, you can calculate robust F-tests by adding the argument white.adjust = TRUE to your function call.

Footnotes

↑ An alternative option is discussed here but it is less powerful than the sandwich package.
↑ Predictably the type option in this function indicates that there are several options (actually «HC0» to «HC4»). Using «HC1» will replicate the robust standard errors you would obtain using STATA.

Источник

History[edit]

Problem[edit]

Solution[edit]

See also[edit]

Software[edit]

References[edit]

Further reading[edit]

Contents

Which package to use

Heteroskedasticity robust standard errors

Autocorrelation and heteroskedasticity robust standard errors

Heteroskedasticity Robust F-tests

Footnotes

Читайте также: