From Wikipedia, the free encyclopedia
The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors (or simply robust standard errors), Eicker–Huber–White standard errors (also Huber–White standard errors or White standard errors),[1] to recognize the contributions of Friedhelm Eicker,[2] Peter J. Huber,[3] and Halbert White.[4]
In regression and time-series modelling, basic forms of models make use of the assumption that the errors or disturbances ui have the same variance across all observation points. When this is not the case, the errors are said to be heteroskedastic, or to have heteroskedasticity, and this behaviour will be reflected in the residuals estimated from a fitted model. Heteroskedasticity-consistent standard errors are used to allow the fitting of a model that does contain heteroskedastic residuals. The first such approach was proposed by Huber (1967), and further improved procedures have been produced since for cross-sectional data, time-series data and GARCH estimation.
Heteroskedasticity-consistent standard errors that differ from classical standard errors may indicate model misspecification. Substituting heteroskedasticity-consistent standard errors does not resolve this misspecification, which may lead to bias in the coefficients. In most situations, the problem should be found and fixed.[5] Other types of standard error adjustments, such as clustered standard errors or HAC standard errors, may be considered as extensions to HC standard errors.
History[edit]
Heteroskedasticity-consistent standard errors are introduced by Friedhelm Eicker,[6][7] and popularized in econometrics by Halbert White.
Problem[edit]
Consider the linear regression model for the scalar Y.
where is a k x 1 column vector of explanatory variables (features), is a k × 1 column vector of parameters to be estimated, and is the residual error.
The ordinary least squares (OLS) estimator is
where is a vector of observations , and denotes the matrix of stacked values observed in the data.
If the sample errors have equal variance and are uncorrelated, then the least-squares estimate of is BLUE (best linear unbiased estimator), and its variance is estimated with
where are the regression residuals.
When the error terms do not have constant variance (i.e., the assumption of is untrue), the OLS estimator loses its desirable properties. The formula for variance now cannot be simplified:
where
While the OLS point estimator remains unbiased, it is not «best» in the sense of having minimum mean square error, and the OLS variance estimator does not provide a consistent estimate of the variance of the OLS estimates.
For any non-linear model (for instance logit and probit models), however, heteroskedasticity has more severe consequences: the maximum likelihood estimates of the parameters will be biased (in an unknown direction), as well as inconsistent (unless the likelihood function is modified to correctly take into account the precise form of heteroskedasticity).[8][9] As pointed out by Greene, “simply computing a robust covariance matrix for an otherwise inconsistent estimator does not give it redemption.”[10]
Solution[edit]
If the regression errors are independent, but have distinct variances , then which can be estimated with . This provides White’s (1980) estimator, often referred to as HCE (heteroskedasticity-consistent estimator):
where as above denotes the matrix of stacked values from the data. The estimator can be derived in terms of the generalized method of moments (GMM).
Also often discussed in the literature (including White’s paper) is the covariance matrix of the -consistent limiting distribution:
where
and
Thus,
and
Precisely which covariance matrix is of concern is a matter of context.
Alternative estimators have been proposed in MacKinnon & White (1985) that correct for unequal variances of regression residuals due to different leverage.[11] Unlike the asymptotic White’s estimator, their estimators are unbiased when the data are homoscedastic.
Of the four widely available different options, often denoted as HC0-HC3, the HC3 specification appears to work best, with tests relying on the HC3 estimator featuring better power and closer proximity to the targetted size, especially in small samples. The larger the sample, the smaller the difference between the different estimators.[12]
An alternative to explicitly modelling the heteroskedasticity is using a resampling method such as the Wild Bootstrap. Given that the studentized Bootstrap, which standardizes the resampled statistic by its standard error, yields an asymptotic refinement,[13] heteroskedasticity-robust standard errors remain nevertheless useful.
Instead of accounting for the heteroskedastic errors, most linear models can be transformed to feature homoskedastic error terms (unless the error term is heteroskedastic by construction, e.g. in a Linear probability model). One way to do this is using Weighted least squares, which also features improved efficiency properties.
See also[edit]
- Delta method
- Generalized least squares
- Generalized estimating equations
- Weighted least squares, an alternative formulation
- White test — a test for whether heteroskedasticity is present.
- Newey–West estimator
- Quasi-maximum likelihood estimate
Software[edit]
- EViews: EViews version 8 offers three different methods for robust least squares: M-estimation (Huber, 1973), S-estimation (Rousseeuw and Yohai, 1984), and MM-estimation (Yohai 1987).[14]
- Julia: the
CovarianceMatrices
package offers several methods for heteroskedastic robust variance covariance matrices.[15] - MATLAB: See the
hac
function in the Econometrics toolbox.[16] - Python: The Statsmodel package offers various robust standard error estimates, see statsmodels.regression.linear_model.RegressionResults for further descriptions
- R: the
vcovHC()
command from the sandwich package.[17][18] - RATS: robusterrors option is available in many of the regression and optimization commands (linreg, nlls, etc.).
- Stata:
robust
option applicable in many pseudo-likelihood based procedures.[19] - Gretl: the option
--robust
to several estimation commands (such asols
) in the context of a cross-sectional dataset produces robust standard errors.[20]
References[edit]
- ^ Kleiber, C.; Zeileis, A. (2006). «Applied Econometrics with R» (PDF). UseR-2006 conference. Archived from the original (PDF) on April 22, 2007.
- ^ Eicker, Friedhelm (1967). «Limit Theorems for Regression with Unequal and Dependent Errors». Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Vol. 5. pp. 59–82. MR 0214223. Zbl 0217.51201.
- ^ Huber, Peter J. (1967). «The behavior of maximum likelihood estimates under nonstandard conditions». Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Vol. 5. pp. 221–233. MR 0216620. Zbl 0212.21504.
- ^ White, Halbert (1980). «A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity». Econometrica. 48 (4): 817–838. CiteSeerX 10.1.1.11.7646. doi:10.2307/1912934. JSTOR 1912934. MR 0575027.
- ^ King, Gary; Roberts, Margaret E. (2015). «How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It». Political Analysis. 23 (2): 159–179. doi:10.1093/pan/mpu015. ISSN 1047-1987.
- ^ Eicker, F. (1963). «Asymptotic Normality and Consistency of the Least Squares Estimators for Families of Linear Regressions». The Annals of Mathematical Statistics. 34 (2): 447–456. doi:10.1214/aoms/1177704156.
- ^ Eicker, Friedhelm (January 1967). «Limit theorems for regressions with unequal and dependent errors». Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics. 5 (1): 59–83.
- ^ Giles, Dave (May 8, 2013). «Robust Standard Errors for Nonlinear Models». Econometrics Beat.
- ^ Guggisberg, Michael (2019). «Misspecified Discrete Choice Models and Huber-White Standard Errors». Journal of Econometric Methods. 8 (1). doi:10.1515/jem-2016-0002.
- ^ Greene, William H. (2012). Econometric Analysis (Seventh ed.). Boston: Pearson Education. pp. 692–693. ISBN 978-0-273-75356-8.
- ^ MacKinnon, James G.; White, Halbert (1985). «Some Heteroskedastic-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties». Journal of Econometrics. 29 (3): 305–325. doi:10.1016/0304-4076(85)90158-7. hdl:10419/189084.
- ^ Long, J. Scott; Ervin, Laurie H. (2000). «Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model». The American Statistician. 54 (3): 217–224. doi:10.2307/2685594. ISSN 0003-1305.
- ^ C., Davison, Anthony (2010). Bootstrap methods and their application. Cambridge Univ. Press. ISBN 978-0-521-57391-7. OCLC 740960962.
- ^ «EViews 8 Robust Regression».
- ^ CovarianceMatrices: Robust Covariance Matrix Estimators
- ^ «Heteroskedasticity and autocorrelation consistent covariance estimators». Econometrics Toolbox.
- ^ sandwich: Robust Covariance Matrix Estimators
- ^ Kleiber, Christian; Zeileis, Achim (2008). Applied Econometrics with R. New York: Springer. pp. 106–110. ISBN 978-0-387-77316-2.
- ^ See online help for
_robust
option andregress
command. - ^ «Robust covariance matrix estimation» (PDF). Gretl User’s Guide, chapter 19.
Further reading[edit]
- Freedman, David A. (2006). «On The So-Called ‘Huber Sandwich Estimator’ and ‘Robust Standard Errors’«. The American Statistician. 60 (4): 299–302. doi:10.1198/000313006X152207. S2CID 6222876.
- Hardin, James W. (2003). «The Sandwich Estimate of Variance». In Fomby, Thomas B.; Hill, R. Carter (eds.). Maximum Likelihood Estimation of Misspecified Models: Twenty Years Later. Amsterdam: Elsevier. pp. 45–74. ISBN 0-7623-1075-8.
- Hayes, Andrew F.; Cai, Li (2007). «Using heteroskedasticity-consistent standard error estimators in OLS regression: An introduction and software implementation». Behavior Research Methods. 39 (4): 709–722. doi:10.3758/BF03192961. PMID 18183883.
- King, Gary; Roberts, Margaret E. (2015). «How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It». Political Analysis. 23 (2): 159–179. doi:10.1093/pan/mpu015.
- Wooldridge, Jeffrey M. (2009). «Heteroskedasticity-Robust Inference after OLS Estimation». Introductory Econometrics : A Modern Approach (Fourth ed.). Mason: South-Western. pp. 265–271. ISBN 978-0-324-66054-8.
- Buja, Andreas, et al. «Models as approximations-a conspiracy of random regressors and model deviations against classical inference in regression.» Statistical Science (2015): 1. pdf
Так как гетероскедастичность не приводит к смещению оценок коэффициентов, можно по-прежнему использовать МНК. Смещены и несостоятельны оказываются не сами оценки коэффициентов, а их стандартные ошибки, поэтому формула для расчета стандартных ошибок в условиях гомоскедастичности не подходит для случая гетероскедастичности.
Естественной идеей в этой ситуации является корректировка формулы расчета стандартных ошибок, чтобы она давала «правильный» (состоятельный) результат. Тогда можно снова будет корректно проводить тесты, проверяющие, например, незначимость коэффициентов, и строить доверительные интервалы. Соответствующие «правильные» стандартные ошибки называются состоятельными в условиях гетероскедастичности стандартными ошибками (heteroskedasticity consistent (heteroskedasticity robust) standard errors)1. Первоначальная формула для их расчета была предложена Уайтом, поэтому иногда их также называют стандартными ошибками в форме Уайта (White standard errors). Предложенная Уайтом состоятельная оценка ковариационной матрицы вектора оценок коэффициентов имеет вид:
(widehat{V}{left( widehat{beta} right) = n}left( {X^{‘}X} right)^{- 1}left( {frac{1}{n}{sumlimits_{s = 1}^{n}e_{s}^{2}}x_{s}x_{s}^{‘}} right)left( {X^{‘}X} right)^{- 1},)
где (x_{s}) – это s-я строка матрицы регрессоров X. Легко видеть, что эта формула более громоздка, чем формула (widehat{V}{left( widehat{beta} right) = left( {X^{‘}X} right)^{- 1}}S^{2}), которую мы вывели в третьей главе для случая гомоскедастичности. К счастью, на практике соответствующие вычисления не представляют сложности, так как возможность автоматически рассчитывать стандартные ошибки в форме Уайта реализована во всех современных эконометрических пакетах. Общепринятое обозначение для этой версии стандартных ошибок: «HC0». В работах (MacKinnon, White,1985) и (Davidson, MacKinnon, 2004) были предложены и альтернативные версии, которые обычно обозначаются в эконометрических пакетах «HC1», «HC2» и «HC3». Их расчетные формулы несколько отличаются, однако суть остается прежней: они позволяют состоятельно оценивать стандартные отклонения МНК-оценок коэффициентов в условиях гетероскедастичности.
Для случая парной регрессии состоятельная в условиях гетероскедастичности стандартная ошибка оценки коэффициента при регрессоре имеет вид:
(mathit{se}{left( widehat{beta_{2}} right) = sqrt{frac{1}{n}frac{frac{1}{n — 2}{sumlimits_{i = 1}^{n}{left( {x_{i} — overline{x}} right)^{2}e_{i}^{2}}}}{widehat{mathit{var}}(x)^{2}}.}})
Формальное доказательство состоятельности будет приведено в следующей главе. Пока же обсудим пример, иллюстрирующий важность использования робастных стандартных ошибок.
Пример 5.1. Оценка эффективности использования удобрений
В файле Agriculture в материалах к этому учебнику содержатся следующие данные 2010 года об урожайности яровой и озимой пшеницы в Спасском районе Пензенской области:
PRODP — урожайность в денежном выражении, в тысячах рублей с 1 га,
SIZE – размер пахотного поля, га,
LABOUR – трудозатраты, руб. на 1 га,
FUNG1 – фунгициды, протравители семян, расходы на удобрение в руб. на 1 га,
FUNG2 – фунгициды, во время роста, расходы на удобрение в руб. на 1 га,
GIRB – гербициды, расходы на удобрение в руб. на 1 га,
INSEC – инсектициды, расходы на удобрение в руб. на 1 га,
YDOB1 – аммофос, во время сева, расходы на удобрение в руб. на 1 га,
YDOB2 – аммиачная селитра, во время роста, расходы на удобрение в руб. на 1 га.
Представим, что вас интересует ответ на вопрос: влияет ли использование фунгицидов на урожайность поля?
(а) Оцените зависимость урожайности в денежном выражении от константы и переменных FUNG1, FUNG2, YDOB1, YDOB2, GIRB, INSEC, LABOUR. Запишите уравнение регрессии в стандартной форме, указав коэффициент детерминации и (в скобках под соответствующими коэффициентами) стандартные ошибки для случая гомоскедастичности. Какие из переменных значимы на 5-процентном уровне значимости?
(б) Решите предыдущий пункт заново, используя теперь состоятельные в условиях гетероскедастичности стандартные ошибки. Сопоставьте выводы по поводу значимости (при пятипроцентном уровне) переменных, характеризующих использование фунгицидов.
Решение:
(а) Оценим требуемое уравнение:
Модель 1: МНК, использованы наблюдения 1-200
Зависимая переменная: PRODP
Коэффициент | Ст. ошибка | t-статистика | P-значение | ||
const | -38,4019 | 7,5273 | -5,1017 | <0,00001 | *** |
FUNG1 | 0,0445755 | 0,0487615 | 0,9142 | 0,36178 | |
FUNG2 | 0,103625 | 0,049254 | 2,1039 | 0,03669 | ** |
GIRB | 0,0776059 | 0,0523553 | 1,4823 | 0,13990 | |
INSEC | 0,0782521 | 0,0484667 | 1,6146 | 0,10805 | |
LABOUR | 0,0415064 | 0,00275277 | 15,0781 | <0,00001 | *** |
YDOB1 | 0,0492168 | 0,0233328 | 2,1093 | 0,03621 | ** |
YDOB2 | -0,0906824 | 0,025864 | -3,5061 | 0,00057 | *** |
Сумма кв. остатков | 150575,6 | Ст. ошибка модели | 28,00443 | |
R-квадрат | 0,801958 | Испр. R-квадрат | 0,794738 | |
F(7, 192) | 111,0701 | Р-значение (F) | 5,08e-64 |
Переменные FUNG2, LABOUR, YDOB1 и YDOB2 значимы на пятипроцентном уровне значимости (причем LABOUR и YDOB2 — ещё и на однопроцентном).
Если представить те же самые результаты в форме уравнения, то получится вот так:
({widehat{mathit{PRODP}}}_{i} = {{- underset{(7,53)}{38,40}} + {underset{(0,05)}{0,04} ast {mathit{FUNG}1}_{i}} + {underset{(0,05)}{0,10} ast {mathit{FUNG}2}_{i}} +})
({{+ underset{(0,05)}{0,08}} ast mathit{GIRB}_{i}} + {underset{(0,05)}{0,08} ast mathit{INSEC}_{i}} + {underset{(0,003)}{0,04} ast mathit{LABOUR}_{i}} + {})
({{{+ underset{(0,02)}{0,05}} ast {mathit{YDOB}1}_{i}} — {underset{(0,03)}{0,09} ast {mathit{YDOB}2}_{i}}},{R^{2} = 0,802})
(б) При использовании альтернативных стандартных ошибок получим следующий результат:
Модель 2: МНК, использованы наблюдения 1-200
Зависимая переменная: PRODP
Робастные оценки стандартных ошибок (с поправкой на гетероскедастичность),
вариант HC1
Коэффициент | Ст. ошибка | t-статистика | P-значение | ||
const | -38,4019 | 7,40425 | -5,1865 | <0,00001 | *** |
FUNG1 | 0,0445755 | 0,0629524 | 0,7081 | 0,47975 | |
FUNG2 | 0,103625 | 0,0624082 | 1,6604 | 0,09846 | * |
GIRB | 0,0776059 | 0,0623777 | 1,2441 | 0,21497 | |
INSEC | 0,0782521 | 0,0536527 | 1,4585 | 0,14634 | |
LABOUR | 0,0415064 | 0,00300121 | 13,8299 | <0,00001 | *** |
YDOB1 | 0,0492168 | 0,0197491 | 2,4921 | 0,01355 | ** |
YDOB2 | -0,0906824 | 0,030999 | -2,9253 | 0,00386 | *** |
Сумма кв. остатков | 150575,6 | Ст. ошибка модели | 28,00443 | |
R-квадрат | 0,801958 | Испр. R-квадрат | 0,794738 | |
F(7, 192) | 119,2263 | Р-значение (F) | 2,16e-66 |
Оценки коэффициентов по сравнению с пунктом (а) не поменялись, что естественно: мы ведь по-прежнему используем обычный МНК. Однако стандартные ошибки теперь немного другие. В некоторых случаях это меняет выводы тестов на незначимость.
Переменные LABOUR, YDOB1 и YDOB2 значимы на пятипроцентном уровне значимости (причем LABOUR и YDOB2 — ещё и на однопроцентном).
Переменная FUNG2 перестала быть значимой на пятипроцентном уровне. Таким образом, при использовании корректных стандартных ошибок следует сделать вывод о том, что соответствующий вид удобрений не важен для урожайности. Обратите внимание, что если бы мы использовали «обычные» стандартные ошибки, то мы пришли бы к противоположному заключению (см. пункт (а)).
* * *
Важно подчеркнуть, что в реальных пространственных данных гетероскедастичность в той или иной степени наблюдается практически всегда. А даже если её и нет, то состоятельные в условиях гетероскедастичности стандартные ошибки по-прежнему будут… состоятельными (и будут близки к «обычным» стандартным ошибкам, посчитанным по формулам из третьей главы). Поэтому в современных прикладных исследованиях при оценке уравнений по умолчанию используются именно робастные стандартные ошибки, а не стандартные ошибки для случая гомоскедастичности. Мы настоятельно рекомендуем читателю поступать так же2. В нашем учебнике с этого момента и во всех последующих главах, если прямо не оговорено иное, для МНК-оценок параметров всегда используются состоятельные в условиях гетероскедастичности стандартные ошибки.
-
Поскольку довольно утомительно каждый раз произносить это название полностью в англоязычном варианте их часто называют просто robust standard errors, что на русском языке эконометристов превратилось в «робастные стандартные ошибки». Кому-то подобный англицизм, конечно, режет слух, однако в устной речи он и правда куда удобней своей длинной альтернативы.↩︎
-
Просто не забывайте включать соответствующую опцию в своем эконометрическом пакете.↩︎
First of all, is it heteroskedasticity or heteroscedasticity? According to
McCulloch (1985),
heteroskedasticity is the proper spelling, because when transliterating Greek words, scientists
use the Latin letter k in place of the Greek letter κ (kappa). κ sometimes is transliterated as
the Latin letter c, but only when these words entered the English language through French, such
as scepter.
Now that this is out of the way, we can get to the meat of this blogpost (foreshadowing pun).
A random variable is said to be heteroskedastic, if its variance is not constant. For example,
the variability of expenditures may increase with income. Richer families may spend a similar
amount on groceries as poorer people, but some rich families will sometimes buy expensive
items such as lobster. The variability of expenditures for rich families is thus quite large.
However, the expenditures on food of poorer families, who cannot afford lobster, will not vary much.
Heteroskedasticity can also appear when data is clustered; for example, variability of
expenditures on food may vary from city to city, but is quite constant within a city.
To illustrate this, let’s first load all the packages needed for this blog post:
library(robustbase) library(tidyverse) library(sandwich) library(lmtest) library(modelr) library(broom)
First, let’s load and prepare the data:
data("education") education <- education %>% rename(residents = X1, per_capita_income = X2, young_residents = X3, per_capita_exp = Y, state = State) %>% mutate(region = case_when( Region == 1 ~ "northeast", Region == 2 ~ "northcenter", Region == 3 ~ "south", Region == 4 ~ "west" )) %>% select(-Region)
I will be using the education
data sat from the {robustbase}
package. I renamed some columns
and changed the values of the Region
column. Now, let’s do a scatterplot of per capita expenditures
on per capita income:
ggplot(education, aes(per_capita_income, per_capita_exp)) + geom_point() + theme_dark()
It would seem that, as income increases, variability of expenditures increases too. Let’s look
at the same plot by region
:
ggplot(education, aes(per_capita_income, per_capita_exp)) + geom_point() + facet_wrap(~region) + theme_dark()
I don’t think this shows much; it would seem that observations might be clustered, but there are
not enough observations to draw any conclusion from this plot (in any case, drawing conclusions
from plots only is dangerous).
Let’s first run a good ol’ linear regression:
lmfit <- lm(per_capita_exp ~ region + residents + young_residents + per_capita_income, data = education) summary(lmfit) ## ## Call: ## lm(formula = per_capita_exp ~ region + residents + young_residents + ## per_capita_income, data = education) ## ## Residuals: ## Min 1Q Median 3Q Max ## -77.963 -25.499 -2.214 17.618 89.106 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -467.40283 142.57669 -3.278 0.002073 ** ## regionnortheast 15.72741 18.16260 0.866 0.391338 ## regionsouth 7.08742 17.29950 0.410 0.684068 ## regionwest 34.32416 17.49460 1.962 0.056258 . ## residents -0.03456 0.05319 -0.650 0.519325 ## young_residents 1.30146 0.35717 3.644 0.000719 *** ## per_capita_income 0.07204 0.01305 5.520 1.82e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 39.88 on 43 degrees of freedom ## Multiple R-squared: 0.6292, Adjusted R-squared: 0.5774 ## F-statistic: 12.16 on 6 and 43 DF, p-value: 6.025e-08
Let’s test for heteroskedasticity using the Breusch-Pagan test that you can find in the {lmtest}
package:
bptest(lmfit) ## ## studentized Breusch-Pagan test ## ## data: lmfit ## BP = 17.921, df = 6, p-value = 0.006432
This test shows that we can reject the null that the variance of the residuals is constant,
thus heteroskedacity is present. To get the correct standard errors, we can use the vcovHC()
function from the {sandwich}
package (hence the choice for the header picture of this post):
lmfit %>% vcovHC() %>% diag() %>% sqrt() ## (Intercept) regionnortheast regionsouth regionwest ## 311.31088691 25.30778221 23.56106307 24.12258706 ## residents young_residents per_capita_income ## 0.09184368 0.68829667 0.02999882
By default vcovHC()
estimates a heteroskedasticity consistent (HC) variance covariance
matrix for the parameters. There are several ways to estimate such a HC matrix, and by default
vcovHC()
estimates the “HC3” one. You can refer to Zeileis (2004)
for more details.
We see that the standard errors are much larger than before! The intercept and regionwest
variables
are not statistically significant anymore.
You can achieve the same in one single step:
coeftest(lmfit, vcov = vcovHC(lmfit)) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -467.402827 311.310887 -1.5014 0.14056 ## regionnortheast 15.727405 25.307782 0.6214 0.53759 ## regionsouth 7.087424 23.561063 0.3008 0.76501 ## regionwest 34.324157 24.122587 1.4229 0.16198 ## residents -0.034558 0.091844 -0.3763 0.70857 ## young_residents 1.301458 0.688297 1.8908 0.06540 . ## per_capita_income 0.072036 0.029999 2.4013 0.02073 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It’s is also easy to change the estimation method for the variance-covariance matrix:
coeftest(lmfit, vcov = vcovHC(lmfit, type = "HC0")) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -467.402827 172.577569 -2.7084 0.009666 ** ## regionnortheast 15.727405 20.488148 0.7676 0.446899 ## regionsouth 7.087424 17.755889 0.3992 0.691752 ## regionwest 34.324157 19.308578 1.7777 0.082532 . ## residents -0.034558 0.054145 -0.6382 0.526703 ## young_residents 1.301458 0.387743 3.3565 0.001659 ** ## per_capita_income 0.072036 0.016638 4.3296 8.773e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As I wrote above, by default, the type
argument is equal to “HC3”.
Another way of dealing with heteroskedasticity is to use the lmrob()
function from the
{robustbase}
package. This package is quite interesting, and offers quite a lot of functions
for robust linear, and nonlinear, regression models. Running a robust linear regression
is just the same as with lm()
:
lmrobfit <- lmrob(per_capita_exp ~ region + residents + young_residents + per_capita_income, data = education) summary(lmrobfit) ## ## Call: ## lmrob(formula = per_capita_exp ~ region + residents + young_residents + per_capita_income, ## data = education) ## --> method = "MM" ## Residuals: ## Min 1Q Median 3Q Max ## -57.074 -14.803 -0.853 24.154 174.279 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -156.37169 132.73828 -1.178 0.24526 ## regionnortheast 20.64576 26.45378 0.780 0.43940 ## regionsouth 10.79694 29.42747 0.367 0.71549 ## regionwest 45.22589 33.07950 1.367 0.17867 ## residents 0.03406 0.04412 0.772 0.44435 ## young_residents 0.57896 0.25512 2.269 0.02832 * ## per_capita_income 0.04328 0.01442 3.000 0.00447 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Robust residual standard error: 26.4 ## Multiple R-squared: 0.6235, Adjusted R-squared: 0.571 ## Convergence in 24 IRWLS iterations ## ## Robustness weights: ## observation 50 is an outlier with |weight| = 0 ( < 0.002); ## 7 weights are ~= 1. The remaining 42 ones are summarized as ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.05827 0.85200 0.93870 0.85250 0.98700 0.99790 ## Algorithmic parameters: ## tuning.chi bb tuning.psi refine.tol ## 1.548e+00 5.000e-01 4.685e+00 1.000e-07 ## rel.tol scale.tol solve.tol eps.outlier ## 1.000e-07 1.000e-10 1.000e-07 2.000e-03 ## eps.x warn.limit.reject warn.limit.meanrw ## 1.071e-08 5.000e-01 5.000e-01 ## nResample max.it best.r.s k.fast.s k.max ## 500 50 2 1 200 ## maxit.scale trace.lev mts compute.rd fast.s.large.n ## 200 0 1000 0 2000 ## psi subsampling cov ## "bisquare" "nonsingular" ".vcov.avar1" ## compute.outlier.stats ## "SM" ## seed : int(0)
This however, gives you different estimates than when fitting a linear regression model.
The estimates should be the same, only the standard errors should be different. This is because
the estimation method is different, and is also robust to outliers (at least that’s my understanding,
I haven’t read the theoretical papers behind the package yet).
Finally, it is also possible to bootstrap the standard errors. For this I will use the
bootstrap()
function from the {modelr}
package:
resamples <- 100 boot_education <- education %>% modelr::bootstrap(resamples)
Let’s take a look at the boot_education
object:
boot_education ## # A tibble: 100 x 2 ## strap .id ## <list> <chr> ## 1 <S3: resample> 001 ## 2 <S3: resample> 002 ## 3 <S3: resample> 003 ## 4 <S3: resample> 004 ## 5 <S3: resample> 005 ## 6 <S3: resample> 006 ## 7 <S3: resample> 007 ## 8 <S3: resample> 008 ## 9 <S3: resample> 009 ## 10 <S3: resample> 010 ## # ... with 90 more rows
The column strap
contains resamples of the original data. I will run my linear regression
from before on each of the resamples:
( boot_lin_reg <- boot_education %>% mutate(regressions = map(strap, ~lm(per_capita_exp ~ region + residents + young_residents + per_capita_income, data = .))) ) ## # A tibble: 100 x 3 ## strap .id regressions ## <list> <chr> <list> ## 1 <S3: resample> 001 <S3: lm> ## 2 <S3: resample> 002 <S3: lm> ## 3 <S3: resample> 003 <S3: lm> ## 4 <S3: resample> 004 <S3: lm> ## 5 <S3: resample> 005 <S3: lm> ## 6 <S3: resample> 006 <S3: lm> ## 7 <S3: resample> 007 <S3: lm> ## 8 <S3: resample> 008 <S3: lm> ## 9 <S3: resample> 009 <S3: lm> ## 10 <S3: resample> 010 <S3: lm> ## # ... with 90 more rows
I have added a new column called regressions
which contains the linear regressions on each
bootstrapped sample. Now, I will create a list of tidied regression results:
( tidied <- boot_lin_reg %>% mutate(tidy_lm = map(regressions, broom::tidy)) ) ## # A tibble: 100 x 4 ## strap .id regressions tidy_lm ## <list> <chr> <list> <list> ## 1 <S3: resample> 001 <S3: lm> <data.frame [7 × 5]> ## 2 <S3: resample> 002 <S3: lm> <data.frame [7 × 5]> ## 3 <S3: resample> 003 <S3: lm> <data.frame [7 × 5]> ## 4 <S3: resample> 004 <S3: lm> <data.frame [7 × 5]> ## 5 <S3: resample> 005 <S3: lm> <data.frame [7 × 5]> ## 6 <S3: resample> 006 <S3: lm> <data.frame [7 × 5]> ## 7 <S3: resample> 007 <S3: lm> <data.frame [7 × 5]> ## 8 <S3: resample> 008 <S3: lm> <data.frame [7 × 5]> ## 9 <S3: resample> 009 <S3: lm> <data.frame [7 × 5]> ## 10 <S3: resample> 010 <S3: lm> <data.frame [7 × 5]> ## # ... with 90 more rows
broom::tidy()
creates a data frame of the regression results. Let’s look at one of these:
tidied$tidy_lm[[1]] ## term estimate std.error statistic p.value ## 1 (Intercept) -515.19835839 129.61828003 -3.9747353 2.648477e-04 ## 2 regionnortheast -11.75706535 18.72014312 -0.6280436 5.332970e-01 ## 3 regionsouth -8.89139412 15.51588707 -0.5730510 5.695950e-01 ## 4 regionwest 25.96217940 17.48335459 1.4849656 1.448476e-01 ## 5 residents -0.08921914 0.05557157 -1.6054819 1.157079e-01 ## 6 young_residents 1.41203710 0.35297506 4.0003877 2.447887e-04 ## 7 per_capita_income 0.08609340 0.01191468 7.2258276 6.069534e-09
This format is easier to handle than the standard lm()
output:
tidied$regressions[[1]] ## ## Call: ## lm(formula = per_capita_exp ~ region + residents + young_residents + ## per_capita_income, data = .) ## ## Coefficients: ## (Intercept) regionnortheast regionsouth ## -515.19836 -11.75707 -8.89139 ## regionwest residents young_residents ## 25.96218 -0.08922 1.41204 ## per_capita_income ## 0.08609
Now that I have all these regression results, I can compute any statistic I need. But first,
let’s transform the data even further:
list_mods <- tidied %>% pull(tidy_lm)
list_mods
is a list of the tidy_lm
data frames. I now add an index and
bind the rows together (by using map2_df()
instead of map2()
):
mods_df <- map2_df(list_mods, seq(1, resamples), ~mutate(.x, resample = .y))
Let’s take a look at the final object:
head(mods_df, 25) ## term estimate std.error statistic p.value ## 1 (Intercept) -515.19835839 129.61828003 -3.9747353 2.648477e-04 ## 2 regionnortheast -11.75706535 18.72014312 -0.6280436 5.332970e-01 ## 3 regionsouth -8.89139412 15.51588707 -0.5730510 5.695950e-01 ## 4 regionwest 25.96217940 17.48335459 1.4849656 1.448476e-01 ## 5 residents -0.08921914 0.05557157 -1.6054819 1.157079e-01 ## 6 young_residents 1.41203710 0.35297506 4.0003877 2.447887e-04 ## 7 per_capita_income 0.08609340 0.01191468 7.2258276 6.069534e-09 ## 8 (Intercept) -526.78466908 142.64004523 -3.6931050 6.207704e-04 ## 9 regionnortheast 3.35909252 21.97165783 0.1528830 8.792057e-01 ## 10 regionsouth 2.62845462 17.20973263 0.1527307 8.793251e-01 ## 11 regionwest 26.40527188 20.50214280 1.2879274 2.046593e-01 ## 12 residents -0.04834920 0.05722962 -0.8448282 4.028830e-01 ## 13 young_residents 1.49618012 0.37091593 4.0337445 2.208994e-04 ## 14 per_capita_income 0.07489435 0.01179888 6.3475800 1.140844e-07 ## 15 (Intercept) -466.17160566 130.18954610 -3.5807146 8.659014e-04 ## 16 regionnortheast -6.75977787 17.24994600 -0.3918724 6.970880e-01 ## 17 regionsouth 6.31061828 16.27953099 0.3876413 7.001938e-01 ## 18 regionwest 27.89261897 15.43852463 1.8066894 7.781178e-02 ## 19 residents -0.08760677 0.05007780 -1.7494134 8.735434e-02 ## 20 young_residents 1.23763741 0.32965208 3.7543746 5.168492e-04 ## 21 per_capita_income 0.08469609 0.01236601 6.8491057 2.128982e-08 ## 22 (Intercept) -316.44265722 166.96769979 -1.8952328 6.479692e-02 ## 23 regionnortheast 11.54907449 14.95746934 0.7721276 4.442620e-01 ## 24 regionsouth 9.25759229 16.89974638 0.5477947 5.866654e-01 ## 25 regionwest 31.83905855 16.56287888 1.9223143 6.120738e-02 ## resample ## 1 1 ## 2 1 ## 3 1 ## 4 1 ## 5 1 ## 6 1 ## 7 1 ## 8 2 ## 9 2 ## 10 2 ## 11 2 ## 12 2 ## 13 2 ## 14 2 ## 15 3 ## 16 3 ## 17 3 ## 18 3 ## 19 3 ## 20 3 ## 21 3 ## 22 4 ## 23 4 ## 24 4 ## 25 4
Now this is a very useful format, because I now can group by the term
column and compute any
statistics I need, in the present case the standard deviation:
( r.std.error <- mods_df %>% group_by(term) %>% summarise(r.std.error = sd(estimate)) ) ## # A tibble: 7 x 2 ## term r.std.error ## <chr> <dbl> ## 1 (Intercept) 226. ## 2 per_capita_income 0.0211 ## 3 regionnortheast 19.7 ## 4 regionsouth 19.1 ## 5 regionwest 18.7 ## 6 residents 0.0629 ## 7 young_residents 0.509
We can append this column to the linear regression model result:
lmfit %>% broom::tidy() %>% full_join(r.std.error) %>% select(term, estimate, std.error, r.std.error) ## Joining, by = "term" ## term estimate std.error r.std.error ## 1 (Intercept) -467.40282655 142.57668653 226.08735720 ## 2 regionnortheast 15.72740519 18.16259519 19.65120904 ## 3 regionsouth 7.08742365 17.29949951 19.05153934 ## 4 regionwest 34.32415663 17.49459866 18.72767470 ## 5 residents -0.03455770 0.05318811 0.06285984 ## 6 young_residents 1.30145825 0.35716636 0.50928879 ## 7 per_capita_income 0.07203552 0.01305066 0.02109277
As you see, using the whole bootstrapping procedure is longer than simply using either one of
the first two methods. However, this procedure is very flexible and can thus be adapted to a very
large range of situations. Either way, in the case of heteroskedasticity, you can see that
results vary a lot depending on the procedure you use, so I would advise to use them all as
robustness tests and discuss the differences.
If you found this blog post useful, you might want to follow me on twitter
for blog post updates.
Here we briefly discuss how to estimate robust standard errors for linear regression models
Contents
- 1 Which package to use
- 2 Heteroskedasticity robust standard errors
- 3 Autocorrelation and heteroskedasticity robust standard errors
- 4 Heteroskedasticity Robust F-tests
- 5 Footnotes
Which package to use
There are a number of pieces of code available to facilitate this task[1]. Here I recommend to use the «sandwich» package. Which has the most comprehensive robust standard error options I am aware of.
As described in more detail in R_Packages you should install the package the first time you use it on a particular computer:
install.packages("sandwich")
and then call the package at the beginning of your script into the library:
library(sandwich)
All code snippets below assume that you have done so. In fact, you may instead want to use another package called «AER» which contains the sandwich package and other relevant packaes (such as the one used for instrumental variables estimation IV_in_R). See the relevant CRAN webpage
Heteroskedasticity robust standard errors
I assume that you know that the presence of heteroskedastic standard errors renders OLS estimators of linear regression models inefficient (although they remain unbiased). More seriously, however, they also imply that the usual standard errors that are computed for your coefficient estimates (e.g. when you use the summary()
command as discussed in R_Regression), are incorrect (or sometimes we call them biased). This implies that inference based on these standard errors will be incorrect (incorrectly sized). What we need are coefficient estimate standard errors that are correct even when regression error terms are heteroskedastic, sometimes called White standard errors.
Let’s assume that you have calculated a regression (as in R_Regression):
# Run a regression
reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata)
The function from the «sandwich» package that you want to use is called vcovHC()
and you use it as follows:
vcv <- vcovHC(reg_ex1, type = "HC1")
This saves the heteroscedastic robust standard error in vcv
[2]. Now you can calculate robust t-tests by using the estimated coefficients and the new standard errors (square roots of the diagonal elements on vcv
). But note that inference using these standard errors is only valid for sufficiently large sample sizes (asymptotically normally distributed t-tests).
You may actually want a neat way to see the standard errors, rather than having to calculate the square roots of the diagonal of this matrix. This is done with the following function (this is part of the lmtest package which will be automatically installed if you installed the AER package as recommended above):
coeftest(reg_ex1, vcv)
if you already calculated vcv
. Try it out and you will find the regression coefficients along with their new standard errors, t-stats and p-values. If not, you may as well use this line
coeftest(reg_ex1, vcov = vcovHC(reg_ex1,type="HC1"))
which incorporates the call to the vcovHC
function.
Autocorrelation and heteroskedasticity robust standard errors
When the error terms are autocorrelated (and potentially heteroskedastic) all of the above applies and we need to use yet another estimator for the coefficient estimate standard errors, sometimes called the Newey-West estimators.
The function from the «sandwich» package that you want to use is called vcovHAC()
and you use it as follows:
vcv <- vcovHAC(reg_ex1)
Everything is as for heteroskedastic error terms. I.e. you would print these standard errors along with the coefficient estimates, t-statistics and p-values from:
coeftest(reg_ex1, vcov = vcovHAC(reg_ex1))
Heteroskedasticity Robust F-tests
To illustrate robust F-tests, we shall basically replicate the example from the standard inference section. We first estimate a somewhat larger regression model.
reg_ex2 <- lm(lwage~exper+log(huswage)+age+educ,data=mydata) reg_ex2_sm <- summary(reg_ex2)
and now we want to test whether the inclusion of the extra two variables age and educ is statistically significant. In the standard inference section we learned that one way to do that is by means of the following command
> waldtest(reg_ex2, .~. - age - educ)
which delivered
Wald test
Model 1: lwage ~ exper + log(huswage) + age + educ Model 2: lwage ~ exper + log(huswage) Res.Df Df F Pr(>F) 1 423 2 425 -2 24.741 6.895e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
But this procedure assumed that the error terms were homoskedastic. If you want to allow for for heteroskedastic error terms you merely have to add another input to the waldtest
function call.
> waldtest(reg_ex2, .~. - age - educ, vcov=vcovHC)
Wald test
Model 1: lwage ~ exper + log(huswage) + age + educ
Model 2: lwage ~ exper + log(huswage)
Res.Df Df F Pr(>F)
1 423
2 425 -2 28.854 1.791e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The input vcov=vcovHC
instructs R to use a robust version of the variance covariance matrix.
As you can see it produces slightly different results, although there is no change in the substantial conclusion that you should not omit these two variables as the null hypothesis that both are irrelevant is soundly rejected.
If you prefer the lht
function to perform F-tests, you can calculate robust F-tests by adding the argument white.adjust = TRUE
to your function call.
Footnotes
- ↑ An alternative option is discussed here but it is less powerful than the sandwich package.
- ↑ Predictably the
type
option in this function indicates that there are several options (actually «HC0» to «HC4»). Using «HC1» will replicate the robust standard errors you would obtain using STATA.