Spatial Regression#
Regression (and prediction more generally) provides us a perfect case to examine how spatial structure can help us understand and analyze our data. In this chapter, we discuss how spatial structure can be used to both validate and improve prediction algorithms, focusing on linear regression specifically.
What is spatial regression and why should I care?#
Usually, spatial structure helps regression models in one of two ways.
The first (and most clear) way space can have an impact on our data is when the process generating the data is itself explicitly spatial.
Here, think of something like the prices for single family homes.
It’s often the case that individuals pay a premium on their house price in order to live in a better school district for the same quality house.
Alternatively, homes closer to noise or chemical polluters like waste water treatment plants, recycling facilities, or wide highways, may actually be cheaper than we would otherwise anticipate.
In cases like asthma incidence, the locations individuals tend to travel to throughout the day, such as their places of work or recreation, may have more impact on their health than their residential addresses.
In this case, it may be necessary to use data from other sites to predict the asthma incidence at a given site.
Regardless of the specific case at play, here, geography is a feature: it directly helps us make predictions about outcomes because those outcomes obtain from geographical processes.
An alternative (and more skeptical understanding) reluctantly acknowledges geography’s instrumental value.
Often, in the analysis of predictive methods and classifiers, we are interested in analyzing what we get wrong.
This is common in econometrics; an analyst may be concerned that the model systematically mis-predicts some types of observations.
If we know our model routinely performs poorly on a known set of observations or type of input, we might make a better model if we can account for this.
Among other kinds of error diagnostics, geography provides us with an exceptionally-useful embedding to assess structure in our errors.
Mapping classification/prediction error can help show whether or not there are clusters of error in our data.
If we know that errors tend to be larger in some areas than in other areas (or if error is “contagious” between observations), then we might be able to exploit this structure to make better predictions.
Spatial structure in our errors might arise from when geography should be an attribute somehow, but we are not sure exactly how to include it in our model.
They might also arise because there is some other feature whose omission causes the spatial patterns in the error we see; if this additional feature were included, the structure would disappear.
Or, it might arise from the complex interactions and interdependencies between the features that we have chosen to use as predictors, resulting in intrinsic structure in mis-prediction.
Most of the predictors we use in models of social processes contain embodied spatial information: patterning intrinsic to the feature that we get for free in the model.
If we intend to or not, using a spatially-patterned predictor in a model can result in spatially-patterned errors; using more than one can amplify this effect.
Thus, regardless of whether or not the true process is explicitly geographic, additional information about the spatial relationships between our observations or more information about nearby sites can make our predictions better.
In this chapter, we build space into the traditional regression framework. We begin with a standard linear regression model, devoid of any geographical reference. From there, we formalise space and spatial relationships in three main ways: first, encoding it in exogenous variables; second, through spatial heterogeneity, or as systematic variation of outcomes across space; third, as dependence, or through the effect associated to the characteristics of spatial neighbors. Throughout, we focus on the conceptual differences each approach entails rather than on the technical details.
from pysal.lib import weights from pysal.explore import esda import numpy import pandas import geopandas import matplotlib.pyplot as plt import seaborn import contextily
Data: San Diego AirBnB#
To learn a little more about how regression works, we’ll examine information about AirBnB properties in San Diego, CA.
This dataset contains house intrinsic characteristics, both continuous (number of beds as in beds
) and categorical (type of renting or, in AirBnB jargon, property group as in the series of pg_X
binary variables), but also variables that explicitly refer to the location and spatial configuration of the dataset (e.g. distance to Balboa Park, d2balboa
or neighborhood id, neighbourhood_cleansed
).
db = geopandas.read_file("../data/airbnb/regression_db.geojson")
These are the explanatory variables we will use throughout the chapter.
variable_names = [ "accommodates", # Number of people it accommodates "bathrooms", # Number of bathrooms "bedrooms", # Number of bedrooms "beds", # Number of beds # Below are binary variables, 1 True, 0 False "rt_Private_room", # Room type: private room "rt_Shared_room", # Room type: shared room "pg_Condominium", # Property group: condo "pg_House", # Property group: house "pg_Other", # Property group: other "pg_Townhouse", # Property group: townhouse ]
Non-spatial regression, a (very) quick refresh#
Before we discuss how to explicitly include space into the linear regression framework, let us show how basic regression can be carried out in Python, and how one can begin to interpret the results. By no means is this a formal and complete introduction to regression so, if that is what you are looking for, we recommend [GH06], in particular chapters 3 and 4, which provide a fantastic, non-spatial introduction.
The core idea of linear regression is to explain the variation in a given (dependent) variable as a linear function of a collection of other (explanatory) variables. For example, in our case, we may want to express the price of a house as a function of the number of bedrooms it has and whether it is a condominium or not. At the individual level, we can express this as:
[
P_i = alpha + sum_k mathbf{X}_{ik}beta_k + epsilon_i
]
where (P_i) is the AirBnB price of house (i), and (X) is a set of covariates that we use to explain such price (e.g. No. of bedrooms and condominium binary variable). (beta) is a vector of parameters that give us information about in which way and to what extent each variable is related to the price, and (alpha), the constant term, is the average house price when all the other variables are zero. The term (epsilon_i) is usually referred to as “error” and captures elements that influence the price of a house but are not included in (X). We can also express this relation in matrix form, excluding subindices for (i), which yields:
[
P = alpha + mathbf{X}beta + epsilon
]
A regression can be seen as a multivariate extension of bivariate correlations. Indeed, one way to interpret the (beta_k) coefficients in the equation above is as the degree of correlation between the explanatory variable (k) and the dependent variable, keeping all the other explanatory variables constant. When one calculates bivariate correlations, the coefficient of a variable is picking up the correlation between the variables, but it is also subsuming into it variation associated with other correlated variables – also called confounding factors. Regression allows us to isolate the distinct effect that a single variable has on the dependent one, once we control for those other variables.
Practically speaking, linear regressions in Python are rather streamlined and easy to work with. There are also several packages which will run them (e.g. statsmodels
, scikit-learn
, pysal
). We will import the spreg
module in Pysal:
from pysal.model import spreg
In the context of this chapter, it makes sense to start with spreg
as that is the only library that will allow us to move into explicitly spatial econometric models. To fit the model specified in the equation above with (X) as the list defined, we only need the following line of code:
# Fit OLS model m1 = spreg.OLS( # Dependent variable db[["log_price"]].values, # Independent variables db[variable_names].values, # Dependent variable name name_y="log_price", # Independent variable name name_x=variable_names, )
We use the command OLS
, part of the spreg
sub-package, and specify the dependent variable (the log of the price, so we can interpret results in terms of percentage change) and the explanatory ones. Note that both objects need to be arrays, so we extract them from the pandas.DataFrame
object using .values
.
In order to inspect the results of the model, we can print the summary
attribute:
REGRESSION ---------- SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES ----------------------------------------- Data set : unknown Weights matrix : None Dependent Variable : log_price Number of Observations: 6110 Mean dependent var : 4.9958 Number of Variables : 11 S.D. dependent var : 0.8072 Degrees of Freedom : 6099 R-squared : 0.6683 Adjusted R-squared : 0.6678 Sum squared residual: 1320.148 F-statistic : 1229.0564 Sigma-square : 0.216 Prob(F-statistic) : 0 S.E. of regression : 0.465 Log likelihood : -3988.895 Sigma-square ML : 0.216 Akaike info criterion : 7999.790 S.E of regression ML: 0.4648 Schwarz criterion : 8073.685 ------------------------------------------------------------------------------------ Variable Coefficient Std.Error t-Statistic Probability ------------------------------------------------------------------------------------ CONSTANT 4.3883830 0.0161147 272.3217773 0.0000000 accommodates 0.0834523 0.0050781 16.4336318 0.0000000 bathrooms 0.1923790 0.0109668 17.5419773 0.0000000 bedrooms 0.1525221 0.0111323 13.7009195 0.0000000 beds -0.0417231 0.0069383 -6.0134430 0.0000000 rt_Private_room -0.5506868 0.0159046 -34.6244758 0.0000000 rt_Shared_room -1.2383055 0.0384329 -32.2198992 0.0000000 pg_Condominium 0.1436347 0.0221499 6.4846529 0.0000000 pg_House -0.0104894 0.0145315 -0.7218393 0.4704209 pg_Other 0.1411546 0.0228016 6.1905633 0.0000000 pg_Townhouse -0.0416702 0.0342758 -1.2157316 0.2241342 ------------------------------------------------------------------------------------ REGRESSION DIAGNOSTICS MULTICOLLINEARITY CONDITION NUMBER 11.964 TEST ON NORMALITY OF ERRORS TEST DF VALUE PROB Jarque-Bera 2 2671.611 0.0000 DIAGNOSTICS FOR HETEROSKEDASTICITY RANDOM COEFFICIENTS TEST DF VALUE PROB Breusch-Pagan test 10 322.532 0.0000 Koenker-Bassett test 10 135.581 0.0000 ================================ END OF REPORT =====================================
A full detailed explanation of the output is beyond the scope of this chapter, so we will focus on the relevant bits for our main purpose. This is concentrated on the Coefficients
section, which gives us the estimates for (beta_k) in our model. In other words, these numbers express the relationship between each explanatory variable and the dependent one, once the effect of confounding factors has been accounted for. Keep in mind however that regression is no magic; we are only discounting the effect of confounding factors that we include in the model, not of all potentially confounding factors.
Results are largely as expected: houses tend to be significantly more expensive if they accommodate more people (accommodates
), if they have more bathrooms and bedrooms, and if they are a condominium or part of the “other” category of house type. Conversely, given a number of rooms, houses with more beds (i.e.. listings that are more “crowded”) tend to go for cheaper, as it is the case for properties where one does not rent the entire house but only a room (rt_Private_room
) or even shares it (rt_Shared_room
). Of course, you might conceptually doubt the assumption that it is possible to arbitrarily change the number of beds within an AirBnB without eventually changing the number of people it accommodates, but methods to address these concerns using interaction effects won’t be discussed here.
Hidden Structures#
In general, our model performs well, being able to predict slightly about two-thirds ((R^2=0.67)) of the variation in the mean nightly price using the covariates we’ve discussed above.
But, our model might display some clustering in the errors, which may be a problem as that violates the i.i.d. assumption linear models usually come built-in with.
To interrogate this, we can do a few things.
One simple concept might be to look at the correlation between the error in predicting an AirBnB and the error in predicting its nearest neighbor.
To examine this, we first might want to split our data up by regions and see if we’ve got some spatial structure in our residuals.
One reasonable theory might be that our model does not include any information about beaches, a critical aspect of why people live and vacation in San Diego.
Therefore, we might want to see whether or not our errors are higher or lower depending on whether or not an AirBnB is in a “beach” neighborhood, a neighborhood near the ocean:
# Create a Boolean (True/False) with whether a # property is coastal or not is_coastal = db.coastal.astype(bool) # Split residuals (m1.u) between coastal and not coastal = m1.u[is_coastal] not_coastal = m1.u[~is_coastal] # Create histogram of the distribution of coastal residuals plt.hist(coastal, density=True, label="Coastal") # Create histogram of the distribution of non-coastal residuals plt.hist( not_coastal, histtype="step", density=True, linewidth=4, label="Not Coastal", ) # Add Line on 0 plt.vlines(0, 0, 1, linestyle=":", color="k", linewidth=4) # Add legend plt.legend() # Display plt.show()
While it appears that the neighborhoods on the coast have only slightly higher average errors (and have lower variance in their prediction errors), the two distributions are significantly distinct from one another when compared using a classic (t) test:
from scipy.stats import ttest_ind ttest_ind(coastal, not_coastal)
Ttest_indResult(statistic=array([13.98193858]), pvalue=array([9.442438e-44]))
There are more sophisticated (and harder to fool) tests that may be applicable for this data, however. We cover them in the Challenge section.
Additionally, it might be the case that some neighborhoods are more desirable than other neighborhoods due to unmodeled latent preferences or marketing.
For instance, despite its presence close to the sea, living near Camp Pendleton -a Marine base in the North of the city- may incur some significant penalties on area desirability due to noise and pollution. These are questions that domain knowledge provides and data analysis can help us answer.
For us to determine whether this is the case, we might be interested in the full distribution of model residuals within each neighborhood.
To make this more clear, we’ll first sort the data by the median residual in that neighborhood, and then make a box plot, which shows the distribution of residuals in each neighborhood:
# Create column with residual values from m1 db["residual"] = m1.u # Obtain the median value of residuals in each neighbourhood medians = ( db.groupby("neighborhood") .residual.median() .to_frame("hood_residual") ) # Increase fontsize seaborn.set(font_scale=1.25) # Set up figure f = plt.figure(figsize=(15, 3)) # Grab figure's axis ax = plt.gca() # Generate bloxplot of values by neighbourhood # Note the data includes the median values merged on-the-fly seaborn.boxplot( "neighborhood", "residual", ax=ax, data=db.merge( medians, how="left", left_on="neighborhood", right_index=True ).sort_values("hood_residual"), palette="bwr", ) # Auto-format of the X labels f.autofmt_xdate() # Display plt.show()
No neighborhood is disjoint from one another, but some do appear to be higher than others, such as the well-known downtown tourist neighborhoods areas of the Gaslamp Quarter, Little Italy, or The Core.
Thus, there may be a distinctive effect of intangible neighborhood fashionableness that matters in this model.
Noting that many of the most over- and under-predicted neighborhoods are near one another in the city, it may also be the case that there is some sort of contagion or spatial spillovers in the nightly rent price.
This often is apparent when individuals seek to price their AirBnB listings to compete with similar nearby listings.
Since our model is not aware of this behavior, its errors may tend to cluster.
One exceptionally simple way we can look into this structure is by examining the relationship between an observation’s residuals and its surrounding residuals.
To do this, we will use spatial weights to represent the geographic relationships between observations.
We cover spatial weights in detail in Chapter 4, so we will not repeat ourselves here.
For this example, we’ll start off with a (KNN) matrix where (k=1), meaning we’re focusing only on the linkages of each AirBnB to their closest other listing.
knn = weights.KNN.from_dataframe(db, k=1)
This means that, when we compute the spatial lag of that (KNN) weight and the residual, we get the residual of the AirBnB listing closest to each observation.
lag_residual = weights.spatial_lag.lag_spatial(knn, m1.u) ax = seaborn.regplot( m1.u.flatten(), lag_residual.flatten(), line_kws=dict(color="orangered"), ci=None, ) ax.set_xlabel("Model Residuals - $u$") ax.set_ylabel("Spatial Lag of Model Residuals - $W u$");
In this plot, we see that our prediction errors tend to cluster!
Above, we show the relationship between our prediction error at each site and the prediction error at the site nearest to it.
Here, we’re using this nearest site to stand in for the surroundings of that AirBnB.
This means that, when the model tends to over-predict a given AirBnB’s nightly log price, sites around that AirBnB are more likely to also be over-predicted. An interesting property of this relationship is that it tends to stabilize as the number of nearest neighbors used to construct each AirBnB’s surroundings increases.
Consult the Challenge section for more on this property.
Given this behavior, let’s look at the stable (k=20) number of neighbors.
Examining the relationship between this stable surrounding average and the focal AirBnB, we can even find clusters in our model error.
Recalling the local Moran statistics in Chapter 7, we can identify certain areas where our predictions of the nightly (log) AirBnB price tend to be significantly off:
# Re-weight W to 20 nearest neighbors knn.reweight(k=20, inplace=True) # Row standardise weights knn.transform = "R" # Run LISA on residuals outliers = esda.moran.Moran_Local(m1.u, knn, permutations=9999) # Select only LISA cluster cores error_clusters = outliers.q % 2 == 1 # Filter out non-significant clusters error_clusters &= outliers.p_sim <= 0.001 # Add `error_clusters` and `local_I` columns ax = ( db.assign( error_clusters=error_clusters, local_I=outliers.Is # Retain error clusters only ) .query( "error_clusters" # Sort by I value to largest plot on top ) .sort_values( "local_I" # Plot I values ) .plot("local_I", cmap="bwr", marker=".") ) # Add basemap contextily.add_basemap(ax, crs=db.crs) # Remove axes ax.set_axis_off();
Thus, these areas tend to be locations where our model significantly under-predicts the nightly AirBnB price both for that specific observation and observations in its immediate surroundings.
This is critical since, if we can identify how these areas are structured — if they have a consistent geography that we can model — then we might make our predictions even better, or at least not systematically mis-predict prices in some areas while correctly predicting prices in other areas. Since significant under- and over-predictions do appear to cluster in a highly structured way, we might be able to use a better model to fix the geography of our model errors.
Bringing space into the regression framework#
There are many different ways that spatial structure shows up in our models, predictions, and our data, even if we do not explicitly intend to study it.
Fortunately, there are nearly as many techniques, called spatial regression methods, that are designed to handle these sorts of structures.
Spatial regression is about explicitly introducing space or geographical context into the statistical framework of a regression.
Conceptually, we want to introduce space into our model whenever we think it plays an important role in the process we are interested in, or when space can act as a reasonable proxy for other factors we cannot but should include in our model.
As an example of the former, we can imagine how houses at the seafront are probably more expensive than those in the second row, given their better views.
To illustrate the latter, we can think of how the character of a neighborhood is important in determining the price of a house; however, it is very hard to identify and quantify “character” per se, although it might be easier to get at its spatial variation, hence a case of space as a proxy.
Spatial regression is a large field of development in the econometrics and statistics literatures.
In this brief introduction, we will consider two related but very different processes that give rise to spatial effects: spatial heterogeneity and spatial dependence. Before diving into them, we begin with another approach that introduces space in a regression model without modifying the model itself but rather creates spatially explicit independent variables.
For more rigorous treatments of the topics introduced here, we refer you to [Ans03, AR14, GH06].
Spatial Feature Engineering: proximity variables#
Using geographic information to “construct” new data is a common approach to bring in spatial information into data analysis.
Often, this reflects the fact that processes are not the same everywhere in the map of analysis, or that geographical information may be useful to predict our outcome of interest. In this section, we will briefly present how to insert spatial features, or (X) variables that are constructed from geographical relationships, in a standard linear model. We discuss spatial feature engineering extensively in Chapter 12, though, and the depth and extent of spatial feature engineering is difficult to overstate. Rather than detail, this section will show how spatially explicit variables you engineer can be “plugged” into a model to improve its performance or help you explain the underlying process of interest with more accuracy.
One relevant proximity-driven variable that could influence our San Diego model is based on the listings proximity to Balboa Park. A common tourist destination, Balboa park is a central recreation hub for the city of San Diego, containing many museums and the San Diego zoo. Thus, it could be the case that people searching for AirBnBs in San Diego are willing to pay a premium to live closer to the park. If this were true and we omitted this from our model, we may indeed see a significant spatial pattern caused by this distance decay effect.
Therefore, this is sometimes called a spatially-patterned omitted covariate: geographic information our model needs to make good predictions which we have left out of our model. Therefore, let’s build a new model containing this distance to Balboa Park covariate. First, though, it helps to visualize the structure of this distance covariate itself:
ax = db.plot("d2balboa", marker=".", s=5) contextily.add_basemap(ax, crs=db.crs) ax.set_axis_off();
To run a linear model that includes the additional variable of distance to the park, we add the name to the list of variables we included originally:
balboa_names = variable_names + ["d2balboa"]
And then fit the model using the OLS class in Pysal’s spreg
:
m2 = spreg.OLS( db[["log_price"]].values, db[balboa_names].values, name_y="log_price", name_x=balboa_names, )
When you inspect the regression diagnostics and output, you see that this covariate is not quite as helpful as we might anticipate:
pandas.DataFrame( [[m1.r2, m1.ar2], [m2.r2, m2.ar2]], index=["M1", "M2"], columns=["R2", "Adj. R2"], )
R2 | Adj. R2 | |
---|---|---|
M1 | 0.668345 | 0.667801 |
M2 | 0.668502 | 0.667904 |
It is not statistically significant at conventional significance levels, the model fit does not substantially change:
# Set up table of regression coefficients pandas.DataFrame( { # Pull out regression coefficients and # flatten as they are returned as Nx1 array "Coeff.": m2.betas.flatten(), # Pull out and flatten standard errors "Std. Error": m2.std_err.flatten(), # Pull out P-values from t-stat object "P-Value": [i[1] for i in m2.t_stat], }, index=m2.name_x, )
Coeff. | Std. Error | P-Value | |
---|---|---|---|
CONSTANT | 4.379624 | 0.016915 | 0.000000e+00 |
accommodates | 0.083644 | 0.005079 | 1.156896e-59 |
bathrooms | 0.190791 | 0.011005 | 9.120139e-66 |
bedrooms | 0.150746 | 0.011179 | 7.418035e-41 |
beds | -0.041476 | 0.006939 | 2.394322e-09 |
rt_Private_room | -0.552996 | 0.015960 | 2.680270e-240 |
rt_Shared_room | -1.235521 | 0.038462 | 2.586867e-209 |
pg_Condominium | 0.140459 | 0.022225 | 2.803765e-10 |
pg_House | -0.013302 | 0.014623 | 3.630396e-01 |
pg_Other | 0.141176 | 0.022798 | 6.309880e-10 |
pg_Townhouse | -0.045784 | 0.034356 | 1.826992e-01 |
d2balboa | 0.001645 | 0.000967 | 8.902052e-02 |
And, there still appears to be spatial structure in our model’s errors:
lag_residual = weights.spatial_lag.lag_spatial(knn, m2.u) ax = seaborn.regplot( m2.u.flatten(), lag_residual.flatten(), line_kws=dict(color="orangered"), ci=None, ) ax.set_xlabel("Residuals ($u$)") ax.set_ylabel("Spatial lag of residuals ($Wu$)");
Finally, the distance to Balboa Park variable does not fit our theory about how distance to amenity should affect the price of an AirBnB; the coefficient estimate is positive, meaning that people are paying a premium to be further from the Park. We will revisit this result later on, when we consider spatial heterogeneity and will be able to shed some light on this. Further, the next chapter is an extensive treatment of spatial fixed effects, presenting many more spatial feature engineering methods. Here, we have only showed how to include these engineered features in a standard linear modeling framework.
Spatial Heterogeneity#
So far we have assumed that our proximity variable might stand in for a difficult-to-measure premium individuals pay when they’re close to a recreational zone (a park in this case). Our approach in that case was to incorporate space through a very specific channel, that is the distance to an amenity we thought might be influencing the final price. However, not all neighborhoods have the same house prices; some neighborhoods may be systematically more expensive than others, regardless of their proximity to Balboa Park. If this is our case, we need some way to account for the fact that each neighborhood may experience these kinds of gestalt, unique effects. One way to do this is by capturing spatial heterogeneity. At its most basic, spatial heterogeneity means that parts of the model may vary systematically with geography, change in different places. For example, changes to the intercept, (alpha), may reflect the fact that different areas have different baseline exposures to a given process. Changes to the slope terms, (beta), may indicate some kind of geographical mediating factor that makes the relationship between the independent and dependent variables vary across space, such as when a governmental policy is not consistently applied across jurisdictions. Finally, changes to the variance of the residuals, commonly denoted (sigma^2), can introduce spatial heteroskedasticity. We deal with the first two in this section.
Spatial Fixed Effects#
The literature commonly refers to geographic variations of (alpha) as “spatial fixed effects”. To illustrate them, let us consider the house price example from the previous section. Sometimes, spatial fixed effects are said to capture “space as a proxy”, in that we know the outcome varies over space, we (hope to) know the pattern it follows (in our case, by neighbourhood) and we can thus incorporate that knowledge into the model by letting (alpha) vary accordingly. The rationale goes as follows. Given we are only including a few explanatory variables in the model, it is likely we are missing some important factors that play a role at determining the price at which a house is sold. Some of them, however, are likely to vary systematically over space (e.g. different neighborhood characteristics). If that is the case, we can control for those unobserved factors by using traditional binary variables but basing their creation on a spatial rule. For example, let us include a binary variable for every neighborhood, indicating whether a given house is located within such area (1
) or not (0
). Mathematically, we are now fitting the following equation:
[
log{P_i} = alpha_r + sum_k mathbf{X}_{ik}beta_k + epsilon_i
]
where the main difference is that we are now allowing the constant term, (alpha), to vary by neighborhood (r), (alpha_r).
Programmatically, we will show two different ways we can estimate this: one,
using statsmodels
; and two, with spreg
. First, we will use statsmodels
, the econometrician’s toolbox in Python.
import statsmodels.formula.api as sm
This package provides a formula-like API, which allows us to express the equation we wish to estimate directly:
f = ( "log_price ~ " + " + ".join(variable_names) + " + neighborhood - 1" ) print(f)
log_price ~ accommodates + bathrooms + bedrooms + beds + rt_Private_room + rt_Shared_room + pg_Condominium + pg_House + pg_Other + pg_Townhouse + neighborhood - 1
The tilde operator in this statement is usually read as “log price is a function of …”, to account for the fact that many different model specifications can be fit according to that functional relationship between log_price
and our covariate list. Critically, note that the trailing -1
term means that we are fitting this model without an intercept term. This is necessary, since including an intercept term alongside unique means for every neighborhood would make the underlying system of equations underspecified.
Using this expression, we can estimate the unique effects of each neighborhood, fitting the model in statsmodels
(note how the specification of the model, formula and data, is separated from the fitting step):
m3 = sm.ols(f, data=db).fit()
We could rely on the summary2()
method to print a similar summary report from the regression but, given it is a lengthy one in this case, we will illustrate how you can extract the spatial fixed effects into a table for display.
# Store variable names for all the spatial fixed effects sfe_names = [i for i in m3.params.index if "neighborhood[" in i] # Create table pandas.DataFrame( { "Coef.": m3.params[sfe_names], "Std. Error": m3.bse[sfe_names], "P-Value": m3.pvalues[sfe_names], } )
Coef. | Std. Error | P-Value | |
---|---|---|---|
neighborhood[Balboa Park] | 4.280766 | 0.033292 | 0.0 |
neighborhood[Bay Ho] | 4.198251 | 0.076878 | 0.0 |
neighborhood[Bay Park] | 4.329223 | 0.050987 | 0.0 |
neighborhood[Carmel Valley] | 4.389261 | 0.056553 | 0.0 |
neighborhood[City Heights West] | 4.053518 | 0.058378 | 0.0 |
neighborhood[Clairemont Mesa] | 4.095259 | 0.047699 | 0.0 |
neighborhood[College Area] | 4.033697 | 0.058258 | 0.0 |
neighborhood[Core] | 4.726186 | 0.052643 | 0.0 |
neighborhood[Cortez Hill] | 4.608090 | 0.051526 | 0.0 |
neighborhood[Del Mar Heights] | 4.496910 | 0.054337 | 0.0 |
neighborhood[East Village] | 4.545469 | 0.029373 | 0.0 |
neighborhood[Gaslamp Quarter] | 4.775799 | 0.047304 | 0.0 |
neighborhood[Grant Hill] | 4.306742 | 0.052365 | 0.0 |
neighborhood[Grantville] | 4.053298 | 0.071396 | 0.0 |
neighborhood[Kensington] | 4.302671 | 0.077176 | 0.0 |
neighborhood[La Jolla] | 4.682084 | 0.025809 | 0.0 |
neighborhood[La Jolla Village] | 4.330311 | 0.077237 | 0.0 |
neighborhood[Linda Vista] | 4.191149 | 0.056916 | 0.0 |
neighborhood[Little Italy] | 4.666742 | 0.046838 | 0.0 |
neighborhood[Loma Portal] | 4.301909 | 0.033236 | 0.0 |
neighborhood[Marina] | 4.558298 | 0.047994 | 0.0 |
neighborhood[Midtown] | 4.366661 | 0.028394 | 0.0 |
neighborhood[Midtown District] | 4.584938 | 0.065087 | 0.0 |
neighborhood[Mira Mesa] | 3.989562 | 0.056101 | 0.0 |
neighborhood[Mission Bay] | 4.515479 | 0.022422 | 0.0 |
neighborhood[Mission Valley] | 4.275960 | 0.074231 | 0.0 |
neighborhood[Moreno Mission] | 4.400942 | 0.056730 | 0.0 |
neighborhood[Normal Heights] | 4.097400 | 0.049022 | 0.0 |
neighborhood[North Clairemont] | 3.984440 | 0.069149 | 0.0 |
neighborhood[North Hills] | 4.253425 | 0.025478 | 0.0 |
neighborhood[Northwest] | 4.173752 | 0.069728 | 0.0 |
neighborhood[Ocean Beach] | 4.437164 | 0.030088 | 0.0 |
neighborhood[Old Town] | 4.420160 | 0.041893 | 0.0 |
neighborhood[Otay Ranch] | 4.185941 | 0.081597 | 0.0 |
neighborhood[Pacific Beach] | 4.438829 | 0.022417 | 0.0 |
neighborhood[Park West] | 4.440907 | 0.044768 | 0.0 |
neighborhood[Rancho Bernadino] | 4.180906 | 0.072010 | 0.0 |
neighborhood[Rancho Penasquitos] | 4.162428 | 0.061776 | 0.0 |
neighborhood[Roseville] | 4.386992 | 0.058623 | 0.0 |
neighborhood[San Carlos] | 4.334991 | 0.083040 | 0.0 |
neighborhood[Scripps Ranch] | 4.082380 | 0.076244 | 0.0 |
neighborhood[Serra Mesa] | 4.312967 | 0.059925 | 0.0 |
neighborhood[South Park] | 4.225311 | 0.053643 | 0.0 |
neighborhood[University City] | 4.193718 | 0.036965 | 0.0 |
neighborhood[West University Heights] | 4.297672 | 0.043134 | 0.0 |
The approach above shows how spatial FE are a particular case of a linear regression with a categorical variable. Neighborhood membership is modeled using binary dummy variables. Thanks to the formula grammar used in statsmodels
, we can express the model abstractly, and Python parses it, appropriately creating binary variables as required.
The second approach leverages spreg
Regimes functionality. We will see regimes below but, for now, think of them as a generalisation of spatial fixed effects where not only (alpha) can vary. This framework allows the user to specify which variables are to be estimated separately for each group. In this case, instead of describing the model in a formula, we need to pass each element of the model as separate arguments.
# spreg spatial fixed effect implementation m4 = spreg.OLS_Regimes( # Dependent variable db[["log_price"]].values, # Independent variables db[variable_names].values, # Variable specifying neighborhood membership db["neighborhood"].tolist(), # Allow the constant term to vary by group/regime constant_regi="many", # Variables to be allowed to vary (True) or kept # constant (False). Here we set all to False cols2regi=[False] * len(variable_names), # Allow separate sigma coefficients to be estimated # by regime (False so a single sigma) regime_err_sep=False, # Dependent variable name name_y="log_price", # Independent variables names name_x=variable_names, )
Similarly as above, we could rely on the summary
attribute to print a report with all the results computed. For simplicity here, we will only confirm that, to the 12th decimal, the parameters estimated are indeed the same as those we get from statsmodels
:
import numpy numpy.round(m4.betas.flatten() - m3.params.values, decimals=12)
array([ 0.e+00, -0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, -1.e-12, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, -0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00, -0.e+00, -0.e+00, 0.e+00, -0.e+00, -0.e+00, 0.e+00, 0.e+00, 0.e+00, 0.e+00])
Econometrically speaking, what the neighborhood FEs we have introduced imply is that, instead of comparing all house prices across San Diego as equal, we only derive variation from within each postcode. Remember that the interpretation of (beta_k) is the effect of variable (k), given all the other explanatory variables included remain constant. By including a single variable for each area, we are effectively forcing the model to compare as equal only house prices that share the same value for each variable; or, in other words, only houses located within the same area. Introducing FE affords a higher degree of isolation of the effects of the variables we introduce in the model because we can control for unobserved effects that align spatially with the distribution of the FE introduced (by neighborhood, in our case). To make a map of neighborhood fixed effects, we need to process the results from our model slightly.
First, we extract only the effects pertaining to the neighborhoods:
neighborhood_effects = m3.params.filter(like="neighborhood") neighborhood_effects.head()
neighborhood[Balboa Park] 4.280766 neighborhood[Bay Ho] 4.198251 neighborhood[Bay Park] 4.329223 neighborhood[Carmel Valley] 4.389261 neighborhood[City Heights West] 4.053518 dtype: float64
Then, we need to extract just the neighborhood name from the index of this Series. A simple way to do this is to strip all the characters that come before and after our neighborhood names:
# Create a sequence with the variable names without # `neighborhood[` and `]` stripped = neighborhood_effects.index.str.strip( "neighborhood[" ).str.strip("]") # Reindex the neighborhood_effects Series on clean names neighborhood_effects.index = stripped # Convert Series to DataFrame neighborhood_effects = neighborhood_effects.to_frame("fixed_effect") # Print top of table neighborhood_effects.head()
fixed_effect | |
---|---|
Balboa Park | 4.280766 |
Bay Ho | 4.198251 |
Bay Park | 4.329223 |
Carmel Valley | 4.389261 |
City Heights West | 4.053518 |
Good, we’re back to our raw neighborhood names. These allow us to join it to an auxillary file with neighborhood boundaries that is indexed on the same names. Let’s read the boundaries first:
sd_path = "../data/airbnb/neighbourhoods.geojson" neighborhoods = geopandas.read_file(sd_path)
And we can then merge the spatial fixed effects and plot them on a map:
# Plot base layer with all neighborhoods in grey ax = neighborhoods.plot( color="k", linewidth=0, alpha=0.5, figsize=(12, 6) ) # Merge SFE estimates (note not every polygon # receives an estimate since not every polygon # contains AirBnb properties) neighborhoods.merge( neighborhood_effects, how="left", left_on="neighbourhood", right_index=True # Drop polygons without a SFE estimate ).dropna( subset=["fixed_effect"] # Plot quantile choropleth ).plot( "fixed_effect", # Variable to display scheme="quantiles", # Choropleth scheme k=7, # No. of classes in the choropleth linewidth=0.1, # Polygon border width cmap="viridis", # Color scheme ax=ax, # Axis to draw on ) # Add basemap contextily.add_basemap( ax, crs=neighborhoods.crs, source=contextily.providers.CartoDB.PositronNoLabels, ) # Remove axis ax.set_axis_off() # Display plt.show()
We can see a clear spatial structure in the SFE estimates. The most expensive neighborhoods tend to be located nearby the coast, while the cheapest ones are more inland.
Spatial Regimes#
At the core of estimating spatial FEs is the idea that, instead of assuming the dependent variable behaves uniformly over space, there are systematic effects following a geographical pattern that affect its behavior. In other words, spatial FEs introduce econometrically the notion of spatial heterogeneity. They do this in the simplest possible form: by allowing the constant term to vary geographically. The other elements of the regression are left untouched and hence apply uniformly across space. The idea of spatial regimes (SRs) is to generalize the spatial FE approach to allow not only the constant term to vary but also any other explanatory variable. This implies that the equation we will be estimating is:
[
log{P_i} = alpha_r + sum_k mathbf{X}_{ki}beta_{k-r} + epsilon_i
]
where we are not only allowing the constant term to vary by region ((alpha_r)), but also every other parameter ((beta_{k-r})).
To illustrate this approach, we will use the “spatial differentiator” of whether a house is in a coastal neighborhood or not (coastal_neig
) to define the regimes. The rationale behind this choice is that renting a house close to the ocean might be a strong enough pull that people might be willing to pay at different rates for each of the house’s characteristics.
To implement this in Python, we use the OLS_Regimes
class in spreg
, which does most of the heavy lifting for us:
# Pysal spatial regimes implementation m5 = spreg.OLS_Regimes( # Dependent variable db[["log_price"]].values, # Independent variables db[variable_names].values, # Variable specifying neighborhood membership db["coastal"].tolist(), # Allow the constant term to vary by group/regime constant_regi="many", # Allow separate sigma coefficients to be estimated # by regime (False so a single sigma) regime_err_sep=False, # Dependent variable name name_y="log_price", # Independent variables names name_x=variable_names, )
The result can be explored and interpreted similarly to the previous ones. If you inspect the summary
attribute, you will find the parameters for each variable mostly conform to what you would expect, across both regimes. To compare them, we can plot them side by side on a bespoke table:
# Results table res = pandas.DataFrame( { # Pull out regression coefficients and # flatten as they are returned as Nx1 array "Coeff.": m5.betas.flatten(), # Pull out and flatten standard errors "Std. Error": m5.std_err.flatten(), # Pull out P-values from t-stat object "P-Value": [i[1] for i in m5.t_stat], }, index=m5.name_x, ) # Coastal regime ## Extract variables for the coastal regime coastal = [i for i in res.index if "1_" in i] ## Subset results to coastal and remove the 1_ underscore coastal = res.loc[coastal, :].rename(lambda i: i.replace("1_", "")) ## Build multi-index column names coastal.columns = pandas.MultiIndex.from_product( [["Coastal"], coastal.columns] ) # Non-coastal model ## Extract variables for the non-coastal regime ncoastal = [i for i in res.index if "0_" in i] ## Subset results to non-coastal and remove the 0_ underscore ncoastal = res.loc[ncoastal, :].rename(lambda i: i.replace("0_", "")) ## Build multi-index column names ncoastal.columns = pandas.MultiIndex.from_product( [["Non-coastal"], ncoastal.columns] ) # Concat both models pandas.concat([coastal, ncoastal], axis=1)
Coastal | Non-coastal | |||||
---|---|---|---|---|---|---|
Coeff. | Std. Error | P-Value | Coeff. | Std. Error | P-Value | |
CONSTANT | 4.479904 | 0.025094 | 0.000000e+00 | 4.407242 | 0.021516 | 0.000000e+00 |
accommodates | 0.048464 | 0.007881 | 8.253761e-10 | 0.090186 | 0.006474 | 1.893020e-43 |
bathrooms | 0.247478 | 0.016566 | 1.381278e-49 | 0.143376 | 0.014268 | 1.418804e-23 |
bedrooms | 0.189740 | 0.017923 | 5.783965e-26 | 0.112963 | 0.013827 | 3.731742e-16 |
beds | -0.050608 | 0.010743 | 2.522348e-06 | -0.026272 | 0.008838 | 2.964354e-03 |
rt_Private_room | -0.558628 | 0.028312 | 4.723759e-84 | -0.529334 | 0.018918 | 3.546091e-162 |
rt_Shared_room | -1.052854 | 0.084174 | 1.836512e-35 | -1.224459 | 0.042597 | 1.657163e-170 |
pg_Condominium | 0.204447 | 0.033943 | 1.810152e-09 | 0.105307 | 0.028131 | 1.831822e-04 |
pg_House | 0.075353 | 0.023378 | 1.274269e-03 | -0.045447 | 0.017957 | 1.140318e-02 |
pg_Other | 0.295485 | 0.038645 | 2.394157e-14 | 0.060753 | 0.027637 | 2.796727e-02 |
pg_Townhouse | -0.073508 | 0.049367 | 1.365396e-01 | -0.010397 | 0.045673 | 8.199294e-01 |
An interesting question arises around the relevance of the regimes. Are estimates for each variable across regimes statistically different? For this, the model object also calculates for us what is called a Chow test. This is a statistic that tests the null hypothesis that estimates from different regimes are undistinguishable. If we reject the null, we have evidence suggesting the regimes actually make a difference.
Results from the Chow test are available on the summary
attribute, or we can extract them directly from the model object, which we will do here. There are two types of Chow test. First is a global one that jointly tests for differences between the two regimes:
(328.86902143020575, 7.113548767626822e-64)
The first value represents the statistic, while the second one captures the p-value. In this case, the two regimes are statistically different from each other. The next step then is to check to whether each of the coefficients in our model differ across regimes. For this, we can pull them out into a table:
pandas.DataFrame( # Chow results by variable m5.chow.regi, # Name of variables index=m5.name_x_r, # Column names columns=["Statistic", "P-value"], )
Statistic | P-value | |
---|---|---|
CONSTANT | 4.832180 | 2.793329e-02 |
accommodates | 16.735685 | 4.296522e-05 |
bathrooms | 22.671471 | 1.922004e-06 |
bedrooms | 11.503786 | 6.945459e-04 |
beds | 3.060313 | 8.022620e-02 |
rt_Private_room | 0.740097 | 3.896298e-01 |
rt_Shared_room | 3.308838 | 6.890820e-02 |
pg_Condominium | 5.057283 | 2.452265e-02 |
pg_House | 16.792503 | 4.169771e-05 |
pg_Other | 24.409876 | 7.786847e-07 |
pg_Townhouse | 0.880564 | 3.480471e-01 |
As we can see in the table, most variables do indeed differ across regimes, statistically speaking. This points to systematic differences in the data generating processes across spatial regimes.
Spatial Dependence#
As we have just discussed, SH is about effects of phenomena that are explicitly linked
to geography and that hence cause spatial variation and clustering. This
encompasses many of the kinds of spatial effects we may be interested in when we fit
linear regressions. However, in other cases, our focus is on the effect of the spatial
configuration of the observations, and the extent to which that has an effect on the
outcome we are considering. For example, we might think that the price of a house not
only depends on whether it is a townhouse or an apartment, but also on
whether it is surrounded by many more townhouses than skyscrapers with more
apartments. This, we could hypothesize, might be related to the different “look and feel” a
neighborhood with low-height, historic buildings has as compared to one with
modern high-rises. To the extent these two different spatial configurations
enter differently the house price determination process, we will be
interested in capturing not only the characteristics of a house, but also of
its surrounding ones.
This kind of spatial effect is fundamentally different
from SH in that is it not related to inherent characteristics of the geography but relates
to the characteristics of the observations in our dataset and, specially, to their spatial
arrangement. We call this phenomenon by which the values of observations are related to
each other through distance spatial dependence [Ans88].
There are several ways to introduce spatial dependence in an econometric
framework, with varying degrees of econometric sophistication (see
[Ans02] for a good overview). Common to all of them however is the way space is
formally encapsulated: through spatial weights matrices ((mathbf{W})), which we discussed in Chapter 4. In this section, we consider three ways in which spatial dependence, through spatial weights matrices, can be incorporated in a regression framework. We begin with the “least invasive” one, where we only modify the set of independent variables, and the move into more substantial modifications of the baseline linear model.
Exogenous effects: The SLX Model#
Let us come back to the house price example we have been working with. So far, we
have hypothesized that the price of a house rented in San Diego through AirBnB can
be explained using information about its own characteristics as well as some
relating to its location such as the neighborhood or the distance to the main
park in the city. However, it is also reasonable to think that prospective renters
care about the set of neighbours a house has, not only about the house itself, and would
be willing to pay more for a house that was surrounded by certain types of houses,
and less if it was located in the middle of other types. How could we test this idea?
When it comes to regression, the most straightforward way to introduce spatial dependence between the observations in the data is by
considering not only a given explanatory variable, but also its spatial lag. Conceptually, this approach falls more within the area of spatial feature engineering, which embeds space in a model through the explanatory variables it uses rather than the functional form of the model, and which we delve into with more detail in Chapter 12. But we think it is interesting to discuss it in this context for two reasons. First, it provides “intellectual scaffolding” to learn the intuition of building spatial dependence into regression. And second, because it also illustrates how many of the techniques we cover in Chapter 12 can be embedded in a regression model (and, by extension, in other predictive approaches).
In our
example case, in addition to including a dummy for the type of house (pg_XXX
), we
can also include the spatial lag of each type of house. This addition implies
we are also including as explanatory factor of the price of a given house the proportion
neighboring houses in each type. Mathematically, this implies estimating the following model:
[
log(P_i) = alpha + sum^{p}_{k=1}X_{ij}beta_j + sum^{p}_{k=1}left(sum^{N}_{j=1}w_{ij}x_{jk}right)gamma_k + epsilon_i
]
where (sum_{j=1}^N w_{ij}x_{jk}) represents the spatial lag of the (k)th explanatory variable.
This can be stated in matrix form using the spatial weights matrix, (mathbf{W}), as:
[
log(P_i) = alpha + mathbf{X}beta + mathbf{WX}gamma + epsilon
]
This splits the model to focus on two main effects: (beta) and (gamma). The
(beta) effect describes the change in (y_i) when (X_{ik}) changes by one.
1 The subscript for site (i) is important here: since we’re dealing
with a (mathbf{W}) matrix, it’s useful to be clear about where the change occurs.
Indeed, this matters for the (gamma) effect, which represents the
indirect association of a change in (X_i) with the house price. This can be conceptualized in two ways.
First, one could think of (gamma) as simply the association between the price in a given house and a unit change in its average surroundings.
This is useful and simple. But this interpretation blurs where this change
might occur. In truth, a change in a variable at site (i) will result in a spillover to its surroundings:
when (x_i) changes, so too does the spatial lag of any site near (i).
The precise size of the change in the lag will depend on the structure of (mathbf{W}), and can be
different for every site it is connected with. For example, think of a very highly-connected “focal” site in a
row-standardized weight matrix. This focal site will not be strongly affected
if a neighbor changes by a single unit, since each site only contributes a
small amount to the lag at the focal site. Alternatively, consider a site with only
one neighbor: its lag will change by exactly the amount its sole neighbor changes.
Thus, to discover the exact indirect effect of a change (y) caused by the change
at a specific site (x_i) you would need to compute the change in the spatial lag,
and then use that as your change in (X). We will discuss this in the following section.
In Python, we can calculate the spatial lag of each variable whose name starts by pg_
by first creating a list of all of those names, and then applying pysal
’s
lag_spatial
to each of them:
# Select only columns in `db` containing the keyword `pg_` wx = ( db.filter( like="pg_" # Compute the spatial lag of each of those variables ) .apply( lambda y: weights.spatial_lag.lag_spatial(knn, y) # Rename the spatial lag, adding w_ to the original name ) .rename( columns=lambda c: "w_" + c # Remove the lag of the binary variable for apartments ) .drop("w_pg_Apartment", axis=1) )
Once computed, we can run the model using OLS estimation because, in this
context, the spatial lags included do not violate any of the assumptions OLS
relies on (they are essentially additional exogenous variables):
# Merge original variables with the spatial lags in `wx` slx_exog = db[variable_names].join(wx) # Fit linear model with `spreg` m6 = spreg.OLS( # Dependent variable db[["log_price"]].values, # Independent variables slx_exog.values, # Dependent variable name name_y="l_price", # Independent variables names name_x=slx_exog.columns.tolist(), )
As in the previous cases, printing the summary
attribute of the model object would show a full report table. The variables we included in the original regression
display similar behavior, albeit with small changes in size, and can be
interpreted also in a similar way. To focus on the aspects that differ from the previous models here, we will only pull out results for the variables for which we also included their spatial lags:
# Collect names of variables of interest vars_of_interest = ( db[variable_names].filter(like="pg_").join(wx).columns ) # Build full table of regression coefficients pandas.DataFrame( { # Pull out regression coefficients and # flatten as they are returned as Nx1 array "Coeff.": m6.betas.flatten(), # Pull out and flatten standard errors "Std. Error": m6.std_err.flatten(), # Pull out P-values from t-stat object "P-Value": [i[1] for i in m6.t_stat], }, index=m6.name_x # Subset for variables of itnerest only and round to # four decimals ).reindex(vars_of_interest).round(4)
Coeff. | Std. Error | P-Value | |
---|---|---|---|
pg_Condominium | 0.1063 | 0.0222 | 0.0000 |
pg_House | 0.0328 | 0.0157 | 0.0368 |
pg_Other | 0.0862 | 0.0240 | 0.0003 |
pg_Townhouse | -0.0277 | 0.0338 | 0.4130 |
w_pg_Condominium | 0.5928 | 0.0690 | 0.0000 |
w_pg_House | -0.0774 | 0.0319 | 0.0152 |
w_pg_Other | 0.4851 | 0.0551 | 0.0000 |
w_pg_Townhouse | -0.2724 | 0.1223 | 0.0260 |
The spatial lag of each type of property
(w_pg_XXX
) is the new addition. We observe that, except for the case
of townhouses (same as with the binary variable, pg_Townhouse
), they are all
significant, suggesting our initial hypothesis on the role of the surrounding
houses might indeed be at work here.
As an illustration, let’s look at some of the direct/indirect effects.
The direct effect of the pg_Condominium
variable means that condominiums are
typically 11% more expensive ((beta_{pg_{Condominium}}=0.1063)) than the benchmark
property type, apartments. More relevant to this section, any given house surrounded by
condominiums also receives a price premium. But, since (pg_{Condominium}) is a dummy variable,
the spatial lag at site (i) represents the percentage of properties near (i) that are
condominiums, which is between (0) and (1).2
So, a unit change in this variable means that you would increase the condominium
percentage by 100%. Thus, a (.1) increase in w_pg_Condominium
(a change of ten percentage points)
would result in a 5.92% increase in the property house price ((beta_{w_pg_Condominium} = 0.6)).
Similar interpretations can be derived for all other spatially lagged variables to derive the
indirect effect of a change in the spatial lag.
To compute the indirect change for a given site (i), you may need to examine the predicted values for its price. In this example, since we are using a row-standardized weights matrix with twenty nearest neighbors, the impact of changing (x_i) is the same for all of its neighbors and for any site (i). Thus, the effect is always (frac{gamma}{20}), or about (0.0296). However, it is interesting to consider this would not be the case for many other kinds of weights (like Kernel
, Queen
, Rook
, DistanceBand
, or Voronoi
), where each observation has potentially a different number of neighbors. To illustrate this, we will construct the indirect effect for a specific (i) in the condominium group.
First, predicted values for (y_i) are stored in the predy
attribute of any spreg
model:
# Print first three predicted values m5.predy[:3]
array([[5.27285901], [5.39966259], [4.28834686]])
To build new predictions, we need to follow the underlying equation of our model.
To illustrate the effect of a change in one of the values in a given location in other locations, we will switch one of the properties into the condominium category. Consider the third observation, which is the first apartment in the data:
# Print values for third observation for columns spanning # from `pg_Apartment` to `pg_Townhouse` db.loc[2, "pg_Apartment":"pg_Townhouse"]
pg_Apartment 1 pg_Condominium 0 pg_House 0 pg_Other 0 pg_Townhouse 0 Name: 2, dtype: object
Let’s now make a copy of our data and change the value to magically turn the apartment into a condominium:
# Make copy of the dataset db_scenario = db.copy() # Make Apartment 0 and condo 1 for third observation db_scenario.loc[2, ["pg_Apartment", "pg_Condominium"]] = [0, 1]
We’ve successfully made the change:
db_scenario.loc[2, "pg_Apartment":"pg_Townhouse"]
pg_Apartment 0 pg_Condominium 1 pg_House 0 pg_Other 0 pg_Townhouse 0 Name: 2, dtype: object
Now, we need to also update the spatial lag variates:
# Select only columns in `db_scenario` containing the keyword `pg_` wx_scenario = ( db_scenario.filter( like="pg" # Compute the spatial lag of each of those variables ) .apply( lambda y: weights.spatial_lag.lag_spatial(knn, y) # Rename the spatial lag, adding w_ to the original name ) .rename( columns=lambda c: "w_" + c # Remove the lag of the binary variable for apartments ) .drop("w_pg_Apartment", axis=1) )
And build a new exogenous (mathbf{X}) matrix, including the a constant 1 as the leading column
slx_exog_scenario = db_scenario[variable_names].join(wx_scenario)
Now, our new prediction (in the scenario where we have changed site 2
from an apartment into a condominium), can be computed by translating the model equation into Python code and plugging into it the simulated values we have just created:
# Compute new set of predicted values y_pred_scenario = m6.betas[0] + slx_exog_scenario @ m6.betas[1:]
Note the only difference between this set of predictions and the one in the original m6
model is that we have switched site 2
from apartment into condominium. Hence, every property which is not connected to site 2
(or is not site 2
itself) will be unaffected. The neighbors of site 2
however will have different predictions. To explore these, let’s first identify who is in this group:
[772, 2212, 139, 4653, 2786, 1218, 138, 808, 1480, 4241, 1631, 3617, 2612, 1162, 135, 23, 5528, 3591, 407, 6088]
Now, the effect of changing site 2
from an apartment into a condominium is associated with the following changes to the predicted (log) price, which we calculate by substracting the new predicted values from the original ones and subsetting only to site 2
and its neighbors:
# Difference between original and new predicted values ( y_pred_scenario - m6.predy # Subset to site `2` and its neighbors ).loc[[2] + knn.neighbors[2]]
0 | |
---|---|
2 | 0.106349 |
772 | 0.029642 |
2212 | 0.029642 |
139 | 0.029642 |
4653 | 0.029642 |
2786 | 0.029642 |
1218 | 0.029642 |
138 | 0.029642 |
808 | 0.029642 |
1480 | 0.029642 |
4241 | 0.029642 |
1631 | 0.029642 |
3617 | 0.029642 |
2612 | 0.029642 |
1162 | 0.029642 |
135 | 0.029642 |
23 | 0.029642 |
5528 | 0.029642 |
3591 | 0.029642 |
407 | 0.029642 |
6088 | 0.029642 |
We see the first row, representing the direct effect, is equal exactly to the estimate for pg_Condominium
. For the other effects, though, we have only changed w_pg_Condominium
by (.03) which roughly equates the marginal effect (w_pg_Condominium
) divided by the weight of the spatial relationship between site 2
and every neighbor (in this case, (frac{1}{20}) for every neighbor, but note if different neighbors had different cardinalities, this would differ).
Introducing a spatial lag of an explanatory variable, as we have just seen, is the most straightforward way of incorporating the notion of spatial dependence in a linear regression framework. It does not require additional changes, it can be estimated with OLS, and the interpretation is rather similar to interpreting non-spatial variables, so long as aggregate changes are required.
The field of spatial econometrics however is a much broader one and has produced over the last decades many techniques to deal with spatial effects and spatial dependence in different ways. Although this might be an over simplification, one can say that most of such efforts for the case of a single cross-section are focused on two main variations: the spatial lag and the spatial error model. Both are similar to the case we have seen in that they are based on the introduction of a spatial lag, but they differ in the component of the model they modify and affect.
Spatial Error#
The spatial error model includes a spatial lag in the error term of the equation:
[
log{P_i} = alpha + sum_k beta_k X_{ki} + u_i
]
[
u_i = lambda u_{lag-i} + epsilon_i
]
where (u_{lag-i} = sum_j w_{i,j} u_j).
Although it appears similar, this specification violates the assumptions about the
error term in a classical OLS model. Hence, alternative estimation methods are
required. Pysal incorporates functionality to estimate several of the most
advanced techniques developed by the literature on spatial econometrics. For
example, we can use a general method of moments that account for
heteroskedasticity [ADKP10]:
# Fit spatial error model with `spreg` # (GMM estimation allowing for heteroskedasticity) m7 = spreg.GM_Error_Het( # Dependent variable db[["log_price"]].values, # Independent variables db[variable_names].values, # Spatial weights matrix w=knn, # Dependent variable name name_y="log_price", # Independent variables names name_x=variable_names, )
Similarly as before, the summary
attribute will return a full-featured table of results. For the most part, it may be interpreted in similar ways to those above. The main difference is that, in this case, we can also recover an estimate and inference for the (lambda) parameter in the error term:
# Build full table of regression coefficients pandas.DataFrame( { # Pull out regression coefficients and # flatten as they are returned as Nx1 array "Coeff.": m7.betas.flatten(), # Pull out and flatten standard errors "Std. Error": m7.std_err.flatten(), # Pull out P-values from t-stat object "P-Value": [i[1] for i in m7.z_stat], }, index=m7.name_x # Subset for lambda parameter and round to # four decimals ).reindex(["lambda"]).round(4)
Coeff. | Std. Error | P-Value | |
---|---|---|---|
lambda | 0.6449 | 0.0187 | 0.0 |
Spatial Lag#
The spatial lag model introduces a spatial lag of the dependent variable. In the example we have covered, this would translate into:
[
log{P_i} = alpha + rho log{P_{lag-i}} + sum_k beta_k X_{ki} + epsilon_i
]
Although it might not seem very different from the previous equation, this model violates
the exogeneity assumption, crucial for OLS to work.
Put simply, this occurs when (P_i) exists on both “sides” of the equals sign.
In theory, since (P_i) is included in computing (P_{lag-i}), exogeneity is violated.
Similarly to the case of
the spatial error, several techniques have been proposed to overcome this
limitation, and Pysal implements several of them. In the example below, we
use a two-stage least squares estimation [Ans88], where the spatial
lag of all the explanatory variables is used as instrument for the endogenous
lag:
# Fit spatial lag model with `spreg` # (GMM estimation) m8 = spreg.GM_Lag( # Dependent variable db[["log_price"]].values, # Independent variables db[variable_names].values, # Spatial weights matrix w=knn, # Dependent variable name name_y="log_price", # Independent variables names name_x=variable_names, )
And let’s summarise the coefficients in a table as before (usual disclaimer about the summary
object applies):
# Build full table of regression coefficients pandas.DataFrame( { # Pull out regression coefficients and # flatten as they are returned as Nx1 array "Coeff.": m8.betas.flatten(), # Pull out and flatten standard errors "Std. Error": m8.std_err.flatten(), # Pull out P-values from t-stat object "P-Value": [i[1] for i in m8.z_stat], }, index=m8.name_z # Round to four decimals ).round(4)
Coeff. | Std. Error | P-Value | |
---|---|---|---|
CONSTANT | 2.7440 | 0.0727 | 0.0000 |
accommodates | 0.0698 | 0.0048 | 0.0000 |
bathrooms | 0.1627 | 0.0104 | 0.0000 |
bedrooms | 0.1604 | 0.0105 | 0.0000 |
beds | -0.0365 | 0.0065 | 0.0000 |
rt_Private_room | -0.4981 | 0.0151 | 0.0000 |
rt_Shared_room | -1.1157 | 0.0366 | 0.0000 |
pg_Condominium | 0.1073 | 0.0209 | 0.0000 |
pg_House | -0.0004 | 0.0137 | 0.9766 |
pg_Other | 0.1208 | 0.0215 | 0.0000 |
pg_Townhouse | -0.0186 | 0.0323 | 0.5653 |
W_log_price | 0.3416 | 0.0148 | 0.0000 |
Similarly to the effects in the SLX regression, changes in the spatial lag regression need to be interpreted with care. Here, W_log_price
applies consistently over all observations, and actually changes the effective strength of each of the (beta) coefficients. Thus, it is useful to use predictions and scenario-building to predict (y) when changing (X), which allows you to analyze the direct and indirect components.
Other ways of bringing space into regression#
We have covered here only a few ways to formally introduce space in a regression framework. There are however many other advanced spatial regression methods routinely used in statistics, data science, and applied analysis. For example, Generalized Additive Models [GOP15, Woo06] haven been used to apply spatial kernel smoothing directly within a regression function. Other similar smoothing methods, such as spatial Gaussian process models [BFC10] or Kriging, conceptualize the dependence between locations as smooth as well. Other methods in spatial regression that consider graph-based geographies (rather than distance/kernel effects) include variations on conditional autoregressive model, which examines spatial relationships at locations conditional on their surroundings, rather than as jointly co-emergent with them. Full coverage of these topics is beyond the scope of this book, however, though [BGFS08] provides a detailed and comprehensive discussion. We have not covered these (and other existing ones) not because we do not think are important or useful, far from it, but because we consider them a bit more advanced than the level at which we wanted to pitch the chapter.
Both the set of models we have introduced in here and others that exist but we have not covered share the feature of embedding space as a “first class” citizen in the regression framework. In all these cases, we conceptualise space (e.g. through a spatial weights matrix) and modify the functional form of our original model in a way that recognises the location of its observations when generating predictions or inference. This approach results in modifications of the modelling framework to accommodate geography. In the next chapter, we discuss a different perspective to embed space in our modelling efforts. Rather than injecting space through the functional form, we will do it through the variables we use to explain/predict the outcome of interest. In this case, we will be using geography to enrich our data before it is passed through a modelling framework.
Questions#
-
One common kind of spatial econometric model is the “Spatial Durbin Model,” which combines the SLX model with the spatial lag model. Alternatively, the “Spatial Durbin Error Model” combines the SLX model with the spatial error model. Fit a Spatial Durbin variant of the spatial models fit in this chapter.
-
Do these variants improve the model fit?
-
What happens to the spatial autocorrelation parameters ((rho), (lambda)) when the SLX term is added? Why might this occur?
-
-
Fortunately for us, spatial error models recover the same estimates (asymptotically) as a typical OLS estimate, although their confidence intervals will change. Statistically, this occurs because OLS miscalculates when there is spatial correlation and/or spatial heteroskedasticity. How much do the confidence intervals change when the spatial error model is fit?
-
One common justification for the SLX model (and the Spatial Durbin variants) is about omitted, spatially-patterned variables. That is, if an omitted variable is associated with the included variables and is spatially-patterned, then we can use the spatial structure of our existing variables to mimic the omitted variable. In our spatial lag model,
-
what variables might we be missing that are important to predict the price of an AirBnB?
-
would these omitted variables have a similar spatial pattern to our included variables? why or why not?
-
-
Where spatial regression models generally focus on how nearby observations are similar to one another, platial models focus on how observations in the same spatial group are similar to one another. These are often dealt with using multilevel or spatial mixed-effect models. When do these two ideas work together well? And, when might these disagree?
Challenge Questions#
The following discussions are a bit challenging, but reflect extra enhancements to the discussions in the chapter that may solidify or enhance an advanced understanding of the material.
The random coast#
In the section analyzing our naive model residuals, we ran a classic two-sample (t)-test to identify whether or not our coastal and not-coastal residential districts tended to have the same prediction errors. Often, though, it’s better to use straightforward, data-driven testing and simulation methods than assuming that the mathematical assumptions of the (t)-statistic are met.
To do this, we can shuffle our assignments to coast and not-coast, and check whether or not there are differences in the distributions of the observed residual distributions and random distributions. In this way, we shuffle the observations that are on the coast, and plot the resulting cumulative distributions.
Below, we do this; running 100 simulated re-assignments of districts to either “coast” or “not coast,” and comparing the distributions of randomly-assigned residuals to the observed distributions of residuals. Further, we do this plotting by the empirical cumulative density function, not the histogram directly. This is because the empirical cumulative density function is usually easier to examine visually, especially for subtle differences.
Below, the black lines represent our simulations, and the colored patches below them represents the observed distribution of residuals. If the black lines tend to be on the left of the colored patch, then, the simulations (where prediction error is totally random with respect to our categories of “coastal” and “not coastal”) tend to have more negative residuals than our actual model. If the black lines tend to be on the right, then they tend to have more positive residuals. As a refresher, positive residuals mean that our model is under-predicting, and negative residuals mean that our model is over-predicting. Below, our simulations provide direct evidence for the claim that our model may be systematically under-predicting coastal price and over-predicting non-coastal prices.
n_simulations = 100 f, ax = plt.subplots(1, 2, figsize=(12, 3), sharex=True, sharey=True) ax[0].hist( coastal, color=["r"] * 3, alpha=0.5, density=True, bins=30, label="Coastal", cumulative=True, ) ax[1].hist( not_coastal, color="b", alpha=0.5, density=True, bins=30, label="Not Coastal", cumulative=True, ) for simulation in range(n_simulations): shuffled_residuals = m1.u[numpy.random.permutation(m1.n)] random_coast, random_notcoast = ( shuffled_residuals[is_coastal], shuffled_residuals[~is_coastal], ) if simulation == 0: label = "Simulations" else: label = None ax[0].hist( random_coast, density=True, histtype="step", color="k", alpha=0.05, bins=30, label=label, cumulative=True, ) ax[1].hist( random_coast, density=True, histtype="step", color="k", alpha=0.05, bins=30, label=label, cumulative=True, ) ax[0].legend() ax[1].legend() plt.tight_layout() plt.show()
The K-neighbor correlogram#
Further, it might be the case that spatial dependence in our mis-predictions only matters for sites that are extremely close to one another, and decays quickly with distance.
To investigate this, we can examine the correlation between each sites’ residual and the average of the (k)th nearest neighbors’ residuals, increasing (k) until the estimate stabilizes.
This main idea is central to the geostatistical concept, the correlogram, which gives the correlation between sites of an attribute being studied as distance increases.
One quick way to check whether or not what we’ve seen is unique or significant is to compare it to what happens when we just assign neighbors randomly.
If what we observe is substantially different from what emerges when neighbors are random, then the structure of the neighbors embeds a structure in the residuals.
We won’t spend too much time on this theory specifically, but we can quickly and efficiently compute the correlation between our observed residuals and the spatial lag of an increasing (k)-nearest neighbor set:
correlations = [] nulls = [] for order in range(1, 51, 5): knn.reweight( k=order, inplace=True ) # operates in place, quickly and efficiently avoiding copies knn.transform = "r" lag_residual = weights.spatial_lag.lag_spatial(knn, m1.u) random_residual = m1.u[numpy.random.permutation(len(m1.u))] random_lag_residual = weights.spatial_lag.lag_spatial( knn, random_residual ) # identical to random neighbors in KNN correlations.append( numpy.corrcoef(m1.u.flatten(), lag_residual.flatten())[0, 1] ) nulls.append( numpy.corrcoef(m1.u.flatten(), random_lag_residual.flatten())[ 0, 1 ] )
plt.plot(range(1, 51, 5), correlations) plt.plot(range(1, 51, 5), nulls, color="orangered") plt.hlines( numpy.mean(correlations[-3:]), *plt.xlim(), linestyle=":", color="k" ) plt.hlines( numpy.mean(nulls[-3:]), *plt.xlim(), linestyle=":", color="k" ) plt.text( s="Long-Run Correlation: ${:.2f}$".format( numpy.mean(correlations[-3:]) ), x=25, y=0.3, ) plt.text( s="Long-Run Null: ${:.2f}$".format(numpy.mean(nulls[-3:])), x=25, y=0.05, ) plt.xlabel("$K$: number of nearest neighbors") plt.ylabel( "Correlation between site n and neighborhood average of size $K$" ) plt.show()
Clearly, the two curves are different. The observed correlation reaches a peak around (r=.34) when around 20 nearest listings are used. This means that adding more than 20 nearest neighbors does not significantly change the correlation in the residuals. Further, the lowest correlation is for the single nearest neighbor, and correlation rapidly increases as more neighbors are added close to the listing. Thus, this means that there does appear to be an unmeasured spatial structure in the residuals, since they are more similar to one another when they are near than when they are far apart. Further, while it’s not shown here (since computationally, it becomes intractable), as the number of nearest neighbors gets very large (approaching the number of observations in the dataset), the average of the (k)th nearest neighbors’ residuals goes to zero, the global average of residuals. This means that the correlation of the residuals and a vector that is nearly constant begins to approach zero.
The null correlations, however, use randomly-chosen neighbors (without reassignment).
Thus, since sampling is truly random in this case, each average of (k) randomly-chosen neighbors is usually zero (the global mean).
So, the correlation between the observed residual and the average of (k) randomly-chosen residuals is also usually zero.
Thus, increasing the number of randomly-chosen neighbors does not significantly adjust the long-run average of zero.
Taken together, we can conclude that there is distinct positive spatial dependence in the error.
This means that our over- and under-predictions are likely to cluster.
Next Steps#
For additional reading on the topics covered in this chapter, please consult the following resources:
For a more in-depth discussion of the fundamentals of spatial econometrics and applications in both GUI and command-line software, consult
Anselin, Luc and Sergio Rey. 2014. Modern Spatial Econometrics in Practice: A Guide to GeoDa, GeoDaSpace, and Pysal. GeoDa Press.
For additional mathematical detail and more extensive treatment of space-time models, consult:
Cressie, Noel and Christopher N. Wikle. 2011. Statistics for Spatio-Temporal Data. Singapore: Wiley Press.
For an alternative perspective on regression and critique of the spatial econometric perspective, consider:
Gibbons, Stephen and Henry G. Overman. 2012. “Mostly Pointless Spatial Econometrics.” Journal of Regional Science 52: 172-191.
And for a useful overview of the discussions around multilevel modelling, with references therein to further resources, consider:
Owen, Gwilym, Richard Harris, and Kelvyn Jones. “Under Examination: Multilevel models, geography and health research.” Progress in Human Geography 40(3): 394-412.
- 1
-
Since we use the log price as the dependent variable, our
(beta) coefficients can be interpreted as the percentage change in the price associated with a unit change in the explanatory variable. - 2
-
Discover this for yourself: what is the average of
numpy.array([True, True, True, False, False, True)]
?
Objectives
- Determine whether there is spatial structure to your data.
- Figure out how to model this using both spatial error models and spatial lag models.
- Compare spatial regression models.
Pre-lab assignment
- Read through the RSpatial chapter on spatial regression
- If you are unfamiliar with regression, consider reading this or watching a bit of this or this. If you like math, read this.
My assumptions
- I am going to assume the word “regression” is not new to you, that you know what a (beta) coefficient is, that you get what (y = mx + b) is about, and crucially that you know what a residual is. Confused? Watch this.
- Regression is a complex art-form. Junk regressions are everywhere. This tutorial focuses on how to deal with SPATIAL data and less on the art of regression and all the assumptions underlying OLS.
- I’ve peppered this tutorial with additional resources. If you’re unfamiliar with some of the basic stats stuff here, I’m assuming you’ll go look into the resources I’ve provided.
- You completed last week’s tutorial and understand how to make a spatial weights matrix and neighborhood.
- You remember that correlation DOES NOT IMPLY causation.
Set up
library(tidyverse)
library(ggExtra) # for awesome plots
library(rgdal)
library(spdep)
library(spgwr)
OLS
Classic linear regression estimates a linear relationship between a dependent variable (y) and a set of explanatory variables X:
[ y_i = X_i beta + epsilon_i ] Ordinary least squares (OLS) estimation is referred to as BLUE or as a Best Linear Unbiased Estimator. If we are interested in estimating the effect of education ((x_1)) on income ((y)), our OLS estimate of (beta) in the equation above is estimated by minimizing the sum of the squared prediction errors. Say what? Let’s visualize this:
Say you’re interested in predicting the effect of education on income because you want to know why all the PhDs make the BIG BUCKS (sarcasm). You go out and collect information on folks’ income and education levels and plot them on the nice plot above. OLS estimates the line that minimizes the sum of the squared prediction errors, or the distance between the line and each point (vertical lines). This distance is the difference between the observed data from our sample (Bob’s income and education) and our predicted estimate of the population relationship (average effect of education on income for white male individuals).1 In this figure, the line represents the predicted effect of education on income. The slope of this predicted effect line is our (beta) coefficient in the equation above. The fit of our modeled effect depends on how far the predicted effect is from each actual observation.
If the key assumptions required to use this BLUE approach are true, then we can make statistical inferences about the entire population based on the estimated (beta) coefficient. These assumptions include:
- Random errors have a mean of zero (i.e. no systematic bias or mis-specification in the population regression equation);
- The random errors have a constant variances (i.e. homoskedastic) and are uncorrelated;
- The random errors have a normal distribution.
Do these assumptions hold with spatial data?
When a value in one location depends on the values of its neighbors, our errors are no longer uncorrelated and may not have a normal distribution. Depending on the nature of the spatial dependence, OLS will be either inefficient with incorrect standard errors or biased and inconsistent. The take home message? OLS tends to break down in the face of spatial dependence…
We’ve already talked a lot about spatial dependence, but let’s take a second to think through why spatial dependence might occur. First, our actual data collection process may induce measurement error. The spatial units at which the data is collected may not reflect the underlying process (i.e. population sliced at Census tract boundaries or yield sliced at county levels). Second, our variable of interest may have an important spatial dynamic, i.e. location and position may drive the observed values of the process of interest.
The best visualization of these different types of spatial dependence I have found comes from this overview of spatial regression. First, let’s look at a situation in which our error is correlated across space, henceforth referred to as a SPATIAL ERROR MODEL:
Here, the subscripts (i) and (j) refer to two different locations in space, for example county (i) and county (j). Say we are looking at how average county-level average education influences county-level average income. We may not expect location and distance to influence income and education.2 We will, however, expect that the way in which counties are sliced could mean that neighboring counties are more similar than distant counties, and that this similarity may influence our income/education estimates. In this case, the ERROR in our model is correlated. This means we violate our OLS BLUE assumption that our error terms are uncorrelated and therefore any OLS estimates will be inefficient and biased. These spatially correlated errors are indicative of omitted (spatially correlated) covariates. We can specify how counties neighbor one another (you know how to do this from last week) and explicitly estimate this spatial structure in the model to control for correlation in the error term.
The other type of spatial dependence is a bit more complex. In this case, the value of our outcome variable at position (i) is influenced by the value of our outcome variable at position (j). We also expect that values of our covariates at location (i) influence values at location (j). In addition, we anticipate that our errors are correlated. Basically, this is a case in which you feel fairly confident that location/distance really drive the values of the variables you’re interested in… perhaps events in one place predict an increased likelihood of similar events in neighboring places (spatial diffusion). We’ll refer to this type of model as a SPATIAL LAG MODEL, though you may also hear it referred to as a spatial auto-regressive model in the literature.
Formally, if we write the standard OLS as follows:
[ y_i = X_i beta + epsilon_i ]
Then the spatial error model and spatial lag models differ in the way we define our error term (epsilon_i). For spatial error models, we define this term as [epsilon_i = lambda w_i epsilon_i + u_i] and for spatial lag models we define this term as [epsilon_i = rho w_i y_j + u_i] I’ll explain these key differences below.
Steps
Before I dive into the differences between spatial error models and spatial lag models, I want to provide a brief overview of the steps you should take when modeling your data with a spatial regression. Before you ever do anything with any data, you should tidy and visualize your data to understand/identify outliers and missing values, and to familiarize yourself with your data. Plot your data and look for signs of spatial patterns. Test for spatial autocorrelation in your variables of interest (moran.test()
). If you feel like your data has spatial dependence, then I recommend working through the following steps:
- Inspect your data, construct an appropriate spatial neighborhood, and assign weights.
- Run an OLS regression with your variables of interest. This regression assumes spatial independence.
- Inspect results and plot your residuals.
- Compute the Moran’s I value for the residuals using the weights you constructed in Step 1. If you see signs of spatial dependence, continue!
- To determine whether to estimate a spatial error model or a spatial lag model, compute the Lagrange Multiplier statistics. If these statistics are the same, go robust. Don’t worry, I explain this below.
- Run that spatial regression!
To be really thorough, I would test sensitivity of your major test results and regressions to different neighborhood/weighting assumptions.
I’ll explain each of these steps below. First, I’ll provide a bit more information the two types of spatial dependence we will explore in this lab.
Spatial error models
Remember that spatial error models assume that only the error terms in the regression are correlated. Let’s look at this another way:
[ y_i = X_i beta + epsilon_i \
epsilon_i = lambda w_i epsilon_i + u_i] where (u_i sim N(0, sigma^2 I)).
Therefore, our final spatial error model can be specified as:
[ y_i = X_i beta + lambda w_i epsilon_i + u_i ]
Mmmmm, mathy. This is actually pretty easy to understand. In a normal regression we estimate a single error term, (epsilon_i). In a spatial error model, we split this error into random error and spatially structured error. Here, random error (unexplained by our model) is (u_i), independent identically distributed (i.i.d.) errors. Our spatially structured error is composed of the the spatial error coefficient, (lambda), and our original (epsilon) error term now weighted by our weight matrix we learned how to build last week, (w_i). If there is no spatial correlation between our errors, then (lambda = 0). If (lambda neq 0), OLS is unbiased and consistent, but our standard errors will be wrong and the betas will be inefficient.
Let’s move on to implementing this in R
. A quick note about our data before we begin: Our data comes from the Applied Spatial Data Analysis in R text by Bivand and colleagues. This dataset covers the 281 census tracts of New York.
ny <- readOGR("./data/NY8_utm18.shp")
## OGR data source with driver: ESRI Shapefile
## Source: "./data/NY8_utm18.shp", layer: "NY8_utm18"
## with 281 features
## It has 17 fields
glimpse(ny@data)
## Observations: 281
## Variables: 17
## $ AREANAME <fctr> Binghamton city, Binghamton city, Binghamton city,...
## $ AREAKEY <fctr> 36007000100, 36007000200, 36007000300, 36007000400...
## $ X <dbl> 4.069397, 4.639371, 5.709063, 7.613831, 7.315968, 8...
## $ Y <dbl> -67.3533, -66.8619, -66.9775, -65.9958, -67.3183, -...
## $ POP8 <dbl> 3540, 3560, 3739, 2784, 2571, 2729, 3952, 993, 1908...
## $ TRACTCAS <dbl> 3.08, 4.08, 1.09, 1.07, 3.06, 1.06, 2.09, 0.02, 2.0...
## $ PROPCAS <dbl> 0.000870, 0.001146, 0.000292, 0.000384, 0.001190, 0...
## $ PCTOWNHOME <dbl> 0.32773109, 0.42682927, 0.33773959, 0.46160483, 0.1...
## $ PCTAGE65P <dbl> 0.14661017, 0.23511236, 0.13800481, 0.11889368, 0.1...
## $ Z <dbl> 0.14197, 0.35555, -0.58165, -0.29634, 0.45689, -0.2...
## $ AVGIDIST <dbl> 0.2373852, 0.2087413, 0.1708548, 0.1406045, 0.15777...
## $ PEXPOSURE <dbl> 3.167099, 3.038511, 2.838229, 2.643366, 2.758587, 2...
## $ Cases <dbl> 3.08284, 4.08331, 1.08750, 1.06515, 3.06017, 1.0638...
## $ Xm <dbl> 4069.397, 4639.371, 5709.063, 7613.831, 7315.968, 8...
## $ Ym <dbl> -67353.3, -66861.9, -66977.5, -65995.8, -67318.3, -...
## $ Xshift <dbl> 423391.0, 423961.0, 425030.6, 426935.4, 426637.5, 4...
## $ Yshift <dbl> 4661502, 4661993, 4661878, 4662859, 4661537, 466192...
Our dependent variable, Z
, is the log transformation of the mean counts of leukemia cases by census tract in NY.3 Our covariates of interest (independent variables) include PEXPOSURE
or the inverse distance to the closet inactive hazardous waste site, PCTAGE65P
or the proportion of people 65 or older, and PCTOWNHOME
, or the percentage of people who own their own home. Now it may be reasonable to assume that space/location/distance are driving our observed rates of leukemia, but for now, we’ll assume we only expect to see spatial interactions in the error terms due to the random slicing of our data into census tracts. We’ll relax this when we estimate a spatial lag model below, then we can compare the two!
Anytime you start a regression, or any data analysis for that matter, start by LOOKING at your data. Let’s start by inspecting the relationship between our covariate of interest, distance to the closest TCE location (PEXPOSURE
) and rates of leukemia:
library(ggExtra)
p <- ggplot(data = ny@data, aes(x = ny$PEXPOSURE, y = ny$Z, color = ny$AREANAME)) +
geom_point() +
theme(legend.position = "none") +
xlab("Exposure") +
ylab("Leukemia rates")
ggMarginal(p, type = "histogram")
I don’t see any clear relationship here, but let’s keep playing. As I said above in the Steps section, it’s often a good idea just to start with a simple linear model that doesn’t account for spatial dynamics. You can pull the residuals out of the model (the noise not explained by your model, (hat{epsilon_i} = hat{y_i} — y)) and see if there are interesting spatial patterns at play. This would suggest that spatial dynamics explain some of the variation in your data and that you should explore a bit more! Let’s estimate a nice old OLS regression with our variables of interest:
ny.ols <- lm(Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny)
summary(ny.ols)
##
## Call:
## lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7417 -0.3957 -0.0326 0.3353 4.1398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.51728 0.15856 -3.262 0.00124 **
## PEXPOSURE 0.04884 0.03506 1.393 0.16480
## PCTAGE65P 3.95089 0.60550 6.525 3.22e-10 ***
## PCTOWNHOME -0.56004 0.17031 -3.288 0.00114 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6571 on 277 degrees of freedom
## Multiple R-squared: 0.1932, Adjusted R-squared: 0.1844
## F-statistic: 22.1 on 3 and 277 DF, p-value: 7.306e-13
lm()
takes an equation and a data as it’s argument. You specify the function as INDEPENDENT_VAR ~ DEPENDENT_VAR_1 + DEP_VAR_2, etc
. Our results show us the Estimate
(think (beta) coefficient), the standard error of that estimate (Std.Error
) and whether that estimate is significant (Pr(>|t|)
). These results suggest that PEXPOSURE
is not significant, but that our other census variables are (home ownership and percent oldies). It’s also useful to glance at our R-squared
estimates. (R^2) gives us a sense of how well our model fits the data… it tells us how much variation we’ve explained out of the TOTAL variation in the model. A higher (R^2) means we’ve explained more of the variation. If you add tons of variables to a regression, you’ll probably explain more of the variation. Adjusted (R^2) penalizes you as you add more variables to your model by taking into account the ratio ((N-1)/(N-k-1)) where (N) is the number of observations and (k) is the number of predictors in your model. In our case, the fit ain’t stellar, i.e. there are probably other significant factors that explain variation in leukemia rates that we haven’t included in our model. More on (R^2) here and here.
What we really care about is our residuals. In classic (y_i = X_ibeta + epsilon_i) regression world, our residuals are caught up in (epsilon) and the residual standard error is measuring the standard deviation of (epsilon). If residual standard error were 0
, then our model fits the data perfectly. This never happens… there’s always some noise in our data that we can’t explain, and this is captured in our residuals. Remember that residuals are the distance between the observed values (Bob’s income and education) and the predicted relationship between variables of interest (the estimated linear effect of education on income).
Let’s look at our residuals.4
ny$resid <- residuals(ny.ols) # pull residuals out of OLS object
spplot(ny, "resid") # add resid column to our ny shapefile
There appears to be some spatial autocorrelation at play, but let’s test this formally. To do this, we will compute the Moran’s I for the residuals in the model. We first have to construct our neighborhood and weights (see last tutorial). I am assuming a KNN neighborhood where the four nearest neighbors are relevant.
library(spdep)
id <- row.names(as(ny, "data.frame"))
coords <- coordinates(ny)
ny_nb <- knn2nb(knearneigh(coords, k = 4), row.names = id)
plot(ny)
plot(ny_nb, coords, add=T, main = "KNN = 4 Neighborhood")
We then assign weights to our neighborhood and use lm.morantest()
to test the spatial autocorrelation in our detrended data (i.e. regression residuals):
ny_w <- nb2listw(ny_nb, style = "B", zero.policy = T)
lm.morantest(ny.ols, ny_w)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data =
## ny)
## weights: ny_w
##
## Moran I statistic standard deviate = 2.1191, p-value = 0.01704
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.069602396 -0.010104423 0.001414785
HOUSTON, WE HAVE SPATIAL AUTOCORRELATION. Note here that we are NOT using the moran.test()
function, but a special function intended to test for spatial autocorrelation in regression residuals, lm.morantest()
. This test takes into account the fact that the variable under consideration is a residual, computed from a regression. The usual Moran’s I test statistic does not. Since our p-value
is significant, we can reject our null of randomly distributed data and assume we’re dealing with spatially autocorrelated residuals. What this really means is that if we include spatial dependence in our model, we can remove spatial noise from our estimate of proximity to hazardous sites on leukemia rates. Important. Let’s do it. We’ll start with our simple spatial error model in which only the error terms are correlated. We can do this with the errorsarlm()
function:
ny.sem <- errorsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny, listw = ny_w, quiet = T)
summary(ny.sem)
##
## Call:errorsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME,
## data = ny, listw = ny_w, quiet = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.618805 -0.387857 -0.024901 0.368261 3.991433
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.553553 0.172557 -3.2079 0.001337
## PEXPOSURE 0.049443 0.041132 1.2020 0.229345
## PCTAGE65P 3.853170 0.618974 6.2251 4.813e-10
## PCTOWNHOME -0.475158 0.184211 -2.5794 0.009897
##
## Lambda: 0.041922, LR test value: 3.3325, p-value: 0.067925
## Asymptotic standard error: 0.021396
## z-value: 1.9594, p-value: 0.050071
## Wald statistic: 3.8391, p-value: 0.050071
##
## Log likelihood: -277.0626 for error model
## ML residual variance (sigma squared): 0.4186, (sigma: 0.64699)
## Number of observations: 281
## Number of parameters estimated: 6
## AIC: 566.13, (AIC for lm: 567.46)
We don’t see any changes in the direction of our estimated effects, but we do see changes in the magnitude. Notice that Lambda
((lambda)), which is indicative of spatial autucorrelation, is significant. Let’s visualize the trend component of our model and the stochastic component (residuals):
ny$fitted_sem <- ny.sem$fitted.values
spplot(ny, "fitted_sem", main = "Trend")
The trend shows the tract-specific patterns in PEXPOSURE
after controlling for the predictors in the model. Now for the residuals…
ny$resid_sem <- ny.sem$residuals
spplot(ny, "resid_sem", main = "Residuals")
The residuals show us the variation in PEXPOSURE
that we haven’t accounted for in our model. Visualizing these residuals can help us inspect spatial patterns in this unexplained variation and may give us ideas about other (spatially-structured) predictors to include in our model!
Spatial lag model (SAR)
The second type of spatial dependence is a bit more complex. In this case, we believe that the values of (y) in one unit (i) are directly influenced by the values of (y) found in (i’s) neighbors.5 We’ll refer to this type of model as a SPATIAL LAG MODEL.
Formally:
[ y_i = X_i beta + epsilon_i \
epsilon_i = rho w_i y_j + u_i ]
Therefore, our final spatial error model can be specified as:
[ y_i = x_i beta + rho w_i y_j + u_i ] where a positive value of (rho) indicates that counties are expected to have higher rates of leukemia if, on average, their neighbors have higher leukemia values. Our (beta) coefficient is different than in classic OLS in that we are estimating the effect of exposure on leukemia while controlling for the extent to which variation in a county’s level of leukemia can be explained by the rate of leukemia in connected counties. (w_i y_i) is the spatially lagged dependent variable for weights matrix (w_i), (X_i) is a matrix of observations on the explanatory variables, (u_i) is a vector of error terms and (rho) is the spatial coefficient. If there is no spatial dependence, and (y) does not depend on neighboring (y) values, then (rho = 0). Rho ((rho)) reflects the spatial dependence inherent in our sample data by measuring the average influence on observations by their neighboring observations. If (rho) is significant , OLS is biased and inconsistent (so don’t use it!). Assuming we’ve already confirmed that we have residual autocorrelation after specifying a simple OLS (lm.morantest()
and visualizations of residuals) and that our variables of interest are spatially autocorrelated, we can incorporate this spatial dependence in our previous OLS model as follows:
ny.lag <- lagsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny, ny_w, zero.policy = T)
summary(ny.lag)
##
## Call:
## lagsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny,
## listw = ny_w, zero.policy = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.55786 -0.38297 -0.01927 0.36193 3.95890
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.468433 0.157234 -2.9792 0.002890
## PEXPOSURE 0.030857 0.034826 0.8860 0.375597
## PCTAGE65P 3.701524 0.601607 6.1527 7.616e-10
## PCTOWNHOME -0.470071 0.168432 -2.7909 0.005257
##
## Rho: 0.04704, LR test value: 5.5728, p-value: 0.018242
## Asymptotic standard error: 0.019371
## z-value: 2.4283, p-value: 0.015168
## Wald statistic: 5.8969, p-value: 0.015168
##
## Log likelihood: -275.9424 for lag model
## ML residual variance (sigma squared): 0.41473, (sigma: 0.64399)
## Number of observations: 281
## Number of parameters estimated: 6
## AIC: 563.88, (AIC for lm: 567.46)
## LM test for residual autocorrelation
## test value: 1.267, p-value: 0.26033
As expected, our spatial autoregressive parameter ((rho), Rho
) is highly significant (p = .015
). Our test for residual autocorrelation is not significant, which suggests that we cannot reject the null of randomly distributed residuals. Let’s inspect our residuals:
ny$lag_resid <- residuals(ny.lag)
spplot(ny, "lag_resid")
It’s always worth restructuring your weights matrix to see how residual dependence changes. Let’s create a row-standardized queen neighborhood and re-estimate our model:
queen <- poly2nb(ny)
ny_qw <- nb2listw(queen) # remeber this defaults to row standardized
Now let’s re-estimate our model:
ny.lag.queen <- lagsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny, ny_qw, zero.policy = T)
summary(ny.lag.queen)
##
## Call:
## lagsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny,
## listw = ny_qw, zero.policy = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.623321 -0.393140 -0.013717 0.346620 4.056860
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.500508 0.155993 -3.2085 0.001334
## PEXPOSURE 0.043560 0.034486 1.2631 0.206547
## PCTAGE65P 3.635007 0.598853 6.0700 1.279e-09
## PCTOWNHOME -0.408772 0.169152 -2.4166 0.015667
##
## Rho: 0.23013, LR test value: 7.5038, p-value: 0.0061569
## Asymptotic standard error: 0.082188
## z-value: 2.8001, p-value: 0.0051087
## Wald statistic: 7.8405, p-value: 0.0051087
##
## Log likelihood: -274.9769 for lag model
## ML residual variance (sigma squared): 0.41046, (sigma: 0.64067)
## Number of observations: 281
## Number of parameters estimated: 6
## AIC: 561.95, (AIC for lm: 567.46)
## LM test for residual autocorrelation
## test value: 1.385, p-value: 0.23926
It looks like this model does a bit better, but not much. If you encounter this issue with your data, explore different neighborhood structures and, alas, you may have to explore more complex spatial models that what we’ve explored here.
Interpreting effects in spatial lag models is complicated. Each observation will have a direct effect on its predictor as well as an indirect effect through its neighbors. You can read more about how to tease out these effects here.
Comparing models
You can test for spatial autocorrelation in specific variables using moran.test()
and spatial autocorrelation in the residuals of your regression using lm.morantest()
. These tests tell you whether spatial autocorrelation is a concern and whether your OLS estimates are garbage. If you find spatial autocorrelation, however, you don’t really know whether you should estimate a spatial error model or a spatial lag model. Lagrange Multiplier test specifies the alternative hypotheses of spatial lag in the dependent variable and the presence of spatial lag specifically in the error term (Nice). A test for spatial dependence in linear models!?! Thank you R
!
We can use the lm.LMtests()
function and use the option test = "all"
to test for different types of spatial structure. You need to pass the function a regression object (ny.ols
) and a spatial listw weights object (ny_w
). This test returns:
- LM test for error dependence
LMerr
- LM test for a missing spatially lagged dependent variable
LMlag
- Variants of these robust to the presence of the other, i.e. LMErr in the possible presence of a lagged dependent variable (
RMErr
) and LMlag in the presence of only significant error dependence (RLMlag
).
Look back at the equations. If you look back at our equations, these tests test for (lambda = 0) (in the case of our spatial error model) and (rho = 0) (in the case of our spatial lag model). We have to use the same weights matrix for this test. You can and should test sensitivity of the test to different weights matrices.
LM <- lm.LMtests(ny.ols, ny_w, test = "all")
LM
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data =
## ny)
## weights: ny_w
##
## LMerr = 3.266, df = 1, p-value = 0.07073
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data =
## ny)
## weights: ny_w
##
## LMlag = 5.9832, df = 1, p-value = 0.01444
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data =
## ny)
## weights: ny_w
##
## RLMerr = 1.0467, df = 1, p-value = 0.3063
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data =
## ny)
## weights: ny_w
##
## RLMlag = 3.7639, df = 1, p-value = 0.05237
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data =
## ny)
## weights: ny_w
##
## SARMA = 7.0299, df = 2, p-value = 0.02975
Here LMlag
refers to our spatial lag model and LMerr
refers to our spatial error model. Since both LMerr and LMlag are both statistically significant, we need to look at their robust counterparts. These “robust” tests are robust to missing error or missing lag information and are actually robust to the presence of the other type of autocorrelation. Their outputs are in the RLMlag
and RLMerr
outputs above. SARMA
test for the significance of both LMerr + RLMlag
.6
t <- sacsarlm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny, ny_w, zero.policy = T)
It appears that our RLMlag
model has the best fit, so we should run a spatial lag model on our data. What does this mean in the real world? That there is spatial autocorrelation between leukemia rates in neighboring census tracts in the state of New York. If you’re interested in learning more about Lagrange Multiplier tests, I recommend this.
Luc Anselin (2005) developed a fantastic visualization of this process, which you can read all about in this extremely detailed overview of spatial regression in R
.
Another common way to compare model fit and make a decision about which model to use is to compare the log likelihood of the models you estimate. You can think of this statistic as the likelihood of the parameters given the data. Look back at the summary()
outputs. You’ll see that the log likelihood is reported with each test. We want to maximize the log likelihood, so go with the model with the highest value. As you test different weight matrices, it’s good practice to compare how the log likelihood changes.
Geographic weighted regression
Another nice exploratory tool is geographic weighted regression. This is good for exploring spatial heterogeneity in the relationships between variables… or, does your effect ((beta)) vary across space.
[ y_i = sum_{ij} beta_j (u_i, v_i) x_{ij} + epsilon_i ]
where our simple (beta) coefficient is replaced with functions (beta_j(.,.)) of location ((u,v)). Think of this as estimating the relationship between (y_i) and our data (x_{ij}) within a circle centered on location ((u,v)) with radius (r). Our findings are sensitive to this radius (r), also known as the bandwidth. The spgwr
package includes a function called grw.sel()
that estimates the root mean square prediction error for GWRs with different bandwidths. The function spits out a bandwidth minimizing the RMSE (i.e. best fit model).
library(spgwr)
bw <- gwr.sel(Z ~ PEXPOSURE, data = ny)
## Bandwidth: 68843.15 CV score: 148.1085
## Bandwidth: 111279.3 CV score: 147.9664
## Bandwidth: 137506.4 CV score: 147.9317
## Bandwidth: 146795.2 CV score: 147.9234
## Bandwidth: 159456.3 CV score: 147.9143
## Bandwidth: 167281.4 CV score: 147.9096
## Bandwidth: 172117.5 CV score: 147.907
## Bandwidth: 175106.4 CV score: 147.9055
## Bandwidth: 176953.7 CV score: 147.9046
## Bandwidth: 178095.3 CV score: 147.9041
## Bandwidth: 178800.9 CV score: 147.9038
## Bandwidth: 179237 CV score: 147.9036
## Bandwidth: 179506.5 CV score: 147.9034
## Bandwidth: 179673 CV score: 147.9034
## Bandwidth: 179776 CV score: 147.9033
## Bandwidth: 179839.6 CV score: 147.9033
## Bandwidth: 179878.9 CV score: 147.9033
## Bandwidth: 179903.2 CV score: 147.9033
## Bandwidth: 179918.3 CV score: 147.9033
## Bandwidth: 179927.5 CV score: 147.9033
## Bandwidth: 179933.3 CV score: 147.9033
## Bandwidth: 179936.8 CV score: 147.9033
## Bandwidth: 179939 CV score: 147.9033
## Bandwidth: 179940.4 CV score: 147.9033
## Bandwidth: 179941.2 CV score: 147.9033
## Bandwidth: 179941.7 CV score: 147.9033
## Bandwidth: 179942 CV score: 147.9033
## Bandwidth: 179942.2 CV score: 147.9033
## Bandwidth: 179942.4 CV score: 147.9033
## Bandwidth: 179942.4 CV score: 147.9033
## Bandwidth: 179942.5 CV score: 147.9033
## Bandwidth: 179942.5 CV score: 147.9032
## Bandwidth: 179942.5 CV score: 147.9032
## Bandwidth: 179942.5 CV score: 147.9032
## Bandwidth: 179942.5 CV score: 147.9032
## Bandwidth: 179942.5 CV score: 147.9032
## Bandwidth: 179942.6 CV score: 147.9032
## Bandwidth: 179942.6 CV score: 147.9032
bw
## [1] 179942.6
ny.gwr <- gwr(Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny, bandwidth = bw)
ny.gwr
## Call:
## gwr(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = ny,
## bandwidth = bw)
## Kernel function: gwr.Gauss
## Fixed bandwidth: 179942.6
## Summary of GWR coefficient estimates at data points:
## Min. 1st Qu. Median 3rd Qu. Max. Global
## X.Intercept. -0.522172 -0.520740 -0.520154 -0.514439 -0.511092 -0.5173
## PEXPOSURE 0.047176 0.048032 0.049527 0.049722 0.050477 0.0488
## PCTAGE65P 3.911526 3.933832 3.959192 3.962334 3.979552 3.9509
## PCTOWNHOME -0.559358 -0.557968 -0.557682 -0.555498 -0.554563 -0.5600
We can attach our estimated coefficients to our original ny
dataset to visualize how the relationship between exposure and leukemia varies across space:
ny$gwr_exp_coeff <- ny.gwr$SDF$PEXPOSURE
spplot(ny, "gwr_exp_coeff")
There appears to be an interesting spatial gradient here. What about our other variables?
ny$gwr_th_coeff <- ny.gwr$SDF$PCTOWNHOME
spplot(ny, "gwr_th_coeff")
ny$gwr_pop_coeff <- ny.gwr$SDF$PCTAGE65P
spplot(ny, "gwr_pop_coeff")
Note that this approach does NOT account for the spatial structure of the data. It’s really just like a bunch of mini-regressions within a circle of bandwidth (r). This is why I recommend using GWR only as an exploratory tool.
If you’re interested in learning more about geographic weighted regression, see here and here.
Additional resources
- This overview of spatial regression in
R
. - Nice overview of general ideas.
- If you’re a visual learner, this is a good overview of spatial regression.
- If you plan on using spatial regression in your research and want to do a deep dive, review this. It includes lots of
R
code. Code and data can be downloaded here. - Finally, if you feel that SEM and/or SLM don’t fit your data, this provides an overview of some of the other spatial structures that can be specified.
If the ordinary least squares (OLS) diagnostics discussed in the previous chapter indicate the existence of spatial lag or spatial error dependence, the researcher will wish to model the type of dependence indicated by these diagnostics. If the OLS diagnostics indicate the presence of a diffusion process, the researcher will wish to estimate a spatial lag model via maximum likelihood (ML) estimation or an instrumental variables specification incorporating instruments for the spatially lagged dependent variable. Alternatively, if the OLS diagnostics indicate the existence of spatial error dependence, the researcher may choose to estimate a more fully specified OLS model to model the spatial dependence or may choose to employ a ML or generalized method of moments (GMM) approach incorporating the spatial dependence in the errors.
The spatial dependence diagnosed via the diagnostics discussed in Chapter 5 may alternatively be produced by spatial heterogeneity in the effects of covariates. If this is the only source of spatial dependence, modeling this heterogeneity will be sufficient to capture the spatial dependence. As a consequence, any specification search should also consider the possibility of spatial heterogeneity, which is the focus of Chapter 7. This chapter will first, however, examine alternative approaches for modeling spatial dependence if spatial heterogeneity is not present.
This chapter begins by examining ML estimation of spatial lag models that derives from Ord (1975). Next, I explore alternative instrumental variables and GMM estimators for spatial lag dependence. Next, I turn to approaches for estimating spatial error models. I conclude by considering areas of concern in the estimation of spatial models. These include estimators for large sample sizes and diagnostics for continued spatial dependence.
MAXIMUM LIKELIHOOD SPATIAL LAG ESTIMATION
The mixed regressive, spatial autoregressive model, or spatial lag model, extends the pure spatial autoregressive model considered in Section 3.2 to include also the set of covariates and associated parameters:
y = ρWy+Xβ+ε
where X is again an N by K matrix of observations on the covariates, β is a K by 1 vector of parameters, and the remaining notation is as discussed in Section 3.2.