In the previous assignment, you were
asked to use OLS regression to examine the relationship between
median house values and several neighborhood characteristics, using
Philadelphia data at the Census block group level. For the current
assignment, you will use GeoDa and ArcGIS to run spatial lag, spatial
error and geographically weighted regression to see whether these
methods can account for the spatial autocorrelation that might remain
in the OLS residuals.
Remember
that this report needs to be written as your previous submission –
with an introduction, methods/results, and discussion. Do
not simply copy the questions and answer them.
Below, you will find an outline which you’re asked to follow when
writing your report.
Data
Description
The attribute table of the Philadelphia
Census block group level dataset Regression
Data.shp contains the
following variables:
-
AREAKEY:
Census Block Group ID
-
MEDHVAL: Median
value of all owner occupied housing units
-
PCBACHMORE:
Proportion of residents in Block Group with at least a bachelor’s
degree
-
PCTVACANT: Proportion
of housing units that are vacant
-
PCTSINGLES: Percent
of housing units that are detached single family houses
-
NBELPOV100: Number
of households with incomes below 100% poverty level (i.e., number of
households living in poverty)
-
MEDHHINC: Median
household income
Note
that the original Philadelphia block group dataset has 1816
observations. We clean the data by removing the following block
groups:
-
Block groups where population < 40
-
Block groups where there are no housing
units
-
Block groups where the median house
value is lower than $10,000
-
One North Philadelphia block group which
had a very high median house value (over $800,000) and a very low
median household income (less than $8,000)
The
final dataset which you are given contains 1720 block groups.
INSTRUCTIONS
SUGGESTION:
READ THE ENTIRE SET OF INSTRUCTIONS BEFORE STARTING TO WORK ON THE
ASSIGNMENT
IMPORTANT:
- When
working in GeoDa, be sure to save your work often. Go to File
-> Save As, and
save everything as a
new shapefile to do
this. Saving as a new shapefile is the only way to ensure that the
new variables that you create are saved in the table and will be
retained there even once you close GeoDa. New fields are not saved
automatically like in ArcGIS.
- You
may do this assignment in R instead.
- Note
that you have done some of the steps previously in HW 1. Here, I
take you through these steps in GeoDa if you choose to use it.
-
In GeoDa, open the file Regression
Data.shp.
-
Recreate the variable LNNBELPOV100
in GeoDa. (This is to give
you a bit of practice with using GeoDa for new variable
calculation.)
- Recall
that you first need to add 1 to NBELPOV100
prior to taking the natural log, because otherwise you may have a
situation where you are taking logarithms of 0’s in block groups
where NBELPOV100 = 0
(and as you may recall, logarithms of 0’s are undefined).
Unfortunately, new variable creation in GeoDa is a bit tedious,
and needs to be done in two separate steps. That is, you cannot
simply input the formula LN(NBELPOV100
+ 1) into GeoDa. Instead,
you need to first create a variable NBELPOV100
+ 1 and only then take the
natural log of that sum.
-
Let’s first create the variable
PLUS1,
defined as NBELPOV100 + 1
-
To do this, first open the attribute
table
and right click anywhere on the table
that opens up. Then select Add
Variable.
- In
the box that pops up, select the following settings. Basically,
you’re creating a real (continuous) variable called PLUS1
that will be placed at the end of the table (i.e., the last
column in the table).
- In
the table, right click on PLUS1
(which contains all 0’s),
select Variable
Calculation. Then compute
the variable as below:
- Using
the steps outlined in 1.a.ii.2 above, create another new variable,
called LNNBELPOV.
Again, this variable will be defined as follows: LNNBELPOV
= LN(NBELPOV100+1) = LN(PLUS1).
- Your
table should now contain the variable called LNNBELPOV.
Again, it will be the field on the very right of the table. Right
click on LNNBELPOV
and select Variable
Calculation. The variable
should be calculated like this:
-
If the log of the dependent variable,
(log of (median house value + 1)), LNMEDHVAL,
isn’t already in the dataset, calculate that variable too.
-
Create a Queen weight file (like this):
In the window that pops up, click Create
and then select the options specified below:
-
Now, we are ready for some analysis.
Using the instructions on the slides, for the variable LNMEDHVAL,
compute the global Moran’s I using the Queen weight matrix
created above. Then check to see whether the Moran’s I value is
significant (using 999 permutations). Take a screenshot of your
results to present in your report (Moran’s I value for the
sample, histogram of Moran’s I values for the permutations, and
the p-value that you obtain will need to be included).
-
Here and throughout, be sure to crop
the screenshots so that only relevant parts are included in the
report.
- Run
the local
Moran’s I (LISA) analysis using the Queen weight matrix. Take a
screenshot of your results, which will need to be included in the
final report.
-
Now, we’re ready to run some
regression analysis! First, let’s rerun the OLS regression in
GeoDa.
-
To do this, on the main menu, select
Regression,
as is done below.
- We
start out with OLS (Classic) regression – the very same
regression we ran in R for the previous assignment. To do this,
select the settings as below, (be sure to navigate to the queen
weights file that you created in step 1.b), and click Run.
Depending on the version of GeoDa that you use, the variable
selection box may look slightly different from the one below, but
all the steps should be the same.
- Once
the regression finishes running, you will get the output. It
should be the output you obtained with R, but will contain a few
additional diagnostics.
- Copy
the regression output into Word or a text file. You
will be expected to present it in your report.
- For
best visualization, present this output using the Courier New
font, Size 8, single spaced.
- Go
back to the dialog box above, and click Save
to Table.
Clicking Save to Table
enables you to save OLS residuals
and OLS predicted values to
the table.
- So,
in the dialog box that pops up (Save
Regression Results), check
Residual.
This new variable will be given a name along the lines of
OLS_RESIDU,
as shown below. Click OK.
- Take
a look at the table: it now contains a new field called OLS_RESIDU
with values of the OLS
regression residuals at the very end of the table.
- Now,
let’s use GeoDa to create the weighted (i.e., spatially lagged)
residuals. That is, for each block group, we will compute an
average of the OLS residuals of the block group’s queen
neighbors. For instance, if block group 1’s queen neighbors are
block groups 3, 5 and 8, then the value of the weighted residual
for block group 1 will be average of the residuals of block groups
3, 5 and 8.
- In
order to do that, first create a new variable called WT_RESIDU.
- Then,
calculate the value of the variable as shown below. Again, for the
weight, select the queen weight matrix that you created earlier.

- Now,
let’s look a scatterplot that shows OLS residuals plotted against
their queen neighbors. Of course, because one of the assumptions of
OLS regression is independence of observations, if this assumption
holds, there will be no relationship between OLS residuals and
their neighbors. However, this assumption is likely to be violated
here.
- On
the main menu, go to Explore
-> Scatter Plot.
- Select
WT_RESIDU
as the independent variable
and OLS_RESIDU
as the dependent variable (as shown below), and click OK.
- Right-click
on the scatterplot that pops up, and check Display
Statistics. Some statistics
will be displayed at the bottom of the plot, including Slope
b (and corresponding
significance results) – this is the coefficient of WT_RESIDU
when you regress OLS_RESIDU
on WT_RESIDU.
- Note
that this this is the same thing as running a simple regression
with OLS_RESIDU
as the dependent variable and WT_RESIDU
as the predictor. The Beta coefficient of WT_RESIDU
in that regression will be the same as Slope
b.
- Take
a screenshot of that scatterplot and the statistics that appear at
the bottom of it to present in your report.
- Using
the same steps as in (1.c) above, Look at the Moran’s I of the
OLS regression residuals to see whether there is spatial
autocorrelation.
- Again,
use the queen matrix that you calculated here.
- Test
whether the Moran’s I value is significant by running 999
permutations.
- Take
a screenshot of the Moran’s I results (both the Moran
scatterplot and the significance test). You will be expected to
present this in your report.
- Now,
let’s run the spatial lag
regression model in GeoDa.
- On
the main menu, go to Regression.
- In
the regression dialog box that pops up, select the following
settings:
- Above,
use the same queen weights file that we have created earlier.
- Once
you click Run, you will get the output. Copy the output into Word
or a text file. You will be expected to present it in your report.
- For
best visualization, present this output using the Courier New
font, Size 8, single spaced.
- After
the regression is done running, you will also be able to go back
to the regression dialog box, and click on Save
to Table, as shown below.
You will be asked to save the spatial lag regression residuals as
you did for OLS residuals.
- Now,
using the same steps as in (1.h) above, look at the Moran’s I
value of the Spatial Lag (SL) residuals, and run 999 permutations
to see whether the spatial autocorrelation in the SL residuals is
statistically significant. Once again, be sure to take a
screenshot of the Moran’s I results (both the Moran scatterplot
and the significance test). You will be expected to present this
in your report.
- Now,
repeat steps 1.i.i – 1.i.vi for spatial
error regression. That is,
keep everything the same except choose spatial
error instead of spatial
lag.
- Before
proceeding, make sure that you have the following outputs saved
somewhere:
- Global
and local Moran’s I results for the variable LNMEDHVAL
- OLS
Regression Results
- Spatial
Lag Regression Results
- Spatial
Error Regression Results
- A
scatterplot of OLS_RESIDU
and WT_RESIDU,
with statistics displayed
- Moran’s
I scatterplot (and results of 999 permutations) for OLS Regression
- Moran’s
I scatterplot (and results of 999 permutations) for Spatial Lag
Regression
- Moran’s
I scatterplot (and results of 999 permutations) for Spatial Error
Regression
- Be
sure to save your file (go to File
-> Save As, and save as
RegressionFinal.shp,
or something of the sort). Now, you may close GeoDa.
- Now,
open the file RegressionFinal.shp
in R and modify the code in the provided R Markdown to run GWR on
this data set. The same dependent variables and predictors as above
should be used. In your report, you will need to present the
following:
- Global
regression output (specifically, global R-squared, AIC and AICc).
- Present
a choropeth map of the local R-squared values.
- Moran’s
I scatter plot and random permutations test for GWR residuals,
which may be done in R using the code in the provided R Markdown,
or in GeoDa. If you choose to do this in GeoDa, export the GWR
results as a shapefile, open it in GeoDa, recalculate the queen
weight matrix and calculate the Moran’s I of GWR residuals. Then,
test whether the Moran’s I value is significant by running 999
permutations, and take a screenshot of the Moran’s I results
(both the Moran scatterplot and the significance test). You will be
expected to present this in your report.
- Follow
instructions on the slides to map local regression results.
Specifically, present maps of the ratio of the beta coefficients
and the standard error estimates.
- Use
dark red when the ratio is < - 2, pink when the ratio is
between 0 and -2, light blue when the ratio is between 0 and 2,
and dark blue when the ratio is > 2.
Now,
you are finally ready to start writing your report!
REPORT OUTLINE
A successful report will address all
the points presented in this outline. You are strongly encouraged to
use the outline as a backbone for your report.
The outline here is structured as an
outline for a journal article. That is, in the Methods section, only
talk about the techniques that you use, present the formulas, etc. Do
not present any results in the methods section. In the Results
section, actually present the output from R and ArcGIS, any figures,
etc, and describe your output.
-
Introduction (~2 paragraphs)
Section
Title
-
State the problem and the setting of
the analysis (Philadelphia).
-
Indicate that in the previous report,
you carried out OLS regression to examine the relationship between
your dependent variable and predictors (state what the DV and
predictors are).
-
State that OLS analysis is often
inappropriate when dealing with datasets that have a spatial
component
-
Mention that the purpose of this report
is to use spatial lag, spatial error and geographically weighted
regression to see whether these methods perform better than OLS.
-
Methods (~5 pages)
Section
Title
- A
Description of the Concept of Spatial Autocorrelation
Subsection Title
- Mention
the 1st
Law of Geography
- Talk
about Moran’s I
- Present
and explain formula for Moran’s I
- As
with all the formulas, be sure to explain what each term is.
- Mention
and explain the weight matrix that you’re using.
-
Indicate that throughout this report,
you will be using this weight matrix.
-
Specify why statisticians sometimes
like to use more than one spatial weight matrix in their
analyses. Explain why this is done.
- In
your own words, talk about how you test whether the spatial
autocorrelation (Moran’s I) is significant. State what
hypotheses you’re testing (present the null and alternative
hypotheses) and describe the random permutation process.
- Describe
the concept of local spatial autocorrelation (no need for formulas
here), and how the significance tests are carried out.
-
A Review of OLS Regression and
Assumptions
Subsection
Title
-
Begin by giving a brief
(3-5 sentence) overview of OLS regression. Specifically, list the
assumptions of OLS
-
Refer the reader to your HW 1 for
more information on OLS.
-
State that when the data has a spatial
component, the assumption that your errors are random/independent
often doesn’t hold
-
Indicate that you can test the
assumption in (ii) above by examining the spatial autocorrelation
of the residuals using Moran’s I.
-
Indicate that another way to test OLS
residuals for spatial autocorrelation is to regress them on
nearby residuals (here, these nearby residuals are residuals at
neighboring block groups, as defined by the Queen matrix).
-
Mention what is Slope
b at the bottom of the
scatterplot of OLS_RESIDU
and WT_RESIDU,
and how it is calculated
-
State that GeoDa or R, [the tool that
you’re using to run your OLS regression], also has a way of
testing other regression assumptions.
-
The first is the assumption of
homoscedasticity,
which is tied to the assumption of independence of errors.
-
State which test(s) is/are used to
examine data for heteroscedasticity in GeoDa/R, and state the
null and alternative hypotheses.
-
Another assumption is that of
normality of errors.
-
State which test is used to test for
normality of errors in GeoDa/R, and state the null and
alternative hypotheses.
-
Spatial Lag and Spatial Error
Regression Subsection
Title
-
State whether you will be using GeoDa
or R for running spatial lag and spatial error regressions.
-
Describe the method of spatial lag
regression in several sentences.
-
Present the model equation for the
spatial lag model.
-
Instead of writing X1…X4, write
the names of the actual predictors that you’re using in this
assignment (e.g., PCTVACANT)
-
Explain what each term is (the β
coefficients, ρ, ε, etc)
-
Describe the method of spatial error
regression in several sentences.
-
Present the model equation for the
spatial error model.
-
Instead of writing X1…X4, write
the names of the actual predictors that you’re using in this
assignment (e.g., PCTVACANT)
-
Explain what each term is (the β
coefficients, λ, ε, u, etc)
-
Indicate that the assumptions that are
needed for OLS are still needed for both spatial lag and spatial
error regression models (except that of spatial independence of
observations).
-
State the goal of spatial lag and
spatial error regression (i.e., what you hope will happen with
regression residuals as a result of using these methods).
-
Mention that you will compare the
results of spatial lag regression with OLS and the results of
spatial error regression with OLS, and will decide whether the
spatial models perform better than OLS based a number of criteria.
-
These criteria include
-
Akaike Information Criterion/Schwarz
Criterion;
-
Log Likelihood;
-
Likelihood Ratio Test
-
Be sure to describe what each of the
above criteria is, and how you decide which model is better based
on this criterion (state any null/alternative hypotheses, if
applicable).
-
State that another way of comparing
OLS results with spatial lag and spatial error results is by
looking at the Moran’s I of regression residuals.
-
Indicate how you would decide which
model is better based on this criterion.
-
Geographically Weighted
Regression Subsection
Title
-
State that you will do your GWR
analyses in R.
-
Introduce GWR by talking about the
concepts of Simpson’s paradox and local regression.
-
Present the GWR equations and explain
them in your own words
-
Talk about how local regression is run
-
Discuss the concept of bandwidth, and
talk about adaptive vs. fixed bandwidth.
-
State that here, you will be using
adaptive bandwidth
-
Explain why adaptive bandwidth is
more appropriate in this problem than the fixed bandwidth
-
Mention that the OLS assumptions still
hold in GWR.
-
When mentioning multicollinearity,
talk about the Condition Number, and the issues of
multicollinearity/clustering in GWR.
-
Indicate why p-values are not part of
the GWR output.
-
Results (~3-5 pages, excluding
maps, figures & tables)
Section
Title
- Spatial
Autocorrelation
Subsection Title
- Present
and describe the global Moran’s I value of the dependent
variable and the random permutations test results.
- Is
LNMEDHVAL significantly
spatially autocorrelated?
- For
Local Moran’s I results, present the Significance Map and
Cluster Map obtained by running the Local Morans’ I.
- Discuss
the results: what are the not significant, high-high, high-low,
low-high and low-low areas on the Cluster Map? Where in the city
are these areas?
-
A Review of OLS Regression and
Assumptions: Results Subsection
Title
-
Present the OLS output from GeoDa
(call this Table 1)
-
Give a brief 2 sentence overview of
the OLS results (feel free to paste this from your description in
HW 1). That is, simply indicate which predictors are significant
and what % of variance in LNMEDHVAL
has been explained by the model.
-
Comment on the results of the tests
on heteroscedasticity
-
Are the results from the different
tests consistent with each other?
-
Do they indicate a problem with
heteroscedasticity?
-
Is this conclusion consistent with
the conclusion from the residual by predicted plot you presented
in HW 1?
-
Include that plot in the current
report as well.
-
Comment on the results of the test on
normality of errors (Jarque-Bera test)
-
Do test results indicate a problem
with normality?
-
Is this conclusion consistent with
the histogram of residuals (errors) you presented in HW 1? If
not, comment why not.
-
Include the histogram in the
current report as well.
-
Present the scatterplot of OLS_RESIDU
by WT_RESIDU
and describe the results.
-
Is Slope
b at
the bottom of the scatterplot significant,
meaning that there’s significant spatial autocorrelation?
-
Present the Moran’s I scatterplot
and results from the 999 permutations for OLS regression
residuals.
-
Are you seeing significant spatial
autocorrelation in your OLS residuals, and is this problematic?
-
Do Moran’s I and the Beta
coefficient of weighted (spatially lagged) residuals tell a
similar story?
-
Spatial Lag and Spatial Error
Regression Results Subsection
Title
-
Present results of Spatial Lag
regression (call this Table 2)
-
Talk about the W_LNMEDHVAL
term in the spatial lag regression output. State whether it is
significant, and how the results can be interpreted.
-
Are the remaining terms (i.e., the
predictors LNNBELPOV,
PCTBACHMOR,
PCTSINGLES,
and PCTVACANT)
in the model significant?
-
Compare these results to OLS
results.
-
State whether, based on the
Breusch-Pagan test, the spatial lag regression residuals are
still heteroscedastic.
-
Compare the Spatial Lag regression
and OLS regression models based on the Akaike Information
Criterion/Schwarz Criterion, the Log Likelihood, and the
Likelihood Ratio Test.
-
Present the Moran’s I scatterplot
of spatial lag regression residuals. Does there seem to be less
spatial autocorrelation in these residuals than in OLS residuals?
-
Overall, which model is doing better
based on all of these criteria?
-
Present results of Spatial Error
regression (call this Table 3)
-
Talk about the LAMBDA
term in the spatial error regression output. State whether it is
significant, and how the results can be interpreted.
-
Are the remaining terms (i.e., the
predictors LNNBELPOV,
PCTBACHMOR,
PCTSINGLES,
and PCTVACANT)
in the model significant?
-
Compare these results to OLS
results.
-
State whether, based on the
Breusch-Pagan test, the spatial lag regression residuals are
still heteroscedastic?
-
Compare the Spatial Error regression
and OLS regression based on the Akaike Information
Criterion/Schwarz Criterion, the Log Likelihood, and the
Likelihood Ratio Test.
-
Present the Moran’s I scatterplot
of spatial error regression residuals. Does there seem to be less
spatial autocorrelation in these residuals than in OLS residuals?
-
Overall, which model is doing better
based on all of these criteria?
-
Compare the Spatial Lag and Spatial
Error results with each other
-
Recall that you should not be using
the likelihood-ratio test for this because the models are not
nested (i.e., neither method is a special subtype of each other).
However, it is OK to compare the two non-nested models, such as
spatial lag and spatial error, based on Akaike Information
Criterion and the Schwarz Information Criterion.
-
Which model has better (lower)
Akaike Information Criterion and Schwarz Information Criterion
values?
-
Geographically Weighted Regression
Results Subsection
Title
-
Present the global GWR results
-
Compare the (overall) R-squared of
the GWR regression with the R-squared of the OLS regression.
State which regression method seems to be doing a better job of
explaining the variance in the dependent variable.
-
Compare the Akaike Information
Criteria (AIC and not AICc) of GWR with those of OLS, Spatial Lag
and Spatial Error models. Which model seems to be doing a better
job based on that (remember, the lower the Akaike Information
Criterion, the better the fit).
-
Present the Moran’s I scatterplot
of GWR residuals. Does there seem to be less spatial
autocorrelation in these residuals than in OLS residuals? What
about the Spatial Lag and Spatial Error Residuals.
-
Be sure to discuss local regression
results, as is done on the slides.
-
Present the maps of coefficients
divided by the standard error that you created earlier. Are there
locations in the city where the relationships between each of the
predictors and the dependent variable possibly significant?
-
Present and discuss the choropleth
map of local R-squares.
-
Discussion (~1 page)
Section
Title
-
In a couple sentences, recap what you
did in the paper and your findings. Discuss what conclusions you
can draw, and which of the four regression methods (OLS, Spatial
Lag, Spatial Error, GWR) was the best, based on the results.
-
Give a brief description of the
limitations (i.e., which assumptions were not met).
-
Discuss what is meant by weighted
(i.e., spatially lagged)
residuals, as opposed to spatial
lag [model]
residuals. [This is a common source of confusion, and being able to
explain this in your own words is important.]
-
Make sure that you are using the
correct terminology throughout the report
-
Mention why ArcGIS is problematic for
GWR.