Homework 2: Using Spatial Lag, Spatial Error and Geographically Weighted Regression to Predict Median House Values in Philadelphia Block Groups

Starting from:

~~$35~~

$29

Home

Homework 2: Using Spatial Lag, Spatial Error and Geographically Weighted Regression to Predict Media

In the previous assignment, you were asked to use OLS regression to examine the relationship between median house values and several neighborhood characteristics, using Philadelphia data at the Census block group level. For the current assignment, you will use GeoDa and ArcGIS to run spatial lag, spatial error and geographically weighted regression to see whether these methods can account for the spatial autocorrelation that might remain in the OLS residuals.

Remember that this report needs to be written as your previous submission – with an introduction, methods/results, and discussion. Do not simply copy the questions and answer them. Below, you will find an outline which you’re asked to follow when writing your report.

Data Description

The attribute table of the Philadelphia Census block group level dataset Regression Data.shp contains the following variables:

AREAKEY: Census Block Group ID
MEDHVAL: Median value of all owner occupied housing units
PCBACHMORE: Proportion of residents in Block Group with at least a bachelor’s degree
PCTVACANT: Proportion of housing units that are vacant
PCTSINGLES: Percent of housing units that are detached single family houses
NBELPOV100: Number of households with incomes below 100% poverty level (i.e., number of households living in poverty)
MEDHHINC: Median household income

Note that the original Philadelphia block group dataset has 1816 observations. We clean the data by removing the following block groups:

Block groups where population < 40
Block groups where there are no housing units
Block groups where the median house value is lower than $10,000
One North Philadelphia block group which had a very high median house value (over $800,000) and a very low median household income (less than $8,000)

The final dataset which you are given contains 1720 block groups.

INSTRUCTIONS

SUGGESTION: READ THE ENTIRE SET OF INSTRUCTIONS BEFORE STARTING TO WORK ON THE ASSIGNMENT

IMPORTANT:

When working in GeoDa, be sure to save your work often. Go to File -> Save As, and save everything as a new shapefile to do this. Saving as a new shapefile is the only way to ensure that the new variables that you create are saved in the table and will be retained there even once you close GeoDa. New fields are not saved automatically like in ArcGIS.
You may do this assignment in R instead.
Note that you have done some of the steps previously in HW 1. Here, I take you through these steps in GeoDa if you choose to use it.

In GeoDa, open the file Regression Data.shp.

Recreate the variable LNNBELPOV100 in GeoDa. (This is to give you a bit of practice with using GeoDa for new variable calculation.)

Recall that you first need to add 1 to NBELPOV100 prior to taking the natural log, because otherwise you may have a situation where you are taking logarithms of 0’s in block groups where NBELPOV100 = 0 (and as you may recall, logarithms of 0’s are undefined). Unfortunately, new variable creation in GeoDa is a bit tedious, and needs to be done in two separate steps. That is, you cannot simply input the formula LN(NBELPOV100 + 1) into GeoDa. Instead, you need to first create a variable NBELPOV100 + 1 and only then take the natural log of that sum.

Let’s first create the variable PLUS1, defined as NBELPOV100 + 1

To do this, first open the attribute table

and right click anywhere on the table that opens up. Then select Add Variable.

In the box that pops up, select the following settings. Basically, you’re creating a real (continuous) variable called PLUS1 that will be placed at the end of the table (i.e., the last column in the table).

In the table, right click on PLUS1 (which contains all 0’s), select Variable Calculation. Then compute the variable as below:

Using the steps outlined in 1.a.ii.2 above, create another new variable, called LNNBELPOV. Again, this variable will be defined as follows: LNNBELPOV = LN(NBELPOV100+1) = LN(PLUS1).

Your table should now contain the variable called LNNBELPOV. Again, it will be the field on the very right of the table. Right click on LNNBELPOV and select Variable Calculation. The variable should be calculated like this:

If the log of the dependent variable, (log of (median house value + 1)), LNMEDHVAL, isn’t already in the dataset, calculate that variable too.

Create a Queen weight file (like this):

In the window that pops up, click Create and then select the options specified below:

Now, we are ready for some analysis. Using the instructions on the slides, for the variable LNMEDHVAL, compute the global Moran’s I using the Queen weight matrix created above. Then check to see whether the Moran’s I value is significant (using 999 permutations). Take a screenshot of your results to present in your report (Moran’s I value for the sample, histogram of Moran’s I values for the permutations, and the p-value that you obtain will need to be included).

Here and throughout, be sure to crop the screenshots so that only relevant parts are included in the report.

Run the local Moran’s I (LISA) analysis using the Queen weight matrix. Take a screenshot of your results, which will need to be included in the final report.

Now, we’re ready to run some regression analysis! First, let’s rerun the OLS regression in GeoDa.

To do this, on the main menu, select Regression, as is done below.

We start out with OLS (Classic) regression – the very same regression we ran in R for the previous assignment. To do this, select the settings as below, (be sure to navigate to the queen weights file that you created in step 1.b), and click Run. Depending on the version of GeoDa that you use, the variable selection box may look slightly different from the one below, but all the steps should be the same.

Once the regression finishes running, you will get the output. It should be the output you obtained with R, but will contain a few additional diagnostics.

Copy the regression output into Word or a text file. You will be expected to present it in your report.

For best visualization, present this output using the Courier New font, Size 8, single spaced.

Go back to the dialog box above, and click Save to Table. Clicking Save to Table enables you to save OLS residuals and OLS predicted values to the table.

So, in the dialog box that pops up (Save Regression Results), check Residual. This new variable will be given a name along the lines of OLS_RESIDU, as shown below. Click OK.

Take a look at the table: it now contains a new field called OLS_RESIDU with values of the OLS regression residuals at the very end of the table.

Now, let’s use GeoDa to create the weighted (i.e., spatially lagged) residuals. That is, for each block group, we will compute an average of the OLS residuals of the block group’s queen neighbors. For instance, if block group 1’s queen neighbors are block groups 3, 5 and 8, then the value of the weighted residual for block group 1 will be average of the residuals of block groups 3, 5 and 8.

In order to do that, first create a new variable called WT_RESIDU.

Then, calculate the value of the variable as shown below. Again, for the weight, select the queen weight matrix that you created earlier.

Now, let’s look a scatterplot that shows OLS residuals plotted against their queen neighbors. Of course, because one of the assumptions of OLS regression is independence of observations, if this assumption holds, there will be no relationship between OLS residuals and their neighbors. However, this assumption is likely to be violated here.

On the main menu, go to Explore -> Scatter Plot.

Select WT_RESIDU as the independent variable and OLS_RESIDU as the dependent variable (as shown below), and click OK.

Right-click on the scatterplot that pops up, and check Display Statistics. Some statistics will be displayed at the bottom of the plot, including Slope b (and corresponding significance results) – this is the coefficient of WT_RESIDU when you regress OLS_RESIDU on WT_RESIDU.

Note that this this is the same thing as running a simple regression with OLS_RESIDU as the dependent variable and WT_RESIDU as the predictor. The Beta coefficient of WT_RESIDU in that regression will be the same as Slope b.

Take a screenshot of that scatterplot and the statistics that appear at the bottom of it to present in your report.

Using the same steps as in (1.c) above, Look at the Moran’s I of the OLS regression residuals to see whether there is spatial autocorrelation.

Again, use the queen matrix that you calculated here.

Test whether the Moran’s I value is significant by running 999 permutations.

Take a screenshot of the Moran’s I results (both the Moran scatterplot and the significance test). You will be expected to present this in your report.

Now, let’s run the spatial lag regression model in GeoDa.

On the main menu, go to Regression.

In the regression dialog box that pops up, select the following settings:

Above, use the same queen weights file that we have created earlier.

Once you click Run, you will get the output. Copy the output into Word or a text file. You will be expected to present it in your report.

For best visualization, present this output using the Courier New font, Size 8, single spaced.

After the regression is done running, you will also be able to go back to the regression dialog box, and click on Save to Table, as shown below. You will be asked to save the spatial lag regression residuals as you did for OLS residuals.

Now, using the same steps as in (1.h) above, look at the Moran’s I value of the Spatial Lag (SL) residuals, and run 999 permutations to see whether the spatial autocorrelation in the SL residuals is statistically significant. Once again, be sure to take a screenshot of the Moran’s I results (both the Moran scatterplot and the significance test). You will be expected to present this in your report.

Now, repeat steps 1.i.i – 1.i.vi for spatial error regression. That is, keep everything the same except choose spatial error instead of spatial lag.

Before proceeding, make sure that you have the following outputs saved somewhere:

Global and local Moran’s I results for the variable LNMEDHVAL

OLS Regression Results

Spatial Lag Regression Results

Spatial Error Regression Results

A scatterplot of OLS_RESIDU and WT_RESIDU, with statistics displayed

Moran’s I scatterplot (and results of 999 permutations) for OLS Regression

Moran’s I scatterplot (and results of 999 permutations) for Spatial Lag Regression

Moran’s I scatterplot (and results of 999 permutations) for Spatial Error Regression

Be sure to save your file (go to File -> Save As, and save as RegressionFinal.shp, or something of the sort). Now, you may close GeoDa.

Now, open the file RegressionFinal.shp in R and modify the code in the provided R Markdown to run GWR on this data set. The same dependent variables and predictors as above should be used. In your report, you will need to present the following:

Global regression output (specifically, global R-squared, AIC and AICc).

Present a choropeth map of the local R-squared values.

Moran’s I scatter plot and random permutations test for GWR residuals, which may be done in R using the code in the provided R Markdown, or in GeoDa. If you choose to do this in GeoDa, export the GWR results as a shapefile, open it in GeoDa, recalculate the queen weight matrix and calculate the Moran’s I of GWR residuals. Then, test whether the Moran’s I value is significant by running 999 permutations, and take a screenshot of the Moran’s I results (both the Moran scatterplot and the significance test). You will be expected to present this in your report.

Follow instructions on the slides to map local regression results. Specifically, present maps of the ratio of the beta coefficients and the standard error estimates.

Use dark red when the ratio is < - 2, pink when the ratio is between 0 and -2, light blue when the ratio is between 0 and 2, and dark blue when the ratio is > 2.

Now, you are finally ready to start writing your report!

REPORT OUTLINE

A successful report will address all the points presented in this outline. You are strongly encouraged to use the outline as a backbone for your report.

The outline here is structured as an outline for a journal article. That is, in the Methods section, only talk about the techniques that you use, present the formulas, etc. Do not present any results in the methods section. In the Results section, actually present the output from R and ArcGIS, any figures, etc, and describe your output.

Introduction (~2 paragraphs) Section Title
1. State the problem and the setting of the analysis (Philadelphia).
2. Indicate that in the previous report, you carried out OLS regression to examine the relationship between your dependent variable and predictors (state what the DV and predictors are).
3. State that OLS analysis is often inappropriate when dealing with datasets that have a spatial component
4. Mention that the purpose of this report is to use spatial lag, spatial error and geographically weighted regression to see whether these methods perform better than OLS.

Methods (~5 pages) Section Title
1. A Description of the Concept of Spatial Autocorrelation Subsection Title
  1. Mention the 1^st Law of Geography
  2. Talk about Moran’s I
    1. Present and explain formula for Moran’s I
      1. As with all the formulas, be sure to explain what each term is.
  3. Mention and explain the weight matrix that you’re using.
    1. Indicate that throughout this report, you will be using this weight matrix.
    2. Specify why statisticians sometimes like to use more than one spatial weight matrix in their analyses. Explain why this is done.
  4. In your own words, talk about how you test whether the spatial autocorrelation (Moran’s I) is significant. State what hypotheses you’re testing (present the null and alternative hypotheses) and describe the random permutation process.
  5. Describe the concept of local spatial autocorrelation (no need for formulas here), and how the significance tests are carried out.

A Review of OLS Regression and Assumptions Subsection Title
1. Begin by giving a brief (3-5 sentence) overview of OLS regression. Specifically, list the assumptions of OLS
  1. Refer the reader to your HW 1 for more information on OLS.
2. State that when the data has a spatial component, the assumption that your errors are random/independent often doesn’t hold
  1. Indicate that you can test the assumption in (ii) above by examining the spatial autocorrelation of the residuals using Moran’s I.
  2. Indicate that another way to test OLS residuals for spatial autocorrelation is to regress them on nearby residuals (here, these nearby residuals are residuals at neighboring block groups, as defined by the Queen matrix).
    1. Mention what is Slope b at the bottom of the scatterplot of OLS_RESIDU and WT_RESIDU, and how it is calculated
3. State that GeoDa or R, [the tool that you’re using to run your OLS regression], also has a way of testing other regression assumptions.
  1. The first is the assumption of homoscedasticity, which is tied to the assumption of independence of errors.
    1. State which test(s) is/are used to examine data for heteroscedasticity in GeoDa/R, and state the null and alternative hypotheses.
  2. Another assumption is that of normality of errors.
    1. State which test is used to test for normality of errors in GeoDa/R, and state the null and alternative hypotheses.

Spatial Lag and Spatial Error Regression Subsection Title
1. State whether you will be using GeoDa or R for running spatial lag and spatial error regressions.
2. Describe the method of spatial lag regression in several sentences.
  1. Present the model equation for the spatial lag model.
    1. Instead of writing X1…X4, write the names of the actual predictors that you’re using in this assignment (e.g., PCTVACANT)
    2. Explain what each term is (the β coefficients, ρ, ε, etc)
3. Describe the method of spatial error regression in several sentences.
  1. Present the model equation for the spatial error model.
    1. Instead of writing X1…X4, write the names of the actual predictors that you’re using in this assignment (e.g., PCTVACANT)
    2. Explain what each term is (the β coefficients, λ, ε, u, etc)
4. Indicate that the assumptions that are needed for OLS are still needed for both spatial lag and spatial error regression models (except that of spatial independence of observations).
5. State the goal of spatial lag and spatial error regression (i.e., what you hope will happen with regression residuals as a result of using these methods).
6. Mention that you will compare the results of spatial lag regression with OLS and the results of spatial error regression with OLS, and will decide whether the spatial models perform better than OLS based a number of criteria.
  1. These criteria include
    1. Akaike Information Criterion/Schwarz Criterion;
    2. Log Likelihood;
    3. Likelihood Ratio Test
  2. Be sure to describe what each of the above criteria is, and how you decide which model is better based on this criterion (state any null/alternative hypotheses, if applicable).
  3. State that another way of comparing OLS results with spatial lag and spatial error results is by looking at the Moran’s I of regression residuals.
    1. Indicate how you would decide which model is better based on this criterion.

Geographically Weighted Regression Subsection Title
1. State that you will do your GWR analyses in R.
2. Introduce GWR by talking about the concepts of Simpson’s paradox and local regression.
3. Present the GWR equations and explain them in your own words
4. Talk about how local regression is run
5. Discuss the concept of bandwidth, and talk about adaptive vs. fixed bandwidth.
  1. State that here, you will be using adaptive bandwidth
    1. Explain why adaptive bandwidth is more appropriate in this problem than the fixed bandwidth
6. Mention that the OLS assumptions still hold in GWR.
  1. When mentioning multicollinearity, talk about the Condition Number, and the issues of multicollinearity/clustering in GWR.
7. Indicate why p-values are not part of the GWR output.

Results (~3-5 pages, excluding maps, figures & tables) Section Title
1. Spatial Autocorrelation Subsection Title
  1. Present and describe the global Moran’s I value of the dependent variable and the random permutations test results.
    1. Is LNMEDHVAL significantly spatially autocorrelated?
  2. For Local Moran’s I results, present the Significance Map and Cluster Map obtained by running the Local Morans’ I.
    1. Discuss the results: what are the not significant, high-high, high-low, low-high and low-low areas on the Cluster Map? Where in the city are these areas?

A Review of OLS Regression and Assumptions: Results Subsection Title
1. Present the OLS output from GeoDa (call this Table 1)
  1. Give a brief 2 sentence overview of the OLS results (feel free to paste this from your description in HW 1). That is, simply indicate which predictors are significant and what % of variance in LNMEDHVAL has been explained by the model.
  2. Comment on the results of the tests on heteroscedasticity
    1. Are the results from the different tests consistent with each other?
    2. Do they indicate a problem with heteroscedasticity?
    3. Is this conclusion consistent with the conclusion from the residual by predicted plot you presented in HW 1?
      1. Include that plot in the current report as well.
  3. Comment on the results of the test on normality of errors (Jarque-Bera test)
    1. Do test results indicate a problem with normality?
    2. Is this conclusion consistent with the histogram of residuals (errors) you presented in HW 1? If not, comment why not.
      1. Include the histogram in the current report as well.
2. Present the scatterplot of OLS_RESIDU by WT_RESIDU and describe the results.
  1. Is Slope b at the bottom of the scatterplot significant, meaning that there’s significant spatial autocorrelation?
3. Present the Moran’s I scatterplot and results from the 999 permutations for OLS regression residuals.
  1. Are you seeing significant spatial autocorrelation in your OLS residuals, and is this problematic?
  2. Do Moran’s I and the Beta coefficient of weighted (spatially lagged) residuals tell a similar story?

Spatial Lag and Spatial Error Regression Results Subsection Title
1. Present results of Spatial Lag regression (call this Table 2)
  1. Talk about the W_LNMEDHVAL term in the spatial lag regression output. State whether it is significant, and how the results can be interpreted.
  2. Are the remaining terms (i.e., the predictors LNNBELPOV, PCTBACHMOR, PCTSINGLES, and PCTVACANT) in the model significant?
    1. Compare these results to OLS results.
  3. State whether, based on the Breusch-Pagan test, the spatial lag regression residuals are still heteroscedastic.
  4. Compare the Spatial Lag regression and OLS regression models based on the Akaike Information Criterion/Schwarz Criterion, the Log Likelihood, and the Likelihood Ratio Test.
  5. Present the Moran’s I scatterplot of spatial lag regression residuals. Does there seem to be less spatial autocorrelation in these residuals than in OLS residuals?
  6. Overall, which model is doing better based on all of these criteria?
2. Present results of Spatial Error regression (call this Table 3)
  1. Talk about the LAMBDA term in the spatial error regression output. State whether it is significant, and how the results can be interpreted.
  2. Are the remaining terms (i.e., the predictors LNNBELPOV, PCTBACHMOR, PCTSINGLES, and PCTVACANT) in the model significant?
    1. Compare these results to OLS results.
  3. State whether, based on the Breusch-Pagan test, the spatial lag regression residuals are still heteroscedastic?
  4. Compare the Spatial Error regression and OLS regression based on the Akaike Information Criterion/Schwarz Criterion, the Log Likelihood, and the Likelihood Ratio Test.
  5. Present the Moran’s I scatterplot of spatial error regression residuals. Does there seem to be less spatial autocorrelation in these residuals than in OLS residuals?
  6. Overall, which model is doing better based on all of these criteria?
3. Compare the Spatial Lag and Spatial Error results with each other
  1. Recall that you should not be using the likelihood-ratio test for this because the models are not nested (i.e., neither method is a special subtype of each other). However, it is OK to compare the two non-nested models, such as spatial lag and spatial error, based on Akaike Information Criterion and the Schwarz Information Criterion.
    1. Which model has better (lower) Akaike Information Criterion and Schwarz Information Criterion values?

Geographically Weighted Regression Results Subsection Title
1. Present the global GWR results
  1. Compare the (overall) R-squared of the GWR regression with the R-squared of the OLS regression. State which regression method seems to be doing a better job of explaining the variance in the dependent variable.
  2. Compare the Akaike Information Criteria (AIC and not AICc) of GWR with those of OLS, Spatial Lag and Spatial Error models. Which model seems to be doing a better job based on that (remember, the lower the Akaike Information Criterion, the better the fit).
  3. Present the Moran’s I scatterplot of GWR residuals. Does there seem to be less spatial autocorrelation in these residuals than in OLS residuals? What about the Spatial Lag and Spatial Error Residuals.
2. Be sure to discuss local regression results, as is done on the slides.
  1. Present the maps of coefficients divided by the standard error that you created earlier. Are there locations in the city where the relationships between each of the predictors and the dependent variable possibly significant?
  2. Present and discuss the choropleth map of local R-squares.

Discussion (~1 page) Section Title
1. In a couple sentences, recap what you did in the paper and your findings. Discuss what conclusions you can draw, and which of the four regression methods (OLS, Spatial Lag, Spatial Error, GWR) was the best, based on the results.
2. Give a brief description of the limitations (i.e., which assumptions were not met).
3. Discuss what is meant by weighted (i.e., spatially lagged) residuals, as opposed to spatial lag [model] residuals. [This is a common source of confusion, and being able to explain this in your own words is important.]
  1. Make sure that you are using the correct terminology throughout the report
4. Mention why ArcGIS is problematic for GWR.