Analyzing Madison Housing Data
Sam West
April 4, 2016
I recently learned that Madison has an open data portal after visiting the Big Data Madison Meetup and I decided to explore some of the datasets publicly available. One dataset that caught my eye was the Assessor Property Information dataset. I don’t own any property but have a general interest in housing development and urban planning. The first piece of information I was interested was finding the data on the house my roommates and I rent which was surprisingly easy.
source("propertyAssessment.R")
## Warning: package 'e1071' was built under R version 3.2.4
df <- read.csv("propertyAssessment.csv")
df <- cleanPropertyDF(df)
dfHouse <- getResidentialFamilyDF(df)
dfHouse <- dfHouse[dfHouse$Current.Year.Improvement.Value>0 & dfHouse$Year.Built >0,]
dfHouseAndApartment <- getFamAndApartmentLiving(df)
dfHouseAndApartment <- dfHouseAndApartment[dfHouseAndApartment$Current.Year.Improvement.Value>0 & dfHouseAndApartment$Year.Built >0,]
#df[df$Address=="Enter your address",]
Madison has been a growing city and I was curious how housing development has been impacted by that growth. At first I just wanted to see when houses were built. To do this I looked for all houses with a property use of “Single family”, a Property Class of “Residential” and which had a current year improvement value greater than $0 to ensure that the property had some sort of building. I also had to filter out houses with a built year of 0. I’m not sure if this signifies houses with no known built year or some other value but this became relevant later on in my analysis. What immediately sticks out is the rises and dips such as the huge housing surge which occurs just after WWII.
ggplot(dfHouse,aes(Year.Built)) + geom_histogram(binwidth=1,colour="dodgerblue1") + labs(title="Count of Houses Built by Year",x="Year Built",y="Count Built")
I was then curios how housing construction compared to apartment construction. There have been quite a few apartment complexes built recently and I thought that maybe this was an increasing trend. I did another graph comparing apartments.
ggplot(dfHouseAndApartment, aes(Year.Built, colour = IS_APARTMENT)) + geom_freqpoly(binwidth = 1) + labs(title="Count of Houses and Apartments Built by Year",x="Year Built",y="Count Built")
Well that was a bit unexpected. It looks like apartments used to match housing but seems to have tapered off recently. However, it turns out that apartments don’t have the year built value set as often as houses. Once I factored in units listed as being built on year 0 we can see a significant number of the total apartments just don’t have a year built specified.
dfHouseAndApartment <- getFamAndApartmentLiving(df)
ggplot(dfHouseAndApartment,aes(Year.Built,fill=IS_APARTMENT)) + geom_histogram(binwidth=20) + labs(title="Count of Houses Built by Year",x="20 Year Period Built",y="Count Built")
Since the number of units is stored in the property use field I pulled that information to store a variable for the number of units for each property address. As I only evaluate a house as one unit even though it could house more people than a studio apartment, the metric is a bit misleading, but gives an idea of the amount of living space available by the apartments that have been built compared to houses.
dfHouseAndApartment <-addNumUnits(dfHouseAndApartment)
aggregate(Num_Units ~ IS_APARTMENT,data=dfHouseAndApartment,sum)
## IS_APARTMENT Num_Units
## 1 FALSE 46178
## 2 TRUE 42991
Next I wanted to examine property values by school district. What stuck out immediately was that Madison has the most expensive houses but in general the typical houses in Middleton/Cross Plains are more expensive.
ggplot(dfHouse, aes( School.District,Current.Year.Total.Value)) + geom_boxplot(fill="dodgerblue1")
I was then curious how housing sizes have increased over time. I plotted this out to see what the trend has been. There is seemingly an interesting dip from the 20’s to 50’s but in general it appears housing size has increased over time.
plot(dfHouse$Total.Living.Area ~ dfHouse$Year.Built,col="blue",xlab="Year Built",ylab="Total Living Area Ft^2")
fit <- lm(dfHouse$Total.Living.Area ~ dfHouse$Year.Built)
abline(fit,col="red")
Finally I wanted to make a simple linear model to see what factors seemed to be relevant in determining housing price. First I just created a histogram to see how the housing data is distributed.
skewness(dfHouse$Current.Year.Total.Value)
## [1] 3.237356
ggplot(dfHouse, aes(Current.Year.Total.Value)) + geom_histogram(binwidth=50000,colour="dodgerblue1")
I took the square root of the total current value to help remove some of the skewness.
skewness(dfHouse$SQR_ROOT_TOTAL_VALUE)
## [1] 1.662436
ggplot(dfHouse, aes(SQR_ROOT_TOTAL_VALUE)) + geom_histogram(binwidth=10,colour="dodgerblue1")
Without spending too much time selecting features for the model, I picked a few features which I was fairly confident would influence housing prices. The parameters appear to be significant although the error tends to grow with larger fitted values and residuals have heavy tails.
fit <- lm(SQR_ROOT_TOTAL_VALUE ~ Water.Frontage + Year.Built + Full.Baths + Foundation + Fireplaces + Bedrooms + Garage.Stalls.1 + Total.Living.Area,data=dfHouse)
summary(fit)
##
## Call:
## lm(formula = SQR_ROOT_TOTAL_VALUE ~ Water.Frontage + Year.Built +
## Full.Baths + Foundation + Fireplaces + Bedrooms + Garage.Stalls.1 +
## Total.Living.Area, data = dfHouse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -532.99 -34.18 -4.48 31.50 1071.71
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.681e+03 2.582e+01 65.103 < 2e-16 ***
## Water.FrontageTRUE 2.082e+02 2.691e+00 77.359 < 2e-16 ***
## Year.Built -7.146e-01 1.309e-02 -54.594 < 2e-16 ***
## Full.Baths 2.088e+01 5.043e-01 41.410 < 2e-16 ***
## FoundationConcrete -2.356e+01 6.943e+00 -3.394 0.000690 ***
## FoundationConcrete Block -3.298e+01 6.988e+00 -4.719 2.38e-06 ***
## FoundationStone -2.386e+01 7.119e+00 -3.352 0.000803 ***
## FoundationWood -7.653e+01 1.920e+01 -3.986 6.72e-05 ***
## Fireplaces 1.634e+01 4.431e-01 36.875 < 2e-16 ***
## Bedrooms -1.695e+00 4.316e-01 -3.927 8.62e-05 ***
## Garage.Stalls.1 2.718e+00 4.974e-01 5.465 4.66e-08 ***
## Total.Living.Area 1.188e-01 6.836e-04 173.726 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.6 on 45715 degrees of freedom
## Multiple R-squared: 0.7194, Adjusted R-squared: 0.7194
## F-statistic: 1.066e+04 on 11 and 45715 DF, p-value: < 2.2e-16
par( mfrow = c( 2, 2 ) )
plot(fit)
Overall the model seems reasonably accurate.
par( mfrow = c( 1, 1 ) )
plot(dfHouse$SQR_ROOT_TOTAL_VALUE~fitted(fit),col="blue",ylab="Square Root Home Value",main="Fitted vs Actual Values")
Wow, that is fun! My favorites are the bubbles. Liked current year to value. Liked all of them, actually! Nice work! Makes me wonder what else is out there!
ReplyDelete