Sam's Data Dabbles: Madison Property Assessment: Decision Trees to Classify High School Districts

Property Assessment: High Schools and Decision Trees

I wanted to continue looking into Madison’s property assessment data as well as learn more about what the data could tell us about various neighborhoods in Madison. There is a field in the data set for the high school for a property and I thought this could be a good proxy for Madison’s neighborhoods. Madison has four high schools that I’m aware of: West, East, Memorial, and La Follette. Before starting this analysis I only knew of three of them, knew where one was located, and for some reason own a Madison East High School shirt. This is all to say that I had little knowledge of the history of Madison’s high schools and was interested in learning a bit more.

library(tree)
source("propertyAssessment.R")
df <- read.csv("propertyAssessment.csv")
df <- cleanPropertyDF(df)
# We only want single family residential housing with property, and size greater than 1 square foot
df <- getResidentialFamilyDF(df)

Explore Data

First let’s take a look at some of the data we’re dealing with. It seems like there is a fairly even distribution of family housing per high school.

xtabs( ~ High.School, data = df)

## High.School
##                  East Lafollette   Memorial   Optional       West 
##       1918      10878      10051      12590        293       9997

I’m assuming that the optional field is for property which could belong to multiple high schools. I don’t think we’ll be able to accurately split this data out so I’m going to remove it from testing.

#We only want East, Lafollette, Memorial, and West high schools
df <- df[df$High.School =="East" | df$High.School=="Lafollette" | df$High.School=="Memorial" | df$High.School=="West",]
df$High.School <- factor(df$High.School)  # Reset our factor variable

Madison is a city centered on the Capitol Building downtown and I suspect the city grew out from there. For this reason I would expect there to be a trend in the year the house was built, and the high school for that house.

boxplot(df$Year.Built ~ df$High.School, col="blue",main="Year Built by High School",ylab="Year Built",xlab="High School")

It definitely looks like houses near Madison East and West tend to be in older neighborhoods. After searching online I found that Madison East was established in 1922, Madison West in 1930, La Follette in 1963, and Memorial in 1966. I also found out that Madison East and West are located closer to downtown, while Memorial and La Follette are more on the edges of the city. Since downtown land is more expensive than on the edges of town, I thought there might be a trend the size of land property is on and the high school.

boxplot(df$Lot.Size.Sq.Ft ~ df$High.School, col="blue",main="Lot Size by High School",ylab="Lot Size Square Ft.",xlab="High School",ylim=c(1,50000))

My hypothesis seems to hold some weight. The property near East and West seems like it tends to be smaller than Lafollette and Memorial, but it’s likely only a small trend.

Creating a Decision Tree

Next I wanted to see if we could make some sort of predictive model to classify houses by the high schools they are near using a classification decision tree. The goal of the classification tree is to take an input set of variables, and use binary splits to classify a target variable. For example, we saw that Madison East and West tend to be in neighborhoods with older houses. A basic partition we could do would be to say if the house was built after 1960 we will assume it is either Lafollette or Memorial and if it was built before 1960 we will assume it is Madison East or West.

First we’re going to split our data into a training set and a test set. We’ll do this in order to create a tree off our training data and then evaluate our accuracy off the test set which we held back. Often this is done when taking a sample from a population of data. Since we technically have the entire population of Madison housing data this step may not be necessary, but we can use it to help validate the end model.

set.seed(15)
train <- sample(1:nrow(df),nrow(df)*.9)
test <- -train
train <- df[train,]
test <- df[test,]

I selected a few parameters which I thought might be relevant in making the decision tree: if the property has a water front, the plot size of the property, the year it was built, the number of stories, the number of bedrooms, the home style of the house, the total living area, the value of the property, the value of the land, and a new variable I created for the value of the land per square footage of the land on a plot. I could have added features such as the middle school near a house, but that would be too closely correlated with the high school. It would give us an accurate model, but I was more interested in the features of the neighborhoods around a high school and not the middle schools which feed into a high school.

hsTree <- tree(High.School ~ Water.Frontage + Lot.Size.Sq.Ft + Year.Built + Stories + Bedrooms + Home.Style + Total.Living.Area+ LandCostRatio + Current.Year.Land.Value,data=train)
summary(hsTree)

## 
## Classification tree:
## tree(formula = High.School ~ Water.Frontage + Lot.Size.Sq.Ft + 
##     Year.Built + Stories + Bedrooms + Home.Style + Total.Living.Area + 
##     LandCostRatio + Current.Year.Land.Value, data = train)
## Variables actually used in tree construction:
## [1] "Current.Year.Land.Value" "Year.Built"             
## [3] "LandCostRatio"           "Lot.Size.Sq.Ft"         
## Number of terminal nodes:  11 
## Residual mean deviance:  1.616 = 63270 / 39150 
## Misclassification error rate: 0.3273 = 12819 / 39164

Looking at the summary we can see that the only features used in the tree were the current year land value, the year the property was built, the cost of land per square footage of land, and the lot size. We can also see that there is a misclassification rate of about 33%. While that’s not great, I think it’s pretty impressive to be able to accurately identify a high school for a property based mainly on the land value, and year the house was built and is better than if we were guessing a house at random.

One nice characteristic of classification trees is that they are fairly easy to interpret and visualize. Starting from the top (or root) can spit on if the Current Year land Value is less than 63550 then go left else go right. We can follow this path all the way down until we get to the leaves of the tree which are the high schools listed.

plot(hsTree)
text(hsTree,pretty=0,cex=2)

Pruning the Tree

Next, we’re going to prune our tree to see if we can reduce the number of steps we take to make a classification without sacrificing too much accuracy. If we wanted, we could make the decision tree become incredibly detailed to the point where correctly identify the set of roughly 45,000 based on a unique identification such as the housing price, plot size, and any number of other features. The problem with this is we would be overfitting our tree to the data we are training on. If we were to add another house in the future, it’s unlikely our tree would correctly classify that house. We want to make our tree generalize the trends in the data to apply to other data and not the data itself.

cv_tree=cv.tree(hsTree,FUN=prune.misclass)
plot(cv_tree)

This plot shows us the number of misclassification our tree is making vs the number of terminal nodes we’re using. In general the more we can simplify the model the better. We don’t appear to get a big boost in accuracy after 8 nodes so we can work from that number. As you can see, our tree was simplified and is a little easier to interpret without a significant loss in accuracy with a misclassification rate still around 33%.

prunedTree <- prune.misclass(hsTree,best=8)
summary(prunedTree)

## 
## Classification tree:
## snip.tree(tree = hsTree, nodes = c(7L, 13L, 4L))
## Variables actually used in tree construction:
## [1] "Current.Year.Land.Value" "Year.Built"             
## [3] "LandCostRatio"          
## Number of terminal nodes:  8 
## Residual mean deviance:  1.728 = 67680 / 39160 
## Misclassification error rate: 0.3292 = 12893 / 39164

plot(prunedTree)
text(prunedTree,pretty=0,cex=2)

Validating the Tree

The last thing we’re going to do is revisit the test data we had saved from earlier. We can evaluate the tree we created on the test data and observe what the predictions are. From there we can compare what our misclassification rate was by taking the (number of observations - the number of errors)/number of observations. As is shown below our misclassification rate is just over 33% which is about what we saw with our training data.

testPrediction <- predict(prunedTree,test,type="class")
numErrors <- sum(testPrediction == test$High.School)
numObservations <- length(testPrediction)
(numObservations - numErrors)/numObservations

## [1] 0.3336397

We can also get an idea of where our errors are coming from by charting our predictions vs actual values. It seems like we are evaluating Lafollette, Memorial, and West very well but our misclassification is very high for East. Something in the model isn’t explaining to the differences in East from the other high schools. The model performs slightly better than random selecting a high school for East, but not much better. Learning what feature is missing to better incorporate houses near East High School is an area where we could seek improvement for this model.

evalDF <- data.frame(test$High.School,testPrediction)
names(evalDF) <- c("Actual","Predicted")
errorTable <- xtabs( ~ Actual + Predicted,data=evalDF)
errorTable

##             Predicted
## Actual       East Lafollette Memorial West
##   East        361        469      162  143
##   Lafollette   72        693      187   16
##   Memorial      4         40     1109  102
##   West         89         27      141  737

for (i in 1:4 ) {
  highSchool <- ifelse(i==1,"East",ifelse(i==2,"Lafollette",ifelse(i==3,"Memorial","West")))
  correct <- errorTable[i,i]
  total <- sum(errorTable[i,])
  print(paste(highSchool," Misclassification Rate: ",(total-correct)/total),sep="")
}

## [1] "East  Misclassification Rate:  0.681938325991189"
## [1] "Lafollette  Misclassification Rate:  0.284090909090909"
## [1] "Memorial  Misclassification Rate:  0.116334661354582"
## [1] "West  Misclassification Rate:  0.258551307847082"

Here’s a graph to visualize this. Our X axis is the actual classification of the property and our Y axis is the predicted. Ideally the biggest section would be where the predicted and actual line up. This is the case for Lafollette, Memorial, and West, but not East. As we can see with East, when the property is East we often predicted it’s Lafollette, and Memorial and West to a lesser extent.

plot(errorTable, col=c("#f0f9e8","#bae4bc","#7bccc4","#2b8cbe"),main="Predicted vs Actual Classifications" )

Final Thoughts

Overall it seems some of my initial predictions were correct and we can get a decent idea of the composition of the housing near a high school from the year the house was built and the land value. These features are able to get a fairly accurate estimate of Memorial, Lafollette, and West. The partitions for Lafollette and Memorial tend to be focused on properties on cheaper land, and houses built more recently while West is on more expensive land built further in the past. Houses near East High School weren’t classified quite as well and I think this is because those neighborhoods consist of a more diverse range of housing with regards to year built, and land value which makes it harder to classify compared to the other neighborhoods.

Sam's Data Dabbles

Monday, April 11, 2016

Madison Property Assessment: Decision Trees to Classify High School Districts

Property Assessment: High Schools and Decision Trees

Sam West

April 10, 2016

Explore Data

Creating a Decision Tree

Pruning the Tree

Validating the Tree

Final Thoughts

No comments:

Post a Comment