Skip to content

Welcome to our team's Zillow Clustering Project! Amanda Gomez, Steve Kane, and Lori Segovia set out to discover which drivers most affect the validity of the Zestimate score.

Notifications You must be signed in to change notification settings

SKG-ZillowClusteringProject/zillow_clustering_project

Repository files navigation


[Scenario] [Project Planning] [Key Findings] [Tested Hypotheses] [Take Aways] [Data Dictionary] [Workflow]


Selling homes in our new normal has just gotten easier with Zillow Offers®. Now home owners can hand over the burden of selling their property, by selling directly to us based on our state of the art Zestimate score.

The accuracy and integrity of our Zestimate score is of high importance. As a junior data scientists on the Zillow data science team, we are tasked with uncovering what drivers most affect the validity of the Zestimate score. This is measured by our target variable: logerror. Which is the difference between Zillow's estimated Zestimate and actual sale price.

logerror = log (Zestimate) − log (ActualSalePrice)

Jump to Navigation

Goal:

The goal for this project is to create a model that will accurately predict the Zestimate’s logerror. By doing so, we will uncover what features available on the Zillow Dataset are driving the amount of error.

Initial Hypotheses:

Hypothesis₁

There is a relationship between home_age and logerror.

Hypothesis₂

There is a relationship between lot_sqft and logerror.

Hypothesis₃

There is a relationship between home_value and logerror.

Hypothesis₄

County that property is in affects the mean logerror.

Project Planning Initial Thoughts:

First iteration:

An MVP; the easiest thing at each stage to move forward. Remember the MVP won't fulfill every detail of the project spec and it isn't a good use of your time to do this at first.

  • Cluster: home_age, home_value
  • New features:
    • home_age: current year - yearbuilt
    • tax_rate: taxamount/taxvaluedollarcnt
    • bed_bath_ratio: bedroomcnt/bathroomcnt
    • property_age_bin

Deliverables:

  • Final clean, interactive Py notebook

Jump to Navigation

Exploration Key Findings:

  • Ventura is a quarter of LA and OC is half of LA
  • Lot size is thrown off by outliers
  • home_value median price is $ 355_758
  • land_value has a similar distribution to home_value, but priced lesser
  • home_age is almost normally distributed.

The following features appear to have clusters to explore:

  • home_age & home_value
  • home_age & sqft
  • lot_sqft & sqft
  • home_value & sqft
  • longitude & property_quality
  • home_age & property_quality

Jump to Navigation

Hypothesis₁

H₀ = No correlation between home_age and logerror.

H𝛼 = There IS a correlation between home_age and logerror.

  • REJECT null hypothesis.
Click to see full list.

Hypothesis₂

H₀ = No correlation between lot_sqft and logerror.

H𝛼 = There IS a correlation between lot_sqft and logerror.

  • FAIL to reject null hypothesis.

Hypothesis₃

H₀ = No correlation between home_value and logerror.

H𝛼 = There IS a correlation between home_value and logerror.

  • FAIL to reject null hypothesis.

Hypothesis₄

H₀ = Mean logerror is the same for small homes on small lots & Average sized homes on small lots.

H𝛼 = Mean logerror for small homes on small lots & Average sized homes on small lots are different.

  • FAIL to reject null hypothesis.

Hypothesis₅

H₀ = Mean logerror is the same for properties in Los Angeles County & Orange County.

H𝛼 = Mean logerror for properties in Los Angeles County & Orange County are different.

  • REJECT null hypothesis.

Hypothesis₆

H₀ = Mean logerror is the same for properties in Los Angeles County & Ventura County.

H𝛼 = Mean logerror for properties in Los Angeles County & Ventura County are different.

  • FAIL to reject null hypothesis.

Hypothesis₇

H₀ = Mean logerror is the same for properties in Orange County & Ventura County.

H𝛼 = Mean logerror for properties in Orange County & Ventura County are different.

  • FAIL to reject null hypothesis.

Jump to Navigation

  • home_age and logerror had a weak linear relationship, at best

  • lot_sqft did not have a significant effect on logerror, which we found surprising

  • also surprising was the apparent lack of significance between home_value and logerror

  • Out of our homemade features, small homes of all ages , large homes, and homes that are considered "best quality" seem to be drivers of logerror.

Jump to Navigation

column_name description key dtype
parcelid Unique identifier for parcels (lots) int64
bathrooms Number of bathrooms in home including fractional bathrooms float64
bedrooms Number of bedrooms in home int64
property_quality Overall assessment of condition of the building from best (lowest) to worst (highest) int64
sqft Calculated total finished living area of the home float64
fips Federal Information Processing Standard code - see https://en.wikipedia.org/wiki/FIPS_county_code for more details int64
latitude Latitude of the middle of the parcel multiplied by 10e6 float64
longitude Longitude of the middle of the parcel multiplied by 10e6 float64
lot_sqft Area of the lot in square feet float64
rawcensustractandblock Census tract and block ID combined - also contains blockgroup assignment by extension float64
regionidcity City in which the property is located (if any) float64
zip_code Zip code in which the property is located int64
roomcnt Total number of rooms in the principal residence int64
unitcnt Number of units the structure is built into (i.e. 2 = duplex, 3 = triplex, etc...) int64
yearbuilt The Year the principal residence was built int64
structure_value The assessed value of the built structure on the parcel float64
home_value The total tax assessed value of the parcel float64
land_value The assessed value of the land area of the parcel float64
taxamount The total property tax assessed for that assessment year float64
logerror The log of the difference between Zestimate value and actual sale price. float64
transactiondate Date property sold. object
county The county the property is located. object
home_age The current age in years of the home. int64
logerror_quartiles logerror distributed into 4 bins. category
Click to see full list.
column_name description key dtype
young_smhome Indicates if the property is a young small square footage home. 1 = yes, 0 = no uint8
middleaged_smhome Indicates if the property is a mid-aged small square footage home. 1 = yes, 0 = no uint8
old_smhome Indicates if the property is an old small square footage home. 1 = yes, 0 = no uint8
young_avghome Indicates if the property is a young average-sized square footage home. 1 = yes, 0 = no uint8
veteran_avghome Indicates if the property is a mid-to-old aged average-sized square footage home. 1 = yes, 0 = no uint8
lghome Indicates if the property is a large-sized square footage home. 1 = yes, 0 = no uint8
smlot_smhome Indicates if the property is a small square footage home on a small lot. 1 = yes, 0 = no uint8
smlot_avghome Indicates if the property is an averaged-sized square footage home on a small lot. 1 = yes, 0 = no uint8
smlot_lghome Indicates if the property is a large square footage home on a small lot. 1 = yes, 0 = no uint8
mdlot Indicates if the property is on a medium-sized lot. 1 = yes, 0 = no uint8
lglot Indicates if the property is on a large-sized lot. 1 = yes, 0 = no uint8
xllot Indicates if the property is on an extra-large-sized lot. 1 = yes, 0 = no uint8
structure_dollar_per_sqft Value of the structure divided by square footage ($) float64
land_dollar_per_sqft Value of the land divided by square footage ($) float64
bed_bath_ratio Number of bedrooms divided by number of bathrooms. float64
sqft_binned Square footage distributed into 3 even sized bins: Small, Medium, Large. category
LA Indicates if property is located in Los Angeles County. 1 = yes, 0 = no uint8
Orange Indicates if property is located in Orange County. 1 = yes, 0 = no uint8
avgqualityavgage Indicates if property is a mid-aged home built of average quality. 1 = yes, 0 = no uint8
poor_quality_old_age Indicates if property is an old home built of poor quality. 1 = yes, 0 = no uint8
avq_quality_young_age Indicates if property is a young home built of average quality. 1 = yes, 0 = no uint8
avg_quality_old_age Indicates if property is an old home built of average quality. 1 = yes, 0 = no uint8
bestest Indicates if property is built of high quality. 1 = yes, 0 = no uint8

Jump to Navigation

Please pull the repo first to use the following links to guide you through the data science pipeline. Enjoy!

  1. Prep Your Repo
  2. Import
  3. Acquire Data
  4. Clean, Prep & Split Data
  5. Explore Data
  6. Evaluate Data
  7. Modeling

Jump to Navigation

About

Welcome to our team's Zillow Clustering Project! Amanda Gomez, Steve Kane, and Lori Segovia set out to discover which drivers most affect the validity of the Zestimate score.

Topics

Resources

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •