Collecting data from Zillow with R

Collecting data from Zillow with R

My mom has been house hunting over the past couple of weeks, so I decided to try and use R to look at the local market. Here’s what I’ve learned:

Collecting data from Zillow was pretty easy, overall. I mostly used R packages rvest, xlm2, and tidyr.

library(rvest)
library(tidyr
library(xml2)

Next, I went to Zillow and searched for homes in Denver, CO. I zoomed in on an area that I wanted to analyze and then copied the link and pulled the data in R:

url<-"https://www.zillow.com//homes//for_sale//Denver-CO_rb//?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy"
webpage<-read_html(url)

The next part gets pretty complicated to explain. You essentially have to find the information you want from the webpage,which looks like a bunch of scrambled text. It’s helpful to go back to the webpage, right click, and select “View Page Source.” This will help you identify the structure of the webpage and pull the data you want. I started by parsing out the housing links from the metadata. You’ll have to remove characters to parse out the data, which I show below:

houses<- webpage %>%
  html_nodes(".zsg-pagination a") %>%
  html_attr("href")

houses<-houses[!is.na(houses)]
houses <-strsplit(houses,"/")
houses<-lapply(houses, function(x) x[length(x)])
houses<-as.numeric(gsub('[_p]','',houses))
houses <-max(houses)
urls<-c(url,paste0(url,2:houses,'_p/'))
urls

Then I used Jonkatz2 parser function to strip the data down even further. The rest of his functions didn’t work for me =/

getZillow <- function(urls) {
   lapply(urls, function(u) {
   cat(u, '\n')
   houses <- read_html(u) %>%
              html_nodes("article") houses })
 }
zdata<- getZillow(urls)

Instead, I ended breaking down different parts of his function to get the data that I need. The reason I had to write all of this complicated syntax is because the data is saved in a list within lists.

#to pull ID
getID<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    ids<-num %>% html_attr("id")
  })
}
id<-getID(zdata)

#get latitude
getLAT<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    lat<-num %>% html_attr("data-latitude")
    
  })
}
lats<-getLAT(zdata)

#get longitude
getLONG<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    long<-num %>% html_attr("data-longitude")
    
  })
}
longs<-getLONG(zdata)

#get price
getPrice<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    price<-num %>%  
      html_node(".zsg-photo-card-price") %>%
      html_text() 
  })
}
price<-getPrice(zdata)

#house description
getHdesc<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    Hdesc<-num %>%  
      html_node(".zsg-photo-card-info") %>%
      html_text() %>%
      strsplit("\u00b7")
  })
}
hdesc<-getHdesc(zdata)

#needs to be stripped down further
hdesc[[1]][[1]]
ldata2<-length(hdesc[[ldata]])

beds<-list()
getBeds<- function(data) {
  for(i in 1:length(data)) {
    t1<-data[[i]]
     beds[[i]]<- t1 %>%
       purrr::map_chr(1)
  }
  return(beds)
}
beds<-getBeds(hdesc)

baths<-list()
getBath<- function(data) {
  for(i in 1:length(data)) {
    t1<-data[[i]]
    baths[[i]]<- t1 %>%
      purrr::map_chr(2)
  }
  return(baths)
}
baths<-getBath(hdesc)

sqft<-list()
getSQft<- function(data) {
  for(i in 1:length(data)) {
    t1<-data[[i]]
    sqft[[i]]<- t1 %>%
      purrr::map_chr(3)
  }
  return(sqft)
}
sqft<-getSQft(hdesc)

#house type
getHtype<- function(data) {
    ldata<-1:length(data)
    lapply(1:length(ldata), function(x) {
      num<-data[[x]]
      Htype<-num %>%  
        html_node(".zsg-photo-card-spec") %>%
        html_text()
  })
}
htype<-getHtype(zdata)

#address
getAddy<-function(data) {
  ldata<- 1:length(data)
  lapply(1:length(ldata),function(x) {
    num<-data[[x]]
    addy<- num %>%
      html_nodes(".zsg-photo-card-address") %>%
    html_text() %>%
      strsplit("\u00b7")
  })
}

address<-getAddy(zdata)

#listing type
getLtype<-function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    ltype<-num %>% html_attr("data-pgapt")
    
  })
}
list_type<-getLtype(zdata)

Now you can unlist one level:

address<-lapply(address, function(x) unlist(x))
htype<-lapply(htype, function(x) unlist(x))
id<-lapply(id, function(x) unlist(x))
lats<-lapply(lats,function(x) unlist(x))
longs<-lapply(longs,function(x) unlist(x))
list_type<-lapply(list_type,function(x) unlist(x))
price<-lapply(price,function(x) unlist(x))

Then, I put it all in a data frame:

df<-data.frame()
list<-list(id, price, address, beds, baths, sqft, list_type,longs, lats, htype)
makeList<-function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    ll<-num %>% unlist(recursive=FALSE) 
  })
}
List<-makeList(list)
df<-data.frame(id=c(List[[1]]), price=c(List[[2]]), address=c(List[[3]]),
               beds=c(List[[4]]), baths=c(List[[5]]), sqft=c(List[[6]]),
               l_type=c(List[[7]]), long=c(List[[8]]), lat=c(List[[9]]),
               h_type=c(List[[10]]))

Some of these variables are not correctly formatted. For example, latitude and longitude values were stripped of their decimal points, so I need to add them back in by first removing the factor formatting and then doing some division.

df$long <-as.numeric(as.character(df$long)) / 1000000
df$lat<-as.numeric(as.character(df$lat)) / 1000000

Also, some of my other variables have characters in them, so I want to remove that too:

df$beds <-as.numeric(gsub("[^0-9]", "",df$beds, ignore.case = TRUE))
df$baths <-as.numeric(gsub("[^0-9]", "",df$baths, ignore.case = TRUE))
df$sqft <-as.numeric(gsub("[^0-9]", "",df$sqft, ignore.case = TRUE))
df$price <-as.numeric(gsub("[^0-9]", "",df$price, ignore.case = TRUE))
#replace NAs with 0
df[is.na(df)]<-0

Now I can map my data, in addition to conducting any analyses that I may want to do. Since there’s a ton of stuff out there on conducting analyses in R, I’ll just show you how I mapped my data using the leaflet package:

library(leaflet)
m <- leaflet() %>%
  addTiles() %>%
  addMarkers(lng=df$long, lat=df$lat, popup=df$id) 
m

It should look like this:

DenverZillow

If you click on the markers, they will show you the house IDs that they are associated with. You can see the web version by going to my OSF account, where I also posted the R program that I used.

Advertisements