Mapping – Stella Min

#Load libraries library(tidyverse) library(rvest) library(magrittr) library(ggmap) library(stringr) library(viridis) library(scales) #Retrieve data: html.global_pop % extract2(1) %>% html_table() #Check data head(df.global_pop_RAW) #Check for unnecessary spaces in values glimpse(df.global_pop_RAW) #Check if country names match those in the map package as.factor(df.global_pop_RAW$`Country (or dependency)`) %>% levels() #Renaming countries to match how they are named in the package df.global_pop_RAW$`Country (or dependency)` <- recode(df.global_pop_RAW$`Country (or dependency)` ,'U.S.' = 'USA' ,'U.K.' = 'UK') #Convert population to numeric--you have to remove the "," before converting df.global_pop_RAW$`Population (2018)`<-as.numeric(as.vector(unlist(gsub(",", "",df.global_pop_RAW$`Population (2018)` )))) sapply(df.global_pop_RAW,class) #Check that it worked #Generate a world map world_map<- map_data('world') #Join map data with our data map.world_joined <- left_join(world_map, df.global_pop_RAW, by = c('region' = 'Country (or dependency)')) #Take only top 10 countries df.global_pop10 % top_n(10) #Printing to check df.global_pop10 #Change data to numeric df.global_pop10$`Population (2018)`<-as.numeric(as.vector(unlist(gsub(",", "",df.global_pop10$`Population (2018)` )))) #Check if worked correctly sapply(df.global_pop10,class) #Join map data to our data map.world_joined2 <- left_join(world_map, df.global_pop10, by = c('region' = 'Country (or dependency)')) #Create Flag to indicate that it will be colored in for the map map.world_joined2 % mutate(tofill2 = ifelse(is.na(`#`), F, T)) #Now generate the map ggplot() + geom_polygon(data = map.world_joined2, aes(x = long, y = lat, group = group, fill = tofill2)) + scale_fill_manual(values = c("lightcyan2","darkturquoise")) + labs(title = 'Top 10 Countries with Largest populations (2018)' ,caption = "source: http://www.worldometers.info/world-population/population-by-country/") + theme_minimal() + theme(text = element_text(family = "Gill Sans") ,plot.title = element_text(size = 16) ,plot.caption = element_text(size = 5) ,axis.text = element_blank() ,axis.title = element_blank() ,axis.ticks = element_blank() ,legend.position = "none" )

#Generate map data (again) world_map<- map_data('world') #re-join with data map.world_joined <- left_join(world_map, df.global_pop_RAW, by = c('region' = 'Country (or dependency)')) #flag to fill ALL countries that match with the map package map.world_joined % mutate(tofill = ifelse(is.na(`#`), F, T)) #Check that it worked correctly head(map.world_joined,12) #Then generate new map ggplot(data = map.world_joined, aes(x = long, y = lat, group = group), color="white", size=.001) + geom_polygon(aes(x = long, y = lat, group = group, fill = `Population (2018)`)) + scale_fill_viridis(option = 'magma') + labs(title = 'Top 10 Countries with Largest populations' ,caption = "source: http://www.worldometers.info/world-population/population-by-country/") + theme_minimal() + theme(text = element_text(family = "Gill Sans") ,plot.title = element_text(size = 18) ,plot.caption = element_text(size = 5) ,axis.text = element_blank() ,axis.title = element_blank() ,axis.ticks = element_blank() )

My mom has been house hunting over the past couple of weeks, so I decided to try and use R to look at the local market. Here’s what I’ve learned:

Collecting data from Zillow was pretty easy, overall. I mostly used R packages rvest, xlm2, and tidyr.

library(rvest)
library(tidyr
library(xml2)

Next, I went to Zillow and searched for homes in Denver, CO. I zoomed in on an area that I wanted to analyze and then copied the link and pulled the data in R:

url<-"https://www.zillow.com//homes//for_sale//Denver-CO_rb//?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy"
webpage<-read_html(url)

The next part gets pretty complicated to explain. You essentially have to find the information you want from the webpage,which looks like a bunch of scrambled text. It’s helpful to go back to the webpage, right click, and select “View Page Source.” This will help you identify the structure of the webpage and pull the data you want. I started by parsing out the housing links from the metadata. You’ll have to remove characters to parse out the data, which I show below:

houses<- webpage %>%
  html_nodes(".zsg-pagination a") %>%
  html_attr("href")

houses<-houses[!is.na(houses)]
houses <-strsplit(houses,"/")
houses<-lapply(houses, function(x) x[length(x)])
houses<-as.numeric(gsub('[_p]','',houses))
houses <-max(houses)
urls<-c(url,paste0(url,2:houses,'_p/'))
urls

Then I used Jonkatz2 parser function to strip the data down even further. The rest of his functions didn’t work for me =/

getZillow <- function(urls) {
   lapply(urls, function(u) {
   cat(u, '\n')
   houses <- read_html(u) %>%
              html_nodes("article") houses })
 }
zdata<- getZillow(urls)

Instead, I ended breaking down different parts of his function to get the data that I need. The reason I had to write all of this complicated syntax is because the data is saved in a list within lists.

#to pull ID
getID<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    ids<-num %>% html_attr("id")
  })
}
id<-getID(zdata)

#get latitude
getLAT<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    lat<-num %>% html_attr("data-latitude")
    
  })
}
lats<-getLAT(zdata)

#get longitude
getLONG<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    long<-num %>% html_attr("data-longitude")
    
  })
}
longs<-getLONG(zdata)

#get price
getPrice<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    price<-num %>%  
      html_node(".zsg-photo-card-price") %>%
      html_text() 
  })
}
price<-getPrice(zdata)

#house description
getHdesc<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    Hdesc<-num %>%  
      html_node(".zsg-photo-card-info") %>%
      html_text() %>%
      strsplit("\u00b7")
  })
}
hdesc<-getHdesc(zdata)

#needs to be stripped down further
hdesc[[1]][[1]]
ldata2<-length(hdesc[[ldata]])

beds<-list()
getBeds<- function(data) {
  for(i in 1:length(data)) {
    t1<-data[[i]]
     beds[[i]]<- t1 %>%
       purrr::map_chr(1)
  }
  return(beds)
}
beds<-getBeds(hdesc)

baths<-list()
getBath<- function(data) {
  for(i in 1:length(data)) {
    t1<-data[[i]]
    baths[[i]]<- t1 %>%
      purrr::map_chr(2)
  }
  return(baths)
}
baths<-getBath(hdesc)

sqft<-list()
getSQft<- function(data) {
  for(i in 1:length(data)) {
    t1<-data[[i]]
    sqft[[i]]<- t1 %>%
      purrr::map_chr(3)
  }
  return(sqft)
}
sqft<-getSQft(hdesc)

#house type
getHtype<- function(data) {
    ldata<-1:length(data)
    lapply(1:length(ldata), function(x) {
      num<-data[[x]]
      Htype<-num %>%  
        html_node(".zsg-photo-card-spec") %>%
        html_text()
  })
}
htype<-getHtype(zdata)

#address
getAddy<-function(data) {
  ldata<- 1:length(data)
  lapply(1:length(ldata),function(x) {
    num<-data[[x]]
    addy<- num %>%
      html_nodes(".zsg-photo-card-address") %>%
    html_text() %>%
      strsplit("\u00b7")
  })
}

address<-getAddy(zdata)

#listing type
getLtype<-function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    ltype<-num %>% html_attr("data-pgapt")
    
  })
}
list_type<-getLtype(zdata)

Now you can unlist one level:

address<-lapply(address, function(x) unlist(x))
htype<-lapply(htype, function(x) unlist(x))
id<-lapply(id, function(x) unlist(x))
lats<-lapply(lats,function(x) unlist(x))
longs<-lapply(longs,function(x) unlist(x))
list_type<-lapply(list_type,function(x) unlist(x))
price<-lapply(price,function(x) unlist(x))

Then, I put it all in a data frame:

df<-data.frame()
list<-list(id, price, address, beds, baths, sqft, list_type,longs, lats, htype)
makeList<-function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    ll<-num %>% unlist(recursive=FALSE) 
  })
}
List<-makeList(list)
df<-data.frame(id=c(List[[1]]), price=c(List[[2]]), address=c(List[[3]]),
               beds=c(List[[4]]), baths=c(List[[5]]), sqft=c(List[[6]]),
               l_type=c(List[[7]]), long=c(List[[8]]), lat=c(List[[9]]),
               h_type=c(List[[10]]))

Some of these variables are not correctly formatted. For example, latitude and longitude values were stripped of their decimal points, so I need to add them back in by first removing the factor formatting and then doing some division.

df$long <-as.numeric(as.character(df$long)) / 1000000
df$lat<-as.numeric(as.character(df$lat)) / 1000000

Also, some of my other variables have characters in them, so I want to remove that too:

df$beds <-as.numeric(gsub("[^0-9]", "",df$beds, ignore.case = TRUE))
df$baths <-as.numeric(gsub("[^0-9]", "",df$baths, ignore.case = TRUE))
df$sqft <-as.numeric(gsub("[^0-9]", "",df$sqft, ignore.case = TRUE))
df$price <-as.numeric(gsub("[^0-9]", "",df$price, ignore.case = TRUE))
#replace NAs with 0
df[is.na(df)]<-0

Now I can map my data, in addition to conducting any analyses that I may want to do. Since there’s a ton of stuff out there on conducting analyses in R, I’ll just show you how I mapped my data using the leaflet package:

library(leaflet)
m <- leaflet() %>%
  addTiles() %>%
  addMarkers(lng=df$long, lat=df$lat, popup=df$id) 
m

It should look like this:

If you click on the markers, they will show you the house IDs that they are associated with. You can see the web version by going to my OSF account, where I also posted the R program that I used.

Stella Min

Demography

Tag: Mapping

2018 World Population Day!

Collecting data from Zillow with R

Menu