2018 World Population Day!

2018 World Population Day!

Happy World Population Day!

In case you’re unfamiliar with World Population Day, it started in 1989 by the Governing Council of the United Nations Development Programme. July 11th was chosen because in 1987, it marked the approximate date in which the world’s population reached 5 billion people. The purpose of World Population Day is to draw attention to issues related to the global population, including the implications of population growth on the environment, economic development, gender equality, education, poverty, and human rights. The latter issue is the theme celebrated this year. Specifically, family planning as a human right, as this year marks the 50th anniversary of the 1968 International Conference on Human Rights, where family planning was for the first time globally affirmed to be a human right.

This year, the world population is estimated to be around 7,632,819,325 people. If you want to see estimates of the global population in real time, you can visit the Worldometers website, which will also show other interesting estimates of population-related statistics, such as healthcare expenditures, energy consumption, and water use. If you’re interested in where they get their data and their methods, you can visit their FAQ section.

World Population #Rstats Edition

In celebration of World Population Day, I thought I would share an R program that pulls data from the Worldometers site:

worldometer-countries

and creates a world map that highlights the top 10 countries with the largest total populations:

WorldPop
The top 10 countries with the largest total populations is highlighted in dark green.

R code below:

#Load libraries
library(tidyverse)
library(rvest)
library(magrittr)
library(ggmap)
library(stringr)
library(viridis)
library(scales)
#Retrieve data:
html.global_pop %
  extract2(1) %>%
  html_table()

#Check data
head(df.global_pop_RAW) 

#Check for unnecessary spaces in values
glimpse(df.global_pop_RAW)

#Check if country names match those in the map package
as.factor(df.global_pop_RAW$`Country (or dependency)`) %>% levels()

#Renaming countries to match how they are named in the package
df.global_pop_RAW$`Country (or dependency)` <- recode(df.global_pop_RAW$`Country (or dependency)`
                                   ,'U.S.' = 'USA'
                                   ,'U.K.' = 'UK')

#Convert population to numeric--you have to remove the "," before converting 
df.global_pop_RAW$`Population (2018)`<-as.numeric(as.vector(unlist(gsub(",", "",df.global_pop_RAW$`Population (2018)` ))))
sapply(df.global_pop_RAW,class) #Check that it worked

#Generate a world map
world_map<- map_data('world')

#Join map data with our data
map.world_joined <- left_join(world_map, df.global_pop_RAW, 
                              by = c('region' = 'Country (or dependency)'))

#Take only top 10 countries
df.global_pop10 %
  top_n(10)

#Printing to check
df.global_pop10 

#Change data to numeric
df.global_pop10$`Population (2018)`<-as.numeric(as.vector(unlist(gsub(",", "",df.global_pop10$`Population (2018)` ))))

#Check if worked correctly
sapply(df.global_pop10,class)

#Join map data to our data
map.world_joined2 <- left_join(world_map, df.global_pop10, 
                              by = c('region' = 'Country (or dependency)'))

#Create Flag to indicate that it will be colored in for the map
map.world_joined2 %
  mutate(tofill2 = ifelse(is.na(`#`), F, T))


#Now generate the map
ggplot() +
  geom_polygon(data = map.world_joined2, 
               aes(x = long, y = lat, group = group, fill = tofill2)) +
  scale_fill_manual(values = c("lightcyan2","darkturquoise")) +
  labs(title =  'Top 10 Countries with Largest populations (2018)'
       ,caption = "source: http://www.worldometers.info/world-population/population-by-country/") +
  theme_minimal() +
  theme(text = element_text(family = "Gill Sans")
        ,plot.title = element_text(size = 16)
        ,plot.caption = element_text(size = 5)
        ,axis.text = element_blank()
        ,axis.title = element_blank()
        ,axis.ticks = element_blank()
        ,legend.position = "none"
  )

Alternatively, you could include all the countries and use a gradient to indicate population size. However, China and India’s population is so large relative to other countries that it becomes difficult to see any real comparison.

#Generate map data (again)
world_map<- map_data('world')

#re-join with data
map.world_joined <- left_join(world_map, df.global_pop_RAW, 
                              by = c('region' = 'Country (or dependency)'))

#flag to fill ALL countries that match with the map package
map.world_joined %
  mutate(tofill = ifelse(is.na(`#`), F, T))

#Check that it worked correctly
head(map.world_joined,12)

#Then generate new map
ggplot(data = map.world_joined, aes(x = long, y = lat, group = group), color="white", size=.001) +
  geom_polygon(aes(x = long, y = lat, group = group, fill = `Population (2018)`)) +
  scale_fill_viridis(option = 'magma') +
  labs(title =  'Top 10 Countries with Largest populations'
       ,caption = "source: http://www.worldometers.info/world-population/population-by-country/") +
  theme_minimal() +
  theme(text = element_text(family = "Gill Sans")
        ,plot.title = element_text(size = 18)
        ,plot.caption = element_text(size = 5)
        ,axis.text = element_blank()
        ,axis.title = element_blank()
        ,axis.ticks = element_blank()
  )

Which should produce this map:
PopMap2
You can see that the other countries that made the top 10 list are not black, which reflects the smallest population sizes. It’s really that this map accentuates how large China’s and India’s population are relative to the other countries.

More population data and viz

If you want to know more about the global population and how it has changed over time, here are some great resources:

Our World in Data— see also their estimates for future population growth

8 min PBS video

Hans Rosling Tedx video (10 min)

20 min Hans Rosling video –this uses the gapminder data I often code with in Python and in R

Kurzgsagt animated video (6.5 min)

If you’re interested in theories and analytical concepts of demography, here are some links to free online class material:

Johns Hopkins Demographic Methods –or here

Johns Hopkins Principles of Population Change

Collecting data from Zillow with R

Collecting data from Zillow with R

My mom has been house hunting over the past couple of weeks, so I decided to try and use R to look at the local market. Here’s what I’ve learned:

Collecting data from Zillow was pretty easy, overall. I mostly used R packages rvest, xlm2, and tidyr.

library(rvest)
library(tidyr
library(xml2)

Next, I went to Zillow and searched for homes in Denver, CO. I zoomed in on an area that I wanted to analyze and then copied the link and pulled the data in R:

url<-"https://www.zillow.com//homes//for_sale//Denver-CO_rb//?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy"
webpage<-read_html(url)

The next part gets pretty complicated to explain. You essentially have to find the information you want from the webpage,which looks like a bunch of scrambled text. It’s helpful to go back to the webpage, right click, and select “View Page Source.” This will help you identify the structure of the webpage and pull the data you want. I started by parsing out the housing links from the metadata. You’ll have to remove characters to parse out the data, which I show below:

houses<- webpage %>%
  html_nodes(".zsg-pagination a") %>%
  html_attr("href")

houses<-houses[!is.na(houses)]
houses <-strsplit(houses,"/")
houses<-lapply(houses, function(x) x[length(x)])
houses<-as.numeric(gsub('[_p]','',houses))
houses <-max(houses)
urls<-c(url,paste0(url,2:houses,'_p/'))
urls

Then I used Jonkatz2 parser function to strip the data down even further. The rest of his functions didn’t work for me =/

getZillow <- function(urls) {
   lapply(urls, function(u) {
   cat(u, '\n')
   houses <- read_html(u) %>%
              html_nodes("article") houses })
 }
zdata<- getZillow(urls)

Instead, I ended breaking down different parts of his function to get the data that I need. The reason I had to write all of this complicated syntax is because the data is saved in a list within lists.

#to pull ID
getID<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    ids<-num %>% html_attr("id")
  })
}
id<-getID(zdata)

#get latitude
getLAT<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    lat<-num %>% html_attr("data-latitude")
    
  })
}
lats<-getLAT(zdata)

#get longitude
getLONG<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    long<-num %>% html_attr("data-longitude")
    
  })
}
longs<-getLONG(zdata)

#get price
getPrice<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    price<-num %>%  
      html_node(".zsg-photo-card-price") %>%
      html_text() 
  })
}
price<-getPrice(zdata)

#house description
getHdesc<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    Hdesc<-num %>%  
      html_node(".zsg-photo-card-info") %>%
      html_text() %>%
      strsplit("\u00b7")
  })
}
hdesc<-getHdesc(zdata)

#needs to be stripped down further
hdesc[[1]][[1]]
ldata2<-length(hdesc[[ldata]])

beds<-list()
getBeds<- function(data) {
  for(i in 1:length(data)) {
    t1<-data[[i]]
     beds[[i]]<- t1 %>%
       purrr::map_chr(1)
  }
  return(beds)
}
beds<-getBeds(hdesc)

baths<-list()
getBath<- function(data) {
  for(i in 1:length(data)) {
    t1<-data[[i]]
    baths[[i]]<- t1 %>%
      purrr::map_chr(2)
  }
  return(baths)
}
baths<-getBath(hdesc)

sqft<-list()
getSQft<- function(data) {
  for(i in 1:length(data)) {
    t1<-data[[i]]
    sqft[[i]]<- t1 %>%
      purrr::map_chr(3)
  }
  return(sqft)
}
sqft<-getSQft(hdesc)

#house type
getHtype<- function(data) {
    ldata<-1:length(data)
    lapply(1:length(ldata), function(x) {
      num<-data[[x]]
      Htype<-num %>%  
        html_node(".zsg-photo-card-spec") %>%
        html_text()
  })
}
htype<-getHtype(zdata)

#address
getAddy<-function(data) {
  ldata<- 1:length(data)
  lapply(1:length(ldata),function(x) {
    num<-data[[x]]
    addy<- num %>%
      html_nodes(".zsg-photo-card-address") %>%
    html_text() %>%
      strsplit("\u00b7")
  })
}

address<-getAddy(zdata)

#listing type
getLtype<-function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    ltype<-num %>% html_attr("data-pgapt")
    
  })
}
list_type<-getLtype(zdata)

Now you can unlist one level:

address<-lapply(address, function(x) unlist(x))
htype<-lapply(htype, function(x) unlist(x))
id<-lapply(id, function(x) unlist(x))
lats<-lapply(lats,function(x) unlist(x))
longs<-lapply(longs,function(x) unlist(x))
list_type<-lapply(list_type,function(x) unlist(x))
price<-lapply(price,function(x) unlist(x))

Then, I put it all in a data frame:

df<-data.frame()
list<-list(id, price, address, beds, baths, sqft, list_type,longs, lats, htype)
makeList<-function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    ll<-num %>% unlist(recursive=FALSE) 
  })
}
List<-makeList(list)
df<-data.frame(id=c(List[[1]]), price=c(List[[2]]), address=c(List[[3]]),
               beds=c(List[[4]]), baths=c(List[[5]]), sqft=c(List[[6]]),
               l_type=c(List[[7]]), long=c(List[[8]]), lat=c(List[[9]]),
               h_type=c(List[[10]]))

Some of these variables are not correctly formatted. For example, latitude and longitude values were stripped of their decimal points, so I need to add them back in by first removing the factor formatting and then doing some division.

df$long <-as.numeric(as.character(df$long)) / 1000000
df$lat<-as.numeric(as.character(df$lat)) / 1000000

Also, some of my other variables have characters in them, so I want to remove that too:

df$beds <-as.numeric(gsub("[^0-9]", "",df$beds, ignore.case = TRUE))
df$baths <-as.numeric(gsub("[^0-9]", "",df$baths, ignore.case = TRUE))
df$sqft <-as.numeric(gsub("[^0-9]", "",df$sqft, ignore.case = TRUE))
df$price <-as.numeric(gsub("[^0-9]", "",df$price, ignore.case = TRUE))
#replace NAs with 0
df[is.na(df)]<-0

Now I can map my data, in addition to conducting any analyses that I may want to do. Since there’s a ton of stuff out there on conducting analyses in R, I’ll just show you how I mapped my data using the leaflet package:

library(leaflet)
m <- leaflet() %>%
  addTiles() %>%
  addMarkers(lng=df$long, lat=df$lat, popup=df$id) 
m

It should look like this:

DenverZillow

If you click on the markers, they will show you the house IDs that they are associated with. You can see the web version by going to my OSF account, where I also posted the R program that I used.