2018 World Population Day!

Happy World Population Day!

In case you’re unfamiliar with World Population Day, it started in 1989 by the Governing Council of the United Nations Development Programme. July 11th was chosen because in 1987, it marked the approximate date in which the world’s population reached 5 billion people. The purpose of World Population Day is to draw attention to issues related to the global population, including the implications of population growth on the environment, economic development, gender equality, education, poverty, and human rights. The latter issue is the theme celebrated this year. Specifically, family planning as a human right, as this year marks the 50th anniversary of the 1968 International Conference on Human Rights, where family planning was for the first time globally affirmed to be a human right.

This year, the world population is estimated to be around 7,632,819,325 people. If you want to see estimates of the global population in real time, you can visit the Worldometers website, which will also show other interesting estimates of population-related statistics, such as healthcare expenditures, energy consumption, and water use. If you’re interested in where they get their data and their methods, you can visit their FAQ section.

World Population #Rstats Edition

In celebration of World Population Day, I thought I would share an R program that pulls data from the Worldometers site:

and creates a world map that highlights the top 10 countries with the largest total populations:

R code below:

#Load libraries
library(tidyverse)
library(rvest)
library(magrittr)
library(ggmap)
library(stringr)
library(viridis)
library(scales)
#Retrieve data:
html.global_pop %
  extract2(1) %>%
  html_table()

#Check data
head(df.global_pop_RAW) 

#Check for unnecessary spaces in values
glimpse(df.global_pop_RAW)

#Check if country names match those in the map package
as.factor(df.global_pop_RAW$`Country (or dependency)`) %>% levels()

#Renaming countries to match how they are named in the package
df.global_pop_RAW$`Country (or dependency)` <- recode(df.global_pop_RAW$`Country (or dependency)`
                                   ,'U.S.' = 'USA'
                                   ,'U.K.' = 'UK')

#Convert population to numeric--you have to remove the "," before converting 
df.global_pop_RAW$`Population (2018)`<-as.numeric(as.vector(unlist(gsub(",", "",df.global_pop_RAW$`Population (2018)` ))))
sapply(df.global_pop_RAW,class) #Check that it worked

#Generate a world map
world_map<- map_data('world')

#Join map data with our data
map.world_joined <- left_join(world_map, df.global_pop_RAW, 
                              by = c('region' = 'Country (or dependency)'))

#Take only top 10 countries
df.global_pop10 %
  top_n(10)

#Printing to check
df.global_pop10 

#Change data to numeric
df.global_pop10$`Population (2018)`<-as.numeric(as.vector(unlist(gsub(",", "",df.global_pop10$`Population (2018)` ))))

#Check if worked correctly
sapply(df.global_pop10,class)

#Join map data to our data
map.world_joined2 <- left_join(world_map, df.global_pop10, 
                              by = c('region' = 'Country (or dependency)'))

#Create Flag to indicate that it will be colored in for the map
map.world_joined2 %
  mutate(tofill2 = ifelse(is.na(`#`), F, T))


#Now generate the map
ggplot() +
  geom_polygon(data = map.world_joined2, 
               aes(x = long, y = lat, group = group, fill = tofill2)) +
  scale_fill_manual(values = c("lightcyan2","darkturquoise")) +
  labs(title =  'Top 10 Countries with Largest populations (2018)'
       ,caption = "source: http://www.worldometers.info/world-population/population-by-country/") +
  theme_minimal() +
  theme(text = element_text(family = "Gill Sans")
        ,plot.title = element_text(size = 16)
        ,plot.caption = element_text(size = 5)
        ,axis.text = element_blank()
        ,axis.title = element_blank()
        ,axis.ticks = element_blank()
        ,legend.position = "none"
  )

Alternatively, you could include all the countries and use a gradient to indicate population size. However, China and India’s population is so large relative to other countries that it becomes difficult to see any real comparison.

#Generate map data (again)
world_map<- map_data('world')

#re-join with data
map.world_joined <- left_join(world_map, df.global_pop_RAW, 
                              by = c('region' = 'Country (or dependency)'))

#flag to fill ALL countries that match with the map package
map.world_joined %
  mutate(tofill = ifelse(is.na(`#`), F, T))

#Check that it worked correctly
head(map.world_joined,12)

#Then generate new map
ggplot(data = map.world_joined, aes(x = long, y = lat, group = group), color="white", size=.001) +
  geom_polygon(aes(x = long, y = lat, group = group, fill = `Population (2018)`)) +
  scale_fill_viridis(option = 'magma') +
  labs(title =  'Top 10 Countries with Largest populations'
       ,caption = "source: http://www.worldometers.info/world-population/population-by-country/") +
  theme_minimal() +
  theme(text = element_text(family = "Gill Sans")
        ,plot.title = element_text(size = 18)
        ,plot.caption = element_text(size = 5)
        ,axis.text = element_blank()
        ,axis.title = element_blank()
        ,axis.ticks = element_blank()
  )

Which should produce this map:

You can see that the other countries that made the top 10 list are not black, which reflects the smallest population sizes. It’s really that this map accentuates how large China’s and India’s population are relative to the other countries.

More population data and viz

If you want to know more about the global population and how it has changed over time, here are some great resources:

Our World in Data— see also their estimates for future population growth

8 min PBS video

Hans Rosling Tedx video (10 min)

20 min Hans Rosling video –this uses the gapminder data I often code with in Python and in R

Kurzgsagt animated video (6.5 min)

If you’re interested in theories and analytical concepts of demography, here are some links to free online class material:

Johns Hopkins Demographic Methods –or here

Johns Hopkins Principles of Population Change

Collecting data from Zillow with R

My mom has been house hunting over the past couple of weeks, so I decided to try and use R to look at the local market. Here’s what I’ve learned:

Collecting data from Zillow was pretty easy, overall. I mostly used R packages rvest, xlm2, and tidyr.

library(rvest)
library(tidyr
library(xml2)

Next, I went to Zillow and searched for homes in Denver, CO. I zoomed in on an area that I wanted to analyze and then copied the link and pulled the data in R:

url<-"https://www.zillow.com//homes//for_sale//Denver-CO_rb//?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy"
webpage<-read_html(url)

The next part gets pretty complicated to explain. You essentially have to find the information you want from the webpage,which looks like a bunch of scrambled text. It’s helpful to go back to the webpage, right click, and select “View Page Source.” This will help you identify the structure of the webpage and pull the data you want. I started by parsing out the housing links from the metadata. You’ll have to remove characters to parse out the data, which I show below:

houses<- webpage %>%
  html_nodes(".zsg-pagination a") %>%
  html_attr("href")

houses<-houses[!is.na(houses)]
houses <-strsplit(houses,"/")
houses<-lapply(houses, function(x) x[length(x)])
houses<-as.numeric(gsub('[_p]','',houses))
houses <-max(houses)
urls<-c(url,paste0(url,2:houses,'_p/'))
urls

Then I used Jonkatz2 parser function to strip the data down even further. The rest of his functions didn’t work for me =/

getZillow <- function(urls) {
   lapply(urls, function(u) {
   cat(u, '\n')
   houses <- read_html(u) %>%
              html_nodes("article") houses })
 }
zdata<- getZillow(urls)

Instead, I ended breaking down different parts of his function to get the data that I need. The reason I had to write all of this complicated syntax is because the data is saved in a list within lists.

#to pull ID
getID<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    ids<-num %>% html_attr("id")
  })
}
id<-getID(zdata)

#get latitude
getLAT<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    lat<-num %>% html_attr("data-latitude")
    
  })
}
lats<-getLAT(zdata)

#get longitude
getLONG<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    long<-num %>% html_attr("data-longitude")
    
  })
}
longs<-getLONG(zdata)

#get price
getPrice<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    price<-num %>%  
      html_node(".zsg-photo-card-price") %>%
      html_text() 
  })
}
price<-getPrice(zdata)

#house description
getHdesc<- function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    Hdesc<-num %>%  
      html_node(".zsg-photo-card-info") %>%
      html_text() %>%
      strsplit("\u00b7")
  })
}
hdesc<-getHdesc(zdata)

#needs to be stripped down further
hdesc[[1]][[1]]
ldata2<-length(hdesc[[ldata]])

beds<-list()
getBeds<- function(data) {
  for(i in 1:length(data)) {
    t1<-data[[i]]
     beds[[i]]<- t1 %>%
       purrr::map_chr(1)
  }
  return(beds)
}
beds<-getBeds(hdesc)

baths<-list()
getBath<- function(data) {
  for(i in 1:length(data)) {
    t1<-data[[i]]
    baths[[i]]<- t1 %>%
      purrr::map_chr(2)
  }
  return(baths)
}
baths<-getBath(hdesc)

sqft<-list()
getSQft<- function(data) {
  for(i in 1:length(data)) {
    t1<-data[[i]]
    sqft[[i]]<- t1 %>%
      purrr::map_chr(3)
  }
  return(sqft)
}
sqft<-getSQft(hdesc)

#house type
getHtype<- function(data) {
    ldata<-1:length(data)
    lapply(1:length(ldata), function(x) {
      num<-data[[x]]
      Htype<-num %>%  
        html_node(".zsg-photo-card-spec") %>%
        html_text()
  })
}
htype<-getHtype(zdata)

#address
getAddy<-function(data) {
  ldata<- 1:length(data)
  lapply(1:length(ldata),function(x) {
    num<-data[[x]]
    addy<- num %>%
      html_nodes(".zsg-photo-card-address") %>%
    html_text() %>%
      strsplit("\u00b7")
  })
}

address<-getAddy(zdata)

#listing type
getLtype<-function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    ltype<-num %>% html_attr("data-pgapt")
    
  })
}
list_type<-getLtype(zdata)

Now you can unlist one level:

address<-lapply(address, function(x) unlist(x))
htype<-lapply(htype, function(x) unlist(x))
id<-lapply(id, function(x) unlist(x))
lats<-lapply(lats,function(x) unlist(x))
longs<-lapply(longs,function(x) unlist(x))
list_type<-lapply(list_type,function(x) unlist(x))
price<-lapply(price,function(x) unlist(x))

Then, I put it all in a data frame:

df<-data.frame()
list<-list(id, price, address, beds, baths, sqft, list_type,longs, lats, htype)
makeList<-function(data) {
  ldata<-1:length(data)
  lapply(1:length(ldata), function(x) {
    num<-data[[x]]
    ll<-num %>% unlist(recursive=FALSE) 
  })
}
List<-makeList(list)
df<-data.frame(id=c(List[[1]]), price=c(List[[2]]), address=c(List[[3]]),
               beds=c(List[[4]]), baths=c(List[[5]]), sqft=c(List[[6]]),
               l_type=c(List[[7]]), long=c(List[[8]]), lat=c(List[[9]]),
               h_type=c(List[[10]]))

Some of these variables are not correctly formatted. For example, latitude and longitude values were stripped of their decimal points, so I need to add them back in by first removing the factor formatting and then doing some division.

df$long <-as.numeric(as.character(df$long)) / 1000000
df$lat<-as.numeric(as.character(df$lat)) / 1000000

Also, some of my other variables have characters in them, so I want to remove that too:

df$beds <-as.numeric(gsub("[^0-9]", "",df$beds, ignore.case = TRUE))
df$baths <-as.numeric(gsub("[^0-9]", "",df$baths, ignore.case = TRUE))
df$sqft <-as.numeric(gsub("[^0-9]", "",df$sqft, ignore.case = TRUE))
df$price <-as.numeric(gsub("[^0-9]", "",df$price, ignore.case = TRUE))
#replace NAs with 0
df[is.na(df)]<-0

Now I can map my data, in addition to conducting any analyses that I may want to do. Since there’s a ton of stuff out there on conducting analyses in R, I’ll just show you how I mapped my data using the leaflet package:

library(leaflet)
m <- leaflet() %>%
  addTiles() %>%
  addMarkers(lng=df$long, lat=df$lat, popup=df$id) 
m

It should look like this:

If you click on the markers, they will show you the house IDs that they are associated with. You can see the web version by going to my OSF account, where I also posted the R program that I used.

Collecting Twitter Data using the twitteR package in Rstudio

Last week, I wrote a blog post about collecting data using Tweepy in Python. Like usual, I decided to recreate my work in R, so that I can compare my experience using different analytical tools. I will walk you through what I did, but I assume that you already have Rstudio installed. If not, and you wish to follow along, here’s a link to a good resource that explains how to download and install Rstudio.

Begin by loading the following libraries–download them if you don’t have them already installed.

#To download:
#install.packages(c("twitteR", "purrr", "dplyr", "stringr"),dependencies=TRUE)

library(twitteR)
library(purrr)
suppressMessages(library(dplyr))
library(stringr)

Next, initiate the OAuth protocol. This of course assumes that you have registered your Twitter app. If not, here’s a link that explains how to do this.


api_key <- "your_consumer_api_key"
api_secret <-"your_consumer_api_secret"
token <- "your_access_token"
token_secret <- "your_access_secret"

setup_twitter_oauth(api_key, api_secret, token, token_secret)

Now you can use the package twitteR to collect the information that you want. For example, #rstats or #rladies <–great hashtags to follow on Twitter, btw 😉

tw = searchTwitter('#rladies + #rstats', n = 20)

which will return a list of (20) tweets that contain the two search terms that I specified:

*If you want more than 20 tweets, simply increase the number following n=

Alternatively, you can collect data on a specific user. For example, I am going to collect tweets from this awesome R-Lady, @Lego_RLady:

#HAPPYFRIDAY everyone🦄💜! It’s time to esc “escape” for the weekend! Great #rstats data work this week everyone 📊👩🏻‍💻 #RLadies pic.twitter.com/Z2y7Cs3CM7

— LegoRLady (@LEGO_RLady) April 20, 2018

Again, using the twitteR package, type the following:


LegoRLady <- getUser("LEGO_RLady") #for info on the user
RLady_tweets<-userTimeline("LEGO_RLady",n=30,retryOnRateLimit=120) #to get tweets
tweets.df<-twListToDF(RLady_tweets) #turn into data frame
write.csv(tweets.df, "Rlady_tweets.csv", row.names = FALSE) #export to Excel

Luckily, she only has 27 tweets total. If you are collecting tweets from a user that has been on Twitter for longer, you’ll likely have to use a loop to continue collecting every tweet because of the rate limit. If you export to Excel, you should see something like this:

*Note: I bolded the column names and created the border to help distinguish the data

If you’re interested in the retweets and replies to @LEGO_RLady, then you can search for that specifically. To limit the amount of data, let’s limit it to any replies since the following tweet:

Deliberating which #rstats package 📦 to use can be overwhelming at times!! But remember you can smile because you know the #rstats community and #RLadies have got your back and can help 👋💪💙💜💙 pic.twitter.com/uAZbODEi2a

— LegoRLady (@LEGO_RLady) May 2, 2018

target_tweet<-"991771358634889222"
atRLady <- searchTwitter("@LEGO_RLady", 
                       sinceID=target_tweet, n=25, retryOnRateLimit = 20)
atRLady.df<-twListToDF(atRLady)

The atRLady.df data frame should look like this:

There’s much more data if you scroll right. You should have 16 variables total.

Sometimes there are characters in the tweet that result in errors. To make sure that the tweet is in plain text, you can do the following:

replies <- unlist(atRLady) #make sure to use the list and not the data frame

#helper function to remove characters:
clean_tweets <- function (tweet_list) {
  lapply(tweet_list, function (x) {
    x <- x$getText() # get text alone
    x <- gsub("&amp", "", x) # rm ampersands
    x <- gsub("(f|ht)(tp)(s?)(://)(.*)[.|/](.*) ?", "", x) # rm links
    x <- gsub("#\\w+", "", x) # rm hashtags
    x <- gsub("@\\w+", "", x) # rm usernames
    x <- iconv(x, "latin1", "ASCII", sub="") # rm emojis
    x <- gsub("[[:punct:]]", "", x) # rm punctuation
    x <- gsub("[[:digit:]]", "", x) # rm numbers
    x <- gsub("[ \t]{2}", " ", x) # rm tabs
    x <- gsub("\\s+", " ", x) # rm extra spaces
    x <- trimws(x) # rm leading and trailing white space
    x <- tolower(x) # convert to lower case
  })
}
tweets_clean <- unlist(clean_tweets(replies))
# If you want to rebombine the text with the metadata (user, time, favorites, retweets)
tweet_data <- data.frame(text=tweets_clean)
tweet_data <- tweet_data[tweet_data$text != "",]
tweet_data<-data.frame(tweet_data)
tweet_data$user <-atRLady.df$screenName
tweet_data$time <- atRLady.df$created
tweet_data$favorites <- atRLady.df$favoriteCount
tweet_data$retweets <- atRLady.df$retweetCount
tweet_data$time_bin <- cut.POSIXt(tweet_data$time, breaks="3 hours", labels = FALSE)
tweet_data$isRetweet <- atRLady.df$isRetweet

You can pull other information from the original data frame as well, but I don’t find that information very helpful since it is usually NA (e.g., latitude and longitude). The final data frame should look like this:

Now you can analyze it. For example, you can graph retweets for each reply

library(ggplot2)
ggplot(data = tweet_data, aes(x = retweets)) +
  geom_bar(aes(fill = ..count..)) +
  theme(legend.position = "none") +
  xlab("Retweets") +
  scale_fill_gradient(low = "midnightblue", high = "aquamarine4")
dev.copy(png,'myplot.png')
dev.off()

If you have more data, you can conduct a sentiment analysis of all the words in the text of the tweets or create a wordcloud (example below)

Overall, using R to collect data from Twitter was really easy. Honestly, it was pretty easy to do in Python too. However, I must say that the R community is slightly better when it comes to sharing resources and blogs that make it easy for beginners to follow what they’ve done. I really love the open source community and I’m excited that I am apart of this movement!

PS- I forgot to announce that I am officially an R-Lady (see directory)! To all my fellow lady friends, I encourage you to join!

US Fertility Heat Map DIY

The US fertility heat maps that I made a couple of weeks ago received a lot of attention and one of the questions I’ve been asked is how I produced it, which I describe in this post.

As I mentioned in my previous post, I simply followed the directions specified in this article, but I limited the UN data to the US. Overall, I think the article does a good job of explaining how they created their heat map in Tableau. The reason why I remade the heat map in R is because I was just frustrated with the process of trying to embed the visualization into WordPress. Both Tableau and WordPress charge you to embed visualizations in a format that is aesthetically pleasing. Luckily, recreating the heat map in R was extremely easy and just as pretty, at least in my opinion. Here’s how I did it:

First, download the data from the UN website–limit the data to the US only. Alternatively, I’ve linked to the (formatted) data on my OSF account, which also provides access to my code.

Now type the following in Rstudio:


#load libraries:
#if you need to install first, type: install.packages("package_name",dependencies=TRUE)
library(tidyverse)
library(viridis)

#set your working directory to the folder your data is stored in
setwd("C:/Users/Stella/Documents/blog/US birth Map")
#if you don't know what directory is currently set to, type: getwd()

#now import your data
us_fertility<-read.csv("USBirthscsv.csv", header=TRUE) #change the file name if you did not use the data I provided (osf.io/h9ta2)

#limit to relevant data
dta% select(Year, January:December)

#gather (i.e., "aggregate") data of interest, in preparation for graphing
dta%
arrange(Year)

#orderring the data by most frequent incidence of births
dta %>%
group_by(Year) %>%
mutate(rank=dense_rank(desc(births)))

#plot the data
plot<- ggplot(bb2, aes(x =fct_rev(Month),
y = Year,
fill=rank)) +
scale_x_discrete(name="Months", labels=c("Jan", "Feb", "Mar",
"Apr", "May","Jun",
"Jul", "Aug", "Sep",
"Oct", "Nov", "Dec")) +
scale_fill_viridis(name = "Births", option="magma") + #optional command to change the colors of the heat map
geom_tile(colour = "White", size = 0.4) +
labs(title = "Heat Map of US Births",
subtitle = "Frequency of Births from 1969-2014",
x = "Month",
y = "Year",
caption = "Source: UN Data") +
theme_tufte()

plot+ aes(x=fct_inorder(Month))

#if you want to save the graph
dev.copy(png, "births.png")
dev.off()

And that’s it! Simple, right?!

Understanding Linear Regression

My husband and I were discussing the intuition behind OLS regression today and I decided to share the materials that I generated to break down covariance, correlations, and the linear regression equation. It may help to follow along in the Excel workbook that I did this in (see link).

First, let’s say that we have the following data:

Y	X1	X2
1	40	25
2	45	20
1	38	30
3	50	30
2	48	28
3	55	30
3	53	34
4	55	36
4	58	32
3	40	34
5	55	38
3	48	28
3	45	30
2	55	36
4	60	34
5	60	38
5	60	42
5	65	38
4	50	34
3	58	38

We plot the relationship between each X variable and Y, to get a visual look at their relationships:

x1y

x2y

I also like to look at the relationship in a 3D plot (this plot was made in Rstudio with plotly–see link for tutorial):

//plot.ly/~allets/6.embed

To calculate the mean of these data, we would have to sum each column and divide each column by the total number of observations (n=20):

Sum	65	1038	655
N	20	20	20
Mean	3.25	51.9	32.75

Now we need the standard deviation. This is something most students do not find very intuitive unless you force them to calculate this step by step, either by hand or in Excel. You should start by subtracting each observation in each column by it’s respective mean:

(Y-MeanY)	(X1-MeanX1)	(X2-MeanX2)
1-3.25=-2.25	40-51.9=-11.9	25-32.75=-7.75
2-3.25=-1.25	45-51.9=-6.9	20-32.75=-12.75
1-3.25=-2.25	38-51.9=-13.9	30-32.75=-2.75
3-3.25=-0.25	50-51.9=-1.9	30-32.75=-2.75
2-3.25=-1.25	48-51.9=-3.9	28-32.75=-4.75
3-3.25=-0.25	55-51.9=3.1	30-32.75=-2.75
3-3.25=-0.25	53-51.9=1.1	34-32.75=1.25
4-3.25=0.75	55-51.9=3.1	36-32.75=3.25
4-3.25=0.75	58-51.9=6.1	32-32.75=-0.75
3-3.25=-0.25	40-51.9=-11.9	34-32.75=1.25
5-3.25=1.75	55-51.9=3.1	38-32.75=5.25
3-3.25=-0.25	48-51.9=-3.9	28-32.75=-4.75
3-3.25=-0.25	45-51.9=-6.9	30-32.75=-2.75
2-3.25=-1.25	55-51.9=3.1	36-32.75=3.25
4-3.25=0.75	60-51.9=8.1	34-32.75=1.25
5-3.25=1.75	60-51.9=8.1	38-32.75=5.25
5-3.25=1.75	60-51.9=8.1	42-32.75=9.25
5-3.25=1.75	65-51.9=13.1	38-32.75=5.25
4-3.25=0.75	50-51.9=-1.9	34-32.75=1.25
3-3.25=-0.25	58-51.9=6.1	38-32.75=5.25

Then you will square each value in each column:

(Y-MeanY)^2	(X1-MeanX1)^2	(X2-MeanX2)^2
-2.25^2=5.0625	-11.9^2=141.61	-7.75^2=60.0625
-1.25^2=1.5625	-6.9^2=47.61	-12.75^2=162.5625
-2.25^2=5.0625	-13.9^2=193.21	-2.75^2=7.5625
-0.25^2=0.0625	-1.9^2=3.60999	-2.75^2=7.5625
-1.25^2=1.5625	-3.9^2=15.21	-4.75^2=22.5625
-0.25^2=0.0625	3.1^2=9.610000	-2.75^2=7.5625
-0.25^2=0.0625	1.1^2=1.21	1.25^2=1.5625
0.75^2=0.5625	3.1^2=9.610000	3.25^2=10.5625
0.75^2=0.5625	6.1^2=37.21	-0.75^2=0.5625
-0.25^2=0.0625	-11.9^2=141.61	1.25^2=1.5625
1.75^2=3.0625	3.1^2=9.610000	5.25^2=27.5625
-0.25^2=0.0625	-3.9^2=15.21	-4.75^2=22.5625
-0.25^2=0.0625	-6.9^2=47.61	-2.75^2=7.5625
-1.25^2=1.5625	3.1^2=9.610000	3.25^2=10.5625
0.75^2=0.5625	8.1^2=65.61	1.25^2=1.5625
1.75^2=3.0625	8.1^2=65.61	5.25^2=27.5625
1.75^2=3.0625	8.1^2=65.61	9.25^2=85.5625
1.75^2=3.0625	13.1^2=171.61	5.25^2=27.5625
0.75^2=0.5625	-1.9^2=3.60999	1.25^2=1.5625
-0.25^2=0.0625	6.1^2=37.21	5.25^2=27.5625

If you sum up each column of the squared values, you will get the standard deviation:

29.75

1091.8

521.75

In addition to calculating the mean and standard deviation of Y, X1, and X2, you will also need to calculate the relationships between Y, X1, and X2 by first, multiplying them together, and then repeating each of the steps that we did above:

*X1Y**	*X2Y**	*X1X2**
1*40=40	1*25=25	25*40=1000
2*45=90	2*20=40	20*45=900
1*38=38	1*30=30	30*38=1140
3*50=150	3*30=90	30*50=1500
2*48=96	2*28=56	28*48=1344
3*55=165	3*30=90	30*55=1650
3*53=159	3*34=102	34*53=1802
4*55=220	4*36=144	36*55=1980
4*58=232	4*32=128	32*58=1856
3*40=120	3*34=102	34*40=1360
5*55=275	5*38=190	38*55=2090
3*48=144	3*28=84	28*48=1344
3*45=135	3*30=90	30*45=1350
2*55=110	2*36=72	36*55=1980
4*60=240	4*34=136	34*60=2040
5*60=300	5*38=190	38*60=2280
5*60=300	5*42=210	42*60=2520
5*65=325	5*38=190	38*65=2470
4*50=200	4*34=136	34*50=1700
3*58=174	3*38=114	38*58=2204

Again, (1) sum each column and (2) divide by the total number of observations (n=20) to get the mean.

Sum	3513	2219	34510
N	20	20	20
Mean	175.65	110.95	1725.5

(3) In a separate table, subtract the respective mean for each column from each row value:

*(X1Y)-(MeanX1Y)*	*(X2Y)-(MeanX2Y)*	*(X1X2)-(MeanX1X2)*
40-175.65=-135.65	25-110.95=-85.95	1000-1725.5=-725.5
90-175.65=-85.65	40-110.95=-70.95	900-1725.5=-825.5
38-175.65=-137.65	30-110.95=-80.95	1140-1725.5=-585.5
150-175.65=-25.65	90-110.95=-20.95	1500-1725.5=-225.5
96-175.65=-79.65	56-110.95=-54.95	1344-1725.5=-381.5
165-175.65=-10.65	90-110.95=-20.95	1650-1725.5=-75.5
159-175.65=-16.65	102-110.95=-8.95	1802-1725.5=76.5
220-175.65=44.35	144-110.95=33.05	1980-1725.5=254.5
232-175.65=56.35	128-110.95=17.05	1856-1725.5=130.5
120-175.65=-55.65	102-110.95=-8.95	1360-1725.5=-365.5
275-175.65=99.35	190-110.95=79.05	2090-1725.5=364.5
144-175.65=-31.65	84-110.95=-26.95	1344-1725.5=-381.5
135-175.65=-40.65	90-110.95=-20.95	1350-1725.5=-375.5
110-175.65=-65.65	72-110.95=-38.95	1980-1725.5=254.5
240-175.65=64.35	136-110.95=25.05	2040-1725.5=314.5
300-175.65=124.35	190-110.95=79.05	2280-1725.5=554.5
300-175.65=124.35	210-110.95=99.05	2520-1725.5=794.5
325-175.65=149.35	190-110.95=79.05	2470-1725.5=744.5
200-175.65=24.35	136-110.95=25.05	1700-1725.5=-25.5
174-175.65=-1.650000	114-110.95=3.05	2204-1725.5=478.5

(4) Square those values for the standard deviation:

*(X1Y)-(MeanX1Y)^2*	*(X2Y)-(MeanX2Y)^2*	*(X1X2)-(MeanX1X2)^2*
-135.65^2=18400.9225	-85.95^2=7387.4025	-725.5^2=526350.25
-85.65^2=7335.9225	-70.95^2=5033.9025	-825.5^2=681450.25
-137.65^2=18947.5225	-80.95^2=6552.9025	-585.5^2=342810.25
-25.65^2=657.9225	-20.95^2=438.9025	-225.5^2=50850.25
-79.65^2=6344.1225	-54.95^2=3019.5025	-381.5^2=145542.25
-10.65^2=113.4225	-20.95^2=438.9025	-75.5^2=5700.25
-16.65^2=277.2225	-8.95^2=80.1025	76.5^2=5852.25
44.35^2=1966.9225	33.05^2=1092.3025	254.5^2=64770.25
56.35^2=3175.3225	17.05^2=290.7025	130.5^2=17030.25
-55.65^2=3096.9225	-8.95^2=80.1025	-365.5^2=133590.25
99.35^2=9870.4225	79.05^2=6248.9025	364.5^2=132860.25
-31.65^2=1001.7225	-26.95^2=726.3025	-381.5^2=145542.25
-40.65^2=1652.4225	-20.95^2=438.9025	-375.5^2=141000.25
-65.65^2=4309.9225	-38.95^2=1517.1025	254.5^2=64770.25
64.35^2=4140.9225	25.05^2=627.5025	314.5^2=98910.25
124.35^2=15462.9225	79.05^2=6248.9025	554.5^2=307470.25
124.35^2=15462.9225	99.05^2=9810.9025	794.5^2=631230.25
149.35^2=22305.4225	79.05^2=6248.9025	744.5^2=554280.25
24.35^2=592.9225	25.05^2=627.5025	-25.5^2=650.25

(5) Now sum up each column for the standard deviation:

135118.6

56918.95

4279623

Comprehensively, you should get a table like this:

	Y	X1	X2	*X1Y**	*X2Y**	*X1X2**
Sum	65	1038	655	3513	2219	34510
N	20	20	20	20	20	20
Mean	3.25	51.9	32.75	175.65	110.95	1725.5
SD	29.75	1091.8	521.75	135118.6	56918.95	4279623

Now we can derive the covariance between each variable, as well as the correlation, using these formulas:

Such that your table should look like this:

	y	X1	X2
Y	29.75	139.5	90.25
X1	0.77	1091.8	515.5
X2	0.72	0.68	521.75

Notice that the numbers in the diagonals (blue) are the standard deviations that we calculated. The numbers in the bottom triangle (underlined) represent the correlation, and the numbers in the top triangle (red) is the covariance. Below, I show you how I calculated each value in each cell (in Excel):

	y	X1	X2
Y	29.75	3513-(1038*65)/20=139.5	2219-(655*65)/20=90.25
X1	139.5/SQRT(1091.8*29.75)=0.77403	1091.8	515.5
X2	90.25/SQRT(521.75*29.75)=0.724	515.5/SQRT(1091.8*521.75)=0.683	521.75

Now, you can calculate the betas for X1 and X2 using these formulas:

Notice that you are standardizing each variable by accounting for its covariance with the other predictor. Plugging in each value, you should get the following:

b1	(521.75139.5-515.590.25)/(1091.8*521.75-515.5^2)=0.086
b2	(1091.8-521.75515.5)/(0.683*-521.75^2)=0.088

Remember that the formula for OLS regression is simply:

So, using algebra, plug in the variables to calculate the constant:

a	3.25-(0.08651.9)-(0.08832.75)=-4.104

Now we have our regression equation:

Y’=4.104 – 0.086*X1 – 0.088*X2

Now we can calculate our column for y-hat by plugging each X1 and X2 value into the equation. You should get a column that looks something like this:

Y’
1.54282
1.536857
1.80801
2.844918
2.496897
3.276963
3.454552
3.802573
3.711394
2.331235
3.977777
2.496897
2.412873
3.802573
4.059415
4.409822
4.760228
4.841867
3.195325
4.237004

Now you can plot the actual Y versus predicted Y (i.e., Y’):

Yyhat

There you have it! Hopefully this breakdown provides a better intuition of the numbers behind the OLS regression formula!

Gapminder gif with Rstudio

I decided to remake the Gapminder gif that I made the other day in Python, but in Rstudio this time. I’ll probably continue doing this for a while, as I try to figure out the advantages of using one program over the other. Here’s is a walk-through of what I did to recreate it:

#install these packages if you haven't already
install.packages(c("devtools", "dplyr", "ggplot2", "readr"))
devtools::install_github("dgrtwo/gganimate",force=TRUE)

library(devtools)
library(dplyr)
library(readr)
library(viridis)
library(ggplot2)
library(gganimate)
library(animation)

#Set up ImageMagick --for gifs
install.packages("installr",dependencies = TRUE)
library(installr)

#Configure your environment--change the location
Sys.setenv(PATH = paste("C:/Program Files/ImageMagick-7.0.7-Q16", Sys.getenv("PATH"), sep = ";")) #change the path to where you installed ImageMagick
#Again, change the location:
magickPath <- shortPathName("C:/Program Files/ImageMagick-7.0.7-Q16/magick.exe")
#ani.options(convert=magickPath)

If you need to download ImageMagick, go to this link

Load data and create plot

Once youâ€™ve installed the appropriate packages and configured ImageMagick to work with Rstudio, you can load your data and plot as usual.

gapminder_data<-read.csv("https://python-graph-gallery.com/wp-content/uploads/gapminderData.csv", header=TRUE)

glimpse(gapminder_data) #print to make sure it loaded correctly

## Observations: 1,704
## Variables: 6
## $ country    Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ year       1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ pop        8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ continent  Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ lifeExp    28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ gdpPercap  779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

# Helper function for string wrapping. 
# Default 20 character target width.
swr = function(string, nwrap=40) {
  paste(strwrap(string, width=nwrap), collapse="\n")
}
swr = Vectorize(swr)

gapminder_plot<-ggplot(gapminder_data) +
  aes(x = gdpPercap,
      y = lifeExp,
      colour = continent,
      size = pop, 
      frame=year) +
      scale_x_log10() +
  scale_size_continuous(guide =FALSE) + #suppresses the second legend (size=pop)
  geom_point() +
  scale_color_viridis(discrete=TRUE)+ #optional way to change colors of the plot
  theme_bw() +
  labs(title=swr("Relationship Between Life Expectancy and GDP per Capita"),
       x= "GDP Per Capita",
       y= "Life expectancy",
      caption="Data: Gapminder")
  theme(legend.position = "none",
        axis.title.x=element_text(size=.2),
        axis.title.y=element_text(size=.2),
        plot.caption = element_text(size=.1))</

#getOption("device") #try running this if your plot doesn't immediately show gapminder_plot

#if you want to save the plot:
ggsave("title.png", 
       plot = last_plot(), # or give ggplot object name as in myPlot,
       width = 5, height = 5, 
       units = "in", # other options c("in", "cm", "mm"), 
       dpi = 300)

Notice that I created the swr function to wrap the title text. If I don’t include that function, the title runs off the plot, like this:

gapminderplot

Animate the plot

Now you can animate the plot using gganimate. Also, if you want to change any of the axis-titles or any other feature of the plot, I like to reference STHDA.

#remember to assign a working directory first:
#setwd() <--use this to change the working directory, if needed
gganimate(gapminder_plot,interval=.5,"gapminderplot.gif")

All in all, I’d say that creating the gif was equally easy in Python and R. Although I had more trouble initally configuring Python with ImageMagic–I might have found it easier in R simply because I used Python to figure this out the first time. On the other hand, I like the way the Python gif looks much more than the gif that Rstudio rendered.

animated_gapminder

Looks like I’ll have to continue experimenting.

US Fertility Heat Map (round 2)

It’s official. I prefer R over Tableau.

I can create the same visualizations, but the process is easier and I have much more control over the data. To make my point: Here’s the same heat map that I made the other day (see previous post).

births2

So much prettier!

My Custom R Package!

One thing that drives me crazy: Copying latent class model results from Mplus, pasting them into Excel, and then filtering out the parts of the output that are unnecessary. You can of course speed this process up by using vLookups or some form of indexing on Excel, but it’s still cumbersome. I’m spoiled by Stata, which has custom packages that allow the user to output their results into nicely formatted tables see for example my previous posts.

My solution? Make my own package that does this for Mplus output. I considered doing this in Stata, but I need to wean myself off of Stata and become proficient in programs like R and Python. So, I chose R. I didn’t do this for any particular reason other than that I am taking courses in Python, so I wanted to get more practice in R.

How it works

1. Install the package

If R is not installed on your computer, the first step is actually to install R. In addition to installing R, I recommend installing R Studio, which provides a friendly user interface for R, especially if you’re uncomfortable coding in a command prompt type window.

If R is installed on your computer, go to my github profile and download the LCA2xl_0.0.1.tar.gz file. Place the file in your R library. For example, my R library is “C:/Users/Stella/Documents/R/win-library/3.4”. Then type the following:

#First check that your working directory is set to the same folder you just placed the package in:
getwd()
#--If it isn't, type the following (replace with the correct information):
setwd('C:/Users/Username/Rpath')

#Second, install the package and load it:
install.packages("LCA2xl_0.0.1.tar.gz", type="source", repos=NULL) 
library(LCA2xl)

#You may also want to load the following packages:
library(dplyr)
library(tidyr)
library(plyr)
library(stringi)
library(stringr)
library(excel.link)
#--if you need to install them, type: install.packages("package_name", dependencies=TRUE)

2. Load your data

Next, tell R which Mplus.out file you would like to use. (Change your working directory to the correct folder first):

mplusfile<-"example.out"
#Alternatively, you can just use the "example.out" file directly in the function command. Shown below.

#Do not type the following. I am only doing this to give you have an idea of what the file looks like (for a 2 Class Model):
readLines(file("example.out"))

file

As you can see, it’s a massive file of text that you have to hunt through for the data that you want.

3. Use LCA2xl to extract models

The main function that you will likely want to use is getPScaleResults. This will get the section of the output file that displays the results in probability scale. The section that is relevant is Category 2 for each variable/measure, because this is the probability that the person in this class exhibits a particular trait or, in my case, the probability that a person will experience a particular event. The information you will need in order to use this function, is the usevariable list and the number of classes in the model. Note that you don’t have to provide a usevariable list. It’s better if you do however, because Mplus often truncates variables, or because your variables have nondescriptive names to prevent Mplus from truncating them. If you are okay with your usevariable names that you listed in Mplus, you can use an optional function that I created, called getUsevars. Here’s an example:

#To use the custom function that I provided:
usevars<-getUsevars(mplusfile) #it extracts the list you provided in Mplus
#--you could also just write: usevars<-getUsevars("example.out")

usevars #to view the results
 [1] "CUREMP2" "NSCHOOL2" "COHAB2" "MARRIED2" "PARENT2" "CUREMP3" "NSCHOOL3" 
 [8] "COHAB3" "MARRIED3" "PARENT3" "CUREMP4" "NSCHOOL4" "COHAB4" "MARRIED4" 
[15] "PARENT4" "CUREMP5" "NSCHOOL5" "COHAB5" "MARRIED5" "PARENT5" "CUREMP6" 
[22] "NSCHOOL6" "COHAB6" "MARRIED6" "PARENT6" "CUREMP8" "NSCHOOL8" "COHAB8" 
[29] "MARRIED8" "PARENT8" "CUREMP12" "NSCHOOL12" "COHAB12" "MARRIED12" "PARENT12" 
#Alternatively, you can assign a list. Just be sure to assign the same number of variables and in the correct order
usevars2<-c("EMPLOYED2","SCHOOL2", "COHAB2", "MARRIED2", "PARENT2", 
            "EMPLOYED3","SCHOOL3", "COHAB3", "MARRIED3", "PARENT3",
            "EMPLOYED4","SCHOOL4", "COHAB4", "MARRIED4", "PARENT4",
            "EMPLOYED5","SCHOOL5", "COHAB5", "MARRIED5", "PARENT5",
            "EMPLOYED6","SCHOOL6", "COHAB6", "MARRIED6", "PARENT6",
            "EMPLOYED7","SCHOOL7", "COHAB7", "MARRIED7", "PARENT7",
            "EMPLOYED8","SCHOOL8", "COHAB8", "MARRIED8", "PARENT8")

usevars2
 [1] "EMPLOYED2" "SCHOOL2" "COHAB2" "MARRIED2" "PARENT2" "EMPLOYED3" "SCHOOL3" 
 [8] "COHAB3" "MARRIED3" "PARENT3" "EMPLOYED4" "SCHOOL4" "COHAB4" "MARRIED4" 
[15] "PARENT4" "EMPLOYED5" "SCHOOL5" "COHAB5" "MARRIED5" "PARENT5" "EMPLOYED6"
[22] "SCHOOL6" "COHAB6" "MARRIED6" "PARENT6" "EMPLOYED7" "SCHOOL7" "COHAB7" 
[29] "MARRIED7" "PARENT7" "EMPLOYED8" "SCHOOL8" "COHAB8" "MARRIED8" "PARENT8" 
#Now you can extract the results in probability scale
results1<-getProbResults(mplusfile, classes=2, usevariableList=usevars)
#Don't worry about the red text. It's just a warning. I didn't assign attributes to the file. It's unnecessary information and it won't effect your results.
results1 #The results are put into a data frame (dataset)

#You will see pretty similar results if you use your custom variable list
results2<-getProbResults(mplusfile, classes=2, usevariableList=usevars2)
results2

Here is what each data set looks like:

results1

results2

Notice that I extracted the indicator (the number following the variable) and also added stars to the estimates: *** p<.001 **p<.01 *p<.05

4. Export the results to Excel

Now you can export the results to an Excel workbook. You also have to option of exporting both models results, if you would like:

listr<-list(results1) #It has to be in list to work
LCA2xl(modelList=listr, returnfile="Results1.xlsx")

#or you can create a list of models to export to Excel:
listresults<-list(results1, results2) #place both data frames in one list
listresults
LCA2xl(modelList=results1, returnfile="Results2.xlsx")

When you run the LCA2xl function, you’ll notice that an Excel Workbook will open on your computer. For the workbook with two models, you should see that each model was saved to a separate sheet.

Excel

5. Extras

Lastly, I added some extra functions: functions that extract Tech 11 and Tech 14 output, a function that extracts the average probability of class membership, and a final option that will take the square root of the results in probability scale. The latter function is mainly for graphing–it scales the y-axis. Below, I’ve provided a demonstration of each function.

#For Techoutput
getTech11(mplusfile, classes=2)

getTech14(mplusfile, classes=2)

sqprobLCA(mplusfile, classes=2, usevariableList = usevars)

extract_LCprob(mplusfile, classes=2)

tech11

tech14

sqprob

probs

You can add these results to the list of dataframes that you export to Excel if you would like them included in your results.

That’s it! Now go have fun with all that extra time you saved!

Stella Min

Demography

Tag: R